Production-grade metrics collection for AI financial agents running on Purple Flea. Custom gauges, histograms, and counters for every critical agent signal — trade P&L, order latency, wallet balances, and escrow state.
Standard infrastructure metrics are not enough for AI trading agents. These Purple Flea-specific Prometheus metrics capture the financial signals that matter: risk, execution quality, and capital efficiency.
rate() to compute error rate over time windows for Alertmanager thresholds.The prometheus_client library handles metric registration, exposition format, and HTTP server creation. The Purple Flea custom collector pattern polls APIs lazily at scrape time — avoiding unnecessary API calls between scrapes.
""" Purple Flea Prometheus Metrics Module Defines all custom metrics for AI trading agent observability. Import this module in your agent's main server file. """ from prometheus_client import ( Gauge, Counter, Histogram, Summary, CollectorRegistry, CONTENT_TYPE_LATEST, generate_latest ) # Use a custom registry to avoid polluting the default with process metrics REGISTRY = CollectorRegistry(auto_describe=True) # ============================================================ # FINANCIAL METRICS # ============================================================ # Trade P&L — updated after each trade settlement trade_pnl_gauge = Gauge( "pf_trade_pnl_usdc", "Current trade P&L in USDC (positive=profit, negative=loss)", labelnames=["agent_id", "strategy", "pnl_type"], # pnl_type: realized|unrealized registry=REGISTRY ) # Wallet balance — polled from Purple Flea API at scrape time wallet_balance_gauge = Gauge( "pf_wallet_balance", "Current wallet balance in native currency units", labelnames=["agent_id", "currency", "wallet_type"], # wallet_type: hot|cold registry=REGISTRY ) # Drawdown from peak equity drawdown_gauge = Gauge( "pf_drawdown_fraction", "Current drawdown from peak equity (0.0 = no drawdown, 1.0 = total loss)", labelnames=["agent_id", "window"], # window: 1h|24h|7d|all_time registry=REGISTRY ) # Casino win rate (rolling) casino_win_rate_gauge = Gauge( "pf_casino_win_rate", "Rolling win rate for casino games over last N rounds (0.0 to 1.0)", labelnames=["agent_id", "game_type"], # game_type: crash|coinflip|dice registry=REGISTRY ) # ============================================================ # EXECUTION QUALITY METRICS # ============================================================ # Order latency histogram — fine-grained buckets for p99 precision order_latency_histogram = Histogram( "pf_order_latency_ms", "Order submission to Purple Flea acknowledgement latency (milliseconds)", labelnames=["agent_id", "endpoint", "order_type"], buckets=[2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500], registry=REGISTRY ) # API error counter api_errors_total = Counter( "pf_api_errors_total", "Total number of Purple Flea API errors by status code", labelnames=["agent_id", "endpoint", "status_code"], registry=REGISTRY ) # ============================================================ # ESCROW METRICS # ============================================================ # Escrow contracts initiated (lifetime counter) escrow_initiated_total = Counter( "pf_escrow_initiated_total", "Total escrow contracts initiated by this agent", labelnames=["agent_id"], registry=REGISTRY ) # Escrow contracts completed (settlement or timeout) escrow_settled_total = Counter( "pf_escrow_settled_total", "Total escrow contracts settled or timed out", labelnames=["agent_id", "outcome"], # outcome: settled|timeout|disputed registry=REGISTRY ) # Currently open escrow contracts escrow_open_gauge = Gauge( "pf_escrow_open", "Number of currently open escrow contracts for this agent", labelnames=["agent_id"], registry=REGISTRY ) # Referral income (passive revenue stream) referral_income_total = Counter( "pf_referral_income_usdc_total", "Cumulative USDC earned through the 15% referral program", labelnames=["agent_id", "referral_tier"], registry=REGISTRY )
Integrate the Prometheus metrics endpoint into your existing FastAPI agent server. The /metrics route returns Prometheus exposition format — scraped directly by Prometheus on the configured interval.
import time import os import httpx from contextlib import asynccontextmanager from fastapi import FastAPI, Request, Response from prometheus_client import generate_latest, CONTENT_TYPE_LATEST from .metrics import ( REGISTRY, trade_pnl_gauge, wallet_balance_gauge, order_latency_histogram, api_errors_total, escrow_open_gauge, drawdown_gauge ) PF_BASE = "https://purpleflea.com/api" PF_API_KEY = os.environ["PF_API_KEY"] # pf_live_xxxx AGENT_ID = os.environ["PF_AGENT_ID"] async def refresh_wallet_metrics(): """Pull current wallet balances from Purple Flea API and update gauges.""" async with httpx.AsyncClient() as client: r = await client.get( f"{PF_BASE}/wallet/{AGENT_ID}/balances", headers={"Authorization": f"Bearer {PF_API_KEY}"}, timeout=8.0 ) r.raise_for_status() data = r.json() for currency, balance in data["balances"].items(): wallet_balance_gauge.labels( agent_id=AGENT_ID, currency=currency, wallet_type="hot" ).set(balance) async def refresh_escrow_metrics(): """Count open escrow contracts and update gauge.""" async with httpx.AsyncClient() as client: r = await client.get( f"{PF_BASE}/escrow?agent_id={AGENT_ID}&status=open", headers={"Authorization": f"Bearer {PF_API_KEY}"}, timeout=8.0 ) escrow_open_gauge.labels(agent_id=AGENT_ID).set(r.json()["count"]) # Instrument trade execution with latency histogram async def place_bet(game_type: str, amount: float, params: dict): start = time.perf_counter() try: async with httpx.AsyncClient() as client: r = await client.post( f"{PF_BASE}/casino/{game_type}", json={"agent_id": AGENT_ID, "amount": amount, **params}, headers={"Authorization": f"Bearer {PF_API_KEY}"}, timeout=10.0 ) r.raise_for_status() elapsed_ms = (time.perf_counter() - start) * 1000 order_latency_histogram.labels( agent_id=AGENT_ID, endpoint=f"casino/{game_type}", order_type="bet" ).observe(elapsed_ms) result = r.json() # Update P&L gauge after settlement pnl_delta = result["pnl_delta_usdc"] trade_pnl_gauge.labels( agent_id=AGENT_ID, strategy=game_type, pnl_type="realized" )._value.set(trade_pnl_gauge.labels( agent_id=AGENT_ID, strategy=game_type, pnl_type="realized" )._value.get() + pnl_delta) return result except httpx.HTTPStatusError as e: api_errors_total.labels( agent_id=AGENT_ID, endpoint=f"casino/{game_type}", status_code=str(e.response.status_code) ).inc() raise app = FastAPI(title="Purple Flea Agent") @app.get("/metrics") async def metrics(): """ Prometheus metrics endpoint. Refreshes live data from Purple Flea APIs before generating output. Scrape this endpoint with Prometheus every 15s. """ await refresh_wallet_metrics() await refresh_escrow_metrics() body = generate_latest(REGISTRY) return Response(content=body, media_type=CONTENT_TYPE_LATEST) @app.get("/health") async def health(): return {"status": "ok", "agent_id": AGENT_ID}
These PromQL expressions power the Purple Flea Grafana dashboards. Use them in Grafana panel queries, recording rules, or the Prometheus expression browser for ad-hoc investigation.
PromQL aggregations over large agent fleets can be expensive at query time. Recording rules pre-compute these aggregations on the Prometheus server and store them as new time series — making dashboard queries instantaneous regardless of fleet size.
groups: - name: purpleflea_fleet_aggregations interval: 30s # Evaluate every 30 seconds rules: # Total fleet P&L by strategy (for leaderboard panels) - record: pf:fleet_pnl_usdc:sum_by_strategy expr: | sum(pf_trade_pnl_usdc{pnl_type="realized"}) by (strategy) # Fleet-wide p95 latency (pre-computed for Grafana stat panel) - record: pf:order_latency_p95:fleet expr: | histogram_quantile(0.95, sum(rate(pf_order_latency_ms_bucket[5m])) by (le) ) # Fleet-wide p99 latency - record: pf:order_latency_p99:fleet expr: | histogram_quantile(0.99, sum(rate(pf_order_latency_ms_bucket[5m])) by (le) ) # Total active escrow count across all agents - record: pf:escrow_open:fleet_total expr: sum(pf_escrow_open) # Number of agents with drawdown > 10% (risk cohort size) - record: pf:agents_at_risk:count expr: | count(pf_drawdown_fraction{window="all_time"} > 0.10) # Average win rate across all crash game agents - record: pf:casino_win_rate:avg_by_game expr: avg(pf_casino_win_rate) by (game_type) # Fleet API error rate (pre-computed for alerting) - record: pf:api_error_rate:5m expr: | sum(rate(pf_api_errors_total{status_code=~"4..|5.."}[5m])) / sum(rate(pf_api_errors_total[5m])) # Total referral income velocity (USDC/hr) - record: pf:referral_income_usdc_per_hour:fleet expr: sum(rate(pf_referral_income_usdc_total[1h])) * 3600 # Per-agent P&L rank (use with sort_desc in Grafana) - record: pf:agent_pnl_1h_increase:by_agent expr: | increase(pf_trade_pnl_usdc{pnl_type="realized"}[1h])
These Prometheus alert rules evaluate continuously against your agent metrics. When conditions are met, Alertmanager routes notifications to Slack, PagerDuty, or any webhook endpoint — including automated Purple Flea API calls to halt trading.
Fires immediately when any agent's drawdown exceeds 20% from their peak equity. At this level, automated intervention is recommended — halt trading and notify human operators.
pf_drawdown_fraction{window="all_time"} > 0.20
Early warning at 10% drawdown. Fires for 5 minutes before escalating to critical. Allows agents to reduce position size or switch to a more conservative strategy.
pf_drawdown_fraction{window="all_time"} > 0.10
Fires when an agent's USDC balance drops below 25 — insufficient for minimum bet sizes. Agent should claim from faucet or request emergency transfer to remain operational.
pf_wallet_balance{currency="USDC"} < 25
Pre-emptive warning at 100 USDC balance. Gives agents time to arrange top-up before reaching the critical threshold that halts operations.
pf_wallet_balance{currency="USDC"} < 100
API error rate above 5% over a 2-minute window. Usually indicates authentication key expiry, rate limit breach, or Purple Flea service degradation. Check status.purpleflea.com.
pf:api_error_rate:5m > 0.05
p99 order latency above 1 second suggests severe performance degradation — network issues, agent logic bottleneck, or API overload. Investigate immediately to avoid missed opportunities.
pf:order_latency_p99:fleet > 1000
groups: - name: purpleflea_agent_alerts rules: - alert: AgentDrawdownBreachCritical expr: pf_drawdown_fraction{window="all_time"} > 0.20 for: 1m labels: severity: critical team: agents annotations: summary: "Agent {{ $labels.agent_id }} drawdown {{ $value | humanizePercentage }}" description: "Drawdown exceeds 20%. Consider halting trading immediately." runbook_url: "https://purpleflea.com/docs/risk-management" dashboard_url: "https://grafana.yourhost.com/d/pf-agent-fleet-v1" - alert: WalletBalanceCriticallyLow expr: pf_wallet_balance{currency="USDC"} < 25 for: 0m # Fire immediately — no buffer labels: severity: critical team: agents annotations: summary: "Agent {{ $labels.agent_id }} wallet critically low: {{ $value }} USDC" description: "Balance below minimum bet threshold. Claim from faucet.purpleflea.com" - alert: HighAPIErrorRate expr: pf:api_error_rate:5m > 0.05 for: 2m labels: severity: warning annotations: summary: "Purple Flea API error rate {{ $value | humanizePercentage }}" description: "Check purpleflea.com/status for service health." - alert: OrderLatencyP99Critical expr: pf:order_latency_p99:fleet > 1000 for: 5m labels: severity: critical annotations: summary: "p99 order latency {{ $value }}ms exceeds 1s threshold"
global: resolve_timeout: 5m slack_api_url: ${SLACK_WEBHOOK_URL} route: receiver: default group_by: [alertname, agent_id] group_wait: 10s group_interval: 5m repeat_interval: 1h routes: - matchers: [severity="critical"] receiver: pagerduty continue: true - matchers: [alertname="AgentDrawdownBreachCritical"] receiver: halt_trading_webhook receivers: - name: default slack_configs: - channel: "#pf-agent-alerts" title: "[{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}" text: "{{ .CommonAnnotations.description }}" - name: halt_trading_webhook webhook_configs: - url: "http://agent-controller:8080/halt" send_resolved: true http_config: bearer_token_file: /var/run/secrets/halt-token - name: pagerduty pagerduty_configs: - routing_key: ${PAGERDUTY_KEY} description: "{{ .CommonAnnotations.summary }}"
Running Purple Flea agents on Kubernetes with the Prometheus Operator? A ServiceMonitor custom resource automatically configures Prometheus to discover and scrape all agent pods matching a label selector — no manual scrape config updates needed as you scale the fleet.
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: purpleflea-agents namespace: agents labels: app: purpleflea-agent release: prometheus-stack # Must match Prometheus operator's serviceMonitorSelector spec: namespaceSelector: matchNames: - agents selector: matchLabels: app.kubernetes.io/component: purpleflea-agent endpoints: - port: metrics path: /metrics interval: 15s scrapeTimeout: 10s honorLabels: true relabelings: - sourceLabels: [__meta_kubernetes_pod_label_agent_id] targetLabel: agent_id - sourceLabels: [__meta_kubernetes_pod_label_strategy] targetLabel: strategy - sourceLabels: [__meta_kubernetes_namespace] targetLabel: namespace
apiVersion: apps/v1 kind: Deployment metadata: name: pf-agent-fleet namespace: agents spec: replicas: 10 # Scale to 10 agent pods selector: matchLabels: app.kubernetes.io/component: purpleflea-agent template: metadata: labels: app.kubernetes.io/component: purpleflea-agent app.kubernetes.io/name: pf-agent annotations: prometheus.io/scrape: "true" prometheus.io/port: "8000" prometheus.io/path: "/metrics" spec: containers: - name: agent image: your-registry/pf-agent:latest ports: - name: http containerPort: 8080 - name: metrics containerPort: 8000 env: - name: PF_API_KEY valueFrom: secretKeyRef: name: pf-credentials key: api-key # Stored as pf_live_xxxx in secret - name: PF_AGENT_ID valueFrom: fieldRef: fieldPath: metadata.name resources: requests: { cpu: "100m", memory: "128Mi" } limits: { cpu: "500m", memory: "512Mi" } livenessProbe: httpGet: { path: /health, port: http } initialDelaySeconds: 10 readinessProbe: httpGet: { path: /metrics, port: metrics } initialDelaySeconds: 5 --- apiVersion: v1 kind: Service metadata: name: pf-agent-metrics namespace: agents labels: app.kubernetes.io/component: purpleflea-agent spec: selector: app.kubernetes.io/component: purpleflea-agent ports: - name: metrics port: 8000 targetPort: metrics
AI trading agents have unique observability requirements that generic APM tools miss. Prometheus's pull-based model, label cardinality, and rich PromQL make it the right foundation for financial agent monitoring.
Prometheus scrapes your agent's /metrics endpoint on schedule. If an agent crashes, its metrics disappear from Prometheus — triggering absence-based alerts. No push infrastructure required for Kubernetes deployments.
Label every metric with agent_id, strategy, and game_type without performance penalty at the metric count scales typical agent fleets reach (tens to hundreds of agents).
PromQL's rate(), increase(), histogram_quantile(), and topk() functions map naturally to financial analytics: drawdown velocity, P&L acceleration, latency percentiles, and leaderboard rankings.
The Prometheus data source in Grafana is the most mature integration available. Variable queries, ad-hoc filters, and exemplar links all work out-of-the-box with the Purple Flea metric schema.
ServiceMonitor CRDs, pod annotations, and namespace selectors let you auto-discover every agent pod as it starts. Scale from 1 to 1,000 agent replicas without touching Prometheus config.
Route critical alerts (drawdown breach, low balance) to PagerDuty while sending info-level alerts (referral fee earned) to Slack. Silence rules during planned maintenance windows.
From a bare agent to full Prometheus instrumentation in under 10 minutes.
pip install prometheus-client httpx fastapi uvicorn — then copy the metrics.py module above into your agent package.
Import the metrics module and add the @app.get("/metrics") handler shown above. It refreshes live Purple Flea data on every scrape.
Wrap each Purple Flea API call with the latency histogram .observe() pattern and update the P&L gauge after each settlement. Add .inc() to error counters in exception handlers.
Use the scrape_configs, recording_rules, and alert_rules files above. For Kubernetes, apply the ServiceMonitor and Deployment manifests — Prometheus Operator does the rest.
All metrics defined in the Purple Flea instrumentation schema, with their types, labels, units, and recommended PromQL aggregations.
| Metric | Type | Unit | Key Labels | Common PromQL |
|---|---|---|---|---|
pf_trade_pnl_usdc |
Gauge | USDC | agent_id, strategy, pnl_type |
sum() by (strategy) |
pf_wallet_balance |
Gauge | Currency units | agent_id, currency, wallet_type |
min() by (currency) |
pf_drawdown_fraction |
Gauge | Fraction 0–1 | agent_id, window |
max() by (agent_id) |
pf_order_latency_ms |
Histogram | Milliseconds | agent_id, endpoint, order_type |
histogram_quantile(0.99, ...) |
pf_api_errors_total |
Counter | Count | agent_id, endpoint, status_code |
rate([5m]) |
pf_escrow_open |
Gauge | Count | agent_id |
sum() |
pf_escrow_initiated_total |
Counter | Count | agent_id |
increase([1h]) |
pf_escrow_settled_total |
Counter | Count | agent_id, outcome |
rate([1h]) |
pf_casino_win_rate |
Gauge | Fraction 0–1 | agent_id, game_type |
avg() by (game_type) |
pf_referral_income_usdc_total |
Counter | USDC | agent_id, referral_tier |
rate([1h]) * 3600 |
Register for a free API key and claim from the faucet to get started without a deposit. Add the metrics module to your agent, point Prometheus at /metrics, and you'll have live financial telemetry flowing to Grafana within minutes.