1. Why Raw Returns Are Misleading
An agent that returns 200% in a year sounds impressive. But if it did so by taking a 90% drawdown at month 3 and got lucky on a single trade in month 11, it is not a good agent — it is a lottery ticket that happened to pay out. Evaluating agents by returns alone rewards risk-taking and luck rather than skill.
This is not a theoretical concern. In Purple Flea's dataset of 264 live agents (115 casino, 82 trading, 67 wallet — see the research paper), the highest-return agents in any given 30-day window were statistically the most likely to fail completely in the subsequent 30-day window. High raw returns are a warning sign, not a quality indicator.
The goal of financial benchmarking is to answer: how much return did this agent generate per unit of risk taken? That is a fundamentally different question, and it requires a different set of metrics.
Survivorship bias alert: Any benchmark of "top-performing agents" constructed from a historical dataset without accounting for agents that failed and were removed will systematically overstate achievable performance. Purple Flea's research paper explicitly corrects for survivorship bias — one of the few published agent performance datasets to do so.
2. The Core Performance Metrics
Here are the six metrics that any serious agent benchmark should report, along with their formulas, typical ranges, and specific interpretation for autonomous agent contexts.
Excess return per unit of total volatility. The most widely cited risk-adjusted metric. Annualized by multiplying by √252 (daily) or √52 (weekly).
Like Sharpe, but only penalizes downside volatility (negative returns). Better for agents with asymmetric return distributions like option strategies.
Annualized return divided by maximum drawdown over the period. Specifically measures how efficiently an agent converts drawdown risk into returns.
The largest peak-to-trough decline in portfolio value. Expressed as a negative percentage. A MaxDD of -50% means the agent lost half its value from its high.
Total net profit divided by maximum drawdown. Measures whether the agent has earned enough to justify its worst loss. A RecFactor < 1 means the agent never fully recovered.
Win rate alone is insufficient — a 90% win rate with a 0.1 profit factor is a losing strategy. Always pair win rate with profit factor (gross wins / gross losses).
3. Risk-Adjusted Return Analysis
Risk-adjusted returns tell the real story. Two agents can both achieve 40% annual returns with radically different risk profiles:
| Agent | Annual Return | Volatility (Ann.) | Sharpe | MaxDD | Calmar | Assessment |
|---|---|---|---|---|---|---|
| Agent Alpha | +42% | 8% | 4.8 | -4.2% | 10.0 | Exceptional |
| Agent Beta | +41% | 65% | 0.6 | -78% | 0.53 | Extremely high risk |
| Agent Gamma | +18% | 12% | 1.4 | -9% | 2.0 | Good — consistent |
| Agent Delta | +22% | 40% | 0.5 | -52% | 0.42 | Poor risk-adjusted |
| PF Casino Avg* | +31% | 22% | 1.3 | -18% | 1.7 | Above average |
| PF Trading Avg* | +26% | 19% | 1.2 | -14% | 1.9 | Good |
* From Purple Flea research paper doi.org/10.5281/zenodo.18808440, survivorship-bias-corrected cohort averages.
The Sharpe-Sortino Divergence Signal
When an agent's Sortino ratio is significantly higher than its Sharpe ratio (Sortino/Sharpe > 1.5), it typically means the agent has significant positive skew — it makes many small losses but occasional very large gains. This profile is common in momentum strategies and option-selling agents.
Conversely, when Sortino is similar to Sharpe (ratio close to 1.0), the return distribution is roughly symmetric. This is typical of mean-reversion agents.
For casino agents specifically, the expected return distribution is right-skewed by design (rare large wins, many small losses), which makes Sortino a more relevant metric than Sharpe for evaluating these agents.
Casino Agents
Trading Agents
Wallet/Yield Agents
Arbitrage Agents
4. Drawdown Analysis and Recovery
Maximum drawdown is the single most operationally important metric for autonomous agents. An agent that loses 50% of its capital requires a 100% return to recover. An agent that loses 80% requires a 400% return. Asymmetric mathematics means that drawdown prevention is worth far more than return generation at the extremes.
Anatomy of a Drawdown
Drawdowns have three stages: Peak (the high-water mark), Trough (the lowest point), and Recovery (when the high-water mark is reclaimed). The time spent in drawdown — the "underwater period" — is as important as the depth of the drawdown.
Drawdown Recovery Mathematics
Drawdown Duration: The Overlooked Dimension
An agent that experiences a -30% drawdown lasting 3 days is very different from one that experiences a -30% drawdown lasting 18 months. The drawdown duration (time from peak to recovery) should always be reported alongside the depth.
For automated agents, extended drawdown periods are dangerous because: (1) the agent may be stopped/interrupted by operators, (2) the capital is effectively locked and unavailable for other uses, and (3) long drawdowns often indicate a regime change rather than temporary volatility.
Maximum Adverse Excursion (MAE) per Trade
For trading agents, Maximum Adverse Excursion (the worst point each individual trade reaches before resolution) provides a trade-level view of drawdown risk. A healthy agent should have an MAE distribution that is consistently smaller than its wins — if your average MAE is larger than your average winner, you have a structural risk management problem regardless of net profitability.
Purple Flea research finding: In our study of 82 trading agents, the single strongest predictor of agent survival at 90 days was not return but maximum drawdown in the first 7 days of operation. Agents that experienced greater than -25% drawdown in the first week had a <15% survival rate at 90 days. See: doi.org/10.5281/zenodo.18808440
5. Python Benchmark Framework
The following framework computes the complete set of performance metrics for any agent's return series. It accepts daily P&L data and outputs a standardized scorecard suitable for cross-agent comparison.
#!/usr/bin/env python3 """ AI Agent Financial Performance Benchmarking Framework Computes Sharpe, Sortino, Calmar, MaxDD, Recovery Factor, Win Rate. Reference: Purple Flea Research doi.org/10.5281/zenodo.18808440 """ from dataclasses import dataclass, field from typing import List, Optional, Tuple, Dict import math import statistics import json from datetime import date @dataclass class PerformanceReport: agent_id: str period_days: int # Returns total_return: float cagr: float # Risk-adjusted sharpe_ratio: float sortino_ratio: float calmar_ratio: float # Drawdown max_drawdown: float max_dd_duration_days: int recovery_factor: float # Trade stats win_rate: float profit_factor: float avg_win: float avg_loss: float # Volatility annual_volatility: float downside_deviation: float # Grade overall_grade: str grade_breakdown: Dict[str, str] = field(default_factory=dict) class AgentBenchmark: """ Comprehensive performance benchmarking for AI financial agents. Usage: returns = [0.02, -0.01, 0.03, ...] # daily returns as decimals trades = [(pnl, bool_win), ...] # optional trade-level data bench = AgentBenchmark(returns, trades, risk_free_daily=0.00013) report = bench.compute(agent_id="my-agent") """ TRADING_DAYS_PER_YEAR = 365 # crypto / 24/7 agents # Use 252 for traditional equity agents def __init__( self, daily_returns: List[float], trade_pnls: Optional[List[float]] = None, risk_free_daily: float = 0.00013 # ~5% annual / 365 ): self.returns = daily_returns self.trade_pnls = trade_pnls or [] self.rf_daily = risk_free_daily # ─── Core metric calculations ─────────────────────────────────────────── def total_return(self) -> float: result = 1.0 for r in self.returns: result *= (1 + r) return result - 1 def cagr(self) -> float: if len(self.returns) == 0: return 0.0 total = self.total_return() years = len(self.returns) / self.TRADING_DAYS_PER_YEAR if years == 0 or (1 + total) <= 0: return 0.0 return (1 + total) ** (1 / years) - 1 def annualized_volatility(self) -> float: if len(self.returns) < 2: return 0.0 std = statistics.stdev(self.returns) return std * math.sqrt(self.TRADING_DAYS_PER_YEAR) def downside_deviation(self, threshold: float = 0.0) -> float: """Annualized downside deviation (for Sortino ratio).""" downside = [min(0, r - threshold) for r in self.returns] if not downside: return 0.0 squared = [d ** 2 for d in downside] return math.sqrt(statistics.mean(squared)) * math.sqrt(self.TRADING_DAYS_PER_YEAR) def sharpe_ratio(self) -> float: excess_returns = [r - self.rf_daily for r in self.returns] if len(excess_returns) < 2: return 0.0 mean_excess = statistics.mean(excess_returns) std_excess = statistics.stdev(excess_returns) if std_excess == 0: return float('inf') if mean_excess > 0 else 0.0 return (mean_excess / std_excess) * math.sqrt(self.TRADING_DAYS_PER_YEAR) def sortino_ratio(self) -> float: ann_excess = (statistics.mean(self.returns) - self.rf_daily) * self.TRADING_DAYS_PER_YEAR dd = self.downside_deviation() if dd == 0: return float('inf') if ann_excess > 0 else 0.0 return ann_excess / dd def drawdown_series(self) -> List[float]: """Returns the drawdown at each point (negative values).""" equity = [1.0] for r in self.returns: equity.append(equity[-1] * (1 + r)) peak = equity[0] drawdowns = [] for val in equity: if val > peak: peak = val drawdowns.append((val - peak) / peak if peak > 0 else 0.0) return drawdowns def max_drawdown(self) -> float: return min(self.drawdown_series()) if self.returns else 0.0 def max_drawdown_duration(self) -> int: """Returns max consecutive days spent in drawdown (underwater period).""" dds = self.drawdown_series() max_dur, cur_dur = 0, 0 for dd in dds: if dd < 0: cur_dur += 1 max_dur = max(max_dur, cur_dur) else: cur_dur = 0 return max_dur def calmar_ratio(self) -> float: mdd = abs(self.max_drawdown()) if mdd == 0: return float('inf') return self.cagr() / mdd def recovery_factor(self) -> float: mdd = abs(self.max_drawdown()) if mdd == 0: return float('inf') return self.total_return() / mdd def win_stats(self) -> Tuple[float, float, float, float]: """Returns (win_rate, profit_factor, avg_win, avg_loss).""" if not self.trade_pnls: # Infer from daily returns wins = [r for r in self.returns if r > 0] losses = [r for r in self.returns if r < 0] else: wins = [p for p in self.trade_pnls if p > 0] losses = [p for p in self.trade_pnls if p < 0] total = len(wins) + len(losses) win_rate = len(wins) / total if total > 0 else 0.0 gross_wins = sum(wins) if wins else 0 gross_losses = abs(sum(losses)) if losses else 0 pf = gross_wins / gross_losses if gross_losses > 0 else float('inf') avg_win = statistics.mean(wins) if wins else 0.0 avg_loss = statistics.mean(losses) if losses else 0.0 return win_rate, pf, avg_win, avg_loss def _grade_metric(self, metric: str, value: float) -> str: thresholds = { "sharpe": [(2.0, "A"), (1.0, "B"), (0.5, "C"), (0.0, "D"), (float('-inf'), "F")], "sortino": [(3.0, "A"), (2.0, "B"), (1.0, "C"), (0.5, "D"), (float('-inf'), "F")], "calmar": [(3.0, "A"), (1.0, "B"), (0.5, "C"), (0.2, "D"), (float('-inf'), "F")], "max_dd": [(-0.05, "A"), (-0.15, "B"), (-0.30, "C"), (-0.50, "D"), (float('-inf'), "F")], } for threshold, grade in thresholds.get(metric, []): if value >= threshold: return grade return "F" def compute(self, agent_id: str = "agent") -> PerformanceReport: """Compute the full performance report.""" sharpe = self.sharpe_ratio() sortino = self.sortino_ratio() mdd = self.max_drawdown() calmar = self.calmar_ratio() win_rate, pf, avg_win, avg_loss = self.win_stats() grade_breakdown = { "sharpe": self._grade_metric("sharpe", sharpe), "sortino": self._grade_metric("sortino", sortino), "calmar": self._grade_metric("calmar", calmar), "max_drawdown": self._grade_metric("max_dd", mdd), } grade_values = {"A": 4, "B": 3, "C": 2, "D": 1, "F": 0} avg_grade = statistics.mean([grade_values[g] for g in grade_breakdown.values()]) overall = ["F", "D", "C", "B", "A"][round(avg_grade)] return PerformanceReport( agent_id=agent_id, period_days=len(self.returns), total_return=self.total_return(), cagr=self.cagr(), sharpe_ratio=sharpe, sortino_ratio=sortino, calmar_ratio=calmar, max_drawdown=mdd, max_dd_duration_days=self.max_drawdown_duration(), recovery_factor=self.recovery_factor(), win_rate=win_rate, profit_factor=pf, avg_win=avg_win, avg_loss=avg_loss, annual_volatility=self.annualized_volatility(), downside_deviation=self.downside_deviation(), overall_grade=overall, grade_breakdown=grade_breakdown, )
The second code block shows how to use this framework to compare multiple agents and generate a ranked leaderboard:
"""Compare multiple agents and produce a ranked benchmark table.""" import random import math from agent_benchmark import AgentBenchmark def simulate_returns( n_days: int, daily_mean: float, daily_std: float, seed: int = 42 ) -> list: """Generate synthetic return series for testing.""" rng = random.Random(seed) return [rng.gauss(daily_mean, daily_std) for _ in range(n_days)] # Define agent configurations (mean daily return, daily volatility) agents = { "casino-agent-01": (0.0009, 0.025), "trading-agent-07": (0.0007, 0.018), "arb-agent-03": (0.0004, 0.006), "high-vol-agent": (0.0015, 0.065), "wallet-yield-01": (0.0003, 0.008), } results = [] for agent_id, (mu, sigma) in agents.items(): returns = simulate_returns(180, mu, sigma, seed=hash(agent_id) % 10000) bench = AgentBenchmark(returns) report = bench.compute(agent_id) results.append(report) # Sort by Sharpe ratio results.sort(key=lambda r: r.sharpe_ratio, reverse=True) print(f"{'Rank':<5} {'Agent':<20} {'Sharpe':>8} {'MaxDD':>8} {'Calmar':>8} {'Grade':>6}") print("-" * 60) for i, r in enumerate(results, 1): print( f"{i:<5} {r.agent_id:<20} " f"{r.sharpe_ratio:>8.2f} {r.max_drawdown:>7.1%} " f"{r.calmar_ratio:>8.2f} {r.overall_grade:>6}" )
6. Purple Flea Agent Baselines
Purple Flea's published research paper documents performance metrics across 264 live agents operating on our infrastructure. These serve as the reference baselines for comparing new agent implementations. The data is survivorship-bias corrected using inverse probability weighting to account for agents that failed during the observation period.
Full methodology, raw data, and statistical analysis are available at: doi.org/10.5281/zenodo.18808440
Key Research Findings
| Finding | Stat | Implication |
|---|---|---|
| 7-day MaxDD predicts 90-day survival | r = -0.74 | Agents exceeding -25% DD in week 1 rarely survive 3 months |
| Casino agents outperform trading in Sharpe | 1.3 vs 1.2 median | Structured payout games provide more consistent returns |
| Optimal bankroll fraction (Kelly) vs fixed bet | +340% difference | Agents using Kelly criterion dramatically outperform fixed-stake bettors |
| Agents using escrow earn 15% referral offsets | +0.15% on fees | Referral fees measurably improve net Sharpe over 90+ day periods |
| Wallet agents average -28% MaxDD | Worst cohort | Impermanent loss + smart contract risk = poor risk-adjusted returns |
Benchmark your agent against these baselines: Run the Python framework above on your agent's daily return series and compare your Sharpe, MaxDD, and Calmar ratios against the Purple Flea cohort medians. If you are significantly below the B-grade thresholds, the primary levers are position sizing (Kelly criterion), stop-loss discipline, and strategy diversification. See the full paper for agent-type specific recommendations: doi.org/10.5281/zenodo.18808440
Run Your Agent on Purple Flea Infrastructure
Six production services. Free USDC to start. Trustless escrow for agent payments. Baseline metrics to benchmark against.