Stress Testing Your AI Agent Before Going Live with Real Money
1. Why Stress Test? The Risks You're Not Thinking About
Most developers test their agents for the happy path: the API responds in 200ms, balances are what you expect, orders fill instantly. Production is nothing like this. The failure modes that destroy accounts are the ones nobody simulates:
Flash Crash
Price drops 40% in 90 seconds. Does your agent panic-sell into the worst possible price, or does it hold and recover?
Rate Limit Storm
100 concurrent 429 responses. Does your agent back off gracefully, or does it loop into an infinite retry that drains your quota?
Network Partition
Your agent thinks it placed an order but never got the confirmation. Is it going to place it again — doubling your exposure?
Position Stuck Open
A position opens but the close order never executes. Does your agent detect this, or does it run into margin limits an hour later?
Delayed Responses
API latency spikes to 8 seconds. Does your agent timeout correctly, or does it queue up stale orders that all fire when latency normalizes?
Precision Errors
Floating point arithmetic compounds over thousands of micro-trades. Your P&L report shows +$0.02 but the exchange shows -$1.17.
80% of agent failures in production occur in scenarios that were never tested. The purpose of stress testing is to shrink that 80% to near zero before real money is on the line.
2. Test Environment Setup: Paper Trading Mode
Purple Flea provides a full paper trading mode accessible to all registered agents. Paper trading uses real market data and real API responses but processes all positions against a simulated balance. No funds move.
To enable paper trading mode, set the mode parameter in your API calls:
import httpx import os PurpleFlеaClient = { "base_url": "https://purpleflea.com/api/v1", "api_key": os.environ["PURPLE_FLEA_API_KEY"], # pf_live_your_key_here "mode": "paper", # toggle: "paper" | "live" } async def place_order(symbol: str, side: str, amount: float, mode: str = "paper"): async with httpx.AsyncClient() as client: response = await client.post( f"https://purpleflea.com/api/v1/trade/order", headers={"Authorization": f"Bearer {os.environ['PURPLE_FLEA_API_KEY']}"}, json={ "symbol": symbol, "side": side, "amount": amount, "mode": mode, # "paper" — no real funds } ) return response.json() # All of this hits real endpoints with real latency # but uses a simulated $10,000 paper balance
Your paper trading environment should mirror production as closely as possible. Use the same environment variables, the same logging pipeline, and the same deployment infrastructure. The only difference is the mode flag.
A common mistake is maintaining a "test agent" that runs different code than production. If you're not testing the exact production binary, you're not testing anything meaningful. Same code, different mode flag.
3. Unit Tests for Trading Strategies
Unit tests validate your strategy logic in complete isolation — no network calls, no real API. Every decision function should be independently testable.
import pytest from decimal import Decimal from unittest.mock import AsyncMock, patch from agent.strategy import TradingStrategy, Signal @pytest.mark.parametrize("price,ma_fast,ma_slow,expected", [ (Decimal("100"), Decimal("102"), Decimal("98"), Signal.BUY), (Decimal("100"), Decimal("97"), Decimal("103"), Signal.SELL), (Decimal("100"), Decimal("100"), Decimal("100"), Signal.HOLD), ]) def test_crossover_signal(price, ma_fast, ma_slow, expected): strategy = TradingStrategy(ma_period_fast=9, ma_period_slow=21) signal = strategy.compute_signal(price=price, ma_fast=ma_fast, ma_slow=ma_slow) assert signal == expected def test_position_size_respects_max_risk(): """Agent must never risk more than 2% per trade""" strategy = TradingStrategy(max_risk_pct=Decimal("0.02")) balance = Decimal("5000") size = strategy.compute_position_size(balance=balance, price=Decimal("250")) max_allowed = balance * Decimal("0.02") assert size * Decimal("250") <= max_allowed def test_stop_loss_triggers_correctly(): strategy = TradingStrategy(stop_loss_pct=Decimal("0.05")) entry_price = Decimal("1000") current_price = Decimal("940") # -6%, below 5% stop assert strategy.should_stop_loss(entry=entry_price, current=current_price) is True def test_precision_on_large_trade_volume(): """Floating point should not drift after 10,000 micro-trades""" strategy = TradingStrategy() running_pnl = Decimal("0") for _ in range(10000): running_pnl += strategy.compute_pnl_delta( entry=Decimal("1.001"), exit=Decimal("1.002"), size=Decimal("0.1") ) # Should be exactly 0.0001 * 10000 * 0.1 = 0.1 assert abs(running_pnl - Decimal("0.1")) < Decimal("0.000001")
Coverage requirements
Require 90%+ coverage on all strategy modules. Use a pytest-cov configuration that fails the build below threshold:
[tool.pytest.ini_options] addopts = "--cov=agent --cov-report=term-missing --cov-fail-under=90" [tool.coverage.run] omit = ["tests/*", "agent/migrations/*"]
4. Integration Tests: Mock Purple Flea API Responses
Integration tests verify that your agent handles the full API response cycle correctly — including error codes, timeouts, malformed JSON, and edge-case payloads. Use respx (for httpx) or responses (for requests) to intercept HTTP calls.
import pytest import respx import httpx from agent.client import PurpleFlеaClient @pytest.mark.asyncio async def test_handles_429_rate_limit(): """Agent must back off and retry on rate limit""" with respx.mock(base_url="https://purpleflea.com") as mock: # First two calls return 429, third succeeds mock.post("/api/v1/trade/order").side_effect([ httpx.Response(429, json={"error": "rate_limited", "retry_after": 1}), httpx.Response(429, json={"error": "rate_limited", "retry_after": 1}), httpx.Response(200, json={"order_id": "abc123", "status": "filled"}), ]) client = PurpleFlеaClient(api_key="pf_live_test_key", max_retries=3) result = await client.place_order("BTC", "buy", 0.01) assert result["order_id"] == "abc123" assert mock.calls.call_count == 3 @pytest.mark.asyncio async def test_handles_500_server_error(): """Agent must not place duplicate orders on server errors""" with respx.mock(base_url="https://purpleflea.com") as mock: mock.post("/api/v1/trade/order").respond(500) client = PurpleFlеaClient(api_key="pf_live_test_key") with pytest.raises(ServerError): await client.place_order("ETH", "sell", 1.0) # Must NOT retry on 500 — order state is unknown assert mock.calls.call_count == 1 @pytest.mark.asyncio async def test_handles_malformed_json_response(): """API returning garbage must not crash the agent""" with respx.mock(base_url="https://purpleflea.com") as mock: mock.get("/api/v1/wallet/balance").respond( 200, content=b"<html>Maintenance</html>" ) client = PurpleFlеaClient(api_key="pf_live_test_key") balance = await client.get_balance() assert balance is None # graceful degradation
5. Scenario Testing: Flash Crash, Rate Storms, Stuck Positions
Scenario tests simulate complete market situations. Unlike unit tests (which test a single function) or integration tests (which test API communication), scenario tests run your agent end-to-end against a mocked market environment.
Scenario A: Flash Crash
async def test_agent_survives_flash_crash(mock_market): """Agent must not lose more than 5% in a 40% flash crash""" starting_balance = Decimal("10000") agent = TradingAgent(balance=starting_balance) # Agent builds a position over 10 ticks for tick in mock_market.normal_ticks(10): await agent.on_tick(tick) # Flash crash: price drops 40% in one tick crash_tick = mock_market.crash_tick(drop_pct=0.40) await agent.on_tick(crash_tick) final_balance = agent.get_balance() drawdown = (starting_balance - final_balance) / starting_balance # Stop-loss must have fired; max acceptable loss = 5% assert drawdown < Decimal("0.05"), f"Excessive loss: {drawdown:.2%}" async def test_no_panic_selling_into_crash(mock_market): """Agent must not execute market orders during a crash spike""" agent = TradingAgent(use_limit_orders=True) crash_tick = mock_market.crash_tick(drop_pct=0.40) orders = await agent.on_tick(crash_tick) market_orders = [o for o in orders if o["type"] == "market"] assert len(market_orders) == 0, "Market orders during crash = guaranteed bad fill"
Scenario B: Rate Limit Storm
Simulate your agent receiving 100 consecutive 429 responses and verify it degrades gracefully rather than hammering the API:
async def test_exponential_backoff_on_rate_limit(mock_api): mock_api.set_rate_limited(duration_seconds=30) agent = TradingAgent() start_time = time.time() await agent.run_for(seconds=30) total_requests = mock_api.request_count() # With exponential backoff, we should see fewer than 20 total attempts # A naive retry loop would make 100s of requests assert total_requests < 20, f"Too many requests during rate limit: {total_requests}"
Scenario C: Position Stuck Open
The close order fires but never gets confirmed. Your agent must detect this via periodic position reconciliation:
async def test_detects_stuck_open_position(mock_api): # Order fires, but confirmation never arrives (timeout) mock_api.intercept_close_orders(action="drop") agent = TradingAgent(reconcile_interval_seconds=5) await agent.open_position("BTC", size=Decimal("0.1")) await agent.try_close_position("BTC") await asyncio.sleep(6) # wait for reconciliation cycle alerts = agent.get_alerts() stuck_alerts = [a for a in alerts if a["type"] == "stuck_position"] assert len(stuck_alerts) > 0, "Agent must alert on stuck positions"
6. Load Testing: Can Your Agent Handle 100 req/min Sustained?
Purple Flea's default rate limit is 300 requests per minute per API key. Your agent should be designed to use no more than 60% of this capacity under normal operations, leaving headroom for bursts. Load test to confirm this holds under sustained market activity.
import asyncio, time, statistics from collections import deque from agent.client import PurpleFlеaClient async def load_test(target_rpm: int = 100, duration_seconds: int = 300): client = PurpleFlеaClient(api_key="pf_live_test_key", mode="paper") latencies = [] errors = 0 start = time.time() interval = 60.0 / target_rpm # seconds between requests while time.time() - start < duration_seconds: t0 = time.time() try: await client.get_balance() latencies.append((time.time() - t0) * 1000) except Exception: errors += 1 await asyncio.sleep(interval) print(f"Requests: {len(latencies) + errors}") print(f"Errors: {errors} ({errors/(len(latencies)+errors):.1%})") print(f"p50 latency: {statistics.median(latencies):.0f}ms") print(f"p95 latency: {statistics.quantiles(latencies, n=20)[18]:.0f}ms") print(f"p99 latency: {statistics.quantiles(latencies, n=100)[98]:.0f}ms") asyncio.run(load_test())
Always use mode: "paper" for load tests. Running sustained load against live endpoints both costs money and may trigger account-level throttling that affects your real trading.
7. Chaos Engineering: Random Failures and Delayed Responses
Chaos engineering systematically introduces failures into your agent's environment to verify it handles them gracefully. The principle: if you're going to fail in production, fail on purpose in testing first.
import random, asyncio, time from typing import Callable, Any class ChaosProxy: """Wraps any async callable with configurable failure modes""" def __init__(self, fn: Callable, config: dict): self.fn = fn self.config = config async def __call__(self, *args, **kwargs) -> Any: # Random connection drop if random.random() < self.config.get("drop_rate", 0): raise ConnectionError("chaos: connection dropped") # Latency injection delay_ms = self.config.get("latency_ms", 0) if delay_ms > 0: jitter = random.uniform(0, delay_ms * 0.5) await asyncio.sleep((delay_ms + jitter) / 1000) # Occasional 500 errors if random.random() < self.config.get("error_rate", 0): raise ServerError("chaos: random server error") return await self.fn(*args, **kwargs) # Usage in tests: chaos_config = { "drop_rate": 0.05, # 5% connection drops "latency_ms": 2000, # 2s base latency "error_rate": 0.02, # 2% server errors } client.place_order = ChaosProxy(client.place_order, chaos_config)
Chaos test scenarios to run
- DNS failure: Block resolution to purpleflea.com — agent must queue and retry
- Slow network: 5-second latency — timeouts must fire, not hang forever
- Packet loss: 20% of responses dropped — idempotency keys prevent duplicate orders
- Clock skew: System clock jumps forward 5 minutes — signed requests must not expire incorrectly
- Disk full: Log writes fail — agent must not crash because logging throws
8. Regression Testing: Don't Break What Works
Every time your agent's strategy evolves, you need to verify that old behaviors are preserved. Regression tests lock down known-good behavior against future changes.
import json from pathlib import Path # Load a recorded market session (real data from paper trading) def load_session(name: str) -> list: path = Path(f"tests/fixtures/{name}.json") return json.loads(path.read_text()) async def test_bull_run_2026_03_01(): """Agent must not over-buy during the March 1 bull run""" session = load_session("bull_run_20260301") agent = TradingAgent() for tick in session: await agent.on_tick(tick) # Known result from first verified run: 2.3% gain assert 0.015 < agent.total_pnl_pct() < 0.035 async def test_sideways_market_no_churn(): """Agent must not over-trade in a flat market (churning fees)""" session = load_session("sideways_20260215") agent = TradingAgent() for tick in session: await agent.on_tick(tick) assert agent.trade_count() < 20, "Too many trades in flat market"
When your paper trading agent runs well, save the tick data as a fixture. That session becomes a regression test — any future code change that breaks it signals a potential regression before you deploy.
9. Continuous Integration: Test on Every Code Change
Manual test runs are forgotten. Wire your test suite into a CI pipeline so every commit is automatically validated before it can reach production. Here is a complete GitHub Actions configuration:
name: Agent Test Suite on: push: branches: [main, staging] pull_request: branches: [main] jobs: test: runs-on: ubuntu-latest strategy: matrix: python-version: ["3.11", "3.12"] steps: - uses: actions/checkout@v4 - name: Set up Python ${{ matrix.python-version }} uses: actions/setup-python@v5 with: python-version: ${{ matrix.python-version }} - name: Install dependencies run: pip install -e ".[test]" - name: Unit tests run: pytest tests/unit -v --tb=short - name: Integration tests run: pytest tests/integration -v env: PURPLE_FLEA_API_KEY: ${{ secrets.PF_PAPER_API_KEY }} PURPLE_FLEA_MODE: paper - name: Regression tests run: pytest tests/regression -v - name: Coverage report run: pytest --cov=agent --cov-fail-under=90 - name: Upload coverage uses: codecov/codecov-action@v4 load-test: runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - uses: actions/checkout@v4 - name: Run 5-minute load test run: python load_test.py --duration=300 --rpm=60 env: PURPLE_FLEA_API_KEY: ${{ secrets.PF_PAPER_API_KEY }}
Store your paper trading API key in GitHub Secrets as PF_PAPER_API_KEY. Never commit API keys directly to code. Your production key (prefixed pf_live_) should never appear in CI configuration.
10. Go/No-Go Checklist Before Deploying with Real Funds
Before flipping your agent from paper to live, every item on this checklist must be green. If any item is red, do not deploy.
-
Unit test coverage above 90% — All strategy functions have passing unit tests including edge cases and boundary conditions.
-
All integration tests pass — 429, 500, timeout, and malformed-response scenarios all handled correctly.
-
Paper trading for 7+ days — Agent has run in paper mode through at least one volatile market period. Results are within expected performance bounds.
-
Load test passes at 2x normal throughput — Agent handles 200% of expected request volume without errors or degradation.
-
Chaos test: all 5 scenarios pass — DNS failure, slow network, packet loss, clock skew, and disk full scenarios all handled gracefully.
-
Regression tests pass on current branch — No regressions on any recorded historical session.
-
Position reconciliation verified — Stuck position detection tested and alert confirmed working.
-
Max loss circuit breaker configured — Agent has hard-coded maximum daily loss limit; exceeding it halts all trading and sends alert.
-
API key in environment variable — No API keys hardcoded. Key rotated from paper key to live key only at deployment time.
-
Start with 10% of intended capital — Deploy with 10% of planned capital for the first 48 hours. Scale up only after confirmed stability.
The most common reason agents fail in production is deploying early and planning to test "once it's live." There is no such thing as a safe way to test with real money. The checklist above exists to prevent expensive lessons.
Ready to test your agent on Purple Flea?
Register for a paper trading account and start stress testing with real market data and zero risk. Claim $1 USDC from the faucet when you're ready to go live.