Guide Engineering March 6, 2026 18 min read

Stress Testing Your AI Agent Before Going Live with Real Money

Deploying an AI agent into live markets without rigorous testing is how accounts blow up in the first five minutes. This guide covers every layer of testing — from unit tests to chaos engineering — so your agent handles real-world chaos before it ever touches real funds.

1. Why Stress Test? The Risks You're Not Thinking About

Most developers test their agents for the happy path: the API responds in 200ms, balances are what you expect, orders fill instantly. Production is nothing like this. The failure modes that destroy accounts are the ones nobody simulates:

📉 Critical

Flash Crash

Price drops 40% in 90 seconds. Does your agent panic-sell into the worst possible price, or does it hold and recover?

🌊 Critical

Rate Limit Storm

100 concurrent 429 responses. Does your agent back off gracefully, or does it loop into an infinite retry that drains your quota?

🔌 High

Network Partition

Your agent thinks it placed an order but never got the confirmation. Is it going to place it again — doubling your exposure?

⏳ High

Position Stuck Open

A position opens but the close order never executes. Does your agent detect this, or does it run into margin limits an hour later?

🕐 Medium

Delayed Responses

API latency spikes to 8 seconds. Does your agent timeout correctly, or does it queue up stale orders that all fire when latency normalizes?

🔢 Medium

Precision Errors

Floating point arithmetic compounds over thousands of micro-trades. Your P&L report shows +$0.02 but the exchange shows -$1.17.

The 80% Rule

80% of agent failures in production occur in scenarios that were never tested. The purpose of stress testing is to shrink that 80% to near zero before real money is on the line.

2. Test Environment Setup: Paper Trading Mode

Purple Flea provides a full paper trading mode accessible to all registered agents. Paper trading uses real market data and real API responses but processes all positions against a simulated balance. No funds move.

To enable paper trading mode, set the mode parameter in your API calls:

paper_trading_setup.py Python

import httpx
import os

PurpleFlеaClient = {
    "base_url": "https://purpleflea.com/api/v1",
    "api_key": os.environ["PURPLE_FLEA_API_KEY"],  # pf_live_your_key_here
    "mode": "paper",  # toggle: "paper" | "live"
}

async def place_order(symbol: str, side: str, amount: float, mode: str = "paper"):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"https://purpleflea.com/api/v1/trade/order",
            headers={"Authorization": f"Bearer {os.environ['PURPLE_FLEA_API_KEY']}"},
            json={
                "symbol": symbol,
                "side": side,
                "amount": amount,
                "mode": mode,  # "paper" — no real funds
            }
        )
        return response.json()

# All of this hits real endpoints with real latency
# but uses a simulated $10,000 paper balance

Your paper trading environment should mirror production as closely as possible. Use the same environment variables, the same logging pipeline, and the same deployment infrastructure. The only difference is the mode flag.

Don't use a separate codebase for testing

A common mistake is maintaining a "test agent" that runs different code than production. If you're not testing the exact production binary, you're not testing anything meaningful. Same code, different mode flag.

3. Unit Tests for Trading Strategies

Unit tests validate your strategy logic in complete isolation — no network calls, no real API. Every decision function should be independently testable.

tests/test_strategy.py Python (pytest)

import pytest
from decimal import Decimal
from unittest.mock import AsyncMock, patch
from agent.strategy import TradingStrategy, Signal

@pytest.mark.parametrize("price,ma_fast,ma_slow,expected", [
    (Decimal("100"), Decimal("102"), Decimal("98"), Signal.BUY),
    (Decimal("100"), Decimal("97"), Decimal("103"), Signal.SELL),
    (Decimal("100"), Decimal("100"), Decimal("100"), Signal.HOLD),
])
def test_crossover_signal(price, ma_fast, ma_slow, expected):
    strategy = TradingStrategy(ma_period_fast=9, ma_period_slow=21)
    signal = strategy.compute_signal(price=price, ma_fast=ma_fast, ma_slow=ma_slow)
    assert signal == expected

def test_position_size_respects_max_risk():
    """Agent must never risk more than 2% per trade"""
    strategy = TradingStrategy(max_risk_pct=Decimal("0.02"))
    balance = Decimal("5000")
    size = strategy.compute_position_size(balance=balance, price=Decimal("250"))
    max_allowed = balance * Decimal("0.02")
    assert size * Decimal("250") <= max_allowed

def test_stop_loss_triggers_correctly():
    strategy = TradingStrategy(stop_loss_pct=Decimal("0.05"))
    entry_price = Decimal("1000")
    current_price = Decimal("940")  # -6%, below 5% stop
    assert strategy.should_stop_loss(entry=entry_price, current=current_price) is True

def test_precision_on_large_trade_volume():
    """Floating point should not drift after 10,000 micro-trades"""
    strategy = TradingStrategy()
    running_pnl = Decimal("0")
    for _ in range(10000):
        running_pnl += strategy.compute_pnl_delta(
            entry=Decimal("1.001"), exit=Decimal("1.002"), size=Decimal("0.1")
        )
    # Should be exactly 0.0001 * 10000 * 0.1 = 0.1
    assert abs(running_pnl - Decimal("0.1")) < Decimal("0.000001")

Coverage requirements

Require 90%+ coverage on all strategy modules. Use a pytest-cov configuration that fails the build below threshold:

pyproject.toml TOML

[tool.pytest.ini_options]
addopts = "--cov=agent --cov-report=term-missing --cov-fail-under=90"

[tool.coverage.run]
omit = ["tests/*", "agent/migrations/*"]

4. Integration Tests: Mock Purple Flea API Responses

Integration tests verify that your agent handles the full API response cycle correctly — including error codes, timeouts, malformed JSON, and edge-case payloads. Use respx (for httpx) or responses (for requests) to intercept HTTP calls.

tests/test_integration.py Python (pytest + respx)

import pytest
import respx
import httpx
from agent.client import PurpleFlеaClient

@pytest.mark.asyncio
async def test_handles_429_rate_limit():
    """Agent must back off and retry on rate limit"""
    with respx.mock(base_url="https://purpleflea.com") as mock:
        # First two calls return 429, third succeeds
        mock.post("/api/v1/trade/order").side_effect([
            httpx.Response(429, json={"error": "rate_limited", "retry_after": 1}),
            httpx.Response(429, json={"error": "rate_limited", "retry_after": 1}),
            httpx.Response(200, json={"order_id": "abc123", "status": "filled"}),
        ])
        client = PurpleFlеaClient(api_key="pf_live_test_key", max_retries=3)
        result = await client.place_order("BTC", "buy", 0.01)
        assert result["order_id"] == "abc123"
        assert mock.calls.call_count == 3

@pytest.mark.asyncio
async def test_handles_500_server_error():
    """Agent must not place duplicate orders on server errors"""
    with respx.mock(base_url="https://purpleflea.com") as mock:
        mock.post("/api/v1/trade/order").respond(500)
        client = PurpleFlеaClient(api_key="pf_live_test_key")
        with pytest.raises(ServerError):
            await client.place_order("ETH", "sell", 1.0)
        # Must NOT retry on 500 — order state is unknown
        assert mock.calls.call_count == 1

@pytest.mark.asyncio
async def test_handles_malformed_json_response():
    """API returning garbage must not crash the agent"""
    with respx.mock(base_url="https://purpleflea.com") as mock:
        mock.get("/api/v1/wallet/balance").respond(
            200, content=b"<html>Maintenance</html>"
        )
        client = PurpleFlеaClient(api_key="pf_live_test_key")
        balance = await client.get_balance()
        assert balance is None  # graceful degradation

5. Scenario Testing: Flash Crash, Rate Storms, Stuck Positions

Scenario tests simulate complete market situations. Unlike unit tests (which test a single function) or integration tests (which test API communication), scenario tests run your agent end-to-end against a mocked market environment.

Scenario A: Flash Crash

tests/scenarios/test_flash_crash.py Python

async def test_agent_survives_flash_crash(mock_market):
    """Agent must not lose more than 5% in a 40% flash crash"""
    starting_balance = Decimal("10000")
    agent = TradingAgent(balance=starting_balance)

    # Agent builds a position over 10 ticks
    for tick in mock_market.normal_ticks(10):
        await agent.on_tick(tick)

    # Flash crash: price drops 40% in one tick
    crash_tick = mock_market.crash_tick(drop_pct=0.40)
    await agent.on_tick(crash_tick)

    final_balance = agent.get_balance()
    drawdown = (starting_balance - final_balance) / starting_balance

    # Stop-loss must have fired; max acceptable loss = 5%
    assert drawdown < Decimal("0.05"), f"Excessive loss: {drawdown:.2%}"

async def test_no_panic_selling_into_crash(mock_market):
    """Agent must not execute market orders during a crash spike"""
    agent = TradingAgent(use_limit_orders=True)
    crash_tick = mock_market.crash_tick(drop_pct=0.40)
    orders = await agent.on_tick(crash_tick)
    market_orders = [o for o in orders if o["type"] == "market"]
    assert len(market_orders) == 0, "Market orders during crash = guaranteed bad fill"

Scenario B: Rate Limit Storm

Simulate your agent receiving 100 consecutive 429 responses and verify it degrades gracefully rather than hammering the API:

tests/scenarios/test_rate_limit.py Python

async def test_exponential_backoff_on_rate_limit(mock_api):
    mock_api.set_rate_limited(duration_seconds=30)
    agent = TradingAgent()
    start_time = time.time()
    await agent.run_for(seconds=30)
    total_requests = mock_api.request_count()

    # With exponential backoff, we should see fewer than 20 total attempts
    # A naive retry loop would make 100s of requests
    assert total_requests < 20, f"Too many requests during rate limit: {total_requests}"

Scenario C: Position Stuck Open

The close order fires but never gets confirmed. Your agent must detect this via periodic position reconciliation:

tests/scenarios/test_stuck_position.py Python

async def test_detects_stuck_open_position(mock_api):
    # Order fires, but confirmation never arrives (timeout)
    mock_api.intercept_close_orders(action="drop")
    agent = TradingAgent(reconcile_interval_seconds=5)

    await agent.open_position("BTC", size=Decimal("0.1"))
    await agent.try_close_position("BTC")
    await asyncio.sleep(6)  # wait for reconciliation cycle

    alerts = agent.get_alerts()
    stuck_alerts = [a for a in alerts if a["type"] == "stuck_position"]
    assert len(stuck_alerts) > 0, "Agent must alert on stuck positions"

6. Load Testing: Can Your Agent Handle 100 req/min Sustained?

Purple Flea's default rate limit is 300 requests per minute per API key. Your agent should be designed to use no more than 60% of this capacity under normal operations, leaving headroom for bursts. Load test to confirm this holds under sustained market activity.

load_test.py Python (asyncio)

import asyncio, time, statistics
from collections import deque
from agent.client import PurpleFlеaClient

async def load_test(target_rpm: int = 100, duration_seconds: int = 300):
    client = PurpleFlеaClient(api_key="pf_live_test_key", mode="paper")
    latencies = []
    errors = 0
    start = time.time()
    interval = 60.0 / target_rpm  # seconds between requests

    while time.time() - start < duration_seconds:
        t0 = time.time()
        try:
            await client.get_balance()
            latencies.append((time.time() - t0) * 1000)
        except Exception:
            errors += 1
        await asyncio.sleep(interval)

    print(f"Requests: {len(latencies) + errors}")
    print(f"Errors: {errors} ({errors/(len(latencies)+errors):.1%})")
    print(f"p50 latency: {statistics.median(latencies):.0f}ms")
    print(f"p95 latency: {statistics.quantiles(latencies, n=20)[18]:.0f}ms")
    print(f"p99 latency: {statistics.quantiles(latencies, n=100)[98]:.0f}ms")

asyncio.run(load_test())

<200ms Target p50 latency

<800ms Target p95 latency

<0.1% Max error rate

Paper Trading for Load Tests

Always use mode: "paper" for load tests. Running sustained load against live endpoints both costs money and may trigger account-level throttling that affects your real trading.

7. Chaos Engineering: Random Failures and Delayed Responses

Chaos engineering systematically introduces failures into your agent's environment to verify it handles them gracefully. The principle: if you're going to fail in production, fail on purpose in testing first.

agent/chaos_proxy.py Python

import random, asyncio, time
from typing import Callable, Any

class ChaosProxy:
    """Wraps any async callable with configurable failure modes"""

    def __init__(self, fn: Callable, config: dict):
        self.fn = fn
        self.config = config

    async def __call__(self, *args, **kwargs) -> Any:
        # Random connection drop
        if random.random() < self.config.get("drop_rate", 0):
            raise ConnectionError("chaos: connection dropped")

        # Latency injection
        delay_ms = self.config.get("latency_ms", 0)
        if delay_ms > 0:
            jitter = random.uniform(0, delay_ms * 0.5)
            await asyncio.sleep((delay_ms + jitter) / 1000)

        # Occasional 500 errors
        if random.random() < self.config.get("error_rate", 0):
            raise ServerError("chaos: random server error")

        return await self.fn(*args, **kwargs)

# Usage in tests:
chaos_config = {
    "drop_rate": 0.05,      # 5% connection drops
    "latency_ms": 2000,    # 2s base latency
    "error_rate": 0.02,    # 2% server errors
}
client.place_order = ChaosProxy(client.place_order, chaos_config)

Chaos test scenarios to run

DNS failure: Block resolution to purpleflea.com — agent must queue and retry
Slow network: 5-second latency — timeouts must fire, not hang forever
Packet loss: 20% of responses dropped — idempotency keys prevent duplicate orders
Clock skew: System clock jumps forward 5 minutes — signed requests must not expire incorrectly
Disk full: Log writes fail — agent must not crash because logging throws

8. Regression Testing: Don't Break What Works

Every time your agent's strategy evolves, you need to verify that old behaviors are preserved. Regression tests lock down known-good behavior against future changes.

tests/regression/test_known_scenarios.py Python

import json
from pathlib import Path

# Load a recorded market session (real data from paper trading)
def load_session(name: str) -> list:
    path = Path(f"tests/fixtures/{name}.json")
    return json.loads(path.read_text())

async def test_bull_run_2026_03_01():
    """Agent must not over-buy during the March 1 bull run"""
    session = load_session("bull_run_20260301")
    agent = TradingAgent()
    for tick in session:
        await agent.on_tick(tick)
    # Known result from first verified run: 2.3% gain
    assert 0.015 < agent.total_pnl_pct() < 0.035

async def test_sideways_market_no_churn():
    """Agent must not over-trade in a flat market (churning fees)"""
    session = load_session("sideways_20260215")
    agent = TradingAgent()
    for tick in session:
        await agent.on_tick(tick)
    assert agent.trade_count() < 20, "Too many trades in flat market"

Record production sessions for replay

When your paper trading agent runs well, save the tick data as a fixture. That session becomes a regression test — any future code change that breaks it signals a potential regression before you deploy.

9. Continuous Integration: Test on Every Code Change

Manual test runs are forgotten. Wire your test suite into a CI pipeline so every commit is automatically validated before it can reach production. Here is a complete GitHub Actions configuration:

.github/workflows/agent-tests.yml YAML

name: Agent Test Suite

on:
  push:
    branches: [main, staging]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.11", "3.12"]

    steps:
      - uses: actions/checkout@v4
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
      - name: Install dependencies
        run: pip install -e ".[test]"
      - name: Unit tests
        run: pytest tests/unit -v --tb=short
      - name: Integration tests
        run: pytest tests/integration -v
        env:
          PURPLE_FLEA_API_KEY: ${{ secrets.PF_PAPER_API_KEY }}
          PURPLE_FLEA_MODE: paper
      - name: Regression tests
        run: pytest tests/regression -v
      - name: Coverage report
        run: pytest --cov=agent --cov-fail-under=90
      - name: Upload coverage
        uses: codecov/codecov-action@v4

  load-test:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - name: Run 5-minute load test
        run: python load_test.py --duration=300 --rpm=60
        env:
          PURPLE_FLEA_API_KEY: ${{ secrets.PF_PAPER_API_KEY }}

Store your paper trading API key in GitHub Secrets as PF_PAPER_API_KEY. Never commit API keys directly to code. Your production key (prefixed pf_live_) should never appear in CI configuration.

Push to branch

→

Unit tests

→

Integration tests

→

Regression suite

→

Coverage check

→

Deploy to staging

10. Go/No-Go Checklist Before Deploying with Real Funds

Before flipping your agent from paper to live, every item on this checklist must be green. If any item is red, do not deploy.

1

Unit test coverage above 90% — All strategy functions have passing unit tests including edge cases and boundary conditions.
2

All integration tests pass — 429, 500, timeout, and malformed-response scenarios all handled correctly.
3

Paper trading for 7+ days — Agent has run in paper mode through at least one volatile market period. Results are within expected performance bounds.
4

Load test passes at 2x normal throughput — Agent handles 200% of expected request volume without errors or degradation.
5

Chaos test: all 5 scenarios pass — DNS failure, slow network, packet loss, clock skew, and disk full scenarios all handled gracefully.
6

Regression tests pass on current branch — No regressions on any recorded historical session.
7

Position reconciliation verified — Stuck position detection tested and alert confirmed working.
8

Max loss circuit breaker configured — Agent has hard-coded maximum daily loss limit; exceeding it halts all trading and sends alert.
9

API key in environment variable — No API keys hardcoded. Key rotated from paper key to live key only at deployment time.
10

Start with 10% of intended capital — Deploy with 10% of planned capital for the first 48 hours. Scale up only after confirmed stability.

The "I'll test later" failure mode

The most common reason agents fail in production is deploying early and planning to test "once it's live." There is no such thing as a safe way to test with real money. The checklist above exists to prevent expensive lessons.

Ready to test your agent on Purple Flea?

Register for a paper trading account and start stress testing with real market data and zero risk. Claim $1 USDC from the faucet when you're ready to go live.

Stress Testing Your AI Agent Before Going Live with Real Money

1. Why Stress Test? The Risks You're Not Thinking About

Flash Crash

Rate Limit Storm

Network Partition

Position Stuck Open

Delayed Responses

Precision Errors

2. Test Environment Setup: Paper Trading Mode

3. Unit Tests for Trading Strategies

Coverage requirements

4. Integration Tests: Mock Purple Flea API Responses

5. Scenario Testing: Flash Crash, Rate Storms, Stuck Positions

Scenario A: Flash Crash

Scenario B: Rate Limit Storm

Scenario C: Position Stuck Open

6. Load Testing: Can Your Agent Handle 100 req/min Sustained?

7. Chaos Engineering: Random Failures and Delayed Responses

Chaos test scenarios to run

8. Regression Testing: Don't Break What Works

9. Continuous Integration: Test on Every Code Change

10. Go/No-Go Checklist Before Deploying with Real Funds

Ready to test your agent on Purple Flea?

Further Reading