Infrastructure

Agent Upgrade Patterns:
Deploying New Agent Versions
Without Losing Money

March 6, 2026 Purple Flea Team 15 min read

Upgrading a software service is a well-understood engineering problem. Upgrading an AI agent that is actively managing money — open positions, pending escrows, live casino sessions, domain registrations — is a different problem entirely. The cost of downtime isn't measured in user experience; it's measured in missed opportunities, stale state, and in the worst case, double-spending or abandoned funds.

This post covers the deployment patterns that production agent teams have converged on: blue-green, canary, shadow mode, state migration, and rollback. Each pattern trades complexity for risk reduction differently. Choose based on your agent's risk profile.

The Upgrade Risk Triangle

Three risks compound during agent upgrades: (1) Downtime risk — funds stagnate or miss time-sensitive operations. (2) Bug risk — new version has a defect that loses money. (3) State risk — position or balance state is corrupted during transition. Good upgrade patterns minimize all three simultaneously.

1. The Upgrade Problem

A typical software deployment has one constraint: don't serve errors during the transition. An agent deployment managing financial infrastructure has additional constraints that most DevOps literature ignores entirely:

Traditional Service Upgrade: v1 running → deploy v2 → v2 running (2 min downtime, usually acceptable) Agent Upgrade with Live Funds: v1 managing $500 in positions │ ├── Has open dice game in progress ├── Has 3 pending escrows it is monitoring ├── Has a crash game it needs to cash out in 30s └── Is mid-way through a momentum trading decision Naive restart loses all of this context.

The solution is never to "restart" — it's to transition. The patterns below are all variations on the same theme: keep v1 running in a safe state while v2 proves itself, then transfer control gracefully.

2. Blue-Green Deployment

Blue-green runs two identical environments — blue (current) and green (new). At any given time, only one is active (receiving real decisions and executing transactions). The other is warm and ready. Switching from blue to green is instantaneous: a single environment variable or config flag change directs the agent's decision loop to the green instance.

BLUE (v1, active) GREEN (v2, standby) ┌─────────────────┐ ┌─────────────────┐ │ Running │ │ Running │ │ Connected to PF │ │ Connected to PF │ │ Watching state │ │ Reading state │ │ EXECUTING bets │ │ NOT executing │ └────────┬────────┘ └────────┬─────────┘ │ │ └──────── shared state ─────┘ (Redis / DB / file) │ SWITCH: set ACTIVE=green │ ┌─────────────────┐ ┌─────────────────┐ │ Now standby │ │ Now ACTIVE │ │ Still running │ │ EXECUTING bets │ │ Ready to revert │ │ Full control │ └─────────────────┘ └──────────────────┘
Python — Blue-green controller
import os, redis

class AgentController:
    def __init__(self, version: str):
        self.version = version
        self.r = redis.Redis(host="localhost")
        self.active_key = "agent:active_version"

    def is_active(self) -> bool:
        active = self.r.get(self.active_key)
        return active is not None and active.decode() == self.version

    def run_loop(self):
        while True:
            if not self.is_active():
                # Standby: read state, do not execute
                self.sync_state()
                time.sleep(1)
                continue
            # Active: read state AND execute decisions
            state = self.read_state()
            decisions = self.decide(state)
            self.execute(decisions)
            self.write_state(state)

# Switch from blue to green (run on operator machine)
def switch_to_green():
    r = redis.Redis(host="localhost")
    r.set("agent:active_version", "green")
    print("Switched active to green")
Blue-Green Advantage

Rollback is instant — just set ACTIVE=blue again. No restart needed. Both versions are warm so the switch is millisecond-level. The main cost: you're running two instances, using 2x compute. Acceptable for any financial agent where the cost of a bug exceeds the cost of extra compute.

3. Canary Releases: 5% of Capital First

Canary releases split traffic (or in agent terms, capital allocation) between old and new versions. Rather than switching 100% of activity to v2, you route a small fraction — say, 5% of your betting budget or escrow volume — to v2 while v1 handles the rest. If v2 performs as expected for N rounds without errors, you incrementally increase its allocation.

Week 0: v1 = 100% of capital, v2 = 0% Week 1: v1 = 95%, v2 = 5% (canary phase) Week 2: v1 = 80%, v2 = 20% (expanding) Week 3: v1 = 50%, v2 = 50% (equal split) Week 4: v1 = 0%, v2 = 100% (complete) Abort at any phase if v2 performance diverges from v1.
Python — Canary allocation router
import random

class CanaryRouter:
    def __init__(self, v2_fraction: float = 0.05):
        self.v2_fraction = v2_fraction  # 0.0 to 1.0
        self.v1_metrics = []
        self.v2_metrics = []

    def route(self, bet_amount: float) -> tuple[str, float]:
        # Returns (version, adjusted_bet)
        if random.random() < self.v2_fraction:
            return "v2", bet_amount * self.v2_fraction
        return "v1", bet_amount * (1 - self.v2_fraction)

    def record(self, version: str, outcome: float):
        (self.v2_metrics if version == "v2" else self.v1_metrics).append(outcome)

    def should_promote(self, min_samples: int = 100, tolerance: float = 0.05) -> bool:
        if len(self.v2_metrics) < min_samples: return False
        v1_mean = sum(self.v1_metrics) / len(self.v1_metrics)
        v2_mean = sum(self.v2_metrics) / len(self.v2_metrics)
        # Promote if v2 within tolerance of v1 or better
        return v2_mean >= v1_mean * (1 - tolerance)

router = CanaryRouter(v2_fraction=0.05)
version, bet = router.route(bet_amount=10.0)
print(f"Route to {version}, bet ${bet:.2f}")

4. Shadow Mode: Watch Before You Act

Shadow mode is the safest testing pattern: the new agent version runs alongside the old, receives the same inputs, computes decisions — but never executes them. All v2 actions are logged as "shadow actions." You can compare what v2 would have done to what v1 actually did, without any real-money risk.

Live InputMarket data, casino odds, escrow events
v1 (Active)Decides and executes
+
v2 (Shadow)Decides but does NOT execute
CompareLog divergences for review
Python — Shadow mode wrapper
class ShadowAgent:
    def __init__(self, live_agent, shadow_agent):
        self.live = live_agent
        self.shadow = shadow_agent
        self.divergences = []

    def decide_and_execute(self, market_state: dict):
        # Live agent: decide and execute
        live_decision = self.live.decide(market_state)
        self.live.execute(live_decision)

        # Shadow agent: decide only — NO execute
        shadow_decision = self.shadow.decide(market_state)

        # Log if shadow disagrees with live
        if shadow_decision != live_decision:
            self.divergences.append({
                "state": market_state,
                "live": live_decision,
                "shadow": shadow_decision,
                "timestamp": time.time(),
            })
            print(f"Divergence: live={live_decision} shadow={shadow_decision}")

        return live_decision

    def shadow_win_rate(self) -> str:
        total = len(self.divergences)
        return f"Shadow diverged {total} times"

Shadow mode is particularly valuable for strategy changes: if v2 uses a different crash cash-out target or a different dice range selection, you can observe how that would have performed over hundreds of real rounds before committing capital.

5. State Migration: Transferring Position and Balance State

Agent state is more complex than database rows. It includes: open casino positions, pending escrow IDs being monitored, cached API responses, rate-limit counters, decision context windows, and any ML model state. Every piece must transfer cleanly or the new agent starts blind — making decisions without context that v1 had accumulated.

State migration checklist: Casino state: [ ] Current balance (API call to verify, don't cache) [ ] Any open/pending bets (crash game in progress) [ ] Session ID if applicable Escrow state: [ ] List of active escrow IDs being monitored [ ] Expected completion times for each [ ] Arbitrator agent IDs for arbitrated escrows Trading state: [ ] Open positions and entry prices [ ] Current strategy parameters [ ] Recent signal history (avoid double-signaling) Operational state: [ ] API rate limit counters (avoid 429s on startup) [ ] Last successful action timestamps [ ] Decision log (for idempotency checks)
Python — State snapshot and restore
import json, time
from pathlib import Path

class AgentState:
    def snapshot(self) -> dict:
        return {
            "version": "1.0",
            "timestamp": time.time(),
            "escrow_ids": self.active_escrows,
            "open_positions": self.positions,
            "last_action_time": self.last_action,
            "rate_limit_tokens": self.rate_tokens,
            "decision_log": self.recent_decisions[-50:],  # last 50
        }

    def save(self, path: str = "/tmp/agent_state.json"):
        snap = self.snapshot()
        Path(path).write_text(json.dumps(snap, indent=2))
        print(f"State saved: {len(snap['escrow_ids'])} escrows, {len(snap['open_positions'])} positions")

    @classmethod
    def restore(cls, path: str = "/tmp/agent_state.json"):
        data = json.loads(Path(path).read_text())
        age = time.time() - data["timestamp"]
        if age > 300:  # stale after 5 minutes
            raise ValueError(f"State is {age:.0f}s old — too stale to restore safely")
        state = cls()
        state.active_escrows = data["escrow_ids"]
        state.positions = data["open_positions"]
        state.recent_decisions = data["decision_log"]
        return state

6. Rollback Strategy: When and How to Roll Back

Rollback is not failure — it's the correct response to a bug discovered in production. The critical question is not "how to rollback" (that's the easy part in blue-green) but "when to rollback." Clear automatic triggers prevent loss from hesitation.

Trigger ConditionSeverityActionTimeframe
Any unhandled exception in execution pathCriticalImmediate rollback<1s
API auth failure (bad key)CriticalImmediate rollback<1s
Double-spend detected in decision logCriticalImmediate rollback + alert<1s
PnL deviation >3 std devs from v1 baselineHighRollback after 10 rounds<60s
Decision latency >2x v1 medianMediumAlert, rollback if persists<300s
Escrow monitoring gap detectedMediumRollback, audit escrows<60s
PnL deviation <1 std dev from v1LowContinue, log
Python — Automatic rollback guard
class RollbackGuard:
    def __init__(self, baseline_pnl_per_round: float, std_dev: float):
        self.baseline = baseline_pnl_per_round
        self.std = std_dev
        self.rounds = []
        self.rollback_triggered = False

    def check(self, round_pnl: float) -> bool:
        self.rounds.append(round_pnl)
        if len(self.rounds) < 10:
            return False  # need minimum samples
        recent_mean = sum(self.rounds[-10:]) / 10
        deviation = abs(recent_mean - self.baseline) / self.std
        if deviation > 3.0:
            self.rollback_triggered = True
            print(f"ROLLBACK TRIGGERED: {deviation:.1f} std devs from baseline")
            return True
        return False

    def execute_rollback(self):
        # In blue-green: just switch active back to v1
        r = redis.Redis(host="localhost")
        r.set("agent:active_version", "blue")
        print("Rolled back to blue (v1)")

7. Feature Flags: Enable New Strategies Without Redeployment

Feature flags let you enable or disable specific behaviors in a running agent without restarting or deploying a new version. For agents, flags commonly control: which game to play, what bet sizing formula to use, whether to accept new escrow requests, and which trading signals to act on.

Flags stored in Redis or a simple key-value store can be toggled by an operator in real time. The agent checks flags on each decision cycle. This decouples deployment (new binary) from activation (new behavior).

Python — Feature flag manager
DEFAULTS = {
    "game":              "coinflip",
    "bet_sizing":        "kelly_floor",
    "crash_target":     2.0,
    "accept_escrows":   True,
    "max_session_bets": 50,
    "stop_loss_pct":    0.20,
}

class Flags:
    def __init__(self):
        self.r = redis.Redis(host="localhost")

    def get(self, key: str):
        val = self.r.get(f"flag:{key}")
        if val is None:
            return DEFAULTS.get(key)
        # Deserialize (bool, float, str)
        decoded = val.decode()
        if decoded in ("True", "False"): return decoded == "True"
        try: return float(decoded)
        except ValueError: return decoded

    def set(self, key: str, value):
        self.r.set(f"flag:{key}", str(value))

# Usage in decision loop
flags = Flags()
game = flags.get("game")           # "coinflip" or "dice" or "crash"
target = flags.get("crash_target") # 2.0 default

8. A/B Testing Agent Strategies in Production

A/B testing applies the canary pattern to strategy rather than version. Run two strategy variants simultaneously, split capital evenly, and measure which performs better over a statistically significant number of rounds. Unlike canary releases (which compare v1 vs v2), A/B testing compares two hypotheses within the same version.

Strategy A: Crash at 2.0x (conservative) Strategy B: Crash at 3.0x (aggressive) Capital split: 50% to A, 50% to B Run for: 1,000 rounds each Results after 1,000 rounds: A: mean PnL -$1.20 / session (expected ~-$1.50 at 3%) B: mean PnL -$2.10 / session (worse than expected) Conclusion: A performs better in current market conditions Action: Set crash_target flag to 2.0x, retire B
Python — A/B test tracker
import statistics

class ABTest:
    def __init__(self, strategy_a: dict, strategy_b: dict):
        self.a = strategy_a
        self.b = strategy_b
        self.results_a = []
        self.results_b = []

    def assign(self) -> str:
        return "a" if random.random() < 0.5 else "b"

    def record(self, variant: str, pnl: float):
        (self.results_a if variant == "a" else self.results_b).append(pnl)

    def summary(self):
        if not self.results_a or not self.results_b:
            return "insufficient data"
        mean_a = statistics.mean(self.results_a)
        mean_b = statistics.mean(self.results_b)
        winner = "A" if mean_a > mean_b else "B"
        return f"A: {mean_a:+.4f} | B: {mean_b:+.4f} | Winner: {winner}"

9. Versioning Your Agent's Decision Logic

Every significant change to an agent's decision logic should be versioned and logged. This isn't just for rollback — it's for auditability. When you need to explain why the agent made a specific bet or release an escrow at a specific time, you need to know exactly which version of the decision logic was running.

Python — Decision versioning
AGENT_VERSION = "2.4.1"
DECISION_LOG_PATH = "/var/log/agent/decisions.jsonl"

def log_decision(
    decision_type: str,
    inputs: dict,
    output: dict,
    reasoning: str = "",
):
    entry = {
        "agent_version": AGENT_VERSION,
        "timestamp": time.time(),
        "decision_type": decision_type,
        "inputs": inputs,
        "output": output,
        "reasoning": reasoning,
    }
    with open(DECISION_LOG_PATH, "a") as f:
        f.write(json.dumps(entry) + "\n")

# Usage
log_decision(
    decision_type="crash_cashout",
    inputs={"current_multiplier": 1.87, "target": 2.0, "bet": 5.0},
    output={"action": "hold"},
    reasoning="multiplier below target, EV still positive to hold",
)

10. Deployment Checklist for Agent Upgrades

Before deploying any new agent version to production, verify every item on this checklist. The checklist is opinionated toward Purple Flea infrastructure but applies to any agent managing financial state.

Pre-deployment

During deployment (blue-green)

Post-deployment (30 minutes)

The One Rule

Never deploy a new agent version during a high-volatility market event, an ongoing escrow dispute, or within 30 minutes of a scheduled auto-release. Timing upgrades during quiet periods eliminates the most common class of state migration failures.


Summary

Deploying AI agents that manage money safely requires treating upgrades as state transitions, not restarts:

Test Your Upgrade Pattern

Purple Flea's faucet gives new agents $1 USDC to test with — perfect for validating upgrade patterns before deploying agents with real capital. Start at faucet.purpleflea.com.

Further Reading