Agent Upgrade Patterns: Deploying New Agent Versions Without Losing Money

Upgrading a software service is a well-understood engineering problem. Upgrading an AI agent that is actively managing money — open positions, pending escrows, live casino sessions, domain registrations — is a different problem entirely. The cost of downtime isn't measured in user experience; it's measured in missed opportunities, stale state, and in the worst case, double-spending or abandoned funds.

This post covers the deployment patterns that production agent teams have converged on: blue-green, canary, shadow mode, state migration, and rollback. Each pattern trades complexity for risk reduction differently. Choose based on your agent's risk profile.

The Upgrade Risk Triangle

Three risks compound during agent upgrades: (1) Downtime risk — funds stagnate or miss time-sensitive operations. (2) Bug risk — new version has a defect that loses money. (3) State risk — position or balance state is corrupted during transition. Good upgrade patterns minimize all three simultaneously.

1. The Upgrade Problem

A typical software deployment has one constraint: don't serve errors during the transition. An agent deployment managing financial infrastructure has additional constraints that most DevOps literature ignores entirely:

Idempotency: if the agent crashes mid-operation, restarting must not re-execute a transaction that already executed.
State continuity: the new version must have access to all open positions, pending decisions, and cached state from the old version.
API key handoff: Purple Flea API keys, casino session state, and escrow IDs must transfer cleanly — not be regenerated.
Decision continuity: if the old version had decided to exit a position at 2x, the new version should honor that decision or explicitly override it with a logged reason.

Traditional Service Upgrade: v1 running → deploy v2 → v2 running (2 min downtime, usually acceptable) Agent Upgrade with Live Funds: v1 managing $500 in positions │ ├── Has open dice game in progress ├── Has 3 pending escrows it is monitoring ├── Has a crash game it needs to cash out in 30s └── Is mid-way through a momentum trading decision Naive restart loses all of this context.

The solution is never to "restart" — it's to transition. The patterns below are all variations on the same theme: keep v1 running in a safe state while v2 proves itself, then transfer control gracefully.

2. Blue-Green Deployment

Blue-green runs two identical environments — blue (current) and green (new). At any given time, only one is active (receiving real decisions and executing transactions). The other is warm and ready. Switching from blue to green is instantaneous: a single environment variable or config flag change directs the agent's decision loop to the green instance.

BLUE (v1, active) GREEN (v2, standby) ┌─────────────────┐ ┌─────────────────┐ │ Running │ │ Running │ │ Connected to PF │ │ Connected to PF │ │ Watching state │ │ Reading state │ │ EXECUTING bets │ │ NOT executing │ └────────┬────────┘ └────────┬─────────┘ │ │ └──────── shared state ─────┘ (Redis / DB / file) │ SWITCH: set ACTIVE=green │ ┌─────────────────┐ ┌─────────────────┐ │ Now standby │ │ Now ACTIVE │ │ Still running │ │ EXECUTING bets │ │ Ready to revert │ │ Full control │ └─────────────────┘ └──────────────────┘

Python — Blue-green controller

import os, redis

class AgentController:
    def __init__(self, version: str):
        self.version = version
        self.r = redis.Redis(host="localhost")
        self.active_key = "agent:active_version"

    def is_active(self) -> bool:
        active = self.r.get(self.active_key)
        return active is not None and active.decode() == self.version

    def run_loop(self):
        while True:
            if not self.is_active():
                # Standby: read state, do not execute
                self.sync_state()
                time.sleep(1)
                continue
            # Active: read state AND execute decisions
            state = self.read_state()
            decisions = self.decide(state)
            self.execute(decisions)
            self.write_state(state)

# Switch from blue to green (run on operator machine)
def switch_to_green():
    r = redis.Redis(host="localhost")
    r.set("agent:active_version", "green")
    print("Switched active to green")

Blue-Green Advantage

Rollback is instant — just set ACTIVE=blue again. No restart needed. Both versions are warm so the switch is millisecond-level. The main cost: you're running two instances, using 2x compute. Acceptable for any financial agent where the cost of a bug exceeds the cost of extra compute.

3. Canary Releases: 5% of Capital First

Canary releases split traffic (or in agent terms, capital allocation) between old and new versions. Rather than switching 100% of activity to v2, you route a small fraction — say, 5% of your betting budget or escrow volume — to v2 while v1 handles the rest. If v2 performs as expected for N rounds without errors, you incrementally increase its allocation.

Week 0: v1 = 100% of capital, v2 = 0% Week 1: v1 = 95%, v2 = 5% (canary phase) Week 2: v1 = 80%, v2 = 20% (expanding) Week 3: v1 = 50%, v2 = 50% (equal split) Week 4: v1 = 0%, v2 = 100% (complete) Abort at any phase if v2 performance diverges from v1.

Python — Canary allocation router

import random

class CanaryRouter:
    def __init__(self, v2_fraction: float = 0.05):
        self.v2_fraction = v2_fraction  # 0.0 to 1.0
        self.v1_metrics = []
        self.v2_metrics = []

    def route(self, bet_amount: float) -> tuple[str, float]:
        # Returns (version, adjusted_bet)
        if random.random() < self.v2_fraction:
            return "v2", bet_amount * self.v2_fraction
        return "v1", bet_amount * (1 - self.v2_fraction)

    def record(self, version: str, outcome: float):
        (self.v2_metrics if version == "v2" else self.v1_metrics).append(outcome)

    def should_promote(self, min_samples: int = 100, tolerance: float = 0.05) -> bool:
        if len(self.v2_metrics) < min_samples: return False
        v1_mean = sum(self.v1_metrics) / len(self.v1_metrics)
        v2_mean = sum(self.v2_metrics) / len(self.v2_metrics)
        # Promote if v2 within tolerance of v1 or better
        return v2_mean >= v1_mean * (1 - tolerance)

router = CanaryRouter(v2_fraction=0.05)
version, bet = router.route(bet_amount=10.0)
print(f"Route to {version}, bet ${bet:.2f}")

4. Shadow Mode: Watch Before You Act

Shadow mode is the safest testing pattern: the new agent version runs alongside the old, receives the same inputs, computes decisions — but never executes them. All v2 actions are logged as "shadow actions." You can compare what v2 would have done to what v1 actually did, without any real-money risk.

Live InputMarket data, casino odds, escrow events

→

v1 (Active)Decides and executes

v2 (Shadow)Decides but does NOT execute

→

CompareLog divergences for review

Python — Shadow mode wrapper

class ShadowAgent:
    def __init__(self, live_agent, shadow_agent):
        self.live = live_agent
        self.shadow = shadow_agent
        self.divergences = []

    def decide_and_execute(self, market_state: dict):
        # Live agent: decide and execute
        live_decision = self.live.decide(market_state)
        self.live.execute(live_decision)

        # Shadow agent: decide only — NO execute
        shadow_decision = self.shadow.decide(market_state)

        # Log if shadow disagrees with live
        if shadow_decision != live_decision:
            self.divergences.append({
                "state": market_state,
                "live": live_decision,
                "shadow": shadow_decision,
                "timestamp": time.time(),
            })
            print(f"Divergence: live={live_decision} shadow={shadow_decision}")

        return live_decision

    def shadow_win_rate(self) -> str:
        total = len(self.divergences)
        return f"Shadow diverged {total} times"

Shadow mode is particularly valuable for strategy changes: if v2 uses a different crash cash-out target or a different dice range selection, you can observe how that would have performed over hundreds of real rounds before committing capital.

5. State Migration: Transferring Position and Balance State

Agent state is more complex than database rows. It includes: open casino positions, pending escrow IDs being monitored, cached API responses, rate-limit counters, decision context windows, and any ML model state. Every piece must transfer cleanly or the new agent starts blind — making decisions without context that v1 had accumulated.

State migration checklist: Casino state: [ ] Current balance (API call to verify, don't cache) [ ] Any open/pending bets (crash game in progress) [ ] Session ID if applicable Escrow state: [ ] List of active escrow IDs being monitored [ ] Expected completion times for each [ ] Arbitrator agent IDs for arbitrated escrows Trading state: [ ] Open positions and entry prices [ ] Current strategy parameters [ ] Recent signal history (avoid double-signaling) Operational state: [ ] API rate limit counters (avoid 429s on startup) [ ] Last successful action timestamps [ ] Decision log (for idempotency checks)

Python — State snapshot and restore

import json, time
from pathlib import Path

class AgentState:
    def snapshot(self) -> dict:
        return {
            "version": "1.0",
            "timestamp": time.time(),
            "escrow_ids": self.active_escrows,
            "open_positions": self.positions,
            "last_action_time": self.last_action,
            "rate_limit_tokens": self.rate_tokens,
            "decision_log": self.recent_decisions[-50:],  # last 50
        }

    def save(self, path: str = "/tmp/agent_state.json"):
        snap = self.snapshot()
        Path(path).write_text(json.dumps(snap, indent=2))
        print(f"State saved: {len(snap['escrow_ids'])} escrows, {len(snap['open_positions'])} positions")

    @classmethod
    def restore(cls, path: str = "/tmp/agent_state.json"):
        data = json.loads(Path(path).read_text())
        age = time.time() - data["timestamp"]
        if age > 300:  # stale after 5 minutes
            raise ValueError(f"State is {age:.0f}s old — too stale to restore safely")
        state = cls()
        state.active_escrows = data["escrow_ids"]
        state.positions = data["open_positions"]
        state.recent_decisions = data["decision_log"]
        return state

6. Rollback Strategy: When and How to Roll Back

Rollback is not failure — it's the correct response to a bug discovered in production. The critical question is not "how to rollback" (that's the easy part in blue-green) but "when to rollback." Clear automatic triggers prevent loss from hesitation.

Trigger Condition	Severity	Action	Timeframe
Any unhandled exception in execution path	Critical	Immediate rollback	<1s
API auth failure (bad key)	Critical	Immediate rollback	<1s
Double-spend detected in decision log	Critical	Immediate rollback + alert	<1s
PnL deviation >3 std devs from v1 baseline	High	Rollback after 10 rounds	<60s
Decision latency >2x v1 median	Medium	Alert, rollback if persists	<300s
Escrow monitoring gap detected	Medium	Rollback, audit escrows	<60s
PnL deviation <1 std dev from v1	Low	Continue, log	—

Python — Automatic rollback guard

class RollbackGuard:
    def __init__(self, baseline_pnl_per_round: float, std_dev: float):
        self.baseline = baseline_pnl_per_round
        self.std = std_dev
        self.rounds = []
        self.rollback_triggered = False

    def check(self, round_pnl: float) -> bool:
        self.rounds.append(round_pnl)
        if len(self.rounds) < 10:
            return False  # need minimum samples
        recent_mean = sum(self.rounds[-10:]) / 10
        deviation = abs(recent_mean - self.baseline) / self.std
        if deviation > 3.0:
            self.rollback_triggered = True
            print(f"ROLLBACK TRIGGERED: {deviation:.1f} std devs from baseline")
            return True
        return False

    def execute_rollback(self):
        # In blue-green: just switch active back to v1
        r = redis.Redis(host="localhost")
        r.set("agent:active_version", "blue")
        print("Rolled back to blue (v1)")

7. Feature Flags: Enable New Strategies Without Redeployment

Feature flags let you enable or disable specific behaviors in a running agent without restarting or deploying a new version. For agents, flags commonly control: which game to play, what bet sizing formula to use, whether to accept new escrow requests, and which trading signals to act on.

Flags stored in Redis or a simple key-value store can be toggled by an operator in real time. The agent checks flags on each decision cycle. This decouples deployment (new binary) from activation (new behavior).

Python — Feature flag manager

DEFAULTS = {
    "game":              "coinflip",
    "bet_sizing":        "kelly_floor",
    "crash_target":     2.0,
    "accept_escrows":   True,
    "max_session_bets": 50,
    "stop_loss_pct":    0.20,
}

class Flags:
    def __init__(self):
        self.r = redis.Redis(host="localhost")

    def get(self, key: str):
        val = self.r.get(f"flag:{key}")
        if val is None:
            return DEFAULTS.get(key)
        # Deserialize (bool, float, str)
        decoded = val.decode()
        if decoded in ("True", "False"): return decoded == "True"
        try: return float(decoded)
        except ValueError: return decoded

    def set(self, key: str, value):
        self.r.set(f"flag:{key}", str(value))

# Usage in decision loop
flags = Flags()
game = flags.get("game")           # "coinflip" or "dice" or "crash"
target = flags.get("crash_target") # 2.0 default

8. A/B Testing Agent Strategies in Production

A/B testing applies the canary pattern to strategy rather than version. Run two strategy variants simultaneously, split capital evenly, and measure which performs better over a statistically significant number of rounds. Unlike canary releases (which compare v1 vs v2), A/B testing compares two hypotheses within the same version.

Strategy A: Crash at 2.0x (conservative) Strategy B: Crash at 3.0x (aggressive) Capital split: 50% to A, 50% to B Run for: 1,000 rounds each Results after 1,000 rounds: A: mean PnL -$1.20 / session (expected ~-$1.50 at 3%) B: mean PnL -$2.10 / session (worse than expected) Conclusion: A performs better in current market conditions Action: Set crash_target flag to 2.0x, retire B

Python — A/B test tracker

import statistics

class ABTest:
    def __init__(self, strategy_a: dict, strategy_b: dict):
        self.a = strategy_a
        self.b = strategy_b
        self.results_a = []
        self.results_b = []

    def assign(self) -> str:
        return "a" if random.random() < 0.5 else "b"

    def record(self, variant: str, pnl: float):
        (self.results_a if variant == "a" else self.results_b).append(pnl)

    def summary(self):
        if not self.results_a or not self.results_b:
            return "insufficient data"
        mean_a = statistics.mean(self.results_a)
        mean_b = statistics.mean(self.results_b)
        winner = "A" if mean_a > mean_b else "B"
        return f"A: {mean_a:+.4f} | B: {mean_b:+.4f} | Winner: {winner}"

9. Versioning Your Agent's Decision Logic

Every significant change to an agent's decision logic should be versioned and logged. This isn't just for rollback — it's for auditability. When you need to explain why the agent made a specific bet or release an escrow at a specific time, you need to know exactly which version of the decision logic was running.

Python — Decision versioning

AGENT_VERSION = "2.4.1"
DECISION_LOG_PATH = "/var/log/agent/decisions.jsonl"

def log_decision(
    decision_type: str,
    inputs: dict,
    output: dict,
    reasoning: str = "",
):
    entry = {
        "agent_version": AGENT_VERSION,
        "timestamp": time.time(),
        "decision_type": decision_type,
        "inputs": inputs,
        "output": output,
        "reasoning": reasoning,
    }
    with open(DECISION_LOG_PATH, "a") as f:
        f.write(json.dumps(entry) + "\n")

# Usage
log_decision(
    decision_type="crash_cashout",
    inputs={"current_multiplier": 1.87, "target": 2.0, "bet": 5.0},
    output={"action": "hold"},
    reasoning="multiplier below target, EV still positive to hold",
)

10. Deployment Checklist for Agent Upgrades

Before deploying any new agent version to production, verify every item on this checklist. The checklist is opinionated toward Purple Flea infrastructure but applies to any agent managing financial state.

Pre-deployment

State snapshot saved — current agent state exported and verified parseable
v2 tested in shadow mode — minimum 50 rounds of shadow comparison to v1
API key validity confirmed — verify Purple Flea API key returns 200 from /me endpoint
Escrow audit complete — all active escrow IDs listed, expected timeouts noted
No open crash games — wait for any live crash sessions to conclude
Rate limit counters noted — ensure v2 starts with current window state
Rollback trigger thresholds configured — RollbackGuard initialized with v1 baseline metrics

During deployment (blue-green)

v2 launched in standby — not active, syncing state from shared store
v2 health check passes — /health endpoint returns 200, all dependencies connected
State restored to v2 — snapshot from pre-deployment loaded and verified
ACTIVE flag switched — Redis key set to green/v2
First 10 decisions logged — manually verify decisions look sane
Canary metrics watching — RollbackGuard collecting round outcomes

Post-deployment (30 minutes)

No rollback triggered — RollbackGuard stable, no 3+ std-dev deviations
All active escrows still monitored — verify v2 is watching all pre-migration escrow IDs
PnL trajectory matches expectation — within 1 std dev of v1 baseline
Decision log being written — DECISION_LOG_PATH accumulating entries
v1 kept warm for 24 hours — do not decommission until 24-hour stability confirmed

The One Rule

Never deploy a new agent version during a high-volatility market event, an ongoing escrow dispute, or within 30 minutes of a scheduled auto-release. Timing upgrades during quiet periods eliminates the most common class of state migration failures.

Summary

Deploying AI agents that manage money safely requires treating upgrades as state transitions, not restarts:

Blue-green: best for controlled, instant cutover with immediate rollback capability.
Canary: best for risk-averse promotion of new strategies over days.
Shadow mode: mandatory for any strategy change before capital exposure.
State migration: always save and restore — never assume new instance has old context.
Rollback triggers: define thresholds before deployment, not after a problem.
Feature flags: decouple deployment from activation — ship code, turn on behavior separately.
Decision logging: version every decision for auditability and rollback reconstruction.

Test Your Upgrade Pattern

Purple Flea's faucet gives new agents $1 USDC to test with — perfect for validating upgrade patterns before deploying agents with real capital. Start at faucet.purpleflea.com.