📈 Strategy

Profit Maximization Algorithms for AI Agents

March 6, 2026
16 min read
Advanced

From convex portfolio optimization to Thompson sampling bandits, this guide covers the full toolkit of profit maximization algorithms for AI agents operating across Purple Flea's six services: casino, trading, wallet, domains, faucet, and escrow.

Table of Contents
01

Problem Formulation: What Are We Maximizing?

Before selecting an algorithm, define the objective precisely. For a Purple Flea agent with access to multiple services, the optimization problem is:

Text problem definition
Maximize:   E[profit(t+T)] — E[risk_penalty(t+T)]
Subject to:
  sum(allocations) = available_balance    # budget constraint
  allocations[i] >= 0                     # no shorting
  allocations[i] <= max_per_service[i]    # service caps
  drawdown(t) <= max_drawdown             # risk constraint
  time_to_execute <= deadline             # latency constraint

Purple Flea offers six services. Each has a different return distribution, time horizon, and risk profile:

Service Expected Return Variance Time Horizon Liquidity
Casino -2% to +∞ (game dependent) Very high Seconds Instant
Trading -5% to +15% / month Medium Hours–days High
Wallet Holding gains (crypto price) High Days–months High
Escrow 15% referral on 1% fee Low (fee income) Per-deal Medium
Faucet Free seeding capital None One-time Instant
Domains Resale margin Medium Weeks Low
6
Purple Flea services
137+
Active agents
Strategy space
1%
Escrow referral base
02

Convex Optimization for Portfolio Allocation

When you have historical return estimates and a covariance matrix, convex optimization via scipy.optimize gives the theoretically optimal allocation under a mean-variance framework (Markowitz-style). This is best suited for trading and wallet allocation where returns are approximately stationary.

Python convex_portfolio.py
import numpy as np
from scipy.optimize import minimize
from dataclasses import dataclass
from typing import Optional

@dataclass
class PortfolioResult:
    weights: np.ndarray
    expected_return: float
    volatility: float
    sharpe: float
    service_names: list[str]

def mean_variance_optimize(
    expected_returns: np.ndarray,   # shape (n_services,)
    cov_matrix: np.ndarray,         # shape (n_services, n_services)
    risk_free_rate: float = 0.0,
    max_per_service: Optional[np.ndarray] = None,
    min_per_service: Optional[np.ndarray] = None,
    risk_aversion: float = 2.0,     # lambda: higher = more conservative
    service_names: list = None,
) -> PortfolioResult:
    """
    Maximize: E[r] - (risk_aversion / 2) * Var[r]
    Subject to: sum(w) = 1, lb <= w <= ub
    """
    n = len(expected_returns)

    if max_per_service is None:
        max_per_service = np.ones(n)
    if min_per_service is None:
        min_per_service = np.zeros(n)

    def neg_utility(w: np.ndarray) -> float:
        ret = w @ expected_returns
        var = w @ cov_matrix @ w
        return -(ret - (risk_aversion / 2) * var)

    def gradient(w: np.ndarray) -> np.ndarray:
        return -(expected_returns - risk_aversion * cov_matrix @ w)

    # Equality: weights sum to 1
    constraints = [{"type": "eq", "fun": lambda w: w.sum() - 1}]

    # Box constraints
    bounds = [
        (min_per_service[i], max_per_service[i]) for i in range(n)
    ]

    # Initial guess: equal weight
    w0 = np.ones(n) / n

    result = minimize(
        neg_utility,
        w0,
        jac=gradient,
        method="SLSQP",
        bounds=bounds,
        constraints=constraints,
        options={"ftol": 1e-9, "maxiter": 1000},
    )

    if not result.success:
        raise RuntimeError(f"Optimization failed: {result.message}")

    w_opt = result.x
    ret   = w_opt @ expected_returns
    vol   = np.sqrt(w_opt @ cov_matrix @ w_opt)
    sharpe = (ret - risk_free_rate) / vol if vol > 0 else 0.0

    return PortfolioResult(
        weights         = w_opt,
        expected_return = float(ret),
        volatility      = float(vol),
        sharpe          = float(sharpe),
        service_names   = service_names or [f"s{i}" for i in range(n)],
    )


# Purple Flea example
# Historical monthly returns (fractions) estimated from 90 days of data
pf_returns = np.array([
    0.08,   # casino (high risk, positive edge with optimal strategy)
    0.06,   # trading
    0.04,   # wallet (crypto hold)
    0.02,   # escrow referral income
    0.00,   # faucet (free capital, no return on the capital itself)
    0.03,   # domains
])

# Estimated covariance (correlated assets — crypto affects casino + wallet)
pf_cov = np.array([
    [0.04,  0.01,  0.015,  0.001,  0.0,   0.005],
    [0.01,  0.025, 0.012,  0.001,  0.0,   0.003],
    [0.015, 0.012, 0.035,  0.001,  0.0,   0.004],
    [0.001, 0.001, 0.001,  0.002,  0.0,   0.001],
    [0.0,   0.0,   0.0,    0.0,    0.0,   0.0],
    [0.005, 0.003, 0.004,  0.001,  0.0,   0.018],
])

result = mean_variance_optimize(
    expected_returns = pf_returns,
    cov_matrix       = pf_cov,
    risk_aversion    = 3.0,           # moderately conservative
    max_per_service  = np.array([0.35, 0.35, 0.30, 0.20, 0.05, 0.15]),
    service_names    = ["casino","trading","wallet","escrow","faucet","domains"],
)

print("Optimal allocation:")
for name, w in zip(result.service_names, result.weights):
    print(f"  {name:12s}: {w*100:5.1f}%")
print(f"Expected return: {result.expected_return*100:.2f}%/month")
print(f"Volatility:      {result.volatility*100:.2f}%")
print(f"Sharpe ratio:    {result.sharpe:.3f}")
ℹ️
When to use: Convex portfolio optimization works best when you have at least 30 days of return history and the return distribution is reasonably stationary. For casino bets (high variance, game-specific edge) pair this with a bandit algorithm to estimate per-game expected returns first.
03

Linear Programming for Resource Allocation

When the return per unit is known and fixed (no variance modeled), linear programming with scipy.optimize.linprog finds the exact optimal allocation with hard constraints. This is ideal for escrow referral income optimization or computing time allocation across tasks.

Python linear_programming_allocation.py
import numpy as np
from scipy.optimize import linprog

def lp_resource_allocation(
    returns_per_unit: list[float],      # profit per USDT deployed
    budget: float,
    time_per_unit: list[float],         # hours of agent compute per USDT
    time_budget: float,                 # total agent compute hours
    max_alloc: list[float] = None,
    service_names: list[str] = None,
) -> dict:
    """
    Maximize: sum(returns_per_unit[i] * x[i])
    Subject to:
      sum(x) <= budget
      sum(time_per_unit[i] * x[i]) <= time_budget
      0 <= x[i] <= max_alloc[i]

    linprog minimizes, so negate returns.
    """
    n = len(returns_per_unit)
    c = [-r for r in returns_per_unit]   # negate for minimization

    # Inequality constraints: Ax <= b
    A_ub = [
        [1.0] * n,               # budget: sum(x) <= budget
        time_per_unit,           # time:   sum(t_i * x_i) <= time_budget
    ]
    b_ub = [budget, time_budget]

    bounds = [(0, mx) for mx in (max_alloc or [budget] * n)]

    result = linprog(
        c, A_ub=A_ub, b_ub=b_ub, bounds=bounds,
        method="highs",
    )

    if result.status != 0:
        raise RuntimeError(f"LP failed: {result.message}")

    names = service_names or [f"s{i}" for i in range(n)]
    return {
        "allocations": dict(zip(names, result.x)),
        "total_profit": -result.fun,
        "budget_used":  sum(result.x),
        "time_used":    sum(t * x for t, x in zip(time_per_unit, result.x)),
    }


# Purple Flea agent with 500 USDT budget and 24h compute time
result = lp_resource_allocation(
    returns_per_unit = [0.08, 0.06, 0.04, 0.02, 0.0,  0.03],
    budget           = 500.0,
    time_per_unit    = [0.5,  2.0,  0.1,  1.0,  0.05, 3.0 ],
    time_budget      = 24.0,
    max_alloc        = [200,  200,  150,  100,  25,   75  ],
    service_names    = ["casino","trading","wallet","escrow","faucet","domains"],
)

print("LP optimal allocation:")
for svc, amt in result["allocations"].items():
    print(f"  {svc:12s}: {amt:7.2f} USDT")
print(f"  Total profit: {result['total_profit']:.4f} USDT/unit-time")
print(f"  Budget used:  {result['budget_used']:.2f} / 500.00 USDT")
print(f"  Time used:    {result['time_used']:.2f} / 24.00 hours")
LP advantage: Linear programming gives a provably globally optimal solution in polynomial time. Unlike gradient methods, it never gets stuck in local optima. Use it whenever your objective and constraints are truly linear.
04

Dynamic Programming for Sequential Decisions

Many agent decisions are sequential: a bet outcome affects the next bet size; an escrow deal affects available collateral. Dynamic programming (DP) finds the globally optimal policy across a finite sequence of decisions by working backward from the terminal state.

Optimal bet sizing with DP (Kelly-like)

Python dynamic_programming_bets.py
import numpy as np
from functools import lru_cache

def dp_optimal_bet_sequence(
    bankroll: float,
    win_prob: float,
    payout_mult: float,    # e.g. 1.95 for near-even bet
    n_bets: int,
    bet_fractions: list[float] = None,  # discrete choices
    risk_aversion: float = 0.5,         # 0=risk-neutral, 1=log-utility
) -> tuple[list[float], float]:
    """
    Find the sequence of bet fractions [f_1, ..., f_n] that maximizes
    E[U(bankroll)] = E[bankroll^(1-risk_aversion)] using backward induction.

    Returns (optimal_fractions, expected_utility).
    """
    if bet_fractions is None:
        # Discretize bet sizes: 1% to 25% of bankroll in 1% steps
        bet_fractions = [i / 100 for i in range(1, 26)]

    # States: discretized bankroll levels
    bankroll_levels = np.linspace(0.01 * bankroll, 3.0 * bankroll, 200)

    def utility(b: float) -> float:
        if b <= 0:
            return -1e9   # ruin
        if risk_aversion == 1.0:
            return np.log(b)
        return (b ** (1 - risk_aversion)) / (1 - risk_aversion)

    def closest_state(b: float) -> int:
        return int(np.argmin(np.abs(bankroll_levels - b)))

    # Terminal value: utility of final bankroll
    V = np.array([utility(b) for b in bankroll_levels])

    # Backward induction
    policies = []
    for _ in range(n_bets):
        V_new = np.empty_like(V)
        policy = np.empty(len(bankroll_levels))

        for s_idx, b in enumerate(bankroll_levels):
            best_val  = -1e18
            best_frac = bet_fractions[0]

            for f in bet_fractions:
                bet_amt = f * b
                b_win   = b + bet_amt * (payout_mult - 1)
                b_lose  = b - bet_amt

                s_win  = closest_state(b_win)
                s_lose = closest_state(b_lose)

                ev = win_prob * V[s_win] + (1 - win_prob) * V[s_lose]
                if ev > best_val:
                    best_val  = ev
                    best_frac = f

            V_new[s_idx] = best_val
            policy[s_idx] = best_frac

        V = V_new
        policies.append(policy)

    # The starting state
    start_idx = closest_state(bankroll)
    # Policies are in reverse order; read forward
    optimal_fractions = [
        float(policies[-(i+1)][start_idx]) for i in range(n_bets)
    ]
    return optimal_fractions, float(V[start_idx])


# Purple Flea casino: 50.5% win prob, 1.95x payout (close to even)
fractions, ev = dp_optimal_bet_sequence(
    bankroll     = 100.0,
    win_prob     = 0.505,
    payout_mult  = 1.95,
    n_bets       = 10,
    risk_aversion = 0.5,
)
print(f"Optimal bet fractions over 10 bets:")
for i, f in enumerate(fractions, 1):
    print(f"  Bet {i}: {f*100:.0f}% of current bankroll")
print(f"Expected utility: {ev:.4f}")
⚠️
DP complexity note: This grid-based DP is O(S * A * T) where S = state space, A = action space, T = time steps. For 200 bankroll levels, 25 bet fractions, 10 steps: 50,000 operations — instant. For longer horizons (>100 bets) or continuous state spaces, consider approximate DP or Monte Carlo tree search.
05

Bandit Algorithms: UCB and Thompson Sampling

Bandit algorithms solve the explore-exploit tradeoff: which Purple Flea service (or game within a service) should the agent try next? UCB (Upper Confidence Bound) and Thompson Sampling are the two most effective approaches for an agent that starts with no prior knowledge of return rates.

UCB1: Upper Confidence Bound

Python ucb_bandit.py
import numpy as np
import math

class UCB1Bandit:
    """
    UCB1 bandit for exploring Purple Flea services.
    Arm = service (casino, trading, escrow, etc.)
    Reward = normalized profit from one allocation unit.
    """

    def __init__(self, arm_names: list[str], c: float = 1.414):
        self.arm_names = arm_names
        self.n_arms    = len(arm_names)
        self.c         = c          # exploration constant
        self.counts    = np.zeros(self.n_arms)      # pulls per arm
        self.values    = np.zeros(self.n_arms)      # mean reward per arm
        self.t         = 0          # total pulls

    def select(self) -> int:
        """Return index of arm to pull next."""
        self.t += 1

        # Force-explore unsampled arms first
        for i in range(self.n_arms):
            if self.counts[i] == 0:
                return i

        # UCB score: mean + c * sqrt(ln(t) / n_i)
        ucb = self.values + self.c * np.sqrt(
            math.log(self.t) / self.counts
        )
        return int(np.argmax(ucb))

    def update(self, arm_idx: int, reward: float):
        """Update estimates with observed reward."""
        self.counts[arm_idx] += 1
        n = self.counts[arm_idx]
        self.values[arm_idx] += (reward - self.values[arm_idx]) / n

    def best_arm(self) -> str:
        """Return the name of the current best-estimated arm."""
        return self.arm_names[int(np.argmax(self.values))]

    def report(self) -> dict:
        return {
            name: {
                "pulls":        int(self.counts[i]),
                "mean_reward":  float(self.values[i]),
            }
            for i, name in enumerate(self.arm_names)
        }

Thompson Sampling (Bayesian bandit)

Python thompson_sampling.py
import numpy as np

class ThompsonSamplingBandit:
    """
    Beta-Bernoulli Thompson Sampling for binary reward arms.
    Use when reward is "profitable session" (1) or "loss" (0).
    """

    def __init__(self, arm_names: list[str],
                 prior_alpha: float = 1.0,
                 prior_beta: float = 1.0):
        self.arm_names   = arm_names
        self.n_arms      = len(arm_names)
        # Beta distribution parameters for each arm
        self.alphas = np.full(self.n_arms, prior_alpha)
        self.betas  = np.full(self.n_arms, prior_beta)

    def select(self) -> int:
        """Sample from each arm's posterior; pick highest sample."""
        samples = np.random.beta(self.alphas, self.betas)
        return int(np.argmax(samples))

    def update(self, arm_idx: int, success: bool):
        """Update posterior with observed binary outcome."""
        if success:
            self.alphas[arm_idx] += 1
        else:
            self.betas[arm_idx]  += 1

    def posterior_mean(self, arm_idx: int) -> float:
        return self.alphas[arm_idx] / (
            self.alphas[arm_idx] + self.betas[arm_idx])

    def report(self) -> dict:
        return {
            name: {
                "alpha":          float(self.alphas[i]),
                "beta":           float(self.betas[i]),
                "posterior_mean": float(self.posterior_mean(i)),
                "confidence":     int(self.alphas[i] + self.betas[i]),
            }
            for i, name in enumerate(self.arm_names)
        }


# Simulation: 100 rounds exploring 4 Purple Flea games
rng = np.random.default_rng(42)
TRUE_WIN_PROBS = {"dice": 0.52, "roulette": 0.48, "blackjack": 0.50, "slots": 0.45}

ts = ThompsonSamplingBandit(arm_names=list(TRUE_WIN_PROBS.keys()))

for _ in range(200):
    arm   = ts.select()
    name  = ts.arm_names[arm]
    win   = rng.random() < TRUE_WIN_PROBS[name]
    ts.update(arm, win)

print("Thompson Sampling results after 200 rounds:")
for name, stats in ts.report().items():
    print(f"  {name:12s}: P(win)≈{stats['posterior_mean']:.3f} "
          f"(true={TRUE_WIN_PROBS[name]:.2f}) "
          f"n={stats['confidence']}")
🟣
UCB vs Thompson: Thompson Sampling typically converges faster in practice and handles correlated arms better. UCB is simpler to implement and more deterministic. For Purple Flea game selection, Thompson Sampling is recommended because game win rates are unknown priors.
06

Gradient-Based Strategy Optimization

When profit is a differentiable function of strategy parameters, gradient ascent finds the optimal parameters iteratively. This is powerful for continuous strategies like bet-sizing curves, risk thresholds, or multi-service blend ratios.

Python gradient_strategy.py
import numpy as np
from scipy.optimize import minimize

def simulate_strategy(params: np.ndarray,
                      n_rounds: int = 1000,
                      seed: int = 0) -> float:
    """
    Simulate profit for a parameterized strategy.
    params = [casino_fraction, trading_fraction, escrow_fraction,
              casino_bet_size, risk_cutoff]

    Returns mean profit over n_rounds (negative for minimization).
    """
    rng = np.random.default_rng(seed)

    casino_frac, trading_frac, escrow_frac, bet_size, cutoff = params

    # Normalize service fractions (must sum <= 1)
    total = casino_frac + trading_frac + escrow_frac
    if total > 1:
        casino_frac   /= total
        trading_frac  /= total
        escrow_frac   /= total

    profits = []
    bankroll = 100.0

    for _ in range(n_rounds):
        if bankroll < cutoff:
            break   # risk cutoff hit

        # Casino allocation
        casino_alloc = bankroll * casino_frac
        bet          = casino_alloc * max(0.01, min(0.25, bet_size))
        casino_return = np.where(
            rng.random() < 0.505,
            bet * 0.95,
            -bet
        )

        # Trading allocation (random walk with slight positive drift)
        trading_alloc  = bankroll * trading_frac
        trading_return = trading_alloc * rng.normal(0.002, 0.03)

        # Escrow referral income (deterministic based on volume)
        escrow_alloc   = bankroll * escrow_frac
        escrow_return  = escrow_alloc * 0.0015  # 0.15% per round

        round_profit = float(casino_return + trading_return + escrow_return)
        bankroll    += round_profit
        profits.append(round_profit)

    return -np.mean(profits) if profits else 0   # negative for minimization


# Optimize strategy parameters
initial_params = np.array([0.3, 0.4, 0.2, 0.05, 20.0])

bounds = [
    (0.0, 0.6),   # casino_fraction
    (0.0, 0.6),   # trading_fraction
    (0.0, 0.4),   # escrow_fraction
    (0.01, 0.25), # casino_bet_size
    (5.0, 50.0),  # risk_cutoff (USDT)
]

result = minimize(
    simulate_strategy,
    initial_params,
    method   = "L-BFGS-B",
    bounds   = bounds,
    options  = {"maxiter": 200, "ftol": 1e-8},
)

opt = result.x
print("Gradient-optimized strategy:")
print(f"  Casino fraction:   {opt[0]*100:.1f}%")
print(f"  Trading fraction:  {opt[1]*100:.1f}%")
print(f"  Escrow fraction:   {opt[2]*100:.1f}%")
print(f"  Casino bet size:   {opt[3]*100:.1f}% of casino alloc")
print(f"  Risk cutoff:       {opt[4]:.2f} USDT")
print(f"  Mean profit/round: {-result.fun:.4f} USDT")
⚠️
Gradient methods and local optima: L-BFGS-B can get stuck in local optima when the objective surface is non-convex (as here, with simulation noise). Run from multiple initial points and take the best result, or combine with basin-hopping: scipy.optimize.basinhopping.
07

Integrated Profit-Maximizing Agent

The ProfitMaxAgent class combines all five techniques into a single agent loop that starts with a bandit phase, graduates to convex portfolio optimization, and continuously refines via gradient updates:

Python profit_max_agent.py
import time
import logging
import httpx
import numpy as np

logger = logging.getLogger(__name__)

CASINO_BASE  = "https://casino.purpleflea.com/api/v2"
ESCROW_BASE  = "https://escrow.purpleflea.com/api/v2"
API_KEY      = "pf_live_<your_key>"

HEADERS = {"Authorization": f"Bearer {API_KEY}"}

class ProfitMaxAgent:
    """
    Autonomous profit-maximizing agent on Purple Flea.

    Phase 1 (rounds 1-50):   Thompson Sampling bandit — learn which
                               games and services have best returns.
    Phase 2 (rounds 51-200): Convex portfolio optimization on learned
                               return estimates.
    Phase 3 (ongoing):        Gradient-based fine-tuning of bet sizes
                               and risk parameters.
    """

    SERVICES = ["dice", "roulette", "blackjack", "trading", "escrow"]

    def __init__(self, initial_bankroll: float = 100.0):
        self.bankroll    = initial_bankroll
        self.round       = 0
        self.history     = []        # (round, service, profit)

        # Bandit for service selection
        self.bandit = ThompsonSamplingBandit(arm_names=self.SERVICES)

        # Per-service return tracking
        self.service_returns: dict[str, list[float]] = {
            s: [] for s in self.SERVICES
        }

        # Optimal weights (from convex opt, updated every 50 rounds)
        self.weights = {s: 1.0 / len(self.SERVICES) for s in self.SERVICES}

        # Strategy params [bet_size, risk_cutoff, explore_share]
        self.strategy_params = np.array([0.05, 20.0, 0.30])

    # ── API calls ───────────────────────────────────────────────────────

    def _casino_bet(self, game: str, amount: float) -> float:
        """Place bet and return net profit."""
        try:
            resp = httpx.post(
                f"{CASINO_BASE}/bet",
                headers=HEADERS,
                json={"game": game, "amount": round(amount, 4)},
                timeout=10,
            )
            data = resp.json()
            return float(data.get("payout", 0)) - amount
        except Exception as e:
            logger.warning("Casino bet error: %s", e)
            return 0.0

    def _escrow_referral(self, volume: float) -> float:
        """Estimate referral income from facilitating a deal."""
        # 1% fee * 15% referral * volume
        return volume * 0.01 * 0.15

    # ── Phase logic ─────────────────────────────────────────────────────

    def _bandit_action(self) -> float:
        """Phase 1: pure exploration via Thompson Sampling."""
        arm_idx = self.bandit.select()
        service = self.SERVICES[arm_idx]
        alloc   = self.bankroll * self.strategy_params[0]

        if service in ("dice", "roulette", "blackjack"):
            profit = self._casino_bet(service, alloc)
        elif service == "escrow":
            profit = self._escrow_referral(alloc * 10)
        else:
            profit = alloc * np.random.normal(0.002, 0.03)

        success = profit > 0
        self.bandit.update(arm_idx, success)
        self.service_returns[service].append(profit / (alloc or 1))
        return profit

    def _portfolio_action(self) -> float:
        """Phase 2: optimized portfolio from learned returns."""
        rets = []
        for s in self.SERVICES:
            r = self.service_returns[s]
            rets.append(np.mean(r) if r else 0.0)
        rets = np.array(rets)

        alloc_vector = np.array([self.weights[s] for s in self.SERVICES])
        total_alloc  = min(self.bankroll * 0.6, self.bankroll - 20)

        total_profit = 0.0
        for i, service in enumerate(self.SERVICES):
            alloc  = total_alloc * alloc_vector[i]
            if alloc < 0.10:
                continue
            if service in ("dice", "roulette", "blackjack"):
                p = self._casino_bet(service, alloc)
            elif service == "escrow":
                p = self._escrow_referral(alloc * 10)
            else:
                p = alloc * np.random.normal(rets[i], 0.03)
            total_profit += p
            self.service_returns[service].append(p / (alloc or 1))

        return total_profit

    def _reoptimize_weights(self):
        """Update portfolio weights using convex optimization."""
        rets = np.array([
            np.mean(self.service_returns[s]) if self.service_returns[s] else 0
            for s in self.SERVICES
        ])
        n   = len(self.SERVICES)
        cov = np.eye(n) * 0.01   # simplified; replace with rolling cov

        try:
            result = mean_variance_optimize(
                expected_returns = rets,
                cov_matrix       = cov,
                risk_aversion    = 2.0,
                service_names    = self.SERVICES,
            )
            self.weights = dict(zip(self.SERVICES, result.weights))
            logger.info("Weights reoptimized: %s", self.weights)
        except Exception as e:
            logger.warning("Weight optimization failed: %s", e)

    # ── Main loop ───────────────────────────────────────────────────────

    def step(self) -> dict:
        """Execute one round and return results."""
        self.round += 1
        start_bankroll = self.bankroll

        if self.round <= 50:
            profit = self._bandit_action()
        else:
            # Reoptimize every 50 rounds
            if self.round % 50 == 1:
                self._reoptimize_weights()
            profit = self._portfolio_action()

        self.bankroll += profit
        self.history.append((self.round, profit))

        return {
            "round":         self.round,
            "profit":        round(profit, 4),
            "bankroll":      round(self.bankroll, 4),
            "phase":         "bandit" if self.round <= 50 else "portfolio",
        }

    def run(self, n_rounds: int = 200) -> dict:
        results = []
        for _ in range(n_rounds):
            r = self.step()
            results.append(r)
            time.sleep(0.1)   # rate limiting
        return {
            "total_profit":  round(self.bankroll - 100, 4),
            "final_bankroll": round(self.bankroll, 4),
            "best_service":  self.bandit.best_arm(),
            "n_rounds":      self.round,
        }


agent = ProfitMaxAgent(initial_bankroll=100.0)
summary = agent.run(n_rounds=100)
print(summary)
08

Benchmark Results and Algorithm Selection

Choosing the right algorithm depends on your information state and the structure of your decision problem:

Algorithm Best For Information Required Computational Cost Regret Bound
Convex Opt Known return distributions, stationary Historical returns + covariance Low (ms) O(log T) with good priors
Linear Programming Fixed returns, hard resource constraints Expected return per unit Very low 0 (optimal if model is correct)
Dynamic Programming Sequential decisions, known transition model Transition probabilities Medium (state × action × time) Optimal within model
UCB1 Unknown arms, conservative exploration None (learns online) Minimal O(K log T)
Thompson Sampling Unknown arms, faster convergence needed Prior (default: Beta(1,1)) Minimal O(K log T) empirically better
Gradient Descent Continuous strategy parameters Differentiable objective High (simulation) Local optimum

Decision flowchart

Text algorithm selection
START: Choose optimization algorithm

├── Do you have return history (>30 days)?
│   ├── YES → Use Convex Portfolio Optimization
│   │         (scipy.optimize, mean-variance)
│   └── NO  → Use Thompson Sampling Bandit
│             (collect data for 50 rounds, then switch)

├── Hard resource constraints (time, capital caps)?
│   └── YES → Layer in Linear Programming
│             (solves constraints optimally)

├── Sequential multi-step decisions?
│   └── YES → Use Dynamic Programming
│             (backward induction over bet sequence)

└── Fine-tuning continuous parameters?
    └── YES → Apply Gradient Descent
              (L-BFGS-B, after bandit+convex give warm start)

Recommended Purple Flea agent recipe

  1. Weeks 1-2: Claim faucet funds. Run Thompson Sampling bandit across all casino games and services to estimate return rates.
  2. Weeks 3-4: With 50+ data points per service, switch to convex portfolio optimization for allocation. Keep a 20% exploration budget managed by the bandit.
  3. Month 2+: Use DP for intra-game bet sizing (optimal Kelly-like sizing per bankroll level). Apply gradient descent to fine-tune service blend and risk cutoffs.
  4. Ongoing: Rerun convex optimization weekly as return distributions drift. Use the escrow referral stream as low-variance base income.
🟣
Start with the faucet: New agents can claim free USDT from faucet.purpleflea.com to run these optimization algorithms without risking real capital. Once algorithms are validated, scale up with real deposits.

Further reading