Research · Machine Learning

Reinforcement Learning for Financial AI Agents

March 6, 2026 · 22 min read · Purple Flea Research

Reinforcement learning (RL) represents one of the most powerful paradigms for building autonomous financial agents. Unlike supervised learning — where you need labeled outcomes — RL allows an agent to learn purely from interaction with an environment, maximizing cumulative reward through trial and error. For financial AI agents operating on platforms like Purple Flea, this translates to learning optimal betting, trading, and capital allocation strategies without explicit programming of rules.

This guide covers the full RL stack for finance: from the mathematical foundations of Markov Decision Processes, through classical Q-learning, up to modern Deep Q-Networks (DQN) and policy gradient methods. We finish with a complete Python DQN trading agent trained on Purple Flea's paper trading sandbox.

Why RL for finance? Financial markets are sequential decision problems with delayed, noisy rewards — exactly the class of problems RL was designed for. An RL agent can discover non-obvious strategies that no human would think to hard-code.

6
Purple Flea Services
137
Active Casino Agents
3
RL paradigms covered

1. Markov Decision Process Fundamentals

Every RL problem is formalized as a Markov Decision Process (MDP). The core components are:

G_t = R_{t+1} + γR_{t+2} + γ²R_{t+3} + ... = Σ_{k=0}^{∞} γᵏ R_{t+k+1}

The goal is to find a policy π(a|s) — a mapping from states to actions (or action probabilities) — that maximizes expected discounted return G_t. The Markov property states that the future is conditionally independent of the past given the present state: P(s'|s₀, a₀, ..., sₜ, aₜ) = P(s'|sₜ, aₜ).

Modeling Purple Flea as an MDP

When your agent operates on Purple Flea, every interaction can be cast as an MDP:

ComponentCasino AgentTrading Agent
StateBalance, bet history, win/loss streak, RTP estimatePrice, volume, OHLCV features, wallet balance, open positions
ActionBet size (0–100 units), game choice, cash outBuy N units, Sell N units, Hold
RewardNet PnL per bet (win - bet_size)Unrealized + realized PnL delta
TransitionStochastic (provably fair RNG)Market-driven (partially observable)
γ (discount)0.95 (near-term focus)0.99 (long-horizon planning)

Tip: For casino agents, the Markov property holds well — each round is nearly independent. For trading, you need rich state representations to approximate the Markov property in a non-Markovian market.

2. Q-Learning: The Foundation

Q-learning is the bedrock tabular RL algorithm. It learns a state-action value function Q(s, a) — the expected cumulative discounted reward from taking action a in state s, then following the optimal policy thereafter.

Q(s,a) ← Q(s,a) + α [R + γ max_{a'} Q(s',a') − Q(s,a)]

Key parameters:

import numpy as np
from collections import defaultdict

class TabularQLearning:
    """Classical Q-learning for discrete state/action spaces."""

    def __init__(self, n_actions: int, alpha: float = 0.05,
                 gamma: float = 0.95, epsilon: float = 1.0,
                 epsilon_decay: float = 0.995, epsilon_min: float = 0.01):
        self.n_actions = n_actions
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        # Q-table: defaultdict for sparse state spaces
        self.Q = defaultdict(lambda: np.zeros(n_actions))

    def choose_action(self, state: tuple) -> int:
        """Epsilon-greedy action selection."""
        if np.random.rand() < self.epsilon:
            return np.random.randint(self.n_actions)
        return int(np.argmax(self.Q[state]))

    def update(self, state: tuple, action: int, reward: float,
               next_state: tuple, done: bool):
        """Bellman update."""
        best_next = 0.0 if done else np.max(self.Q[next_state])
        td_target = reward + self.gamma * best_next
        td_error = td_target - self.Q[state][action]
        self.Q[state][action] += self.alpha * td_error

    def decay_epsilon(self):
        self.epsilon = max(self.epsilon_min,
                           self.epsilon * self.epsilon_decay)

Limitations of Tabular Q-Learning

Tabular Q-learning becomes intractable for real financial environments. A trading agent with 50 price features, 10 discretization levels each, faces 10^50 states — impossible to enumerate. This motivates function approximation.

3. Deep Q-Networks (DQN)

Deep Q-Networks, introduced by DeepMind in 2013, replace the Q-table with a neural network parameterized by θ: Q(s, a; θ) ≈ Q*(s, a). Two key innovations make DQN stable:

L(θ) = E[(R + γ max_{a'} Q(s',a';θ⁻) − Q(s,a;θ))²]

DQN Architecture for Financial Agents

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

class DQNNetwork(nn.Module):
    """Feed-forward DQN for financial state representations."""

    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: list = [256, 256, 128]):
        super().__init__()
        layers = []
        in_dim = state_dim
        for h in hidden_dims:
            layers.extend([
                nn.Linear(in_dim, h),
                nn.LayerNorm(h),
                nn.ReLU(),
                nn.Dropout(0.1)
            ])
            in_dim = h
        layers.append(nn.Linear(in_dim, action_dim))
        self.net = nn.Sequential(*layers)
        self._init_weights()

    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
                nn.init.zeros_(m.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)


class ReplayBuffer:
    """Circular replay buffer with uniform sampling."""

    def __init__(self, capacity: int = 50_000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size: int):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (
            torch.FloatTensor(np.array(states)),
            torch.LongTensor(actions),
            torch.FloatTensor(rewards),
            torch.FloatTensor(np.array(next_states)),
            torch.FloatTensor(dones)
        )

    def __len__(self):
        return len(self.buffer)


class DQNAgent:
    """Full DQN agent with experience replay and target network."""

    def __init__(self, state_dim: int, action_dim: int,
                 lr: float = 1e-4, gamma: float = 0.99,
                 epsilon_start: float = 1.0,
                 epsilon_end: float = 0.01,
                 epsilon_decay: int = 10_000,
                 target_update_freq: int = 500,
                 batch_size: int = 64,
                 buffer_size: int = 50_000):
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.target_update_freq = target_update_freq
        self.batch_size = batch_size
        self.steps = 0

        self.device = torch.device('cuda' if torch.cuda.is_available()
                                   else 'cpu')

        # Online and target networks
        self.online_net = DQNNetwork(state_dim, action_dim).to(self.device)
        self.target_net = DQNNetwork(state_dim, action_dim).to(self.device)
        self.target_net.load_state_dict(self.online_net.state_dict())
        self.target_net.eval()

        self.optimizer = optim.Adam(self.online_net.parameters(), lr=lr)
        self.replay_buffer = ReplayBuffer(buffer_size)
        self.loss_fn = nn.SmoothL1Loss()  # Huber loss for stability

    @property
    def epsilon(self) -> float:
        return self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
               np.exp(-self.steps / self.epsilon_decay)

    def act(self, state: np.ndarray, training: bool = True) -> int:
        if training and np.random.rand() < self.epsilon:
            return np.random.randint(self.action_dim)
        with torch.no_grad():
            state_t = torch.FloatTensor(state).unsqueeze(0).to(self.device)
            q_vals = self.online_net(state_t)
            return int(q_vals.argmax(dim=1).item())

    def remember(self, state, action, reward, next_state, done):
        self.replay_buffer.push(state, action, reward, next_state, done)

    def learn(self) -> float | None:
        if len(self.replay_buffer) < self.batch_size:
            return None

        states, actions, rewards, next_states, dones = \
            self.replay_buffer.sample(self.batch_size)
        states = states.to(self.device)
        actions = actions.to(self.device)
        rewards = rewards.to(self.device)
        next_states = next_states.to(self.device)
        dones = dones.to(self.device)

        # Current Q values
        current_q = self.online_net(states).gather(1, actions.unsqueeze(1))

        # Double DQN: use online net to select action, target net to evaluate
        with torch.no_grad():
            online_next_actions = self.online_net(next_states).argmax(1, keepdim=True)
            target_next_q = self.target_net(next_states).gather(1, online_next_actions)
            target_q = rewards.unsqueeze(1) + \
                       self.gamma * target_next_q * (1 - dones.unsqueeze(1))

        loss = self.loss_fn(current_q, target_q)
        self.optimizer.zero_grad()
        loss.backward()
        # Gradient clipping prevents exploding gradients in finance
        nn.utils.clip_grad_norm_(self.online_net.parameters(), 10.0)
        self.optimizer.step()

        self.steps += 1
        if self.steps % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.online_net.state_dict())

        return loss.item()

4. Reward Shaping for Financial Goals

Raw profit as the reward signal sounds natural, but it creates several pathologies in financial RL:

Shaped Reward Function

def compute_reward(
    prev_portfolio_value: float,
    curr_portfolio_value: float,
    trade_cost: float,
    drawdown: float,
    max_drawdown_threshold: float = 0.20,
    risk_penalty_weight: float = 0.5,
    sharpe_window: list = None
) -> float:
    """
    Multi-component reward shaping for financial RL.
    Combines return, risk penalty, and Sharpe approximation.
    """
    # 1. Log return (normalized, handles scale variation)
    log_return = np.log(curr_portfolio_value / (prev_portfolio_value + 1e-8))

    # 2. Transaction cost penalty
    cost_penalty = -trade_cost / prev_portfolio_value

    # 3. Drawdown penalty (quadratic above threshold)
    if drawdown > max_drawdown_threshold:
        dd_penalty = -risk_penalty_weight * (drawdown ** 2)
    else:
        dd_penalty = 0.0

    # 4. Rolling Sharpe approximation (dense reward signal)
    sharpe_bonus = 0.0
    if sharpe_window and len(sharpe_window) >= 10:
        returns = np.array(sharpe_window[-20:])
        if returns.std() > 1e-8:
            sharpe_bonus = 0.1 * (returns.mean() / returns.std())

    return log_return + cost_penalty + dd_penalty + sharpe_bonus

Reward hacking warning: Poorly shaped rewards lead to degenerate strategies. An agent rewarded purely on Sharpe ratio may refuse to trade at all (zero variance = infinite Sharpe). Always include minimum activity requirements or use a combination of metrics.

5. State Representation for Financial Environments

Feature engineering is often more impactful than architectural choices in financial RL. A good state vector captures:

import numpy as np

class PurpleFleaStateBuilder:
    """Build normalized state vectors for Purple Flea trading/casino agents."""

    def __init__(self, lookback: int = 30):
        self.lookback = lookback
        self.price_history = []
        self.balance_history = []

    def push(self, price: float, balance: float):
        self.price_history.append(price)
        self.balance_history.append(balance)
        if len(self.price_history) > self.lookback * 2:
            self.price_history.pop(0)
            self.balance_history.pop(0)

    def _returns(self, prices: list, window: int) -> np.ndarray:
        arr = np.array(prices[-window-1:])
        return np.diff(arr) / (arr[:-1] + 1e-8)

    def _rsi(self, prices: list, period: int = 14) -> float:
        if len(prices) < period + 1:
            return 0.5
        deltas = np.diff(prices[-(period+1):])
        gains = np.where(deltas > 0, deltas, 0).mean()
        losses = np.where(deltas < 0, -deltas, 0).mean()
        if losses < 1e-8:
            return 1.0
        rs = gains / losses
        return 1 - 1 / (1 + rs)

    def build(self, current_balance: float,
              initial_balance: float = 1000.0) -> np.ndarray:
        """Returns a normalized state vector."""
        if len(self.price_history) < self.lookback + 1:
            return np.zeros(15)

        prices = self.price_history

        # Price returns at 1, 5, 20 step windows
        r1  = self._returns(prices, 1)[-1]   if len(prices) > 2  else 0.0
        r5  = self._returns(prices, 5).mean() if len(prices) > 6  else 0.0
        r20 = self._returns(prices, 20).mean() if len(prices) > 21 else 0.0

        # Volatility (std of 20-step returns, annualized proxy)
        vol20 = np.std(self._returns(prices, 20)) if len(prices) > 21 else 0.0

        # RSI
        rsi = self._rsi(prices)

        # Simple momentum: price vs 20-period SMA
        sma20 = np.mean(prices[-20:])
        momentum = (prices[-1] - sma20) / (sma20 + 1e-8)

        # Portfolio state (normalized)
        balance_ratio = current_balance / (initial_balance + 1e-8)
        drawdown = max(0, 1 - current_balance / max(self.balance_history[-50:] + [1.0]))
        pnl_pct = (current_balance - initial_balance) / (initial_balance + 1e-8)

        # Temporal (hour of day encoded as sin/cos)
        import datetime
        now = datetime.datetime.utcnow()
        hour_sin = np.sin(2 * np.pi * now.hour / 24)
        hour_cos = np.cos(2 * np.pi * now.hour / 24)
        dow_sin  = np.sin(2 * np.pi * now.weekday() / 7)
        dow_cos  = np.cos(2 * np.pi * now.weekday() / 7)

        state = np.array([
            r1, r5, r20, vol20, rsi,
            momentum, balance_ratio, drawdown, pnl_pct,
            hour_sin, hour_cos, dow_sin, dow_cos,
            np.clip(r1 / (vol20 + 1e-8), -3, 3),  # normalized return
            1.0 if balance_ratio > 1 else 0.0       # profitable flag
        ], dtype=np.float32)

        return np.clip(state, -5, 5)

6. Curriculum Learning: Casino → Trading

Curriculum learning is the practice of training an agent on progressively harder tasks. For financial agents, a natural curriculum exists:

StageEnvironmentState ComplexityReward SignalRisk Level
1. NovicePurple Flea Faucet (free )Low (balance, bet)Dense (per-bet)Zero
2. CasinoCoin Flip / Dice (house edge 1–2%)Medium (streak, RTP)DenseLow
3. Paper TradePurple Flea sandbox APIHigh (30+ features)SparseZero capital
4. Live TradeReal Purple Flea marketsHighReal PnLHigh

The key insight: a casino agent learns basic concepts of risk/reward, bankroll management, and exploration vs exploitation with a dense, fast reward signal. These learned representations transfer to the harder sparse-reward trading task.

class CurriculumTrainer:
    """Progressively harder training environments for financial RL."""

    STAGES = ['faucet', 'casino', 'paper_trade', 'live_trade']
    PROMOTION_THRESHOLD = 1.2  # 20% balance growth to advance

    def __init__(self, agent: DQNAgent):
        self.agent = agent
        self.current_stage = 0
        self.stage_episodes = 0
        self.stage_returns = []

    def should_promote(self) -> bool:
        if self.stage_episodes < 100:
            return False
        recent = self.stage_returns[-50:]
        avg_return = np.mean(recent)
        return avg_return >= self.PROMOTION_THRESHOLD and self.current_stage < 3

    def promote(self):
        self.current_stage += 1
        self.stage_episodes = 0
        self.stage_returns = []
        stage_name = self.STAGES[self.current_stage]
        print(f"[Curriculum] Promoted to stage: {stage_name}")
        # Partially reset exploration for new environment
        self.agent.steps = max(0, self.agent.steps - 2000)

    def record_episode(self, total_return: float):
        self.stage_episodes += 1
        self.stage_returns.append(total_return)
        if self.should_promote():
            self.promote()

    @property
    def current_env_name(self) -> str:
        return self.STAGES[self.current_stage]

7. Policy Gradient Methods

DQN works for discrete action spaces, but financial trading often requires continuous actions: "buy $237.50 worth" rather than choosing from a fixed menu. Policy gradient methods — particularly Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) — are better suited here.

Proximal Policy Optimization (PPO)

PPO constrains policy updates to prevent destructively large steps — crucial in financial environments where a single bad update can cause an agent to blow up its account:

L_CLIP(θ) = E_t [min(r_t(θ) Â_t, clip(r_t(θ), 1-ε, 1+ε) Â_t)]
import torch
import torch.nn as nn
import torch.optim as optim

class ActorCritic(nn.Module):
    """Shared trunk with separate actor (policy) and critic (value) heads."""

    def __init__(self, state_dim: int, action_dim: int,
                 hidden: int = 256):
        super().__init__()
        self.trunk = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden),   nn.Tanh(),
        )
        # Actor: outputs mean of Gaussian policy (continuous actions)
        self.actor_mean = nn.Linear(hidden, action_dim)
        self.actor_log_std = nn.Parameter(torch.zeros(action_dim))
        # Critic: outputs scalar value estimate
        self.critic = nn.Linear(hidden, 1)

    def forward(self, x):
        feat = self.trunk(x)
        mean = torch.tanh(self.actor_mean(feat))  # bounded [-1, 1]
        std  = self.actor_log_std.exp().clamp(1e-3, 1.0)
        value = self.critic(feat).squeeze(-1)
        return mean, std, value

    def get_action(self, state):
        mean, std, value = self(state)
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum(-1)
        return action, log_prob, value


class PPOTrader:
    """Minimal PPO implementation for continuous-action trading."""

    def __init__(self, state_dim: int, action_dim: int,
                 clip_eps: float = 0.2, vf_coef: float = 0.5,
                 ent_coef: float = 0.01, lr: float = 3e-4,
                 n_epochs: int = 10):
        self.model = ActorCritic(state_dim, action_dim)
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)
        self.clip_eps = clip_eps
        self.vf_coef = vf_coef
        self.ent_coef = ent_coef
        self.n_epochs = n_epochs

    def update(self, rollout: dict) -> dict:
        """
        rollout contains: states, actions, log_probs, returns, advantages
        """
        states     = torch.FloatTensor(rollout['states'])
        actions    = torch.FloatTensor(rollout['actions'])
        old_lp     = torch.FloatTensor(rollout['log_probs'])
        returns    = torch.FloatTensor(rollout['returns'])
        advantages = torch.FloatTensor(rollout['advantages'])

        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        total_loss_pg = 0.0
        for _ in range(self.n_epochs):
            mean, std, values = self.model(states)
            dist = torch.distributions.Normal(mean, std)
            new_lp = dist.log_prob(actions).sum(-1)
            entropy = dist.entropy().sum(-1).mean()

            # Policy loss (clipped surrogate objective)
            ratio = (new_lp - old_lp).exp()
            obj1  = ratio * advantages
            obj2  = ratio.clamp(1 - self.clip_eps, 1 + self.clip_eps) * advantages
            pg_loss = -torch.min(obj1, obj2).mean()

            # Value loss
            vf_loss = nn.functional.mse_loss(values, returns)

            # Combined loss
            loss = pg_loss + self.vf_coef * vf_loss - self.ent_coef * entropy

            self.optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(self.model.parameters(), 0.5)
            self.optimizer.step()
            total_loss_pg += pg_loss.item()

        return {'pg_loss': total_loss_pg / self.n_epochs}

8. Purple Flea Environment Wrapper

The following environment class wraps Purple Flea's REST API into a Gym-compatible interface that any RL agent can train against:

import requests
import numpy as np

PURPLE_FLEA_API = "https://purpleflea.com/api"

class PurpleFlеaTradingEnv:
    """
    OpenAI Gym-compatible wrapper for Purple Flea paper trading.
    Uses pf_live_<your_key> for authentication.
    """

    def __init__(self, api_key: str, initial_balance: float = 1000.0,
                 max_steps: int = 500, mode: str = 'paper'):
        self.api_key = api_key
        self.initial_balance = initial_balance
        self.max_steps = max_steps
        self.mode = mode
        self.state_builder = PurpleFleaStateBuilder(lookback=30)
        self.reset()

    @property
    def observation_space_dim(self) -> int:
        return 15

    @property
    def action_space(self) -> dict:
        # Discrete: 0=hold, 1=buy_small, 2=buy_large, 3=sell_small, 4=sell_large
        return {'n': 5, 'type': 'discrete'}

    def _get_price(self) -> float:
        """Fetch current price from Purple Flea API."""
        try:
            r = requests.get(
                f"{PURPLE_FLEA_API}/trading/ticker/PFBTC",
                headers={'Authorization': f'Bearer {self.api_key}'},
                timeout=3
            )
            return float(r.json()['price'])
        except Exception:
            return self.last_price

    def reset(self) -> np.ndarray:
        self.balance = self.initial_balance
        self.position = 0.0
        self.step_count = 0
        self.last_price = 50_000.0
        self.peak_balance = self.initial_balance
        self.state_builder = PurpleFleaStateBuilder(lookback=30)
        price = self._get_price()
        self.last_price = price
        self.state_builder.push(price, self.balance)
        return self.state_builder.build(self.balance, self.initial_balance)

    def step(self, action: int) -> tuple:
        """
        Actions:
          0 = hold
          1 = buy 5% of balance
          2 = buy 20% of balance
          3 = sell 25% of position
          4 = sell 100% of position (close)
        """
        price = self._get_price()
        trade_cost = 0.0

        if action == 1 and self.balance > 10:
            amount = self.balance * 0.05
            self.position += amount / price
            self.balance -= amount * 1.001  # 0.1% fee
            trade_cost = amount * 0.001

        elif action == 2 and self.balance > 10:
            amount = self.balance * 0.20
            self.position += amount / price
            self.balance -= amount * 1.001
            trade_cost = amount * 0.001

        elif action == 3 and self.position > 0:
            sell_qty = self.position * 0.25
            proceeds = sell_qty * price
            self.balance += proceeds * 0.999
            self.position -= sell_qty
            trade_cost = proceeds * 0.001

        elif action == 4 and self.position > 0:
            proceeds = self.position * price
            self.balance += proceeds * 0.999
            trade_cost = proceeds * 0.001
            self.position = 0.0

        portfolio_value = self.balance + self.position * price
        self.peak_balance = max(self.peak_balance, portfolio_value)
        drawdown = 1 - portfolio_value / self.peak_balance

        reward = compute_reward(
            prev_portfolio_value=self.last_price * self.position + self.balance,
            curr_portfolio_value=portfolio_value,
            trade_cost=trade_cost,
            drawdown=drawdown
        )

        self.last_price = price
        self.step_count += 1
        self.state_builder.push(price, self.balance)
        obs = self.state_builder.build(self.balance, self.initial_balance)

        done = (self.step_count >= self.max_steps or
                portfolio_value < self.initial_balance * 0.5)  # 50% stop
        info = {
            'portfolio_value': portfolio_value,
            'balance': self.balance,
            'position': self.position,
            'drawdown': drawdown,
            'step': self.step_count
        }
        return obs, reward, done, info

9. Full Training Loop

import matplotlib.pyplot as plt

def train_dqn_agent(
    api_key: str,
    n_episodes: int = 1000,
    save_path: str = 'dqn_trader.pt'
) -> dict:
    """Complete DQN training pipeline for Purple Flea trading."""

    env = PurpleFlеaTradingEnv(api_key=api_key)
    agent = DQNAgent(
        state_dim=env.observation_space_dim,
        action_dim=env.action_space['n'],
        lr=1e-4,
        gamma=0.99,
        epsilon_start=1.0,
        epsilon_end=0.05,
        epsilon_decay=5_000,
        target_update_freq=200,
        batch_size=64,
        buffer_size=30_000
    )
    curriculum = CurriculumTrainer(agent)

    history = {
        'episode_returns': [],
        'portfolio_values': [],
        'losses': [],
        'epsilons': []
    }

    for ep in range(n_episodes):
        obs = env.reset()
        ep_reward = 0.0
        ep_losses = []

        while True:
            action = agent.act(obs, training=True)
            next_obs, reward, done, info = env.step(action)
            agent.remember(obs, action, reward, next_obs, float(done))
            loss = agent.learn()
            if loss is not None:
                ep_losses.append(loss)

            ep_reward += reward
            obs = next_obs
            if done:
                break

        final_pv = info['portfolio_value']
        ep_return = final_pv / env.initial_balance
        curriculum.record_episode(ep_return)

        history['episode_returns'].append(ep_reward)
        history['portfolio_values'].append(final_pv)
        history['losses'].append(np.mean(ep_losses) if ep_losses else 0)
        history['epsilons'].append(agent.epsilon)

        if (ep + 1) % 50 == 0:
            avg_pv = np.mean(history['portfolio_values'][-50:])
            print(f"Episode {ep+1:4d} | Avg Portfolio: ${avg_pv:.2f} | "
                  f"ε: {agent.epsilon:.3f} | Stage: {curriculum.current_env_name}")
            torch.save({
                'online_net': agent.online_net.state_dict(),
                'optimizer': agent.optimizer.state_dict(),
                'steps': agent.steps,
                'episode': ep
            }, save_path)

    return history


# Run training (replace with your actual API key)
# history = train_dqn_agent(api_key="pf_live_<your_key>", n_episodes=500)

10. Evaluation Metrics

Raw returns are insufficient for evaluating financial RL agents. Use these standard metrics:

def evaluate_agent(portfolio_values: list,
                   risk_free_rate: float = 0.0) -> dict:
    """Compute standard financial performance metrics."""
    pvs = np.array(portfolio_values)
    returns = np.diff(pvs) / pvs[:-1]

    # Annualized (assuming hourly steps)
    ann_factor = np.sqrt(365 * 24)

    sharpe = (returns.mean() / (returns.std() + 1e-8)) * ann_factor

    # Maximum drawdown
    peak = np.maximum.accumulate(pvs)
    drawdowns = (pvs - peak) / peak
    max_dd = drawdowns.min()

    # Calmar ratio
    ann_return = (pvs[-1] / pvs[0]) ** (365 * 24 / len(pvs)) - 1
    calmar = ann_return / (abs(max_dd) + 1e-8)

    # Win rate
    win_rate = (returns > 0).mean()

    # Profit factor
    gross_profit = returns[returns > 0].sum()
    gross_loss   = abs(returns[returns < 0].sum())
    profit_factor = gross_profit / (gross_loss + 1e-8)

    return {
        'final_value':     pvs[-1],
        'total_return':    (pvs[-1] / pvs[0] - 1) * 100,
        'sharpe_ratio':    sharpe,
        'max_drawdown':    max_dd * 100,
        'calmar_ratio':    calmar,
        'win_rate':        win_rate * 100,
        'profit_factor':   profit_factor,
        'n_steps':         len(pvs)
    }
MetricPoorAcceptableExcellent
Sharpe Ratio< 0.50.5 – 1.5> 2.0
Max Drawdown> 30%10–30%< 10%
Calmar Ratio< 0.50.5 – 2.0> 3.0
Win Rate< 40%40–55%> 55%
Profit Factor< 1.01.0 – 1.5> 2.0

11. Advanced Techniques

Prioritized Experience Replay

Standard uniform replay wastes capacity on uninformative transitions. Prioritized replay samples experiences in proportion to their TD error — rare, high-error transitions get replayed more often. This is especially valuable in financial environments where market regime changes are infrequent but critical.

Distributional RL (C51 / QR-DQN)

Instead of learning E[G], distributional RL learns the full return distribution P(G). This is extremely valuable for risk-aware financial agents: an agent can be configured to maximize expected return while maintaining a hard constraint on the 5th percentile of outcomes (CVaR constraint).

Recurrent DQN (DRQN)

Markets are non-Markovian — past price history matters beyond what fits in a fixed-size feature vector. Replacing the feed-forward trunk with an LSTM allows the agent to maintain a hidden state representing "market memory" across arbitrarily long sequences.

class DRQNNetwork(nn.Module):
    """Deep Recurrent Q-Network with LSTM for non-Markovian markets."""

    def __init__(self, state_dim: int, action_dim: int,
                 lstm_hidden: int = 128, lstm_layers: int = 1):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(state_dim, 64), nn.ReLU()
        )
        self.lstm = nn.LSTM(64, lstm_hidden, lstm_layers, batch_first=True)
        self.q_head = nn.Linear(lstm_hidden, action_dim)

    def forward(self, x: torch.Tensor,
                hidden=None) -> tuple:
        # x: (batch, seq_len, state_dim) for sequential input
        # or (batch, state_dim) for single-step inference
        if x.dim() == 2:
            x = x.unsqueeze(1)  # add seq dim
        enc = self.encoder(x)
        lstm_out, hidden = self.lstm(enc, hidden)
        q_vals = self.q_head(lstm_out[:, -1, :])
        return q_vals, hidden

12. Getting Started with Purple Flea

The fastest path to training a real RL agent on Purple Flea:

  1. Claim free tokens via the Faucet: faucet.purpleflea.com — zero risk, immediate balance to start training
  2. Get an API key: purpleflea.com/docs#api-key
  3. Install dependencies: pip install torch numpy requests
  4. Copy the DQN agent above and set api_key="pf_live_<your_key>"
  5. Run the training loop with n_episodes=100 to verify connectivity
  6. Monitor via MCP: Use the Purple Flea MCP server to call training/evaluation tools directly from your LLM agent

Connect via MCP: All Purple Flea services are available as MCP tools at purpleflea.com/mcp-six-services-complete-config. Your RL agent can call place_bet, trade, and get_balance directly from its tool loop.

Conclusion

Reinforcement learning provides a principled framework for building financial AI agents that improve through experience rather than explicit programming. The key takeaways:

The full code for this guide is available via the Purple Flea research repository. Start with the Faucet to get free tokens for risk-free experimentation, then scale up to live trading as your agent demonstrates consistent performance in backtesting.