Reinforcement Learning for Financial AI Agents
Reinforcement learning (RL) represents one of the most powerful paradigms for building autonomous financial agents. Unlike supervised learning — where you need labeled outcomes — RL allows an agent to learn purely from interaction with an environment, maximizing cumulative reward through trial and error. For financial AI agents operating on platforms like Purple Flea, this translates to learning optimal betting, trading, and capital allocation strategies without explicit programming of rules.
This guide covers the full RL stack for finance: from the mathematical foundations of Markov Decision Processes, through classical Q-learning, up to modern Deep Q-Networks (DQN) and policy gradient methods. We finish with a complete Python DQN trading agent trained on Purple Flea's paper trading sandbox.
Why RL for finance? Financial markets are sequential decision problems with delayed, noisy rewards — exactly the class of problems RL was designed for. An RL agent can discover non-obvious strategies that no human would think to hard-code.
1. Markov Decision Process Fundamentals
Every RL problem is formalized as a Markov Decision Process (MDP). The core components are:
- State space S: All possible states the agent can observe
- Action space A: All decisions the agent can take
- Transition function T(s, a, s'): Probability of moving from state s to s' after action a
- Reward function R(s, a, s'): Scalar feedback signal
- Discount factor γ ∈ [0,1): How much future rewards are worth today
The goal is to find a policy π(a|s) — a mapping from states to actions (or action probabilities) — that maximizes expected discounted return G_t. The Markov property states that the future is conditionally independent of the past given the present state: P(s'|s₀, a₀, ..., sₜ, aₜ) = P(s'|sₜ, aₜ).
Modeling Purple Flea as an MDP
When your agent operates on Purple Flea, every interaction can be cast as an MDP:
| Component | Casino Agent | Trading Agent |
|---|---|---|
| State | Balance, bet history, win/loss streak, RTP estimate | Price, volume, OHLCV features, wallet balance, open positions |
| Action | Bet size (0–100 units), game choice, cash out | Buy N units, Sell N units, Hold |
| Reward | Net PnL per bet (win - bet_size) | Unrealized + realized PnL delta |
| Transition | Stochastic (provably fair RNG) | Market-driven (partially observable) |
| γ (discount) | 0.95 (near-term focus) | 0.99 (long-horizon planning) |
Tip: For casino agents, the Markov property holds well — each round is nearly independent. For trading, you need rich state representations to approximate the Markov property in a non-Markovian market.
2. Q-Learning: The Foundation
Q-learning is the bedrock tabular RL algorithm. It learns a state-action value function Q(s, a) — the expected cumulative discounted reward from taking action a in state s, then following the optimal policy thereafter.
Key parameters:
- α (learning rate): Step size for updates. Too high → oscillation. Too low → slow convergence. Typical: 0.01–0.1
- γ (discount): Balances immediate vs future reward. 0.99 for long-horizon trading.
- ε (exploration rate): Probability of random action (ε-greedy policy). Anneal from 1.0 → 0.01 over training.
import numpy as np
from collections import defaultdict
class TabularQLearning:
"""Classical Q-learning for discrete state/action spaces."""
def __init__(self, n_actions: int, alpha: float = 0.05,
gamma: float = 0.95, epsilon: float = 1.0,
epsilon_decay: float = 0.995, epsilon_min: float = 0.01):
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
# Q-table: defaultdict for sparse state spaces
self.Q = defaultdict(lambda: np.zeros(n_actions))
def choose_action(self, state: tuple) -> int:
"""Epsilon-greedy action selection."""
if np.random.rand() < self.epsilon:
return np.random.randint(self.n_actions)
return int(np.argmax(self.Q[state]))
def update(self, state: tuple, action: int, reward: float,
next_state: tuple, done: bool):
"""Bellman update."""
best_next = 0.0 if done else np.max(self.Q[next_state])
td_target = reward + self.gamma * best_next
td_error = td_target - self.Q[state][action]
self.Q[state][action] += self.alpha * td_error
def decay_epsilon(self):
self.epsilon = max(self.epsilon_min,
self.epsilon * self.epsilon_decay)
Limitations of Tabular Q-Learning
Tabular Q-learning becomes intractable for real financial environments. A trading agent with 50 price features, 10 discretization levels each, faces 10^50 states — impossible to enumerate. This motivates function approximation.
3. Deep Q-Networks (DQN)
Deep Q-Networks, introduced by DeepMind in 2013, replace the Q-table with a neural network parameterized by θ: Q(s, a; θ) ≈ Q*(s, a). Two key innovations make DQN stable:
- Experience Replay: Store transitions (s, a, r, s') in a replay buffer. Sample random mini-batches to break temporal correlations and improve sample efficiency.
- Target Network: A second network Q(s,a;θ⁻) with frozen weights, updated periodically. Prevents the moving-target problem where both the prediction and the target shift simultaneously.
DQN Architecture for Financial Agents
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random
class DQNNetwork(nn.Module):
"""Feed-forward DQN for financial state representations."""
def __init__(self, state_dim: int, action_dim: int,
hidden_dims: list = [256, 256, 128]):
super().__init__()
layers = []
in_dim = state_dim
for h in hidden_dims:
layers.extend([
nn.Linear(in_dim, h),
nn.LayerNorm(h),
nn.ReLU(),
nn.Dropout(0.1)
])
in_dim = h
layers.append(nn.Linear(in_dim, action_dim))
self.net = nn.Sequential(*layers)
self._init_weights()
def _init_weights(self):
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
nn.init.zeros_(m.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x)
class ReplayBuffer:
"""Circular replay buffer with uniform sampling."""
def __init__(self, capacity: int = 50_000):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size: int):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return (
torch.FloatTensor(np.array(states)),
torch.LongTensor(actions),
torch.FloatTensor(rewards),
torch.FloatTensor(np.array(next_states)),
torch.FloatTensor(dones)
)
def __len__(self):
return len(self.buffer)
class DQNAgent:
"""Full DQN agent with experience replay and target network."""
def __init__(self, state_dim: int, action_dim: int,
lr: float = 1e-4, gamma: float = 0.99,
epsilon_start: float = 1.0,
epsilon_end: float = 0.01,
epsilon_decay: int = 10_000,
target_update_freq: int = 500,
batch_size: int = 64,
buffer_size: int = 50_000):
self.action_dim = action_dim
self.gamma = gamma
self.epsilon_start = epsilon_start
self.epsilon_end = epsilon_end
self.epsilon_decay = epsilon_decay
self.target_update_freq = target_update_freq
self.batch_size = batch_size
self.steps = 0
self.device = torch.device('cuda' if torch.cuda.is_available()
else 'cpu')
# Online and target networks
self.online_net = DQNNetwork(state_dim, action_dim).to(self.device)
self.target_net = DQNNetwork(state_dim, action_dim).to(self.device)
self.target_net.load_state_dict(self.online_net.state_dict())
self.target_net.eval()
self.optimizer = optim.Adam(self.online_net.parameters(), lr=lr)
self.replay_buffer = ReplayBuffer(buffer_size)
self.loss_fn = nn.SmoothL1Loss() # Huber loss for stability
@property
def epsilon(self) -> float:
return self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
np.exp(-self.steps / self.epsilon_decay)
def act(self, state: np.ndarray, training: bool = True) -> int:
if training and np.random.rand() < self.epsilon:
return np.random.randint(self.action_dim)
with torch.no_grad():
state_t = torch.FloatTensor(state).unsqueeze(0).to(self.device)
q_vals = self.online_net(state_t)
return int(q_vals.argmax(dim=1).item())
def remember(self, state, action, reward, next_state, done):
self.replay_buffer.push(state, action, reward, next_state, done)
def learn(self) -> float | None:
if len(self.replay_buffer) < self.batch_size:
return None
states, actions, rewards, next_states, dones = \
self.replay_buffer.sample(self.batch_size)
states = states.to(self.device)
actions = actions.to(self.device)
rewards = rewards.to(self.device)
next_states = next_states.to(self.device)
dones = dones.to(self.device)
# Current Q values
current_q = self.online_net(states).gather(1, actions.unsqueeze(1))
# Double DQN: use online net to select action, target net to evaluate
with torch.no_grad():
online_next_actions = self.online_net(next_states).argmax(1, keepdim=True)
target_next_q = self.target_net(next_states).gather(1, online_next_actions)
target_q = rewards.unsqueeze(1) + \
self.gamma * target_next_q * (1 - dones.unsqueeze(1))
loss = self.loss_fn(current_q, target_q)
self.optimizer.zero_grad()
loss.backward()
# Gradient clipping prevents exploding gradients in finance
nn.utils.clip_grad_norm_(self.online_net.parameters(), 10.0)
self.optimizer.step()
self.steps += 1
if self.steps % self.target_update_freq == 0:
self.target_net.load_state_dict(self.online_net.state_dict())
return loss.item()
4. Reward Shaping for Financial Goals
Raw profit as the reward signal sounds natural, but it creates several pathologies in financial RL:
- Reward sparsity: In trending markets, an agent can hold for many steps and receive zero reward until a trade closes.
- Scale instability: A 1 BTC trade and a 0.001 BTC trade have wildly different reward magnitudes.
- Risk blindness: Pure PnL maximization ignores variance, drawdown, and ruin risk.
Shaped Reward Function
def compute_reward(
prev_portfolio_value: float,
curr_portfolio_value: float,
trade_cost: float,
drawdown: float,
max_drawdown_threshold: float = 0.20,
risk_penalty_weight: float = 0.5,
sharpe_window: list = None
) -> float:
"""
Multi-component reward shaping for financial RL.
Combines return, risk penalty, and Sharpe approximation.
"""
# 1. Log return (normalized, handles scale variation)
log_return = np.log(curr_portfolio_value / (prev_portfolio_value + 1e-8))
# 2. Transaction cost penalty
cost_penalty = -trade_cost / prev_portfolio_value
# 3. Drawdown penalty (quadratic above threshold)
if drawdown > max_drawdown_threshold:
dd_penalty = -risk_penalty_weight * (drawdown ** 2)
else:
dd_penalty = 0.0
# 4. Rolling Sharpe approximation (dense reward signal)
sharpe_bonus = 0.0
if sharpe_window and len(sharpe_window) >= 10:
returns = np.array(sharpe_window[-20:])
if returns.std() > 1e-8:
sharpe_bonus = 0.1 * (returns.mean() / returns.std())
return log_return + cost_penalty + dd_penalty + sharpe_bonus
Reward hacking warning: Poorly shaped rewards lead to degenerate strategies. An agent rewarded purely on Sharpe ratio may refuse to trade at all (zero variance = infinite Sharpe). Always include minimum activity requirements or use a combination of metrics.
5. State Representation for Financial Environments
Feature engineering is often more impactful than architectural choices in financial RL. A good state vector captures:
- Price features: Returns at multiple scales (1m, 5m, 1h, 1d), volatility, RSI, MACD
- Portfolio state: Current balance, position size, unrealized PnL, available margin
- Market microstructure: Bid-ask spread, order book imbalance, recent volume
- Temporal features: Hour of day, day of week (market regimes vary by time)
import numpy as np
class PurpleFleaStateBuilder:
"""Build normalized state vectors for Purple Flea trading/casino agents."""
def __init__(self, lookback: int = 30):
self.lookback = lookback
self.price_history = []
self.balance_history = []
def push(self, price: float, balance: float):
self.price_history.append(price)
self.balance_history.append(balance)
if len(self.price_history) > self.lookback * 2:
self.price_history.pop(0)
self.balance_history.pop(0)
def _returns(self, prices: list, window: int) -> np.ndarray:
arr = np.array(prices[-window-1:])
return np.diff(arr) / (arr[:-1] + 1e-8)
def _rsi(self, prices: list, period: int = 14) -> float:
if len(prices) < period + 1:
return 0.5
deltas = np.diff(prices[-(period+1):])
gains = np.where(deltas > 0, deltas, 0).mean()
losses = np.where(deltas < 0, -deltas, 0).mean()
if losses < 1e-8:
return 1.0
rs = gains / losses
return 1 - 1 / (1 + rs)
def build(self, current_balance: float,
initial_balance: float = 1000.0) -> np.ndarray:
"""Returns a normalized state vector."""
if len(self.price_history) < self.lookback + 1:
return np.zeros(15)
prices = self.price_history
# Price returns at 1, 5, 20 step windows
r1 = self._returns(prices, 1)[-1] if len(prices) > 2 else 0.0
r5 = self._returns(prices, 5).mean() if len(prices) > 6 else 0.0
r20 = self._returns(prices, 20).mean() if len(prices) > 21 else 0.0
# Volatility (std of 20-step returns, annualized proxy)
vol20 = np.std(self._returns(prices, 20)) if len(prices) > 21 else 0.0
# RSI
rsi = self._rsi(prices)
# Simple momentum: price vs 20-period SMA
sma20 = np.mean(prices[-20:])
momentum = (prices[-1] - sma20) / (sma20 + 1e-8)
# Portfolio state (normalized)
balance_ratio = current_balance / (initial_balance + 1e-8)
drawdown = max(0, 1 - current_balance / max(self.balance_history[-50:] + [1.0]))
pnl_pct = (current_balance - initial_balance) / (initial_balance + 1e-8)
# Temporal (hour of day encoded as sin/cos)
import datetime
now = datetime.datetime.utcnow()
hour_sin = np.sin(2 * np.pi * now.hour / 24)
hour_cos = np.cos(2 * np.pi * now.hour / 24)
dow_sin = np.sin(2 * np.pi * now.weekday() / 7)
dow_cos = np.cos(2 * np.pi * now.weekday() / 7)
state = np.array([
r1, r5, r20, vol20, rsi,
momentum, balance_ratio, drawdown, pnl_pct,
hour_sin, hour_cos, dow_sin, dow_cos,
np.clip(r1 / (vol20 + 1e-8), -3, 3), # normalized return
1.0 if balance_ratio > 1 else 0.0 # profitable flag
], dtype=np.float32)
return np.clip(state, -5, 5)
6. Curriculum Learning: Casino → Trading
Curriculum learning is the practice of training an agent on progressively harder tasks. For financial agents, a natural curriculum exists:
| Stage | Environment | State Complexity | Reward Signal | Risk Level |
|---|---|---|---|---|
| 1. Novice | Purple Flea Faucet (free ) | Low (balance, bet) | Dense (per-bet) | Zero |
| 2. Casino | Coin Flip / Dice (house edge 1–2%) | Medium (streak, RTP) | Dense | Low |
| 3. Paper Trade | Purple Flea sandbox API | High (30+ features) | Sparse | Zero capital |
| 4. Live Trade | Real Purple Flea markets | High | Real PnL | High |
The key insight: a casino agent learns basic concepts of risk/reward, bankroll management, and exploration vs exploitation with a dense, fast reward signal. These learned representations transfer to the harder sparse-reward trading task.
class CurriculumTrainer:
"""Progressively harder training environments for financial RL."""
STAGES = ['faucet', 'casino', 'paper_trade', 'live_trade']
PROMOTION_THRESHOLD = 1.2 # 20% balance growth to advance
def __init__(self, agent: DQNAgent):
self.agent = agent
self.current_stage = 0
self.stage_episodes = 0
self.stage_returns = []
def should_promote(self) -> bool:
if self.stage_episodes < 100:
return False
recent = self.stage_returns[-50:]
avg_return = np.mean(recent)
return avg_return >= self.PROMOTION_THRESHOLD and self.current_stage < 3
def promote(self):
self.current_stage += 1
self.stage_episodes = 0
self.stage_returns = []
stage_name = self.STAGES[self.current_stage]
print(f"[Curriculum] Promoted to stage: {stage_name}")
# Partially reset exploration for new environment
self.agent.steps = max(0, self.agent.steps - 2000)
def record_episode(self, total_return: float):
self.stage_episodes += 1
self.stage_returns.append(total_return)
if self.should_promote():
self.promote()
@property
def current_env_name(self) -> str:
return self.STAGES[self.current_stage]
7. Policy Gradient Methods
DQN works for discrete action spaces, but financial trading often requires continuous actions: "buy $237.50 worth" rather than choosing from a fixed menu. Policy gradient methods — particularly Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) — are better suited here.
Proximal Policy Optimization (PPO)
PPO constrains policy updates to prevent destructively large steps — crucial in financial environments where a single bad update can cause an agent to blow up its account:
import torch
import torch.nn as nn
import torch.optim as optim
class ActorCritic(nn.Module):
"""Shared trunk with separate actor (policy) and critic (value) heads."""
def __init__(self, state_dim: int, action_dim: int,
hidden: int = 256):
super().__init__()
self.trunk = nn.Sequential(
nn.Linear(state_dim, hidden), nn.Tanh(),
nn.Linear(hidden, hidden), nn.Tanh(),
)
# Actor: outputs mean of Gaussian policy (continuous actions)
self.actor_mean = nn.Linear(hidden, action_dim)
self.actor_log_std = nn.Parameter(torch.zeros(action_dim))
# Critic: outputs scalar value estimate
self.critic = nn.Linear(hidden, 1)
def forward(self, x):
feat = self.trunk(x)
mean = torch.tanh(self.actor_mean(feat)) # bounded [-1, 1]
std = self.actor_log_std.exp().clamp(1e-3, 1.0)
value = self.critic(feat).squeeze(-1)
return mean, std, value
def get_action(self, state):
mean, std, value = self(state)
dist = torch.distributions.Normal(mean, std)
action = dist.sample()
log_prob = dist.log_prob(action).sum(-1)
return action, log_prob, value
class PPOTrader:
"""Minimal PPO implementation for continuous-action trading."""
def __init__(self, state_dim: int, action_dim: int,
clip_eps: float = 0.2, vf_coef: float = 0.5,
ent_coef: float = 0.01, lr: float = 3e-4,
n_epochs: int = 10):
self.model = ActorCritic(state_dim, action_dim)
self.optimizer = optim.Adam(self.model.parameters(), lr=lr)
self.clip_eps = clip_eps
self.vf_coef = vf_coef
self.ent_coef = ent_coef
self.n_epochs = n_epochs
def update(self, rollout: dict) -> dict:
"""
rollout contains: states, actions, log_probs, returns, advantages
"""
states = torch.FloatTensor(rollout['states'])
actions = torch.FloatTensor(rollout['actions'])
old_lp = torch.FloatTensor(rollout['log_probs'])
returns = torch.FloatTensor(rollout['returns'])
advantages = torch.FloatTensor(rollout['advantages'])
# Normalize advantages
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
total_loss_pg = 0.0
for _ in range(self.n_epochs):
mean, std, values = self.model(states)
dist = torch.distributions.Normal(mean, std)
new_lp = dist.log_prob(actions).sum(-1)
entropy = dist.entropy().sum(-1).mean()
# Policy loss (clipped surrogate objective)
ratio = (new_lp - old_lp).exp()
obj1 = ratio * advantages
obj2 = ratio.clamp(1 - self.clip_eps, 1 + self.clip_eps) * advantages
pg_loss = -torch.min(obj1, obj2).mean()
# Value loss
vf_loss = nn.functional.mse_loss(values, returns)
# Combined loss
loss = pg_loss + self.vf_coef * vf_loss - self.ent_coef * entropy
self.optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.model.parameters(), 0.5)
self.optimizer.step()
total_loss_pg += pg_loss.item()
return {'pg_loss': total_loss_pg / self.n_epochs}
8. Purple Flea Environment Wrapper
The following environment class wraps Purple Flea's REST API into a Gym-compatible interface that any RL agent can train against:
import requests
import numpy as np
PURPLE_FLEA_API = "https://purpleflea.com/api"
class PurpleFlеaTradingEnv:
"""
OpenAI Gym-compatible wrapper for Purple Flea paper trading.
Uses pf_live_<your_key> for authentication.
"""
def __init__(self, api_key: str, initial_balance: float = 1000.0,
max_steps: int = 500, mode: str = 'paper'):
self.api_key = api_key
self.initial_balance = initial_balance
self.max_steps = max_steps
self.mode = mode
self.state_builder = PurpleFleaStateBuilder(lookback=30)
self.reset()
@property
def observation_space_dim(self) -> int:
return 15
@property
def action_space(self) -> dict:
# Discrete: 0=hold, 1=buy_small, 2=buy_large, 3=sell_small, 4=sell_large
return {'n': 5, 'type': 'discrete'}
def _get_price(self) -> float:
"""Fetch current price from Purple Flea API."""
try:
r = requests.get(
f"{PURPLE_FLEA_API}/trading/ticker/PFBTC",
headers={'Authorization': f'Bearer {self.api_key}'},
timeout=3
)
return float(r.json()['price'])
except Exception:
return self.last_price
def reset(self) -> np.ndarray:
self.balance = self.initial_balance
self.position = 0.0
self.step_count = 0
self.last_price = 50_000.0
self.peak_balance = self.initial_balance
self.state_builder = PurpleFleaStateBuilder(lookback=30)
price = self._get_price()
self.last_price = price
self.state_builder.push(price, self.balance)
return self.state_builder.build(self.balance, self.initial_balance)
def step(self, action: int) -> tuple:
"""
Actions:
0 = hold
1 = buy 5% of balance
2 = buy 20% of balance
3 = sell 25% of position
4 = sell 100% of position (close)
"""
price = self._get_price()
trade_cost = 0.0
if action == 1 and self.balance > 10:
amount = self.balance * 0.05
self.position += amount / price
self.balance -= amount * 1.001 # 0.1% fee
trade_cost = amount * 0.001
elif action == 2 and self.balance > 10:
amount = self.balance * 0.20
self.position += amount / price
self.balance -= amount * 1.001
trade_cost = amount * 0.001
elif action == 3 and self.position > 0:
sell_qty = self.position * 0.25
proceeds = sell_qty * price
self.balance += proceeds * 0.999
self.position -= sell_qty
trade_cost = proceeds * 0.001
elif action == 4 and self.position > 0:
proceeds = self.position * price
self.balance += proceeds * 0.999
trade_cost = proceeds * 0.001
self.position = 0.0
portfolio_value = self.balance + self.position * price
self.peak_balance = max(self.peak_balance, portfolio_value)
drawdown = 1 - portfolio_value / self.peak_balance
reward = compute_reward(
prev_portfolio_value=self.last_price * self.position + self.balance,
curr_portfolio_value=portfolio_value,
trade_cost=trade_cost,
drawdown=drawdown
)
self.last_price = price
self.step_count += 1
self.state_builder.push(price, self.balance)
obs = self.state_builder.build(self.balance, self.initial_balance)
done = (self.step_count >= self.max_steps or
portfolio_value < self.initial_balance * 0.5) # 50% stop
info = {
'portfolio_value': portfolio_value,
'balance': self.balance,
'position': self.position,
'drawdown': drawdown,
'step': self.step_count
}
return obs, reward, done, info
9. Full Training Loop
import matplotlib.pyplot as plt
def train_dqn_agent(
api_key: str,
n_episodes: int = 1000,
save_path: str = 'dqn_trader.pt'
) -> dict:
"""Complete DQN training pipeline for Purple Flea trading."""
env = PurpleFlеaTradingEnv(api_key=api_key)
agent = DQNAgent(
state_dim=env.observation_space_dim,
action_dim=env.action_space['n'],
lr=1e-4,
gamma=0.99,
epsilon_start=1.0,
epsilon_end=0.05,
epsilon_decay=5_000,
target_update_freq=200,
batch_size=64,
buffer_size=30_000
)
curriculum = CurriculumTrainer(agent)
history = {
'episode_returns': [],
'portfolio_values': [],
'losses': [],
'epsilons': []
}
for ep in range(n_episodes):
obs = env.reset()
ep_reward = 0.0
ep_losses = []
while True:
action = agent.act(obs, training=True)
next_obs, reward, done, info = env.step(action)
agent.remember(obs, action, reward, next_obs, float(done))
loss = agent.learn()
if loss is not None:
ep_losses.append(loss)
ep_reward += reward
obs = next_obs
if done:
break
final_pv = info['portfolio_value']
ep_return = final_pv / env.initial_balance
curriculum.record_episode(ep_return)
history['episode_returns'].append(ep_reward)
history['portfolio_values'].append(final_pv)
history['losses'].append(np.mean(ep_losses) if ep_losses else 0)
history['epsilons'].append(agent.epsilon)
if (ep + 1) % 50 == 0:
avg_pv = np.mean(history['portfolio_values'][-50:])
print(f"Episode {ep+1:4d} | Avg Portfolio: ${avg_pv:.2f} | "
f"ε: {agent.epsilon:.3f} | Stage: {curriculum.current_env_name}")
torch.save({
'online_net': agent.online_net.state_dict(),
'optimizer': agent.optimizer.state_dict(),
'steps': agent.steps,
'episode': ep
}, save_path)
return history
# Run training (replace with your actual API key)
# history = train_dqn_agent(api_key="pf_live_<your_key>", n_episodes=500)
10. Evaluation Metrics
Raw returns are insufficient for evaluating financial RL agents. Use these standard metrics:
def evaluate_agent(portfolio_values: list,
risk_free_rate: float = 0.0) -> dict:
"""Compute standard financial performance metrics."""
pvs = np.array(portfolio_values)
returns = np.diff(pvs) / pvs[:-1]
# Annualized (assuming hourly steps)
ann_factor = np.sqrt(365 * 24)
sharpe = (returns.mean() / (returns.std() + 1e-8)) * ann_factor
# Maximum drawdown
peak = np.maximum.accumulate(pvs)
drawdowns = (pvs - peak) / peak
max_dd = drawdowns.min()
# Calmar ratio
ann_return = (pvs[-1] / pvs[0]) ** (365 * 24 / len(pvs)) - 1
calmar = ann_return / (abs(max_dd) + 1e-8)
# Win rate
win_rate = (returns > 0).mean()
# Profit factor
gross_profit = returns[returns > 0].sum()
gross_loss = abs(returns[returns < 0].sum())
profit_factor = gross_profit / (gross_loss + 1e-8)
return {
'final_value': pvs[-1],
'total_return': (pvs[-1] / pvs[0] - 1) * 100,
'sharpe_ratio': sharpe,
'max_drawdown': max_dd * 100,
'calmar_ratio': calmar,
'win_rate': win_rate * 100,
'profit_factor': profit_factor,
'n_steps': len(pvs)
}
| Metric | Poor | Acceptable | Excellent |
|---|---|---|---|
| Sharpe Ratio | < 0.5 | 0.5 – 1.5 | > 2.0 |
| Max Drawdown | > 30% | 10–30% | < 10% |
| Calmar Ratio | < 0.5 | 0.5 – 2.0 | > 3.0 |
| Win Rate | < 40% | 40–55% | > 55% |
| Profit Factor | < 1.0 | 1.0 – 1.5 | > 2.0 |
11. Advanced Techniques
Prioritized Experience Replay
Standard uniform replay wastes capacity on uninformative transitions. Prioritized replay samples experiences in proportion to their TD error — rare, high-error transitions get replayed more often. This is especially valuable in financial environments where market regime changes are infrequent but critical.
Distributional RL (C51 / QR-DQN)
Instead of learning E[G], distributional RL learns the full return distribution P(G). This is extremely valuable for risk-aware financial agents: an agent can be configured to maximize expected return while maintaining a hard constraint on the 5th percentile of outcomes (CVaR constraint).
Recurrent DQN (DRQN)
Markets are non-Markovian — past price history matters beyond what fits in a fixed-size feature vector. Replacing the feed-forward trunk with an LSTM allows the agent to maintain a hidden state representing "market memory" across arbitrarily long sequences.
class DRQNNetwork(nn.Module):
"""Deep Recurrent Q-Network with LSTM for non-Markovian markets."""
def __init__(self, state_dim: int, action_dim: int,
lstm_hidden: int = 128, lstm_layers: int = 1):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(state_dim, 64), nn.ReLU()
)
self.lstm = nn.LSTM(64, lstm_hidden, lstm_layers, batch_first=True)
self.q_head = nn.Linear(lstm_hidden, action_dim)
def forward(self, x: torch.Tensor,
hidden=None) -> tuple:
# x: (batch, seq_len, state_dim) for sequential input
# or (batch, state_dim) for single-step inference
if x.dim() == 2:
x = x.unsqueeze(1) # add seq dim
enc = self.encoder(x)
lstm_out, hidden = self.lstm(enc, hidden)
q_vals = self.q_head(lstm_out[:, -1, :])
return q_vals, hidden
12. Getting Started with Purple Flea
The fastest path to training a real RL agent on Purple Flea:
- Claim free tokens via the Faucet: faucet.purpleflea.com — zero risk, immediate balance to start training
- Get an API key: purpleflea.com/docs#api-key
- Install dependencies:
pip install torch numpy requests - Copy the DQN agent above and set
api_key="pf_live_<your_key>" - Run the training loop with
n_episodes=100to verify connectivity - Monitor via MCP: Use the Purple Flea MCP server to call training/evaluation tools directly from your LLM agent
Connect via MCP: All Purple Flea services are available as MCP tools at purpleflea.com/mcp-six-services-complete-config. Your RL agent can call place_bet, trade, and get_balance directly from its tool loop.
Conclusion
Reinforcement learning provides a principled framework for building financial AI agents that improve through experience rather than explicit programming. The key takeaways:
- Model your financial environment as an MDP with carefully designed state, action, and reward components
- Use Double DQN with experience replay and target networks for stable discrete-action learning
- Shape rewards to incorporate risk-adjusted metrics (Sharpe, drawdown) not just raw PnL
- Apply curriculum learning: start on Purple Flea's Faucet/Casino, then graduate to trading
- Move to PPO or SAC when your action space becomes continuous (position sizing)
- Always evaluate with Sharpe ratio, max drawdown, and Calmar ratio — not just total return
The full code for this guide is available via the Purple Flea research repository. Start with the Faucet to get free tokens for risk-free experimentation, then scale up to live trading as your agent demonstrates consistent performance in backtesting.