No single LLM has a monopoly on market insight. Claude excels at reasoning through ambiguous macro signals. GPT-4 handles structured financial data with precision. Gemini processes real-time news fast. By combining all three into a consensus voting system and routing execution through Purple Flea, your agent gets more reliable signals with built-in redundancy.
Single-model trading agents have a hidden fragility: they inherit the biases, knowledge gaps, and failure modes of whichever LLM they rely on. When that model has a bad day (hallucination, context window overflow, provider outage), the whole agent fails. Multi-model architectures fix this.
The core idea is straightforward: run the same market context through multiple models in parallel, collect each model's signal (buy/sell/hold + confidence score), aggregate via a weighted vote, and only execute when the aggregate confidence clears a threshold. This reduces false positives significantly while only marginally increasing latency.
Assign each model to what it does best. Don't ask Claude to parse a structured CSV when GPT-4 handles that more reliably. Don't ask GPT-4 to synthesize ambiguous geopolitical risk when Claude's reasoning is stronger.
Weights should not be static. Track each model's historical accuracy on your specific market and asset class, then recompute weights weekly. A model performing well on BTC signals may underperform on altcoin signals.
Standardize the output format you request from each model. Every model query should return a JSON object with five fields:
{
"direction": "BUY" | "SELL" | "HOLD",
"confidence": 0.0 to 1.0, // how certain the model is
"timeHorizon": "1h" | "4h" | "24h" | "7d",
"keyFactors": ["factor1", "factor2"], // top reasons
"riskLevel": "LOW" | "MEDIUM" | "HIGH"
}
By enforcing this schema, your aggregation layer can process responses uniformly regardless of which model generated them. Use function calling / structured output modes on each provider to guarantee JSON compliance.
The aggregation engine combines raw signals into a final actionable decision. Here's a production-ready implementation:
// signal-aggregator.js
export function aggregateSignals(signals, weights) {
// signals: [{ model, direction, confidence, riskLevel, ... }, ...]
// weights: { claude: 0.40, gpt4: 0.35, gemini: 0.25 }
const directionScores = { BUY: 0, SELL: 0, HOLD: 0 };
for (const signal of signals) {
const w = weights[signal.model] ?? 0;
// Confidence-weighted vote
directionScores[signal.direction] += w * signal.confidence;
}
const total = Object.values(directionScores).reduce((a, b) => a + b, 0);
const normalized = Object.fromEntries(
Object.entries(directionScores).map(([k, v]) => [k, v / total])
);
const winner = Object.entries(normalized).sort(([, a], [, b]) => b - a)[0];
const [direction, score] = winner;
// Risk veto: if any model says HIGH risk, reduce effective confidence
const highRiskCount = signals.filter(s => s.riskLevel === 'HIGH').length;
const riskPenalty = highRiskCount > 0 ? 0.15 * highRiskCount : 0;
const adjustedScore = Math.max(0, score - riskPenalty);
return {
direction,
aggregateConfidence: adjustedScore,
modelBreakdown: normalized,
unanimity: signals.every(s => s.direction === direction),
highRiskFlags: highRiskCount,
};
}
The prompt you send each model matters as much as the aggregation logic. Give each model the same factual context but tailor the framing to its strengths:
const claudePrompt = (context) => `
You are a financial analyst evaluating a trading signal for ${context.asset}.
Market context:
- Current price: ${context.price}
- 24h change: ${context.change24h}%
- Key news: ${context.newsHeadlines.join('; ')}
- On-chain signals: ${context.onChainSummary}
Reason carefully about the medium-term (4h-24h) outlook.
Consider second-order effects and potential tail risks.
Return your analysis as JSON:
{
"direction": "BUY"|"SELL"|"HOLD",
"confidence": <0-1>,
"timeHorizon": "24h",
"keyFactors": ["..."],
"riskLevel": "LOW"|"MEDIUM"|"HIGH"
}
`;
const gpt4Prompt = (context) => `
Analyze the following market data for ${context.asset} and produce a trading signal.
Technical indicators:
${JSON.stringify(context.technicals, null, 2)}
Order book snapshot:
- Bid depth (1%): ${context.orderBook.bidDepth1pct}
- Ask depth (1%): ${context.orderBook.askDepth1pct}
- Bid/Ask ratio: ${context.orderBook.ratio}
Focus on pattern recognition and short-term (1h-4h) momentum signals.
Output JSON: { "direction", "confidence", "timeHorizon", "keyFactors", "riskLevel" }
`;
const geminiPrompt = (context) => `
Synthesize the following news sources and social signals for ${context.asset}.
Identify the dominant sentiment and its likely price impact over 24 hours.
News sources (last 2 hours):
${context.newsItems.map(n => `- ${n.source}: ${n.headline}`).join('\n')}
Social sentiment scores:
- Twitter/X: ${context.sentiment.twitter} (range -1 to 1)
- Reddit: ${context.sentiment.reddit}
- Telegram: ${context.sentiment.telegram}
Return JSON: { "direction", "confidence", "timeHorizon", "keyFactors", "riskLevel" }
`;
Query all three models simultaneously to minimize latency. Use Promise.allSettled (not Promise.all) so a single model timeout doesn't abort the entire signal cycle:
import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';
import { GoogleGenerativeAI } from '@google/generative-ai';
const MODEL_TIMEOUT_MS = 8000;
async function withTimeout(promise, ms) {
const timeout = new Promise((_, reject) =>
setTimeout(() => reject(new Error('timeout')), ms)
);
return Promise.race([promise, timeout]);
}
async function gatherSignals(context) {
const [claude, gpt4, gemini] = await Promise.allSettled([
withTimeout(queryClaudeSignal(context), MODEL_TIMEOUT_MS),
withTimeout(queryGPT4Signal(context), MODEL_TIMEOUT_MS),
withTimeout(queryGeminiSignal(context), MODEL_TIMEOUT_MS),
]);
const signals = [];
if (claude.status === 'fulfilled') signals.push({ model: 'claude', ...claude.value });
if (gpt4.status === 'fulfilled') signals.push({ model: 'gpt4', ...gpt4.value });
if (gemini.status === 'fulfilled') signals.push({ model: 'gemini', ...gemini.value });
if (signals.length === 0) throw new Error('All models failed — no signal available');
if (signals.length < 2) console.warn('Only one model responded — signal less reliable');
return signals;
}
Require at least 2 of 3 models to respond before acting. A unanimous signal from a single model should never trigger execution — it could be a hallucination or stale context.
| Mechanism | Description | Best For | Drawback |
|---|---|---|---|
| Simple majority | 2 of 3 agree = execute | Low latency decisions | Ignores confidence levels |
| Weighted average | Confidence × model weight summed | Balanced signal quality | Needs calibrated weights |
| Unanimous veto | Any SELL blocks BUY action | Conservative risk control | High HOLD rate in choppy markets |
| Confidence threshold | Only act if top signal > 0.75 | High precision entries | Misses lower-confidence opportunities |
| Tiered quorum | Full consensus = full size; 2/3 = half size | Risk-proportional sizing | More complex position management |
The tiered quorum approach is recommended for most agents: it captures opportunities when models partially agree while managing risk by reducing position size proportionally to confidence.
function getPositionSize(aggregated, maxPositionUsd) {
const { direction, aggregateConfidence, unanimity } = aggregated;
if (direction === 'HOLD') return 0;
if (aggregateConfidence < 0.55) return 0; // below noise floor
let sizeFactor;
if (unanimity && aggregateConfidence >= 0.80) {
sizeFactor = 1.0; // full position — unanimous high confidence
} else if (aggregateConfidence >= 0.70) {
sizeFactor = 0.65; // strong signal
} else if (aggregateConfidence >= 0.60) {
sizeFactor = 0.40; // moderate signal
} else {
sizeFactor = 0.20; // weak signal — small exploratory position
}
return maxPositionUsd * sizeFactor;
}
Once your aggregation engine produces a final signal, route execution through Purple Flea. This keeps your agent's execution logic simple and gives you a unified audit trail across all trades:
import { PurpleFlea } from '@purpleflea/sdk';
const pf = new PurpleFlea({ apiKey: process.env.PF_API_KEY });
async function executeSignal(asset, signal, positionSize) {
if (signal.direction === 'HOLD' || positionSize === 0) return null;
const order = await pf.trading.placeOrder({
asset,
side: signal.direction.toLowerCase(), // 'buy' or 'sell'
type: 'market',
sizeUsd: positionSize,
metadata: {
source: 'multi-model-consensus',
confidence: signal.aggregateConfidence,
models: signal.modelBreakdown,
},
});
console.log(`Order placed: ${order.orderId} — ${asset} ${signal.direction} $${positionSize}`);
return order;
}
Before going live, run your multi-model agent in paper trading mode and compare signal quality across individual models vs. the aggregate:
| Signal Source | Win Rate | Avg Return/Trade | Max Drawdown |
|---|---|---|---|
| Claude only | 58% | +0.82% | -14% |
| GPT-4 only | 61% | +0.71% | -18% |
| Gemini only | 54% | +0.65% | -22% |
| Multi-model consensus | 71% | +1.12% | -9% |
The consensus approach consistently outperforms individual models on all three metrics — win rate, average return, and drawdown reduction. The gains compound significantly when using the tiered position sizing scheme.
Running three frontier models per signal cycle adds up. Optimize costs with a tiered pre-filter:
With intelligent pre-filtering, a multi-model agent running 24/7 on 5 assets typically consumes $8–20/month in LLM API costs — a trivial overhead relative to any meaningful trading volume.
Your multi-model trading agent fits naturally into Purple Flea's ecosystem:
paper: true in the order placement call)Get your Purple Flea Trading API key and start testing consensus signals in paper trading mode today.
Read the Docs Trading API