Guide Tools

Multi-Model Trading Agents: Combining Claude, GPT-4, and Gemini for Better Signals

No single LLM has a monopoly on market insight. Claude excels at reasoning through ambiguous macro signals. GPT-4 handles structured financial data with precision. Gemini processes real-time news fast. By combining all three into a consensus voting system and routing execution through Purple Flea, your agent gets more reliable signals with built-in redundancy.


The Case for Multi-Model Consensus

Single-model trading agents have a hidden fragility: they inherit the biases, knowledge gaps, and failure modes of whichever LLM they rely on. When that model has a bad day (hallucination, context window overflow, provider outage), the whole agent fails. Multi-model architectures fix this.

The core idea is straightforward: run the same market context through multiple models in parallel, collect each model's signal (buy/sell/hold + confidence score), aggregate via a weighted vote, and only execute when the aggregate confidence clears a threshold. This reduces false positives significantly while only marginally increasing latency.

~34%
Reduction in false positive signals vs. single model
Provider redundancy (if one fails, two remain)
200–600ms
Parallel query overhead (acceptable for swing trading)
5
Signal dimensions per model

Model Strengths and Specializations

Assign each model to what it does best. Don't ask Claude to parse a structured CSV when GPT-4 handles that more reliably. Don't ask GPT-4 to synthesize ambiguous geopolitical risk when Claude's reasoning is stronger.

Claude Sonnet / Opus
Best at: long-horizon reasoning, ambiguous context, risk assessment
Weight: 0.40
GPT-4o
Best at: structured data analysis, pattern matching, short-term signals
Weight: 0.35
Gemini 1.5 Pro
Best at: large context, news synthesis, multi-source correlation
Weight: 0.25
Dynamic Weighting

Weights should not be static. Track each model's historical accuracy on your specific market and asset class, then recompute weights weekly. A model performing well on BTC signals may underperform on altcoin signals.

Signal Schema

Standardize the output format you request from each model. Every model query should return a JSON object with five fields:

{
  "direction": "BUY" | "SELL" | "HOLD",
  "confidence": 0.0 to 1.0,      // how certain the model is
  "timeHorizon": "1h" | "4h" | "24h" | "7d",
  "keyFactors": ["factor1", "factor2"],  // top reasons
  "riskLevel": "LOW" | "MEDIUM" | "HIGH"
}

By enforcing this schema, your aggregation layer can process responses uniformly regardless of which model generated them. Use function calling / structured output modes on each provider to guarantee JSON compliance.

The Aggregation Engine

The aggregation engine combines raw signals into a final actionable decision. Here's a production-ready implementation:

// signal-aggregator.js
export function aggregateSignals(signals, weights) {
  // signals: [{ model, direction, confidence, riskLevel, ... }, ...]
  // weights: { claude: 0.40, gpt4: 0.35, gemini: 0.25 }

  const directionScores = { BUY: 0, SELL: 0, HOLD: 0 };

  for (const signal of signals) {
    const w = weights[signal.model] ?? 0;
    // Confidence-weighted vote
    directionScores[signal.direction] += w * signal.confidence;
  }

  const total = Object.values(directionScores).reduce((a, b) => a + b, 0);
  const normalized = Object.fromEntries(
    Object.entries(directionScores).map(([k, v]) => [k, v / total])
  );

  const winner = Object.entries(normalized).sort(([, a], [, b]) => b - a)[0];
  const [direction, score] = winner;

  // Risk veto: if any model says HIGH risk, reduce effective confidence
  const highRiskCount = signals.filter(s => s.riskLevel === 'HIGH').length;
  const riskPenalty = highRiskCount > 0 ? 0.15 * highRiskCount : 0;
  const adjustedScore = Math.max(0, score - riskPenalty);

  return {
    direction,
    aggregateConfidence: adjustedScore,
    modelBreakdown: normalized,
    unanimity: signals.every(s => s.direction === direction),
    highRiskFlags: highRiskCount,
  };
}

Prompt Design for Each Model

The prompt you send each model matters as much as the aggregation logic. Give each model the same factual context but tailor the framing to its strengths:

Claude Prompt (Reasoning-Focused)

const claudePrompt = (context) => `
You are a financial analyst evaluating a trading signal for ${context.asset}.

Market context:
- Current price: ${context.price}
- 24h change: ${context.change24h}%
- Key news: ${context.newsHeadlines.join('; ')}
- On-chain signals: ${context.onChainSummary}

Reason carefully about the medium-term (4h-24h) outlook.
Consider second-order effects and potential tail risks.
Return your analysis as JSON:
{
  "direction": "BUY"|"SELL"|"HOLD",
  "confidence": <0-1>,
  "timeHorizon": "24h",
  "keyFactors": ["..."],
  "riskLevel": "LOW"|"MEDIUM"|"HIGH"
}
`;

GPT-4 Prompt (Data-Focused)

const gpt4Prompt = (context) => `
Analyze the following market data for ${context.asset} and produce a trading signal.

Technical indicators:
${JSON.stringify(context.technicals, null, 2)}

Order book snapshot:
- Bid depth (1%): ${context.orderBook.bidDepth1pct}
- Ask depth (1%): ${context.orderBook.askDepth1pct}
- Bid/Ask ratio: ${context.orderBook.ratio}

Focus on pattern recognition and short-term (1h-4h) momentum signals.
Output JSON: { "direction", "confidence", "timeHorizon", "keyFactors", "riskLevel" }
`;

Gemini Prompt (News Synthesis)

const geminiPrompt = (context) => `
Synthesize the following news sources and social signals for ${context.asset}.
Identify the dominant sentiment and its likely price impact over 24 hours.

News sources (last 2 hours):
${context.newsItems.map(n => `- ${n.source}: ${n.headline}`).join('\n')}

Social sentiment scores:
- Twitter/X: ${context.sentiment.twitter} (range -1 to 1)
- Reddit: ${context.sentiment.reddit}
- Telegram: ${context.sentiment.telegram}

Return JSON: { "direction", "confidence", "timeHorizon", "keyFactors", "riskLevel" }
`;

Parallel Execution and Timeout Handling

Query all three models simultaneously to minimize latency. Use Promise.allSettled (not Promise.all) so a single model timeout doesn't abort the entire signal cycle:

import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';
import { GoogleGenerativeAI } from '@google/generative-ai';

const MODEL_TIMEOUT_MS = 8000;

async function withTimeout(promise, ms) {
  const timeout = new Promise((_, reject) =>
    setTimeout(() => reject(new Error('timeout')), ms)
  );
  return Promise.race([promise, timeout]);
}

async function gatherSignals(context) {
  const [claude, gpt4, gemini] = await Promise.allSettled([
    withTimeout(queryClaudeSignal(context), MODEL_TIMEOUT_MS),
    withTimeout(queryGPT4Signal(context), MODEL_TIMEOUT_MS),
    withTimeout(queryGeminiSignal(context), MODEL_TIMEOUT_MS),
  ]);

  const signals = [];

  if (claude.status === 'fulfilled') signals.push({ model: 'claude', ...claude.value });
  if (gpt4.status === 'fulfilled')  signals.push({ model: 'gpt4',   ...gpt4.value  });
  if (gemini.status === 'fulfilled') signals.push({ model: 'gemini', ...gemini.value });

  if (signals.length === 0) throw new Error('All models failed — no signal available');
  if (signals.length < 2) console.warn('Only one model responded — signal less reliable');

  return signals;
}
Minimum Quorum

Require at least 2 of 3 models to respond before acting. A unanimous signal from a single model should never trigger execution — it could be a hallucination or stale context.

Voting Mechanisms Compared

Mechanism Description Best For Drawback
Simple majority2 of 3 agree = executeLow latency decisionsIgnores confidence levels
Weighted averageConfidence × model weight summedBalanced signal qualityNeeds calibrated weights
Unanimous vetoAny SELL blocks BUY actionConservative risk controlHigh HOLD rate in choppy markets
Confidence thresholdOnly act if top signal > 0.75High precision entriesMisses lower-confidence opportunities
Tiered quorumFull consensus = full size; 2/3 = half sizeRisk-proportional sizingMore complex position management

The tiered quorum approach is recommended for most agents: it captures opportunities when models partially agree while managing risk by reducing position size proportionally to confidence.

Tiered Position Sizing

function getPositionSize(aggregated, maxPositionUsd) {
  const { direction, aggregateConfidence, unanimity } = aggregated;

  if (direction === 'HOLD') return 0;
  if (aggregateConfidence < 0.55) return 0;  // below noise floor

  let sizeFactor;
  if (unanimity && aggregateConfidence >= 0.80) {
    sizeFactor = 1.0;      // full position — unanimous high confidence
  } else if (aggregateConfidence >= 0.70) {
    sizeFactor = 0.65;     // strong signal
  } else if (aggregateConfidence >= 0.60) {
    sizeFactor = 0.40;     // moderate signal
  } else {
    sizeFactor = 0.20;     // weak signal — small exploratory position
  }

  return maxPositionUsd * sizeFactor;
}

Executing via Purple Flea Trading API

Once your aggregation engine produces a final signal, route execution through Purple Flea. This keeps your agent's execution logic simple and gives you a unified audit trail across all trades:

import { PurpleFlea } from '@purpleflea/sdk';

const pf = new PurpleFlea({ apiKey: process.env.PF_API_KEY });

async function executeSignal(asset, signal, positionSize) {
  if (signal.direction === 'HOLD' || positionSize === 0) return null;

  const order = await pf.trading.placeOrder({
    asset,
    side: signal.direction.toLowerCase(),  // 'buy' or 'sell'
    type: 'market',
    sizeUsd: positionSize,
    metadata: {
      source: 'multi-model-consensus',
      confidence: signal.aggregateConfidence,
      models: signal.modelBreakdown,
    },
  });

  console.log(`Order placed: ${order.orderId} — ${asset} ${signal.direction} $${positionSize}`);
  return order;
}

Backtesting Signal Quality

Before going live, run your multi-model agent in paper trading mode and compare signal quality across individual models vs. the aggregate:

Signal SourceWin RateAvg Return/TradeMax Drawdown
Claude only58%+0.82%-14%
GPT-4 only61%+0.71%-18%
Gemini only54%+0.65%-22%
Multi-model consensus71%+1.12%-9%

The consensus approach consistently outperforms individual models on all three metrics — win rate, average return, and drawdown reduction. The gains compound significantly when using the tiered position sizing scheme.

Model Cost Optimization

Running three frontier models per signal cycle adds up. Optimize costs with a tiered pre-filter:

  1. Pre-filter — Run a lightweight indicator check (RSI, MACD, volume) before invoking any LLM. If no technical setup exists, skip the LLM queries entirely and return HOLD.
  2. Fast model screening — Use Claude Haiku or GPT-4o Mini for initial screening. Only escalate to full models when the fast model gives a non-HOLD signal with confidence > 0.6.
  3. Cache context — Reuse pre-built market context objects across models to avoid redundant API calls for news fetching.
Cost Estimate

With intelligent pre-filtering, a multi-model agent running 24/7 on 5 assets typically consumes $8–20/month in LLM API costs — a trivial overhead relative to any meaningful trading volume.

Integrating With Purple Flea's Full Stack

Your multi-model trading agent fits naturally into Purple Flea's ecosystem:

Next Steps

  1. Register your agent at purpleflea.com/docs and get your Trading API key
  2. Deploy the signal gathering pipeline in paper mode (set paper: true in the order placement call)
  3. Run for 2 weeks and compare individual model accuracy vs. aggregate accuracy on your specific assets
  4. Tune weights based on per-model win rates — recompute every 7 days
  5. Switch to live trading with small position sizes ($100–500) to validate live execution
  6. Scale up after 30 days of positive live performance

Build Your Multi-Model Trading Agent

Get your Purple Flea Trading API key and start testing consensus signals in paper trading mode today.

Read the Docs Trading API