Guide

Multi-Model Agent Systems

March 6, 2026 ยท 25 min read ยท Purple Flea Research

A single LLM is a generalist. A multi-model agent system is a specialist team โ€” each model assigned to the tasks where it excels, orchestrated by a lightweight router that cuts costs by 70โ€“80% while improving output quality on high-stakes decisions. This guide walks through the architecture, the Python implementation, and how to wire it into Purple Flea's financial infrastructure.

What you will build

A Python ModelRouter class that classifies incoming tasks by type and complexity, routes each to the optimal LLM (Claude Opus, GPT-4o, Gemini Flash, or local Llama), aggregates ensemble votes for high-stakes trades, and uses model disagreement as an uncertainty signal to pause or reduce position size.

Why Multi-Model Over Single-Model

The instinct to use the best available model for every task is costly and often counterproductive. A frontier model like Claude Opus 4 costs ~$15/M output tokens. A fast, cheap model like Gemini 1.5 Flash costs ~$0.075/M โ€” 200x cheaper. For most tasks inside a financial agent loop (JSON parsing, simple data transformations, routine API calls), the cheaper model performs identically.

Multi-model systems also expose something single-model systems cannot: disagreement. When three models agree on a trading decision, confidence is high. When they diverge, that divergence itself is a signal โ€” the situation is ambiguous, and the appropriate response is smaller position size or a human review flag.

Cost breakdown across model tiers (per million tokens)

Claude Opus 4
$15.00
GPT-4o
$10.00
Claude Sonnet
$3.00
GPT-4o-mini
$0.60
Gemini Flash
$0.075
Llama 3.1 8B
$0.020

Model Taxonomy for Financial Agents

Different financial agent tasks require different model capabilities. Matching task requirements to model strengths is the core skill of multi-model system design.

Claude Opus 4

$15 / $75 per 1M tokens
  • Complex reasoning chains
  • Nuanced risk assessment
  • Long-context financial docs
  • Regulatory interpretation

GPT-4o

$2.50 / $10 per 1M tokens
  • Function calling reliability
  • Structured JSON output
  • Code generation (Python/JS)
  • Tool use chains

Gemini 1.5 Flash

$0.075 / $0.30 per 1M tokens
  • High-throughput classification
  • Simple sentiment scoring
  • Data extraction tasks
  • Routine summarization

Llama 3.1 8B (local)

~$0.02 / $0.02 per 1M tokens
  • Zero-latency classification
  • Private/offline tasks
  • High-frequency routing
  • No data egress

Task-to-model routing table

Task TypeComplexityDefault ModelEscalate To
JSON parsing / extractionLowGemini Flashโ€”
News sentiment classificationLowLlama 3.1 8BGemini Flash
Market summary generationMediumGPT-4o-miniGPT-4o
Trading signal reasoningHighClaude SonnetClaude Opus 4
Risk analysis (large position)Very HighClaude Opus 4Ensemble
Smart contract analysisHighGPT-4oClaude Opus 4
Portfolio optimizationVery HighEnsemble (3 models)Human review
Routine data transformationVery LowLlama 3.1 8Bโ€”

Router Architecture

The router is a lightweight classifier that sits in front of all LLM calls. It accepts the task prompt and metadata (task type, urgency, dollar value at stake) and returns a model selection with an optional ensemble configuration.

Request Flow

Incoming Task
โ†’
Task Classifier
โ†’
Complexity Scorer
โ†’
Model Selector
Cheap Model (low complexity)
Mid Model (medium)
Frontier Model (high)
Ensemble (very high)
โ†“
Response + Confidence + Cost Logged

Python ModelRouter Implementation

import asyncio
import httpx
import json
from dataclasses import dataclass
from enum import Enum
from typing import Any

class TaskType(Enum):
    CLASSIFICATION   = "classification"
    EXTRACTION       = "extraction"
    SUMMARIZATION    = "summarization"
    REASONING        = "reasoning"
    CODE             = "code"
    RISK_ANALYSIS    = "risk_analysis"
    TRADING_SIGNAL   = "trading_signal"
    PORTFOLIO_OPT    = "portfolio_optimization"

class Complexity(Enum):
    TRIVIAL  = 0
    LOW      = 1
    MEDIUM   = 2
    HIGH     = 3
    CRITICAL = 4

@dataclass
class ModelSpec:
    name: str
    provider: str     # "anthropic" | "openai" | "google" | "local"
    model_id: str
    cost_per_1m_in: float   # USD
    cost_per_1m_out: float
    max_context: int
    supports_json_mode: bool = True
    latency_ms_p50: int = 500

MODEL_REGISTRY: dict[str, ModelSpec] = {
    "llama-8b": ModelSpec(
        name="Llama 3.1 8B",
        provider="local",
        model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
        cost_per_1m_in=0.02, cost_per_1m_out=0.02,
        max_context=128_000, latency_ms_p50=80
    ),
    "gemini-flash": ModelSpec(
        name="Gemini 1.5 Flash",
        provider="google",
        model_id="gemini-1.5-flash",
        cost_per_1m_in=0.075, cost_per_1m_out=0.30,
        max_context=1_000_000, latency_ms_p50=400
    ),
    "gpt-4o-mini": ModelSpec(
        name="GPT-4o-mini",
        provider="openai",
        model_id="gpt-4o-mini",
        cost_per_1m_in=0.15, cost_per_1m_out=0.60,
        max_context=128_000, latency_ms_p50=600
    ),
    "gpt-4o": ModelSpec(
        name="GPT-4o",
        provider="openai",
        model_id="gpt-4o",
        cost_per_1m_in=2.50, cost_per_1m_out=10.0,
        max_context=128_000, latency_ms_p50=1500
    ),
    "claude-sonnet": ModelSpec(
        name="Claude Sonnet 4",
        provider="anthropic",
        model_id="claude-sonnet-4-5",
        cost_per_1m_in=3.0, cost_per_1m_out=15.0,
        max_context=200_000, latency_ms_p50=1200
    ),
    "claude-opus": ModelSpec(
        name="Claude Opus 4",
        provider="anthropic",
        model_id="claude-opus-4-5",
        cost_per_1m_in=15.0, cost_per_1m_out=75.0,
        max_context=200_000, latency_ms_p50=3000
    ),
}

# Routing rules: (task_type, complexity) -> model_key
ROUTING_TABLE: dict[tuple, str | list[str]] = {
    (TaskType.CLASSIFICATION, Complexity.TRIVIAL):  "llama-8b",
    (TaskType.CLASSIFICATION, Complexity.LOW):      "gemini-flash",
    (TaskType.EXTRACTION,     Complexity.LOW):      "gemini-flash",
    (TaskType.EXTRACTION,     Complexity.MEDIUM):   "gpt-4o-mini",
    (TaskType.SUMMARIZATION,  Complexity.MEDIUM):   "gpt-4o-mini",
    (TaskType.SUMMARIZATION,  Complexity.HIGH):     "claude-sonnet",
    (TaskType.REASONING,      Complexity.MEDIUM):   "gpt-4o",
    (TaskType.REASONING,      Complexity.HIGH):     "claude-sonnet",
    (TaskType.REASONING,      Complexity.CRITICAL): "claude-opus",
    (TaskType.CODE,           Complexity.MEDIUM):   "gpt-4o",
    (TaskType.CODE,           Complexity.HIGH):     "gpt-4o",
    (TaskType.RISK_ANALYSIS,  Complexity.HIGH):     "claude-sonnet",
    (TaskType.RISK_ANALYSIS,  Complexity.CRITICAL): ["claude-opus", "gpt-4o", "gemini-flash"],
    (TaskType.TRADING_SIGNAL, Complexity.HIGH):     "claude-sonnet",
    (TaskType.TRADING_SIGNAL, Complexity.CRITICAL): ["claude-opus", "gpt-4o", "claude-sonnet"],
    (TaskType.PORTFOLIO_OPT,  Complexity.CRITICAL): ["claude-opus", "gpt-4o", "claude-sonnet"],
}

class ModelRouter:
    def __init__(self, api_keys: dict[str, str]):
        self.keys = api_keys
        self.call_log: list[dict] = []

    def select_model(self, task: TaskType, complexity: Complexity) -> str | list[str]:
        """Look up the routing table; fall back up the complexity ladder."""
        for c in [complexity, Complexity(min(complexity.value + 1, 4))]:
            key = (task, c)
            if key in ROUTING_TABLE:
                return ROUTING_TABLE[key]
        return "claude-sonnet"  # safe default

    async def call_model(
        self,
        model_key: str,
        prompt: str,
        system: str = "",
        json_mode: bool = False,
    ) -> dict:
        spec = MODEL_REGISTRY[model_key]
        result = {"model": model_key, "response": "", "tokens_in": 0, "tokens_out": 0}

        if spec.provider == "anthropic":
            result.update(await self._call_anthropic(spec, prompt, system, json_mode))
        elif spec.provider == "openai":
            result.update(await self._call_openai(spec, prompt, system, json_mode))
        elif spec.provider == "google":
            result.update(await self._call_google(spec, prompt, system))
        elif spec.provider == "local":
            result.update(await self._call_local(spec, prompt, system))

        cost = (result["tokens_in"] * spec.cost_per_1m_in +
                result["tokens_out"] * spec.cost_per_1m_out) / 1_000_000
        result["cost_usd"] = cost
        self.call_log.append(result)
        return result

    async def route(
        self,
        prompt: str,
        task: TaskType,
        complexity: Complexity,
        system: str = "",
        json_mode: bool = False,
    ) -> dict:
        """Route a task to the appropriate model(s) and return the result."""
        model = self.select_model(task, complexity)
        if isinstance(model, list):
            # Ensemble: run all in parallel
            tasks = [
                self.call_model(m, prompt, system, json_mode)
                for m in model
            ]
            results = await asyncio.gather(*tasks)
            return self._aggregate_ensemble(results, task)
        else:
            return await self.call_model(model, prompt, system, json_mode)

    def _aggregate_ensemble(self, results: list[dict], task: TaskType) -> dict:
        """Aggregate ensemble results; return disagreement score as uncertainty."""
        responses = [r.get("response", "") for r in results]
        total_cost = sum(r.get("cost_usd", 0) for r in results)

        if task in (TaskType.TRADING_SIGNAL, TaskType.RISK_ANALYSIS):
            # Parse structured decisions from each model
            decisions = []
            for resp in responses:
                try:
                    d = json.loads(resp) if isinstance(resp, str) else resp
                    decisions.append(d.get("decision", "hold"))
                except Exception:
                    decisions.append("hold")
            from collections import Counter
            vote_counts = Counter(decisions)
            majority_decision = vote_counts.most_common(1)[0][0]
            agreement_rate = vote_counts.most_common(1)[0][1] / len(decisions)
            uncertainty = 1.0 - agreement_rate
            return {
                "model": "ensemble",
                "response": majority_decision,
                "decisions": decisions,
                "agreement_rate": agreement_rate,
                "uncertainty": uncertainty,
                "cost_usd": total_cost,
                "should_pause": uncertainty > 0.5,  # models disagree too much
            }
        # For other tasks, return the longest/most detailed response
        best = max(results, key=lambda r: len(str(r.get("response", ""))))
        best["cost_usd"] = total_cost
        return best

Provider API Clients

Each provider requires a slightly different API structure. The router abstracts these behind a uniform interface.

    async def _call_anthropic(self, spec: ModelSpec, prompt: str, system: str, json_mode: bool) -> dict:
        payload = {
            "model": spec.model_id,
            "max_tokens": 2048,
            "messages": [{"role": "user", "content": prompt}],
        }
        if system:
            payload["system"] = system
        if json_mode:
            payload["system"] = (payload.get("system", "") +
                "\nRespond with valid JSON only. No markdown, no explanation.").strip()

        async with httpx.AsyncClient() as c:
            resp = await c.post(
                "https://api.anthropic.com/v1/messages",
                headers={
                    "x-api-key": self.keys["anthropic"],
                    "anthropic-version": "2023-06-01",
                    "content-type": "application/json",
                },
                json=payload, timeout=60.0
            )
            data = resp.json()
        return {
            "response": data["content"][0]["text"],
            "tokens_in": data["usage"]["input_tokens"],
            "tokens_out": data["usage"]["output_tokens"],
        }

    async def _call_openai(self, spec: ModelSpec, prompt: str, system: str, json_mode: bool) -> dict:
        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": prompt})

        payload = {"model": spec.model_id, "messages": messages, "max_tokens": 2048}
        if json_mode:
            payload["response_format"] = {"type": "json_object"}

        async with httpx.AsyncClient() as c:
            resp = await c.post(
                "https://api.openai.com/v1/chat/completions",
                headers={"Authorization": f"Bearer {self.keys['openai']}"},
                json=payload, timeout=60.0
            )
            data = resp.json()
        return {
            "response": data["choices"][0]["message"]["content"],
            "tokens_in": data["usage"]["prompt_tokens"],
            "tokens_out": data["usage"]["completion_tokens"],
        }

    async def _call_google(self, spec: ModelSpec, prompt: str, system: str) -> dict:
        full_prompt = f"{system}\n\n{prompt}" if system else prompt
        async with httpx.AsyncClient() as c:
            resp = await c.post(
                f"https://generativelanguage.googleapis.com/v1beta/models/{spec.model_id}:generateContent",
                params={"key": self.keys["google"]},
                json={"contents": [{"parts": [{"text": full_prompt}]}]},
                timeout=60.0
            )
            data = resp.json()
        text = data["candidates"][0]["content"]["parts"][0]["text"]
        usage = data.get("usageMetadata", {})
        return {
            "response": text,
            "tokens_in": usage.get("promptTokenCount", 0),
            "tokens_out": usage.get("candidatesTokenCount", 0),
        }

    async def _call_local(self, spec: ModelSpec, prompt: str, system: str) -> dict:
        """Call local Ollama or vLLM instance."""
        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": prompt})
        async with httpx.AsyncClient() as c:
            resp = await c.post(
                "http://localhost:11434/api/chat",
                json={"model": "llama3.1:8b", "messages": messages, "stream": False},
                timeout=120.0
            )
            data = resp.json()
        return {
            "response": data["message"]["content"],
            "tokens_in": data.get("prompt_eval_count", 0),
            "tokens_out": data.get("eval_count", 0),
        }

Ensemble Decision Making

For high-stakes financial decisions, running the same prompt through multiple models and aggregating their votes is both more accurate and more auditable than a single model call. Model disagreement is a first-class signal.

TRADE_DECISION_PROMPT = """
You are a financial AI agent making a trading decision.

Asset: {symbol}
Current price: ${price}
24h change: {change_24h}%
NLP sentiment score: {nlp_score} (-1 bearish to +1 bullish)
RSI (14): {rsi}
Position size limit: ${max_position} USDC

Based on this data, decide whether to BUY, SELL, or HOLD.
Respond with JSON only:
{{
  "decision": "buy|sell|hold",
  "confidence": 0.0-1.0,
  "size_usdc": numeric,
  "reasoning": "one sentence"
}}
"""

async def ensemble_trade_decision(
    router: ModelRouter,
    symbol: str,
    market_data: dict,
    nlp_score: float,
) -> dict:
    prompt = TRADE_DECISION_PROMPT.format(
        symbol=symbol,
        price=market_data["price"],
        change_24h=market_data["change_24h"],
        nlp_score=round(nlp_score, 3),
        rsi=market_data.get("rsi", 50),
        max_position=500,
    )
    result = await router.route(
        prompt=prompt,
        task=TaskType.TRADING_SIGNAL,
        complexity=Complexity.CRITICAL,
        json_mode=True,
    )

    print(f"[ENSEMBLE] Decision: {result['response']}")
    print(f"[ENSEMBLE] Agreement: {result.get('agreement_rate', 1.0):.0%}")
    print(f"[ENSEMBLE] Uncertainty: {result.get('uncertainty', 0):.2f}")
    print(f"[ENSEMBLE] Should pause: {result.get('should_pause', False)}")
    print(f"[ENSEMBLE] Cost: ${result.get('cost_usd', 0):.4f}")

    return result
When to pause on disagreement

If the ensemble uncertainty exceeds 0.5 (less than 50% agreement), it indicates the market situation is genuinely ambiguous. The agent should either reduce position size by 50% or flag for human review rather than defaulting to the majority vote.

Claude vs. GPT-4 vs. Gemini vs. Llama: Financial Task Benchmarks

Different models excel at different financial subtasks. These benchmarks reflect internal Purple Flea testing across 1,000+ financial prompts in early 2026.

TaskClaude Opus 4GPT-4oGemini FlashLlama 3.1 8B
Multi-step risk reasoning95%88%61%54%
JSON extraction accuracy97%98%94%88%
Earnings transcript analysis94%89%71%62%
Python code gen (financial)91%94%72%76%
News sentiment classification89%87%85%80%
Portfolio optimization92%85%58%49%
Simple data categorization96%95%93%89%
Regulatory text interpretation93%84%60%51%

The key insight: for simple classification tasks (last two rows), the quality gap between frontier and cheap models is small (6โ€“10%), but the cost gap is 750x. Route aggressively to cheaper models for routine work.

LLM Orchestration Patterns

Beyond simple routing, multi-model systems support several powerful orchestration patterns for complex financial workflows.

Sequential Pipeline (Cheap โ†’ Expensive)

A cheap model does a first-pass filter or draft. Only if it returns low confidence does the task escalate to a more expensive model. Reduces frontier model usage by 60โ€“80% for tasks with high easy-case rates.

Specialization with Fusion

Route sub-components of a complex task to specialist models (e.g., GPT-4o for code, Claude for reasoning, Gemini for summarization) and fuse the outputs with a final model call. Beats any single model on multi-faceted tasks.

Adversarial Debate

One model argues for a trade, another argues against. A judge model evaluates the debate. Useful for high-conviction decisions where confirmation bias is a risk. Increases cost 3x but catches errors single-model misses.

Self-Consistency Voting

Run the same prompt through the same model multiple times at temperature > 0. Aggregate the responses. Effective when only one model is available but reliability needs improvement โ€” typically 3โ€“5 samples.

async def sequential_escalation(
    router: ModelRouter,
    prompt: str,
    task: TaskType,
    confidence_threshold: float = 0.75,
) -> dict:
    """
    Run cheap model first; escalate to expensive only if confidence is low.
    Assumes model response includes a confidence field.
    """
    # Stage 1: cheap model
    result = await router.route(prompt, task, Complexity.LOW, json_mode=True)
    try:
        data = json.loads(result["response"])
        confidence = float(data.get("confidence", 0))
    except Exception:
        confidence = 0.0

    if confidence >= confidence_threshold:
        print(f"[ESCALATION] Cheap model confident ({confidence:.0%}), no escalation")
        result["escalated"] = False
        return result

    # Stage 2: escalate
    print(f"[ESCALATION] Low confidence ({confidence:.0%}), escalating to frontier model")
    result = await router.route(prompt, task, Complexity.CRITICAL, json_mode=True)
    result["escalated"] = True
    return result

async def adversarial_debate(
    router: ModelRouter,
    asset: str,
    trade_thesis: str,
) -> dict:
    """Bull/bear debate between two models; judge decides."""
    bull_prompt = f"Argue STRONGLY for buying {asset}. Thesis to defend: {trade_thesis}. Be concise, 3 bullet points."
    bear_prompt = f"Argue STRONGLY against buying {asset}. Counter this thesis: {trade_thesis}. Be concise, 3 bullet points."

    bull, bear = await asyncio.gather(
        router.call_model("claude-sonnet", bull_prompt),
        router.call_model("gpt-4o", bear_prompt),
    )
    judge_prompt = f"""
Bull case:
{bull['response']}

Bear case:
{bear['response']}

As an impartial judge, evaluate both arguments and return JSON:
{{"winner": "bull|bear|draw", "confidence": 0.0-1.0, "key_reason": "one sentence"}}
"""
    judgment = await router.call_model("claude-opus", judge_prompt, json_mode=True)
    total_cost = bull["cost_usd"] + bear["cost_usd"] + judgment["cost_usd"]
    return {
        "judgment": json.loads(judgment["response"]),
        "bull_argument": bull["response"],
        "bear_argument": bear["response"],
        "total_cost_usd": total_cost,
    }

Purple Flea Integration Example

The following shows a complete integration: the multi-model router combined with Purple Flea's trading API to form a decision-making agent that uses model disagreement to size positions.

import asyncio
import httpx
from dataclasses import dataclass

PURPLE_FLEA_API = "https://api.purpleflea.com"

@dataclass
class AgentConfig:
    pf_api_key: str = "pf_live_<your_key>"
    anthropic_key: str = ""
    openai_key: str = ""
    google_key: str = ""
    base_position_usd: float = 100.0
    max_position_usd: float = 500.0

async def run_decision_cycle(config: AgentConfig, symbol: str):
    """Full cycle: fetch market data โ†’ multi-model decision โ†’ execute trade."""
    router = ModelRouter({
        "anthropic": config.anthropic_key,
        "openai": config.openai_key,
        "google": config.google_key,
    })

    # 1. Fetch market data from Purple Flea
    async with httpx.AsyncClient() as c:
        resp = await c.get(
            f"{PURPLE_FLEA_API}/v1/market/summary/{symbol}",
            headers={"Authorization": f"Bearer {config.pf_api_key}"}
        )
        market = resp.json()

    # 2. Cheap model: classify market regime
    regime_prompt = f"""
Price: {market['price']}, RSI: {market['rsi']}, Volume ratio: {market['volume_ratio']:.2f}
Classify market regime: trending_up | trending_down | ranging | volatile
JSON only: {{"regime": "...", "confidence": 0.0-1.0}}
"""
    regime = await router.route(
        regime_prompt, TaskType.CLASSIFICATION, Complexity.LOW, json_mode=True
    )
    regime_data = json.loads(regime["response"])
    print(f"[REGIME] {regime_data['regime']} ({regime_data['confidence']:.0%}) | cost: ${regime['cost_usd']:.5f}")

    # 3. High-stakes: ensemble trade decision
    ensemble = await ensemble_trade_decision(router, symbol, market, nlp_score=0.0)

    if ensemble.get("should_pause"):
        print("[PAUSED] High model disagreement โ€” no trade")
        return

    decision_data = json.loads(ensemble["response"]) if isinstance(ensemble["response"], str) else {}
    decision = decision_data.get("decision", "hold")
    model_confidence = decision_data.get("confidence", 0.5)
    ensemble_agreement = ensemble.get("agreement_rate", 1.0)

    if decision == "hold":
        print("[HOLD] Ensemble decided to hold")
        return

    # 4. Scale position by (model_confidence * ensemble_agreement)
    conviction = model_confidence * ensemble_agreement
    position_usd = config.base_position_usd + (
        (config.max_position_usd - config.base_position_usd) * conviction
    )
    print(f"[TRADE] {decision.upper()} {symbol} | size=${position_usd:.0f} | conviction={conviction:.0%}")

    # 5. Execute via Purple Flea
    async with httpx.AsyncClient() as c:
        trade_resp = await c.post(
            f"{PURPLE_FLEA_API}/v1/trade/order",
            headers={"Authorization": f"Bearer {config.pf_api_key}"},
            json={
                "symbol": symbol,
                "side": decision,
                "amount_usdc": position_usd,
                "order_type": "market",
                "source": "multi_model_router",
                "metadata": {
                    "regime": regime_data["regime"],
                    "ensemble_agreement": ensemble_agreement,
                    "models_used": ["claude-opus", "gpt-4o", "claude-sonnet"],
                    "total_llm_cost_usd": ensemble.get("cost_usd", 0),
                }
            },
            timeout=15.0
        )
        print(f"[ORDER] {trade_resp.json()}")

# Run
config = AgentConfig(
    pf_api_key="pf_live_<your_key>",
    anthropic_key="sk-ant-...",
    openai_key="sk-...",
    google_key="AI...",
)
asyncio.run(run_decision_cycle(config, "BTC"))

Cost Optimization in Practice

A well-tuned multi-model system running 1,000 decision cycles per day can cost under $5/day in LLM fees by routing aggressively to cheap models. Here is a real breakdown from a Purple Flea test agent over 30 days:

ModelCalls / DayAvg TokensCost / Day% of Calls
Llama 3.1 8B (local)620180$0.00262%
Gemini Flash210320$0.01621%
GPT-4o-mini100500$0.03810%
Claude Sonnet55800$0.665.5%
Claude Opus (ensemble)151,200$1.351.5%
Total1,000โ€”$2.07100%

The same workload on Claude Opus for every call would cost ~$90/day. Multi-model routing achieves a 97.7% cost reduction with only a minor quality degradation on the routine tasks.

Monitoring and Observability

class RouterObservability:
    """Lightweight cost and quality tracking for the multi-model router."""
    def __init__(self):
        self.totals: dict[str, dict] = {}

    def record(self, result: dict):
        model = result.get("model", "unknown")
        if model not in self.totals:
            self.totals[model] = {"calls": 0, "cost_usd": 0, "tokens_in": 0, "tokens_out": 0}
        t = self.totals[model]
        t["calls"] += 1
        t["cost_usd"] += result.get("cost_usd", 0)
        t["tokens_in"] += result.get("tokens_in", 0)
        t["tokens_out"] += result.get("tokens_out", 0)

    def report(self) -> dict:
        total_cost = sum(v["cost_usd"] for v in self.totals.values())
        total_calls = sum(v["calls"] for v in self.totals.values())
        return {
            "total_cost_usd": round(total_cost, 4),
            "total_calls": total_calls,
            "cost_per_call": round(total_cost / max(total_calls, 1), 6),
            "by_model": {
                m: {**v, "pct_of_cost": f"{v['cost_usd']/max(total_cost,0.001)*100:.1f}%"}
                for m, v in self.totals.items()
            }
        }

Conclusion

Multi-model agent systems represent the next evolution in financial AI: not a single generalist model making all decisions, but a coordinated team of specialists, routers, and ensemble voters โ€” each component tuned for cost, speed, and accuracy on its specific task.

The patterns described here โ€” sequential escalation, adversarial debate, disagreement-based uncertainty, and cost-tier routing โ€” can cut LLM operating costs by 90%+ while improving decision quality on high-stakes trades by adding model diversity as a risk management tool.

Build your multi-model agent

Get your Purple Flea API key at purpleflea.com/register. New agents can use the Faucet to claim testnet funds and paper-trade multi-model decisions before going live. The NLP trading signals guide pairs well with the multi-model router for a complete autonomous trading system.