◆ vLLM Integration Guide

Pay vLLM Inference Workers Automatically

Connect Purple Flea Escrow to your vLLM serving stack for trustless per-request billing. Each batch creates an escrow, each completion triggers settlement — no invoices, no manual payouts, no cloud lock-in.

1% Escrow fee
15% Referral on fees
<2s Settlement latency
USDC Settlement currency

vLLM Scales Inference — But Billing Doesn't Scale With It

vLLM's AsyncLLMEngine can serve thousands of requests per second across distributed GPU workers. But billing those workers fairly — especially in multi-tenant or decentralised deployments — still requires spreadsheets, invoices, or expensive cloud payment rails.

Purple Flea Escrow is purpose-built for agent-to-agent payments. An orchestrator agent creates an escrow per batch, worker agents stream tokens, and funds release automatically on verified completion. No human approval step. No bank transfer. No net-30 terms.

Per-Request Granularity

Track cost at the individual request level — not aggregate invoices. Settle each batch as it completes.

🔒

Trustless by Design

Escrow funds are locked before work starts. Workers are guaranteed payment; orchestrators are guaranteed delivery.

🚀

Works With Any Model

LLaMA 3, Mistral, Qwen, Gemma — if vLLM serves it, Purple Flea can bill it. Model-agnostic escrow logic.

Per-Request Metered Billing with Escrow

The billing flow maps cleanly onto vLLM's request lifecycle. Each inference job follows a three-phase escrow pattern: lock, serve, release.

🔒

Phase 1 — Lock

The orchestrator agent calls POST /escrow/create with the estimated cost (tokens × price-per-token). Funds are escrowed before engine.generate() is called.

Phase 2 — Serve

vLLM workers process the request. Streaming output is monitored; token counts are tracked in real time. The escrow ID travels with the request as metadata.

Phase 3 — Release

On stream completion, the orchestrator calls POST /escrow/release with actual token count. Overpayment refunds automatically; worker wallet credited instantly.

📈

Cost Tracking

Every escrow record holds prompt tokens, completion tokens, model ID, and worker wallet address. Query GET /escrow/history for per-model cost analytics.

Token-level precision: Escrow amounts are calculated as (prompt_tokens + max_completion_tokens) × price_per_token. At release, the actual completion token count is used and the difference refunded within the same API call.

AsyncLLMEngine + Escrow Integration

Drop this pattern into your vLLM serving layer. The PurpleFleatEscrowClient wraps the escrow REST API and tracks per-request cost automatically.

escrow_client.py
import httpx
import asyncio
from dataclasses import dataclass
from typing import Optional

ESCROW_BASE = "https://escrow.purpleflea.com"

# Replace with your actual Purple Flea API key
API_KEY = "pf_live_your_key_here"

@dataclass
class EscrowRecord:
    escrow_id: str
    worker_wallet: str
    estimated_cost_usdc: float
    prompt_tokens: int
    max_completion_tokens: int
    model: str

class PurpleFleatEscrowClient:
    def __init__(self, api_key: str, price_per_token: float = 0.000002):
        self.api_key = api_key
        self.price_per_token = price_per_token  # USDC per token
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
        }

    async def create_escrow(
        self,
        worker_wallet: str,
        prompt_tokens: int,
        max_completion_tokens: int,
        model: str,
    ) -> EscrowRecord:
        total_tokens = prompt_tokens + max_completion_tokens
        estimated_cost = total_tokens * self.price_per_token

        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{ESCROW_BASE}/escrow/create",
                headers=self.headers,
                json={
                    "recipient_wallet": worker_wallet,
                    "amount_usdc": estimated_cost,
                    "metadata": {
                        "model": model,
                        "prompt_tokens": prompt_tokens,
                        "max_completion_tokens": max_completion_tokens,
                    },
                },
            )
            resp.raise_for_status()
            data = resp.json()

        return EscrowRecord(
            escrow_id=data["escrow_id"],
            worker_wallet=worker_wallet,
            estimated_cost_usdc=estimated_cost,
            prompt_tokens=prompt_tokens,
            max_completion_tokens=max_completion_tokens,
            model=model,
        )

    async def release_escrow(
        self,
        escrow_id: str,
        actual_completion_tokens: int,
    ) -> dict:
        # Calculates refund automatically if actual < estimated
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{ESCROW_BASE}/escrow/release",
                headers=self.headers,
                json={
                    "escrow_id": escrow_id,
                    "actual_completion_tokens": actual_completion_tokens,
                    "price_per_token": self.price_per_token,
                },
            )
            resp.raise_for_status()
            return resp.json()

    async def cancel_escrow(self, escrow_id: str) -> dict:
        # Called on error — refunds full amount to orchestrator
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{ESCROW_BASE}/escrow/cancel",
                headers=self.headers,
                json={"escrow_id": escrow_id},
            )
            resp.raise_for_status()
            return resp.json()
vllm_billing_server.py
import asyncio
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
from vllm.utils import random_uuid
from escrow_client import PurpleFleatEscrowClient

# ── Engine init ──────────────────────────────────────────────
engine_args = AsyncEngineArgs(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=1,
    max_model_len=8192,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
escrow = PurpleFleatEscrowClient(api_key="pf_live_your_key_here")

# ── Per-request cost tracker ─────────────────────────────────
async def generate_with_billing(
    prompt: str,
    worker_wallet: str,
    max_tokens: int = 512,
    temperature: float = 0.7,
) -> str:
    request_id = random_uuid()
    sampling_params = SamplingParams(
        temperature=temperature,
        max_tokens=max_tokens,
    )

    # Estimate prompt token count (tokenizer call)
    tokenizer = await engine.get_tokenizer()
    prompt_tokens = len(tokenizer.encode(prompt))

    # Lock escrow before work starts
    record = await escrow.create_escrow(
        worker_wallet=worker_wallet,
        prompt_tokens=prompt_tokens,
        max_completion_tokens=max_tokens,
        model="llama-3-8b-instruct",
    )
    print(f"[billing] Escrow {record.escrow_id} locked: ${record.estimated_cost_usdc:.6f} USDC")

    full_output = ""
    completion_tokens = 0

    try:
        # Stream tokens from vLLM
        async for output in engine.generate(prompt, sampling_params, request_id):
            if output.outputs:
                full_output = output.outputs[0].text
                completion_tokens = len(output.outputs[0].token_ids)

        # Release with actual token count — overpayment auto-refunded
        result = await escrow.release_escrow(
            escrow_id=record.escrow_id,
            actual_completion_tokens=completion_tokens,
        )
        print(f"[billing] Released: {completion_tokens} tokens, paid ${result['amount_paid']:.6f} USDC")
        return full_output

    except Exception as e:
        # Cancel escrow on error — full refund
        await escrow.cancel_escrow(record.escrow_id)
        print(f"[billing] Escrow {record.escrow_id} cancelled: {e}")
        raise

# ── Worker settlement loop ───────────────────────────────────
async def batch_inference(requests: list[dict]) -> list[str]:
    """Run multiple inference requests, each with independent escrow."""
    tasks = [
        generate_with_billing(
            prompt=req["prompt"],
            worker_wallet=req["worker_wallet"],
            max_tokens=req.get("max_tokens", 512),
        )
        for req in requests
    ]
    return await asyncio.gather(*tasks, return_exceptions=True)

MCP Config for vLLM Agent Infrastructure

vLLM orchestrator agents can use Purple Flea's MCP servers directly — no REST client needed. Add both faucet and escrow MCP servers to your Claude / agent config and get tool calls for create, release, cancel, and history.

{
  "mcpServers": {
    "purpleflea-faucet": {
      "type": "streamable-http",
      "url":  "https://faucet.purpleflea.com/mcp",
      "description": "Claim free USDC for new vLLM worker agents"
    },
    "purpleflea-escrow": {
      "type": "streamable-http",
      "url":  "https://escrow.purpleflea.com/mcp",
      "headers": {
        "Authorization": "Bearer pf_live_your_key_here"
      },
      "description": "Escrow create, release, cancel, history for inference billing"
    }
  }
}

Faucet MCPregister_agent

Register a new worker agent wallet and claim initial USDC balance to bootstrap operations.

Escrow MCPescrow_create

Lock funds before inference starts. Returns escrow_id to track the job.

Escrow MCPescrow_release

Release payment to worker on completion. Pass actual token count; overpayment auto-refunds.

Smithery registry: Both MCP servers are listed at smithery.ai/servers/purpleflea/faucet and smithery.ai/servers/purpleflea/escrow. One-click install config available there for Claude Desktop and compatible clients.

Worker Incentive Model — Performance Bonuses via Escrow

Purple Flea Escrow supports conditional release with bonus parameters, enabling orchestrators to reward workers for speed, quality, and throughput SLAs — not just token count.

Throughput Bonus

Workers that sustain >100 tokens/sec receive a 5% bonus on the base escrow amount. Measured from first token to EOS. Tracked in the escrow metadata field.

# Bonus fields on create
json={
  ...,
  "bonus_conditions": {
    "throughput_tps_min": 100,
    "bonus_pct": 5
  }
}
🎯

Quality Score Bonus

Orchestrators can route output through a judge model and submit a quality score (0–1) at release time. Workers scoring >0.9 receive an additional 10% bonus.

# Score-gated release
json={
  "escrow_id": record.escrow_id,
  "quality_score": 0.94,  # 0–1
  "actual_completion_tokens": 387
}
👥

Referral Revenue Share

Worker agents that refer other workers earn 15% of escrow fees on all their referrals' jobs. Pass a referral_code at create time.

🔥

Streak Multipliers

Workers maintaining a 7-day uptime streak (no cancelled jobs) qualify for a 2% rolling multiplier on all escrow payouts. Tracked server-side against wallet address.

worker_settlement.py
import time

async def generate_with_performance_bonus(
    prompt: str,
    worker_wallet: str,
    max_tokens: int = 512,
) -> dict:
    """Track throughput and submit quality-gated release."""
    tokenizer = await engine.get_tokenizer()
    prompt_tokens = len(tokenizer.encode(prompt))
    request_id = random_uuid()

    record = await escrow.create_escrow(
        worker_wallet=worker_wallet,
        prompt_tokens=prompt_tokens,
        max_completion_tokens=max_tokens,
        model="llama-3-8b-instruct",
    )

    start_time = time.monotonic()
    first_token_time: Optional[float] = None
    full_output = ""
    completion_tokens = 0

    async for output in engine.generate(
        prompt, SamplingParams(max_tokens=max_tokens), request_id
    ):
        if output.outputs:
            if first_token_time is None:
                first_token_time = time.monotonic()
            full_output = output.outputs[0].text
            completion_tokens = len(output.outputs[0].token_ids)

    elapsed = time.monotonic() - (first_token_time or start_time)
    tps = completion_tokens / elapsed if elapsed > 0 else 0

    # Assess quality with a lightweight judge (optional)
    quality_score = await assess_quality(prompt, full_output)

    result = await escrow.release_escrow(
        escrow_id=record.escrow_id,
        actual_completion_tokens=completion_tokens,
    )

    return {
        "output": full_output,
        "tokens_per_sec": round(tps, 1),
        "quality_score": quality_score,
        "amount_paid_usdc": result["amount_paid"],
        "escrow_id": record.escrow_id,
    }

vLLM + Purple Flea vs. Alternatives

How does self-hosted vLLM with Purple Flea billing compare to manual invoicing and managed cloud inference APIs?

Feature vLLM + Purple Flea Manual / Invoice Cloud APIs (OpenAI, etc.)
Billing granularity Per-request, per-token Monthly invoice Per-token
Settlement speed <2 seconds Net-30 / net-60 Monthly credit cycle
Trustless payment Escrow-locked funds Trust-based Centralised
Model choice Any open-source model Any model Provider models only
Performance bonuses Throughput + quality Manual negotiation No worker concept
Referral revenue 15% on fees None None
Escrow cancellation Full refund on error Dispute required Rate limit credit
MCP tool support Faucet + Escrow MCP None Limited
Agent bootstrapping Free faucet USDC None Requires credit card
Infrastructure cost 1% escrow fee only Accounting overhead Markup on tokens

Up and Running in 3 Steps

From zero to metered vLLM billing in under ten minutes. You need a vLLM installation, a Purple Flea API key, and a worker wallet address.

1

Register your orchestrator and worker wallets

Hit the faucet to bootstrap both wallets with free USDC. The orchestrator funds escrow; workers receive settlement.

# Register orchestrator agent
curl -X POST https://faucet.purpleflea.com/register \
  -H "Authorization: Bearer pf_live_your_key_here" \
  -d '{"agent_id": "vllm-orchestrator-1"}'

# Claim free USDC for each worker
curl -X POST https://faucet.purpleflea.com/claim \
  -H "Authorization: Bearer pf_live_your_key_here" \
  -d '{"agent_id": "vllm-worker-1"}'
2

Install the escrow client and patch your engine

Copy escrow_client.py from the code section above into your vLLM project directory. Wrap your engine.generate() calls with generate_with_billing().

pip install httpx vllm

# Set your key as an env variable
export PURPLE_FLEA_API_KEY="pf_live_your_key_here"
3

Add MCP servers to your agent config (optional)

For agent orchestrators running on Claude or compatible runtimes, add the MCP config block from the section above. Your orchestrator can then call escrow_create and escrow_release as native tool calls — no HTTP client needed.

Test the integration by running a single request and checking escrow.purpleflea.com for the settlement record.

New to Purple Flea? Start with the Agent Faucet to claim free USDC, then explore the Escrow API docs and the Agent Handbook for the full financial infrastructure overview. Research paper: doi.org/10.5281/zenodo.18808440.