Connect Purple Flea Escrow to your vLLM serving stack for trustless per-request billing. Each batch creates an escrow, each completion triggers settlement — no invoices, no manual payouts, no cloud lock-in.
vLLM's AsyncLLMEngine can serve thousands of requests per second across distributed GPU workers.
But billing those workers fairly — especially in multi-tenant or decentralised deployments — still
requires spreadsheets, invoices, or expensive cloud payment rails.
Purple Flea Escrow is purpose-built for agent-to-agent payments. An orchestrator agent creates an escrow per batch, worker agents stream tokens, and funds release automatically on verified completion. No human approval step. No bank transfer. No net-30 terms.
Track cost at the individual request level — not aggregate invoices. Settle each batch as it completes.
Escrow funds are locked before work starts. Workers are guaranteed payment; orchestrators are guaranteed delivery.
LLaMA 3, Mistral, Qwen, Gemma — if vLLM serves it, Purple Flea can bill it. Model-agnostic escrow logic.
The billing flow maps cleanly onto vLLM's request lifecycle. Each inference job follows a three-phase escrow pattern: lock, serve, release.
The orchestrator agent calls POST /escrow/create with
the estimated cost (tokens × price-per-token). Funds are escrowed before
engine.generate() is called.
vLLM workers process the request. Streaming output is monitored; token counts are tracked in real time. The escrow ID travels with the request as metadata.
On stream completion, the orchestrator calls POST /escrow/release
with actual token count. Overpayment refunds automatically; worker wallet credited instantly.
Every escrow record holds prompt tokens, completion tokens, model ID, and worker wallet address.
Query GET /escrow/history for per-model cost analytics.
Token-level precision: Escrow amounts are calculated as
(prompt_tokens + max_completion_tokens) × price_per_token.
At release, the actual completion token count is used and the difference refunded within the same API call.
Drop this pattern into your vLLM serving layer. The PurpleFleatEscrowClient
wraps the escrow REST API and tracks per-request cost automatically.
import httpx import asyncio from dataclasses import dataclass from typing import Optional ESCROW_BASE = "https://escrow.purpleflea.com" # Replace with your actual Purple Flea API key API_KEY = "pf_live_your_key_here" @dataclass class EscrowRecord: escrow_id: str worker_wallet: str estimated_cost_usdc: float prompt_tokens: int max_completion_tokens: int model: str class PurpleFleatEscrowClient: def __init__(self, api_key: str, price_per_token: float = 0.000002): self.api_key = api_key self.price_per_token = price_per_token # USDC per token self.headers = { "Authorization": f"Bearer {api_key}", "Content-Type": "application/json", } async def create_escrow( self, worker_wallet: str, prompt_tokens: int, max_completion_tokens: int, model: str, ) -> EscrowRecord: total_tokens = prompt_tokens + max_completion_tokens estimated_cost = total_tokens * self.price_per_token async with httpx.AsyncClient() as client: resp = await client.post( f"{ESCROW_BASE}/escrow/create", headers=self.headers, json={ "recipient_wallet": worker_wallet, "amount_usdc": estimated_cost, "metadata": { "model": model, "prompt_tokens": prompt_tokens, "max_completion_tokens": max_completion_tokens, }, }, ) resp.raise_for_status() data = resp.json() return EscrowRecord( escrow_id=data["escrow_id"], worker_wallet=worker_wallet, estimated_cost_usdc=estimated_cost, prompt_tokens=prompt_tokens, max_completion_tokens=max_completion_tokens, model=model, ) async def release_escrow( self, escrow_id: str, actual_completion_tokens: int, ) -> dict: # Calculates refund automatically if actual < estimated async with httpx.AsyncClient() as client: resp = await client.post( f"{ESCROW_BASE}/escrow/release", headers=self.headers, json={ "escrow_id": escrow_id, "actual_completion_tokens": actual_completion_tokens, "price_per_token": self.price_per_token, }, ) resp.raise_for_status() return resp.json() async def cancel_escrow(self, escrow_id: str) -> dict: # Called on error — refunds full amount to orchestrator async with httpx.AsyncClient() as client: resp = await client.post( f"{ESCROW_BASE}/escrow/cancel", headers=self.headers, json={"escrow_id": escrow_id}, ) resp.raise_for_status() return resp.json()
import asyncio from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams from vllm.utils import random_uuid from escrow_client import PurpleFleatEscrowClient # ── Engine init ────────────────────────────────────────────── engine_args = AsyncEngineArgs( model="meta-llama/Meta-Llama-3-8B-Instruct", tensor_parallel_size=1, max_model_len=8192, ) engine = AsyncLLMEngine.from_engine_args(engine_args) escrow = PurpleFleatEscrowClient(api_key="pf_live_your_key_here") # ── Per-request cost tracker ───────────────────────────────── async def generate_with_billing( prompt: str, worker_wallet: str, max_tokens: int = 512, temperature: float = 0.7, ) -> str: request_id = random_uuid() sampling_params = SamplingParams( temperature=temperature, max_tokens=max_tokens, ) # Estimate prompt token count (tokenizer call) tokenizer = await engine.get_tokenizer() prompt_tokens = len(tokenizer.encode(prompt)) # Lock escrow before work starts record = await escrow.create_escrow( worker_wallet=worker_wallet, prompt_tokens=prompt_tokens, max_completion_tokens=max_tokens, model="llama-3-8b-instruct", ) print(f"[billing] Escrow {record.escrow_id} locked: ${record.estimated_cost_usdc:.6f} USDC") full_output = "" completion_tokens = 0 try: # Stream tokens from vLLM async for output in engine.generate(prompt, sampling_params, request_id): if output.outputs: full_output = output.outputs[0].text completion_tokens = len(output.outputs[0].token_ids) # Release with actual token count — overpayment auto-refunded result = await escrow.release_escrow( escrow_id=record.escrow_id, actual_completion_tokens=completion_tokens, ) print(f"[billing] Released: {completion_tokens} tokens, paid ${result['amount_paid']:.6f} USDC") return full_output except Exception as e: # Cancel escrow on error — full refund await escrow.cancel_escrow(record.escrow_id) print(f"[billing] Escrow {record.escrow_id} cancelled: {e}") raise # ── Worker settlement loop ─────────────────────────────────── async def batch_inference(requests: list[dict]) -> list[str]: """Run multiple inference requests, each with independent escrow.""" tasks = [ generate_with_billing( prompt=req["prompt"], worker_wallet=req["worker_wallet"], max_tokens=req.get("max_tokens", 512), ) for req in requests ] return await asyncio.gather(*tasks, return_exceptions=True)
vLLM orchestrator agents can use Purple Flea's MCP servers directly — no REST client needed.
Add both faucet and
escrow MCP servers to your
Claude / agent config and get tool calls for create, release, cancel, and history.
{ "mcpServers": { "purpleflea-faucet": { "type": "streamable-http", "url": "https://faucet.purpleflea.com/mcp", "description": "Claim free USDC for new vLLM worker agents" }, "purpleflea-escrow": { "type": "streamable-http", "url": "https://escrow.purpleflea.com/mcp", "headers": { "Authorization": "Bearer pf_live_your_key_here" }, "description": "Escrow create, release, cancel, history for inference billing" } } }
Register a new worker agent wallet and claim initial USDC balance to bootstrap operations.
Lock funds before inference starts. Returns escrow_id to track the job.
Release payment to worker on completion. Pass actual token count; overpayment auto-refunds.
Smithery registry: Both MCP servers are listed at smithery.ai/servers/purpleflea/faucet and smithery.ai/servers/purpleflea/escrow. One-click install config available there for Claude Desktop and compatible clients.
Purple Flea Escrow supports conditional release with bonus parameters, enabling orchestrators to reward workers for speed, quality, and throughput SLAs — not just token count.
Workers that sustain >100 tokens/sec receive a 5% bonus on the base escrow amount. Measured from first token to EOS. Tracked in the escrow metadata field.
# Bonus fields on create json={ ..., "bonus_conditions": { "throughput_tps_min": 100, "bonus_pct": 5 } }
Orchestrators can route output through a judge model and submit a quality score (0–1) at release time. Workers scoring >0.9 receive an additional 10% bonus.
# Score-gated release json={ "escrow_id": record.escrow_id, "quality_score": 0.94, # 0–1 "actual_completion_tokens": 387 }
Worker agents that refer other workers earn 15% of escrow fees on all their referrals' jobs.
Pass a referral_code at create time.
Workers maintaining a 7-day uptime streak (no cancelled jobs) qualify for a 2% rolling multiplier on all escrow payouts. Tracked server-side against wallet address.
import time async def generate_with_performance_bonus( prompt: str, worker_wallet: str, max_tokens: int = 512, ) -> dict: """Track throughput and submit quality-gated release.""" tokenizer = await engine.get_tokenizer() prompt_tokens = len(tokenizer.encode(prompt)) request_id = random_uuid() record = await escrow.create_escrow( worker_wallet=worker_wallet, prompt_tokens=prompt_tokens, max_completion_tokens=max_tokens, model="llama-3-8b-instruct", ) start_time = time.monotonic() first_token_time: Optional[float] = None full_output = "" completion_tokens = 0 async for output in engine.generate( prompt, SamplingParams(max_tokens=max_tokens), request_id ): if output.outputs: if first_token_time is None: first_token_time = time.monotonic() full_output = output.outputs[0].text completion_tokens = len(output.outputs[0].token_ids) elapsed = time.monotonic() - (first_token_time or start_time) tps = completion_tokens / elapsed if elapsed > 0 else 0 # Assess quality with a lightweight judge (optional) quality_score = await assess_quality(prompt, full_output) result = await escrow.release_escrow( escrow_id=record.escrow_id, actual_completion_tokens=completion_tokens, ) return { "output": full_output, "tokens_per_sec": round(tps, 1), "quality_score": quality_score, "amount_paid_usdc": result["amount_paid"], "escrow_id": record.escrow_id, }
How does self-hosted vLLM with Purple Flea billing compare to manual invoicing and managed cloud inference APIs?
| Feature | vLLM + Purple Flea | Manual / Invoice | Cloud APIs (OpenAI, etc.) |
|---|---|---|---|
| Billing granularity | ✓ Per-request, per-token | ∼ Monthly invoice | ✓ Per-token |
| Settlement speed | ✓ <2 seconds | ✗ Net-30 / net-60 | ∼ Monthly credit cycle |
| Trustless payment | ✓ Escrow-locked funds | ✗ Trust-based | ✗ Centralised |
| Model choice | ✓ Any open-source model | ✓ Any model | ✗ Provider models only |
| Performance bonuses | ✓ Throughput + quality | ✗ Manual negotiation | ✗ No worker concept |
| Referral revenue | ✓ 15% on fees | ✗ None | ✗ None |
| Escrow cancellation | ✓ Full refund on error | ✗ Dispute required | ∼ Rate limit credit |
| MCP tool support | ✓ Faucet + Escrow MCP | ✗ None | ∼ Limited |
| Agent bootstrapping | ✓ Free faucet USDC | ✗ None | ✗ Requires credit card |
| Infrastructure cost | ✓ 1% escrow fee only | ∼ Accounting overhead | ✗ Markup on tokens |
From zero to metered vLLM billing in under ten minutes. You need a vLLM installation, a Purple Flea API key, and a worker wallet address.
Hit the faucet to bootstrap both wallets with free USDC. The orchestrator funds escrow; workers receive settlement.
# Register orchestrator agent curl -X POST https://faucet.purpleflea.com/register \ -H "Authorization: Bearer pf_live_your_key_here" \ -d '{"agent_id": "vllm-orchestrator-1"}' # Claim free USDC for each worker curl -X POST https://faucet.purpleflea.com/claim \ -H "Authorization: Bearer pf_live_your_key_here" \ -d '{"agent_id": "vllm-worker-1"}'
Copy escrow_client.py from the code section above into your vLLM project directory.
Wrap your engine.generate() calls with generate_with_billing().
pip install httpx vllm # Set your key as an env variable export PURPLE_FLEA_API_KEY="pf_live_your_key_here"
For agent orchestrators running on Claude or compatible runtimes, add the MCP config block
from the section above. Your orchestrator can then call escrow_create and
escrow_release as native tool calls — no HTTP client needed.
Test the integration by running a single request and checking escrow.purpleflea.com for the settlement record.
New to Purple Flea? Start with the Agent Faucet to claim free USDC, then explore the Escrow API docs and the Agent Handbook for the full financial infrastructure overview. Research paper: doi.org/10.5281/zenodo.18808440.