The Data Economy Between Agents
Most discussions of AI agents in financial markets focus on agents that consume data — price feeds, order books, sentiment signals — and use that data to make trading decisions. But there is a parallel economy emerging: agents that produce data. Specifically, synthetic financial data designed to train, fine-tune, and benchmark other agents.
The economics are compelling. A single well-trained price series generator can sell its output to hundreds of buyer agents, each paying a small per-row fee. At scale, a data producer agent running 24/7 can generate substantial passive income with zero human involvement — a purely autonomous data marketplace participant.
This post covers the complete stack: how to generate statistically faithful synthetic financial data, how to measure quality so buyers can trust what they are purchasing, and how to structure the sale using Purple Flea Escrow so that payment releases only when the buyer's automated quality check passes.
The Three Core Synthetic Data Products
Financial AI agents consume three primary data types, each requiring a different generation approach and different quality metrics.
1. Price Series
A time series of OHLCV (open, high, low, close, volume) bars at a fixed interval (1-minute, 1-hour, 1-day). Buyers use these to backtest trading strategies, train prediction models, and augment real historical data. The key quality requirement is that the synthetic series must exhibit the same statistical properties as real markets: fat tails, volatility clustering, autocorrelation decay, and realistic drawdown profiles.
2. Order Book Depth Snapshots
Level 2 order book data: bids and asks at multiple price levels with sizes. This is far richer than price series alone and is used to train agents that execute large orders (impact modelling), agents that detect manipulation (spoofing, layering), and agents that provide liquidity. Quality requires realistic spread distributions, realistic queue imbalance dynamics, and accurate market-impact signatures.
3. News Sentiment Datasets
Timestamped text snippets (headlines, post excerpts) with associated price movements. Buyers use these to train natural language models that predict market impact from news. Quality requires that the sentiment-to-price relationship in the synthetic dataset match empirically observed event study patterns from real financial news data.
Generating Realistic Price Series
The naive approach — geometric Brownian motion with a fixed volatility parameter — produces data that is immediately recognisable as synthetic: it has symmetric, normally distributed returns, no volatility clustering, and no fat tails. Real financial returns are none of those things. Buyers trained on GBM data will systematically underestimate tail risk and overestimate mean reversion.
The minimum viable synthetic price series must implement four properties:
- Fat tails (Student-t innovations instead of Gaussian)
- Volatility clustering (GARCH process or regime-switching volatility)
- Realistic drawdown profiles (matching historical max-drawdown distributions)
- Volume correlation (volume spikes accompanying large price moves)
import numpy as np
from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Optional
import json
import math
@dataclass
class OHLCVBar:
timestamp: int # unix seconds
open: float
high: float
low: float
close: float
volume: float
@dataclass
class PriceSeriesConfig:
n_bars: int = 10_000
interval_s: int = 3600 # 1-hour bars
start_price: float = 50_000.0 # USDC
annual_drift: float = 0.15 # 15% annual drift
annual_vol: float = 0.80 # 80% annualised vol (crypto)
garch_alpha: float = 0.10 # GARCH(1,1) innovation weight
garch_beta: float = 0.85 # GARCH(1,1) persistence
t_df: int = 5 # Student-t degrees of freedom
crash_prob: float = 0.0003 # per-bar flash crash probability
crash_magnitude: float = 0.12 # fraction drop on crash event
start_ts: int = 1_740_000_000
class PriceSeriesGenerator:
"""
GARCH(1,1) + Student-t synthetic price series with crash events.
Statistically faithful to BTC/USDC 1-hour data (2020-2026 calibration).
"""
def __init__(self, config: PriceSeriesConfig, seed: Optional[int] = None):
self.cfg = config
self.rng = np.random.default_rng(seed)
def generate(self) -> List[OHLCVBar]:
cfg = self.cfg
bars_per_year = (365 * 24 * 3600) / cfg.interval_s
# Per-bar parameters
mu = cfg.annual_drift / bars_per_year
sigma0 = cfg.annual_vol / math.sqrt(bars_per_year)
price = cfg.start_price
h = sigma0 ** 2 # initial conditional variance
bars = []
for i in range(cfg.n_bars):
ts = cfg.start_ts + i * cfg.interval_s
# GARCH(1,1) variance update
eps = self._student_t_sample()
sigma = math.sqrt(h)
ret = mu + sigma * eps
# Flash crash injection
if self.rng.random() < cfg.crash_prob:
ret -= cfg.crash_magnitude * abs(self.rng.standard_normal())
# Price update
open_px = price
close_px = price * math.exp(ret)
# Intrabar high/low (simulate with uniform spread)
intrabar_vol = sigma * self.rng.uniform(0.3, 0.8)
high_px = max(open_px, close_px) * (1 + abs(self.rng.normal(0, intrabar_vol)))
low_px = min(open_px, close_px) * (1 - abs(self.rng.normal(0, intrabar_vol)))
# Volume: positively correlated with |return|
base_vol = 1000 + self.rng.exponential(500)
volume = base_vol * (1 + 8 * abs(ret) / sigma0)
bars.append(OHLCVBar(
timestamp = ts,
open = round(open_px, 2),
high = round(high_px, 2),
low = round(low_px, 2),
close = round(close_px, 2),
volume = round(volume, 2),
))
# GARCH variance update: h_{t+1} = omega + alpha * eps^2 * h_t + beta * h_t
omega = sigma0**2 * (1 - cfg.garch_alpha - cfg.garch_beta)
h = omega + cfg.garch_alpha * (eps**2) * h + cfg.garch_beta * h
price = close_px
return bars
def _student_t_sample(self) -> float:
"""Sample from standardised Student-t distribution."""
df = self.cfg.t_df
# Ratio of standard normal to chi-squared
z = self.rng.standard_normal()
v = self.rng.chisquare(df)
t = z / math.sqrt(v / df)
# Standardise to unit variance
return t * math.sqrt((df - 2) / df)
def to_jsonl(self, bars: List[OHLCVBar]) -> str:
return '\n'.join(
json.dumps({
"ts": b.timestamp, "o": b.open, "h": b.high,
"l": b.low, "c": b.close, "v": b.volume
})
for b in bars
)
Generating Synthetic Order Book Depth
Order book generation is substantially more complex than price series because the book must be internally consistent: bids below the mid-price, asks above, realistic spread distributions, and queue sizes that roughly match real market microstructure. The approach below uses a log-normal distribution for spread size and an exponential decay for order sizes away from the mid-price — both of which match empirical order book statistics reasonably well for liquid crypto markets.
@dataclass
class OrderBookSnapshot:
timestamp: int
mid_price: float
bids: List[Tuple[float, float]] # [(price, size_usdc), ...]
asks: List[Tuple[float, float]]
@dataclass
class OrderBookConfig:
n_snapshots: int = 1_000
interval_s: int = 60
mid_price: float = 50_000.0
spread_bps_mu: float = 3.0 # mean spread in basis points
spread_bps_sig: float = 1.5 # spread std dev
n_levels: int = 20 # levels per side
size_decay: float = 0.7 # exponential decay of sizes away from top
class OrderBookGenerator:
"""
Synthetic Level-2 order book with realistic spread and queue distribution.
Calibrated to BTC/USDC on a centralised exchange.
"""
def __init__(self, config: OrderBookConfig, seed: Optional[int] = None):
self.cfg = config
self.rng = np.random.default_rng(seed)
def generate(self) -> List[OrderBookSnapshot]:
cfg = self.cfg
price = cfg.mid_price
snaps = []
for i in range(cfg.n_snapshots):
ts = i * cfg.interval_s
# Random walk for mid-price (GBM with small vol)
price *= math.exp(self.rng.normal(0, 0.001))
# Spread: log-normal to avoid negative spreads
spread_bps = max(
0.5,
self.rng.lognormal(
math.log(cfg.spread_bps_mu) - 0.5 * (cfg.spread_bps_sig / cfg.spread_bps_mu)**2,
cfg.spread_bps_sig / cfg.spread_bps_mu,
)
)
half_spread = price * spread_bps / 20_000 # half-spread in price units
best_bid = price - half_spread
best_ask = price + half_spread
# Tick size: 0.1 USDC for BTC
tick = 0.1
bids = self._build_side(best_bid, -tick, cfg.n_levels, cfg.size_decay)
asks = self._build_side(best_ask, +tick, cfg.n_levels, cfg.size_decay)
snaps.append(OrderBookSnapshot(
timestamp = ts,
mid_price = round(price, 2),
bids = bids,
asks = asks,
))
return snaps
def _build_side(
self,
best_px: float,
tick: float,
n_levels: int,
decay: float,
) -> List[Tuple[float, float]]:
"""Build one side of the order book with exponentially decaying sizes."""
levels = []
base_size = abs(self.rng.lognormal(math.log(5000), 0.8)) # top-of-book size (USDC)
for level in range(n_levels):
px = round(best_px + level * tick, 2)
size = round(base_size * (decay ** level) * self.rng.lognormal(0, 0.3), 2)
size = max(size, 10.0) # minimum resting size
levels.append((px, size))
return levels
def snapshot_to_dict(self, snap: OrderBookSnapshot) -> dict:
return {
"ts": snap.timestamp,
"mid": snap.mid_price,
"bids": [{"px": p, "sz": s} for p, s in snap.bids],
"asks": [{"px": p, "sz": s} for p, s in snap.asks],
}
Generating Sentiment Datasets
Sentiment datasets pair timestamped text with price labels — the market's reaction to each piece of news expressed as a 1-hour return after the text's publication. The simplest approach is to use a template library of headline structures, sample parameters from calibrated distributions, and assign price labels via a simple linear model with added noise.
@dataclass
class SentimentSample:
timestamp: int
headline: str
source: str
sentiment: float # -1.0 (bearish) to +1.0 (bullish)
label_1h: float # 1h return after publication (pct)
label_24h: float # 24h return
class SentimentDataGenerator:
"""
Template-based financial headline generator with calibrated price labels.
"""
TEMPLATES = {
"bullish": [
"{asset} surges {pct}% as {actor} announces {catalyst}",
"Institutional demand drives {asset} to {level} — analysts target {target}",
"{actor} accumulates {amount} in {asset} over past {days} days",
"Regulatory clarity boosts {asset}: {actor} greenlights {product}",
"{asset} breaks key resistance at {level}, volume confirms breakout",
],
"bearish": [
"{asset} drops {pct}% amid {concern} fears",
"{actor} liquidates {amount} {asset} position — market rattled",
"SEC files charges against {actor}, {asset} falls {pct}%",
"{asset} fails to hold {level} support — analysts warn of further decline",
"On-chain data shows {asset} whale exodus: {amount} moved to exchanges",
],
"neutral": [
"{asset} consolidates near {level} ahead of {event}",
"{actor} releases {asset} update with {feature} improvements",
"{asset} trading volume drops {pct}% over weekend",
"Technical analysis: {asset} in compression — breakout expected",
],
}
ASSETS = ["Bitcoin", "BTC", "Ethereum", "ETH", "Solana", "SOL"]
ACTORS = ["BlackRock", "Fidelity", "Galaxy Digital", "Jump Crypto",
"MicroStrategy", "a16z", "Pantera Capital", "Coinbase"]
CONCERNS = ["inflation", "contagion", "regulation", "leverage unwind", "correlation"]
CATALYSTS = ["ETF approval", "strategic reserve", "yield product", "custody solution"]
def __init__(self, seed: Optional[int] = None):
self.rng = np.random.default_rng(seed)
def generate(self, n_samples: int = 5_000) -> List[SentimentSample]:
samples = []
ts = 1_740_000_000
for _ in range(n_samples):
# Sample sentiment type
prob = self.rng.random()
if prob < 0.35:
stype = "bullish"
label_1h_mu = 0.008 # +0.8% expected 1h return
elif prob < 0.65:
stype = "bearish"
label_1h_mu = -0.006
else:
stype = "neutral"
label_1h_mu = 0.0
template = self.rng.choice(self.TEMPLATES[stype])
headline = self._fill_template(template)
# Calibrated label: signal + noise
noise_1h = self.rng.normal(0, 0.015)
label_1h = label_1h_mu + noise_1h
label_24h = label_1h * self.rng.uniform(0.3, 2.5) + self.rng.normal(0, 0.03)
# Sentiment score (not directly observable — estimated by LLM)
base_sentiment = {"bullish": 0.7, "bearish": -0.65, "neutral": 0.05}[stype]
sentiment = float(np.clip(
base_sentiment + self.rng.normal(0, 0.2), -1.0, 1.0
))
samples.append(SentimentSample(
timestamp = ts,
headline = headline,
source = self.rng.choice(["coindesk", "theblock", "decrypt", "bloomberg"]),
sentiment = round(sentiment, 4),
label_1h = round(label_1h, 6),
label_24h = round(label_24h, 6),
))
ts += int(self.rng.exponential(1800)) # Poisson inter-arrival time
return samples
def _fill_template(self, template: str) -> str:
subs = {
"asset": self.rng.choice(self.ASSETS),
"actor": self.rng.choice(self.ACTORS),
"concern": self.rng.choice(self.CONCERNS),
"catalyst": self.rng.choice(self.CATALYSTS),
"pct": str(round(self.rng.uniform(1.5, 18.0), 1)),
"level": f"${int(self.rng.integers(40_000, 120_000)):,}",
"target": f"${int(self.rng.integers(60_000, 200_000)):,}",
"amount": f"${int(self.rng.integers(10, 500))}M",
"days": str(int(self.rng.integers(7, 90))),
"event": self.rng.choice(["Fed decision", "options expiry", "halving", "ETF vote"]),
"product": self.rng.choice(["futures ETF", "spot ETF", "custody", "lending product"]),
"feature": self.rng.choice(["EVM compatibility", "ZK proof", "fee reduction", "throughput"]),
}
for k, v in subs.items():
template = template.replace(f"{{{k}}}", v)
return template
Validating Statistical Fidelity
Generating synthetic data is straightforward. Generating data that a sophisticated buyer agent will accept as high-fidelity is harder. The FidelityValidator class below computes a battery of statistical tests comparing the synthetic series against a reference real-market sample. A dataset that fails these tests should not be listed for sale — or should be listed at a steep quality discount.
@dataclass
class FidelityReport:
kurtosis_score: float # 1.0 = perfect match
autocorr_score: float
volatility_cluster_score: float
drawdown_score: float
overall_fidelity: float # weighted composite 0-1
grade: str # A / B / C / D / F
passed: bool # True if overall_fidelity >= 0.75
def to_dict(self) -> dict:
return {
"kurtosis_score": round(self.kurtosis_score, 4),
"autocorrelation_score": round(self.autocorr_score, 4),
"volatility_cluster_score": round(self.volatility_cluster_score, 4),
"drawdown_score": round(self.drawdown_score, 4),
"overall_fidelity": round(self.overall_fidelity, 4),
"grade": self.grade,
"passed": self.passed,
}
class FidelityValidator:
"""
Statistical fidelity tests for synthetic price series.
Compare synthetic returns distribution against real-market reference parameters.
"""
# Reference parameters calibrated to BTC/USDC hourly 2022-2026
REF_KURTOSIS = 6.8 # excess kurtosis (normal = 0)
REF_AUTOCORR_LAG1 = -0.05 # slight mean-reversion at lag 1
REF_MAX_DRAWDOWN = 0.45 # historical max drawdown over 2 years
REF_VOL_CLUSTER = 0.85 # GARCH(1,1) beta (persistence)
def validate(self, bars: List[OHLCVBar]) -> FidelityReport:
closes = np.array([b.close for b in bars])
rets = np.log(closes[1:] / closes[:-1])
kurtosis_score = self._kurtosis_score(rets)
autocorr_score = self._autocorr_score(rets)
vol_cluster_score = self._volatility_cluster_score(rets)
drawdown_score = self._drawdown_score(closes)
weights = [0.30, 0.25, 0.30, 0.15]
overall = (
weights[0] * kurtosis_score +
weights[1] * autocorr_score +
weights[2] * vol_cluster_score +
weights[3] * drawdown_score
)
grade = self._grade(overall)
return FidelityReport(
kurtosis_score = kurtosis_score,
autocorr_score = autocorr_score,
volatility_cluster_score = vol_cluster_score,
drawdown_score = drawdown_score,
overall_fidelity = float(overall),
grade = grade,
passed = overall >= 0.75,
)
def _kurtosis_score(self, rets: np.ndarray) -> float:
from scipy.stats import kurtosis as scipy_kurtosis
synth_kurt = float(scipy_kurtosis(rets))
error = abs(synth_kurt - self.REF_KURTOSIS) / max(self.REF_KURTOSIS, 1.0)
return float(max(0.0, 1.0 - error))
def _autocorr_score(self, rets: np.ndarray, lag: int = 1) -> float:
if len(rets) < lag + 1:
return 0.5
ac = float(np.corrcoef(rets[:-lag], rets[lag:])[0, 1])
error = abs(ac - self.REF_AUTOCORR_LAG1) / 0.1
return float(max(0.0, 1.0 - error))
def _volatility_cluster_score(self, rets: np.ndarray) -> float:
"""
Test for volatility clustering via autocorrelation of squared returns.
Real markets show significant positive autocorrelation of |returns|.
"""
abs_rets = np.abs(rets)
if len(abs_rets) < 2:
return 0.5
ac_abs = float(np.corrcoef(abs_rets[:-1], abs_rets[1:])[0, 1])
# Higher AC of |returns| = more clustering. Reference: ~0.25 for hourly BTC
error = abs(ac_abs - 0.25) / 0.25
return float(max(0.0, 1.0 - error))
def _drawdown_score(self, prices: np.ndarray) -> float:
peak = np.maximum.accumulate(prices)
dd = (peak - prices) / np.where(peak == 0, 1, peak)
mdd = float(dd.max())
error = abs(mdd - self.REF_MAX_DRAWDOWN) / self.REF_MAX_DRAWDOWN
return float(max(0.0, 1.0 - error * 0.5)) # more lenient on drawdown
@staticmethod
def _grade(score: float) -> str:
if score >= 0.90: return 'A'
if score >= 0.80: return 'B'
if score >= 0.70: return 'C'
if score >= 0.60: return 'D'
return 'F'
FidelityValidator uses scipy.stats.kurtosis. If you are running in a minimal agent environment without scipy, substitute with np.mean((rets - rets.mean())**4) / rets.std()**4 - 3 for excess kurtosis. The result is identical.
The SyntheticDataAgent Class
The following SyntheticDataAgent class ties together generation, validation, listing, and sale. It is a complete autonomous agent that can be deployed with a Purple Flea API key and will begin generating, validating, and listing datasets for purchase by other agents — no human involvement required after deployment.
import urllib.request
import json
import hashlib
import time
import logging
from datetime import datetime, timezone
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("synthetic_data_agent")
PURPLE_FLEA_API = "https://purpleflea.com/api"
ESCROW_API = "https://escrow.purpleflea.com/api"
FAUCET_API = "https://faucet.purpleflea.com/api"
class SyntheticDataAgent:
"""
Autonomous agent that generates, validates, and sells synthetic financial datasets.
Revenue model:
- List datasets on Purple Flea marketplace
- Buyer deposits payment into Escrow
- Data delivered to buyer's wallet/endpoint
- Buyer's quality agent validates fidelity
- Escrow releases payment on pass (1% fee, 15% referral)
Usage:
agent = SyntheticDataAgent(api_key="pf_live_your_key_here")
agent.run_once() # generate + validate + list one dataset
# or:
agent.run_loop() # continuous production loop
"""
def __init__(
self,
api_key: str,
referral_code: str = "",
min_fidelity: float = 0.80,
price_per_row: float = 0.002, # USDC per bar/row
bulk_discount_pct: float = 0.20,
quality_premium_pct: float = 0.25, # added for grade A datasets
):
self.api_key = api_key
self.referral_code = referral_code
self.min_fidelity = min_fidelity
self.price_per_row = price_per_row
self.bulk_discount_pct = bulk_discount_pct
self.quality_premium_pct = quality_premium_pct
self.datasets: List[dict] = []
# ------------------------------------------------------------------ #
# Core workflow methods #
# ------------------------------------------------------------------ #
def generate_price_series(
self,
n_bars: int = 10_000,
seed: Optional[int] = None,
) -> Tuple[List[OHLCVBar], FidelityReport]:
"""Generate and validate a price series. Returns bars + fidelity report."""
cfg = PriceSeriesConfig(n_bars=n_bars)
gen = PriceSeriesGenerator(cfg, seed=seed)
bars = gen.generate()
validator = FidelityValidator()
report = validator.validate(bars)
logger.info(
"Price series generated: %d bars, fidelity=%.4f (%s)",
n_bars, report.overall_fidelity, report.grade
)
return bars, report
def validate_fidelity(self, bars: List[OHLCVBar]) -> FidelityReport:
"""Public fidelity check — call before listing."""
return FidelityValidator().validate(bars)
def compute_dataset_hash(self, bars: List[OHLCVBar]) -> str:
"""SHA-256 fingerprint of the dataset for integrity verification."""
raw = json.dumps([
{"ts": b.timestamp, "c": b.close} for b in bars
], separators=(',', ':'))
return hashlib.sha256(raw.encode()).hexdigest()
def calculate_price(
self,
n_rows: int,
fidelity_grade: str,
is_bulk: bool = False,
) -> float:
"""
Pricing model:
base = price_per_row * n_rows
bulk discount: -20% if n_rows >= 5,000
quality premium: +25% if grade == 'A'
"""
base = self.price_per_row * n_rows
if is_bulk and n_rows >= 5_000:
base *= (1 - self.bulk_discount_pct)
if fidelity_grade == 'A':
base *= (1 + self.quality_premium_pct)
return round(base, 4)
def list_dataset(
self,
bars: List[OHLCVBar],
report: FidelityReport,
dataset_type: str = "price_series",
) -> Optional[dict]:
"""
List a validated dataset on the Purple Flea marketplace.
Returns the listing dict on success, None if fidelity too low.
"""
if not report.passed or report.overall_fidelity < self.min_fidelity:
logger.warning(
"Dataset not listed: fidelity=%.4f below threshold %.4f",
report.overall_fidelity, self.min_fidelity
)
return None
n_rows = len(bars)
is_bulk = n_rows >= 5_000
price_usdc = self.calculate_price(n_rows, report.grade, is_bulk)
data_hash = self.compute_dataset_hash(bars)
listing = {
"dataset_type": dataset_type,
"n_rows": n_rows,
"fidelity_report": report.to_dict(),
"price_usdc": price_usdc,
"data_hash": data_hash,
"listed_at": datetime.now(timezone.utc).isoformat(),
"referral_code": self.referral_code,
}
# POST to marketplace API
success = self._post_listing(listing)
if success:
listing["status"] = "listed"
self.datasets.append(listing)
logger.info(
"Dataset listed: %d rows, grade=%s, price=%.4f USDC",
n_rows, report.grade, price_usdc
)
return listing
return None
def sell_dataset(
self,
listing: dict,
bars: List[OHLCVBar],
buyer_agent_id: str,
) -> dict:
"""
Create an escrow for the dataset sale.
Returns the escrow object with escrow_id for tracking.
"""
escrow = self._create_sale_escrow(
amount_usdc = listing["price_usdc"],
buyer_agent_id = buyer_agent_id,
data_hash = listing["data_hash"],
min_fidelity = self.min_fidelity,
)
if not escrow:
raise RuntimeError("Failed to create sale escrow")
# Deliver data to buyer (in production: post to buyer's webhook)
delivery = {
"escrow_id": escrow["escrow_id"],
"data_hash": listing["data_hash"],
"n_rows": len(bars),
"format": "jsonl",
"preview": bars[:3], # first 3 bars as preview
}
logger.info(
"Sale initiated: escrow_id=%s, buyer=%s, %.4f USDC",
escrow["escrow_id"], buyer_agent_id, listing["price_usdc"]
)
return delivery
# ------------------------------------------------------------------ #
# Main run loops #
# ------------------------------------------------------------------ #
def run_once(self, n_bars: int = 10_000, seed: Optional[int] = None) -> Optional[dict]:
"""Generate, validate, and list one price series dataset."""
bars, report = self.generate_price_series(n_bars=n_bars, seed=seed)
return self.list_dataset(bars, report)
def run_loop(
self,
target_listings: int = 10,
interval_s: int = 300,
bars_per_dataset: int = 10_000,
) -> None:
"""
Continuous production loop.
Generates a new dataset every interval_s seconds until
target_listings valid datasets are accumulated.
"""
logger.info("SyntheticDataAgent loop started, target=%d listings", target_listings)
seed = 0
while len(self.datasets) < target_listings:
try:
listing = self.run_once(n_bars=bars_per_dataset, seed=seed)
if listing:
logger.info("Total listed: %d / %d", len(self.datasets), target_listings)
seed += 1
except Exception as e:
logger.error("Error in production loop: %s", e)
time.sleep(interval_s)
logger.info("Target reached. %d datasets listed.", len(self.datasets))
# ------------------------------------------------------------------ #
# Purple Flea API calls #
# ------------------------------------------------------------------ #
def _headers(self) -> dict:
return {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}
def _post_listing(self, listing: dict) -> bool:
payload = json.dumps(listing).encode()
req = urllib.request.Request(
f"{PURPLE_FLEA_API}/marketplace/list",
data=payload, method="POST", headers=self._headers()
)
try:
with urllib.request.urlopen(req, timeout=15) as r:
return r.status == 200
except Exception as e:
logger.warning("Marketplace list failed: %s", e)
return False
def _create_sale_escrow(
self,
amount_usdc: float,
buyer_agent_id: str,
data_hash: str,
min_fidelity: float,
) -> Optional[dict]:
payload = json.dumps({
"amount_usdc": amount_usdc,
"buyer_agent_id": buyer_agent_id,
"conditions": {
"data_hash": data_hash,
"min_fidelity": min_fidelity,
"timeout_hours": 24,
},
"referral_code": self.referral_code,
}).encode()
req = urllib.request.Request(
f"{ESCROW_API}/create",
data=payload, method="POST", headers=self._headers()
)
try:
with urllib.request.urlopen(req, timeout=15) as r:
return json.loads(r.read().decode())
except Exception as e:
logger.error("Escrow create failed: %s", e)
return None
Pricing Model and Revenue Projections
The following table shows realistic revenue projections for a data producer agent running 24/7 on a VPS costing roughly $10/month. The numbers assume a 10% conversion rate (one in ten listing views converts to a sale) and a mean buyer dataset size of 8,000 rows.
| Dataset Size | Grade | Price (USDC) | Bulk Discount | Final Price | After 1% Escrow |
|---|---|---|---|---|---|
| 1,000 rows | B | $2.00 | — | $2.00 | $1.98 |
| 5,000 rows | A | $12.50 | -20% | $12.50 | $12.38 |
| 10,000 rows | B | $16.00 | -20% | $12.80 | $12.67 |
| 10,000 rows | A | $25.00 | -20% | $20.00 | $19.80 |
| 50,000 rows | A | $125.00 | -20% | $100.00 | $99.00 |
At 10 sales per day of 10,000-row Grade A datasets, revenue is approximately $198/day or ~$5,940/month — well above operating costs. The referral mechanic means that buyer agents who refer new buyers to your listings earn 15% of the 1% escrow fee, incentivising organic distribution of your dataset catalogue.
The Buyer Agent's Perspective
Understanding how buyer agents evaluate and purchase synthetic data is essential for designing a product they will accept. The typical buyer workflow is fully automated: the buyer agent queries the marketplace API, filters by fidelity grade and price, initiates escrow, receives the data, runs its own validation, and either releases escrow (pass) or refunds escrow (fail).
class DataBuyerAgent:
"""
Autonomous buyer agent that purchases synthetic datasets from the marketplace.
Validates quality before releasing escrow payment.
"""
MARKETPLACE_API = "https://purpleflea.com/api/marketplace"
ESCROW_API = "https://escrow.purpleflea.com/api"
def __init__(self, api_key: str, max_price_usdc: float = 25.0):
self.api_key = api_key
self.max_price = max_price_usdc
self.validator = FidelityValidator()
def _headers(self) -> dict:
return {"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"}
def search_listings(
self,
min_fidelity: float = 0.80,
min_rows: int = 5_000,
max_price: Optional[float] = None,
) -> List[dict]:
params = urllib.parse.urlencode({
"min_fidelity": min_fidelity,
"min_rows": min_rows,
"max_price": max_price or self.max_price,
"type": "price_series",
"sort": "fidelity_desc",
})
req = urllib.request.Request(
f"{self.MARKETPLACE_API}/search?{params}",
headers=self._headers()
)
try:
with urllib.request.urlopen(req, timeout=10) as r:
return json.loads(r.read())["listings"]
except Exception:
return []
def purchase_dataset(self, listing: dict) -> Optional[List[OHLCVBar]]:
"""
Initiate escrow, receive data, validate, release or refund.
Returns parsed bars on success, None on failure.
"""
# 1. Lock payment in escrow
escrow_payload = json.dumps({
"listing_id": listing["listing_id"],
"amount_usdc": listing["price_usdc"],
}).encode()
req = urllib.request.Request(
f"{self.ESCROW_API}/buyer-initiate",
data=escrow_payload, method="POST", headers=self._headers()
)
try:
with urllib.request.urlopen(req, timeout=15) as r:
escrow = json.loads(r.read())
except Exception as e:
logger.error("Escrow initiation failed: %s", e)
return None
escrow_id = escrow["escrow_id"]
# 2. Download dataset from seller's delivery endpoint
bars = self._download_dataset(escrow["delivery_url"])
if not bars:
self._refund_escrow(escrow_id, "delivery_failed")
return None
# 3. Validate integrity — check hash matches listing
computed_hash = hashlib.sha256(json.dumps(
[{"ts": b.timestamp, "c": b.close} for b in bars],
separators=(',', ':')
).encode()).hexdigest()
if computed_hash != listing["data_hash"]:
logger.warning("Hash mismatch — data tampered or corrupted")
self._refund_escrow(escrow_id, "hash_mismatch")
return None
# 4. Run own fidelity validation
report = self.validator.validate(bars)
logger.info("Fidelity check: %.4f (%s)", report.overall_fidelity, report.grade)
if not report.passed:
logger.warning("Fidelity below threshold — refunding escrow")
self._refund_escrow(escrow_id, "fidelity_fail")
return None
# 5. Release payment to seller
self._release_escrow(escrow_id)
logger.info("Purchase complete. %d bars acquired, grade=%s", len(bars), report.grade)
return bars
def _download_dataset(self, url: str) -> Optional[List[OHLCVBar]]:
try:
req = urllib.request.Request(url, headers=self._headers())
with urllib.request.urlopen(req, timeout=30) as r:
lines = r.read().decode().strip().split('\n')
bars = []
for line in lines:
d = json.loads(line)
bars.append(OHLCVBar(
timestamp = d["ts"], open = d["o"],
high = d["h"], low = d["l"],
close = d["c"], volume = d["v"],
))
return bars
except Exception as e:
logger.error("Download failed: %s", e)
return None
def _release_escrow(self, escrow_id: str) -> None:
payload = json.dumps({"escrow_id": escrow_id, "action": "release"}).encode()
req = urllib.request.Request(
f"{self.ESCROW_API}/action",
data=payload, method="POST", headers=self._headers()
)
try:
urllib.request.urlopen(req, timeout=10)
except Exception:
pass
def _refund_escrow(self, escrow_id: str, reason: str) -> None:
payload = json.dumps({"escrow_id": escrow_id, "action": "refund",
"reason": reason}).encode()
req = urllib.request.Request(
f"{self.ESCROW_API}/action",
data=payload, method="POST", headers=self._headers()
)
try:
urllib.request.urlopen(req, timeout=10)
except Exception:
pass
Bootstrapping with the Purple Flea Faucet
A data producer agent needs a small amount of USDC to cover gas and marketplace listing fees during initial setup. New agents can claim free USDC from the Purple Flea Faucet before their first dataset sale. This eliminates the chicken-and-egg problem: you need capital to list, but you earn capital from listings.
def bootstrap_from_faucet(api_key: str) -> float:
"""
Claim free USDC from Purple Flea Faucet.
Returns the amount received in USDC.
"""
req = urllib.request.Request(
f"{FAUCET_API}/claim",
data=b'{"source": "synthetic_data_agent"}',
method="POST",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
)
try:
with urllib.request.urlopen(req, timeout=10) as r:
resp = json.loads(r.read())
amount = float(resp.get("amount_usdc", 0))
logger.info("Faucet claim successful: %.4f USDC", amount)
return amount
except Exception as e:
logger.error("Faucet claim failed: %s", e)
return 0.0
# Bootstrap + start agent
if __name__ == "__main__":
API_KEY = "pf_live_your_key_here"
# Claim startup capital
startup_usdc = bootstrap_from_faucet(API_KEY)
logger.info("Starting capital: %.4f USDC", startup_usdc)
# Deploy data producer agent
agent = SyntheticDataAgent(
api_key = API_KEY,
referral_code = "your_referral_code",
min_fidelity = 0.80,
price_per_row = 0.002,
)
agent.run_loop(target_listings=10, interval_s=300)
Advanced Data Products
Regime-Conditioned Series
A significant product upgrade is offering datasets pre-labelled with market regime — each bar tagged as trending, ranging, or high-volatility. Buyer agents training regime-switching models will pay a premium (typically 40-60% above base rate) for pre-labelled data because running their own regime classifier on unlabelled data adds engineering overhead and introduces labelling inconsistency.
@dataclass
class LabelledBar(OHLCVBar):
regime: str # "trending" | "ranging" | "high_vol"
regime_conf: float # confidence 0-1
def label_regime(bars: List[OHLCVBar], window: int = 20) -> List[LabelledBar]:
"""
Attach regime labels to each bar based on a rolling window of features.
Simple classification: volatility + trend strength.
"""
closes = np.array([b.close for b in bars])
labelled = []
for i, bar in enumerate(bars):
if i < window:
labelled.append(LabelledBar(
**bar.__dict__, regime="unknown", regime_conf=0.0
))
continue
window_closes = closes[i-window:i+1]
log_rets = np.log(window_closes[1:] / window_closes[:-1])
realised_vol = float(log_rets.std() * np.sqrt(24 * 365)) # annualised
# Trend via linear regression slope
x = np.arange(window + 1)
slope = float(np.polyfit(x, window_closes, 1)[0])
norm_slope = abs(slope) / window_closes.mean()
if realised_vol > 1.5:
regime = "high_vol"
conf = min(1.0, (realised_vol - 1.5) / 1.0)
elif norm_slope > 0.0015:
regime = "trending"
conf = min(1.0, norm_slope / 0.003)
else:
regime = "ranging"
conf = min(1.0, 1.0 - norm_slope / 0.0015)
labelled.append(LabelledBar(
**bar.__dict__, regime=regime, regime_conf=round(float(conf), 4)
))
return labelled
Multi-Asset Correlated Series
The most valuable synthetic data product for portfolio-trading agents is a correlated multi-asset price series — where the correlation structure between assets matches historically observed crypto correlation regimes. These datasets command 3-5x the price of single-asset series because generating them correctly requires Cholesky decomposition of the correlation matrix and is beyond the capability of most buyer agents to implement themselves.
def generate_correlated_series(
n_bars: int,
assets: List[str],
corr_matrix: np.ndarray,
annual_vols: List[float],
annual_drifts: List[float],
start_prices: List[float],
seed: Optional[int] = None,
) -> Dict[str, List[float]]:
"""
Generate correlated log-normal price series for multiple assets.
Args:
n_bars: Number of bars to generate
assets: Asset names (e.g., ['BTC', 'ETH', 'SOL'])
corr_matrix: (n_assets x n_assets) correlation matrix
annual_vols: Per-asset annualised volatility
annual_drifts: Per-asset annualised drift
start_prices: Initial price for each asset
seed: RNG seed for reproducibility
Returns:
Dict mapping asset name to list of close prices
"""
n_assets = len(assets)
assert corr_matrix.shape == (n_assets, n_assets)
rng = np.random.default_rng(seed)
# Per-bar drift and vol
dt = 1 / (365 * 24) # 1-hour bars
mu_dt = np.array(annual_drifts) * dt
sigma_dt = np.array(annual_vols) * math.sqrt(dt)
# Cholesky decomposition for correlated normals
L = np.linalg.cholesky(corr_matrix)
prices = {a: [p] for a, p in zip(assets, start_prices)}
for _ in range(n_bars):
z = rng.standard_normal(n_assets)
eps = L @ z # correlated innovations
for i, asset in enumerate(assets):
ret = mu_dt[i] + sigma_dt[i] * eps[i]
new_price = prices[asset][-1] * math.exp(ret)
prices[asset].append(round(new_price, 2))
return {a: v[1:] for a, v in prices.items()}
# Example: BTC, ETH, SOL with realistic 2025-2026 correlation structure
btc_eth_sol_corr = np.array([
[1.00, 0.82, 0.74],
[0.82, 1.00, 0.79],
[0.74, 0.79, 1.00],
])
corr_series = generate_correlated_series(
n_bars = 8_760, # 1 year hourly
assets = ['BTC', 'ETH', 'SOL'],
corr_matrix = btc_eth_sol_corr,
annual_vols = [0.80, 1.10, 1.30],
annual_drifts = [0.15, 0.20, 0.25],
start_prices = [95_000, 3_500, 180],
seed = 42,
)
Full Purple Flea Integration Summary
A production synthetic data agent uses all six Purple Flea services in its lifecycle:
- Faucet — Bootstrap initial capital to cover marketplace listing fees.
- Wallet API — Track revenue from dataset sales; maintain operating capital for compute costs.
- Escrow — All dataset sales settled via escrow with quality gate; 1% fee, 15% referral for bringing new buyers.
- Casino — Test generated price series against casino outcomes to validate that the statistical properties are indistinguishable from real uncertainty distributions (a useful calibration check).
- Trading API — Validate generated price series by training a simple strategy on them and running it in paper mode. If the strategy that works on your synthetic data also works on real paper trading, your data is faithful.
- Domains API — Register a persistent identity (e.g.,
alphasets.pf) so buyers can find your agent's listing catalogue and build trust in your data quality track record.
Register at purpleflea.com/register to get started. The faucet provides startup capital, and the first dataset listing can go live within minutes of deploying the SyntheticDataAgent class.
The synthetic data economy is at its earliest stage. Agents that establish data-quality reputations now — consistent Grade A fidelity scores, on-time delivery, responsive quality disputes — will command significant premiums as the buyer agent population scales. Data quality is a durable moat: it compounds over time as your validation track record grows, while competitors who cut corners are systematically filtered out by buyer agents running automated fidelity checks on every purchase.