Tools

Synthetic Data Markets for AI Agents: Generate, Sell, and Monetize Training Data

A new class of AI agent doesn't trade markets — it manufactures the training data that other agents need to trade. Here is how to build a data-generating agent, validate statistical fidelity, and sell datasets trustlessly via Purple Flea Escrow.

📅 March 7, 2026 🕐 24 min read 💬 Python
$0.002
Base price per row (USDC)
15%
Escrow referral on data sales
1%
Escrow settlement fee
3
Core data product types

The Data Economy Between Agents

Most discussions of AI agents in financial markets focus on agents that consume data — price feeds, order books, sentiment signals — and use that data to make trading decisions. But there is a parallel economy emerging: agents that produce data. Specifically, synthetic financial data designed to train, fine-tune, and benchmark other agents.

The economics are compelling. A single well-trained price series generator can sell its output to hundreds of buyer agents, each paying a small per-row fee. At scale, a data producer agent running 24/7 can generate substantial passive income with zero human involvement — a purely autonomous data marketplace participant.

This post covers the complete stack: how to generate statistically faithful synthetic financial data, how to measure quality so buyers can trust what they are purchasing, and how to structure the sale using Purple Flea Escrow so that payment releases only when the buyer's automated quality check passes.

Why synthetic data instead of real data Real financial data has three problems for training agents: (1) it is scarce — a liquid market generates at most one price history, and it cannot be augmented; (2) it contains survivorship bias — historical data excludes crashes and delistings that happened; (3) it cannot be used to simulate future regimes. Synthetic data solves all three: it is infinitely augmentable, can include arbitrary crash scenarios, and can be calibrated to future regime assumptions.

The Three Core Synthetic Data Products

Financial AI agents consume three primary data types, each requiring a different generation approach and different quality metrics.

1. Price Series

A time series of OHLCV (open, high, low, close, volume) bars at a fixed interval (1-minute, 1-hour, 1-day). Buyers use these to backtest trading strategies, train prediction models, and augment real historical data. The key quality requirement is that the synthetic series must exhibit the same statistical properties as real markets: fat tails, volatility clustering, autocorrelation decay, and realistic drawdown profiles.

2. Order Book Depth Snapshots

Level 2 order book data: bids and asks at multiple price levels with sizes. This is far richer than price series alone and is used to train agents that execute large orders (impact modelling), agents that detect manipulation (spoofing, layering), and agents that provide liquidity. Quality requires realistic spread distributions, realistic queue imbalance dynamics, and accurate market-impact signatures.

3. News Sentiment Datasets

Timestamped text snippets (headlines, post excerpts) with associated price movements. Buyers use these to train natural language models that predict market impact from news. Quality requires that the sentiment-to-price relationship in the synthetic dataset match empirically observed event study patterns from real financial news data.

Generating Realistic Price Series

The naive approach — geometric Brownian motion with a fixed volatility parameter — produces data that is immediately recognisable as synthetic: it has symmetric, normally distributed returns, no volatility clustering, and no fat tails. Real financial returns are none of those things. Buyers trained on GBM data will systematically underestimate tail risk and overestimate mean reversion.

The minimum viable synthetic price series must implement four properties:

python
import numpy as np
from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Optional
import json
import math

@dataclass
class OHLCVBar:
    timestamp:  int    # unix seconds
    open:       float
    high:       float
    low:        float
    close:      float
    volume:     float

@dataclass
class PriceSeriesConfig:
    n_bars:          int   = 10_000
    interval_s:      int   = 3600        # 1-hour bars
    start_price:     float = 50_000.0    # USDC
    annual_drift:    float = 0.15        # 15% annual drift
    annual_vol:      float = 0.80        # 80% annualised vol (crypto)
    garch_alpha:     float = 0.10        # GARCH(1,1) innovation weight
    garch_beta:      float = 0.85        # GARCH(1,1) persistence
    t_df:            int   = 5           # Student-t degrees of freedom
    crash_prob:      float = 0.0003      # per-bar flash crash probability
    crash_magnitude: float = 0.12        # fraction drop on crash event
    start_ts:        int   = 1_740_000_000

class PriceSeriesGenerator:
    """
    GARCH(1,1) + Student-t synthetic price series with crash events.
    Statistically faithful to BTC/USDC 1-hour data (2020-2026 calibration).
    """

    def __init__(self, config: PriceSeriesConfig, seed: Optional[int] = None):
        self.cfg = config
        self.rng = np.random.default_rng(seed)

    def generate(self) -> List[OHLCVBar]:
        cfg  = self.cfg
        bars_per_year = (365 * 24 * 3600) / cfg.interval_s

        # Per-bar parameters
        mu    = cfg.annual_drift / bars_per_year
        sigma0 = cfg.annual_vol / math.sqrt(bars_per_year)

        price   = cfg.start_price
        h       = sigma0 ** 2   # initial conditional variance
        bars    = []

        for i in range(cfg.n_bars):
            ts = cfg.start_ts + i * cfg.interval_s

            # GARCH(1,1) variance update
            eps    = self._student_t_sample()
            sigma  = math.sqrt(h)
            ret    = mu + sigma * eps

            # Flash crash injection
            if self.rng.random() < cfg.crash_prob:
                ret -= cfg.crash_magnitude * abs(self.rng.standard_normal())

            # Price update
            open_px  = price
            close_px = price * math.exp(ret)

            # Intrabar high/low (simulate with uniform spread)
            intrabar_vol = sigma * self.rng.uniform(0.3, 0.8)
            high_px  = max(open_px, close_px) * (1 + abs(self.rng.normal(0, intrabar_vol)))
            low_px   = min(open_px, close_px) * (1 - abs(self.rng.normal(0, intrabar_vol)))

            # Volume: positively correlated with |return|
            base_vol = 1000 + self.rng.exponential(500)
            volume   = base_vol * (1 + 8 * abs(ret) / sigma0)

            bars.append(OHLCVBar(
                timestamp = ts,
                open      = round(open_px, 2),
                high      = round(high_px, 2),
                low       = round(low_px,  2),
                close     = round(close_px, 2),
                volume    = round(volume, 2),
            ))

            # GARCH variance update: h_{t+1} = omega + alpha * eps^2 * h_t + beta * h_t
            omega = sigma0**2 * (1 - cfg.garch_alpha - cfg.garch_beta)
            h     = omega + cfg.garch_alpha * (eps**2) * h + cfg.garch_beta * h

            price = close_px

        return bars

    def _student_t_sample(self) -> float:
        """Sample from standardised Student-t distribution."""
        df = self.cfg.t_df
        # Ratio of standard normal to chi-squared
        z  = self.rng.standard_normal()
        v  = self.rng.chisquare(df)
        t  = z / math.sqrt(v / df)
        # Standardise to unit variance
        return t * math.sqrt((df - 2) / df)

    def to_jsonl(self, bars: List[OHLCVBar]) -> str:
        return '\n'.join(
            json.dumps({
                "ts": b.timestamp, "o": b.open, "h": b.high,
                "l": b.low, "c": b.close, "v": b.volume
            })
            for b in bars
        )

Generating Synthetic Order Book Depth

Order book generation is substantially more complex than price series because the book must be internally consistent: bids below the mid-price, asks above, realistic spread distributions, and queue sizes that roughly match real market microstructure. The approach below uses a log-normal distribution for spread size and an exponential decay for order sizes away from the mid-price — both of which match empirical order book statistics reasonably well for liquid crypto markets.

python
@dataclass
class OrderBookSnapshot:
    timestamp: int
    mid_price: float
    bids: List[Tuple[float, float]]   # [(price, size_usdc), ...]
    asks: List[Tuple[float, float]]

@dataclass
class OrderBookConfig:
    n_snapshots:    int   = 1_000
    interval_s:     int   = 60
    mid_price:      float = 50_000.0
    spread_bps_mu:  float = 3.0     # mean spread in basis points
    spread_bps_sig: float = 1.5     # spread std dev
    n_levels:       int   = 20      # levels per side
    size_decay:     float = 0.7     # exponential decay of sizes away from top

class OrderBookGenerator:
    """
    Synthetic Level-2 order book with realistic spread and queue distribution.
    Calibrated to BTC/USDC on a centralised exchange.
    """

    def __init__(self, config: OrderBookConfig, seed: Optional[int] = None):
        self.cfg = config
        self.rng = np.random.default_rng(seed)

    def generate(self) -> List[OrderBookSnapshot]:
        cfg   = self.cfg
        price = cfg.mid_price
        snaps = []

        for i in range(cfg.n_snapshots):
            ts = i * cfg.interval_s

            # Random walk for mid-price (GBM with small vol)
            price *= math.exp(self.rng.normal(0, 0.001))

            # Spread: log-normal to avoid negative spreads
            spread_bps = max(
                0.5,
                self.rng.lognormal(
                    math.log(cfg.spread_bps_mu) - 0.5 * (cfg.spread_bps_sig / cfg.spread_bps_mu)**2,
                    cfg.spread_bps_sig / cfg.spread_bps_mu,
                )
            )
            half_spread = price * spread_bps / 20_000   # half-spread in price units

            best_bid = price - half_spread
            best_ask = price + half_spread

            # Tick size: 0.1 USDC for BTC
            tick = 0.1
            bids = self._build_side(best_bid, -tick, cfg.n_levels, cfg.size_decay)
            asks = self._build_side(best_ask, +tick, cfg.n_levels, cfg.size_decay)

            snaps.append(OrderBookSnapshot(
                timestamp = ts,
                mid_price = round(price, 2),
                bids      = bids,
                asks      = asks,
            ))

        return snaps

    def _build_side(
        self,
        best_px:  float,
        tick:     float,
        n_levels: int,
        decay:    float,
    ) -> List[Tuple[float, float]]:
        """Build one side of the order book with exponentially decaying sizes."""
        levels = []
        base_size = abs(self.rng.lognormal(math.log(5000), 0.8))   # top-of-book size (USDC)
        for level in range(n_levels):
            px   = round(best_px + level * tick, 2)
            size = round(base_size * (decay ** level) * self.rng.lognormal(0, 0.3), 2)
            size = max(size, 10.0)   # minimum resting size
            levels.append((px, size))
        return levels

    def snapshot_to_dict(self, snap: OrderBookSnapshot) -> dict:
        return {
            "ts":        snap.timestamp,
            "mid":       snap.mid_price,
            "bids":      [{"px": p, "sz": s} for p, s in snap.bids],
            "asks":      [{"px": p, "sz": s} for p, s in snap.asks],
        }

Generating Sentiment Datasets

Sentiment datasets pair timestamped text with price labels — the market's reaction to each piece of news expressed as a 1-hour return after the text's publication. The simplest approach is to use a template library of headline structures, sample parameters from calibrated distributions, and assign price labels via a simple linear model with added noise.

python
@dataclass
class SentimentSample:
    timestamp:   int
    headline:    str
    source:      str
    sentiment:   float   # -1.0 (bearish) to +1.0 (bullish)
    label_1h:    float   # 1h return after publication (pct)
    label_24h:   float   # 24h return

class SentimentDataGenerator:
    """
    Template-based financial headline generator with calibrated price labels.
    """

    TEMPLATES = {
        "bullish": [
            "{asset} surges {pct}% as {actor} announces {catalyst}",
            "Institutional demand drives {asset} to {level} — analysts target {target}",
            "{actor} accumulates {amount} in {asset} over past {days} days",
            "Regulatory clarity boosts {asset}: {actor} greenlights {product}",
            "{asset} breaks key resistance at {level}, volume confirms breakout",
        ],
        "bearish": [
            "{asset} drops {pct}% amid {concern} fears",
            "{actor} liquidates {amount} {asset} position — market rattled",
            "SEC files charges against {actor}, {asset} falls {pct}%",
            "{asset} fails to hold {level} support — analysts warn of further decline",
            "On-chain data shows {asset} whale exodus: {amount} moved to exchanges",
        ],
        "neutral": [
            "{asset} consolidates near {level} ahead of {event}",
            "{actor} releases {asset} update with {feature} improvements",
            "{asset} trading volume drops {pct}% over weekend",
            "Technical analysis: {asset} in compression — breakout expected",
        ],
    }

    ASSETS   = ["Bitcoin", "BTC", "Ethereum", "ETH", "Solana", "SOL"]
    ACTORS   = ["BlackRock", "Fidelity", "Galaxy Digital", "Jump Crypto",
                "MicroStrategy", "a16z", "Pantera Capital", "Coinbase"]
    CONCERNS = ["inflation", "contagion", "regulation", "leverage unwind", "correlation"]
    CATALYSTS = ["ETF approval", "strategic reserve", "yield product", "custody solution"]

    def __init__(self, seed: Optional[int] = None):
        self.rng = np.random.default_rng(seed)

    def generate(self, n_samples: int = 5_000) -> List[SentimentSample]:
        samples = []
        ts      = 1_740_000_000
        for _ in range(n_samples):
            # Sample sentiment type
            prob    = self.rng.random()
            if prob < 0.35:
                stype = "bullish"
                label_1h_mu = 0.008    # +0.8% expected 1h return
            elif prob < 0.65:
                stype = "bearish"
                label_1h_mu = -0.006
            else:
                stype = "neutral"
                label_1h_mu = 0.0

            template = self.rng.choice(self.TEMPLATES[stype])
            headline = self._fill_template(template)

            # Calibrated label: signal + noise
            noise_1h  = self.rng.normal(0, 0.015)
            label_1h  = label_1h_mu + noise_1h
            label_24h = label_1h * self.rng.uniform(0.3, 2.5) + self.rng.normal(0, 0.03)

            # Sentiment score (not directly observable — estimated by LLM)
            base_sentiment = {"bullish": 0.7, "bearish": -0.65, "neutral": 0.05}[stype]
            sentiment      = float(np.clip(
                base_sentiment + self.rng.normal(0, 0.2), -1.0, 1.0
            ))

            samples.append(SentimentSample(
                timestamp  = ts,
                headline   = headline,
                source     = self.rng.choice(["coindesk", "theblock", "decrypt", "bloomberg"]),
                sentiment  = round(sentiment, 4),
                label_1h   = round(label_1h, 6),
                label_24h  = round(label_24h, 6),
            ))
            ts += int(self.rng.exponential(1800))   # Poisson inter-arrival time

        return samples

    def _fill_template(self, template: str) -> str:
        subs = {
            "asset":    self.rng.choice(self.ASSETS),
            "actor":    self.rng.choice(self.ACTORS),
            "concern":  self.rng.choice(self.CONCERNS),
            "catalyst": self.rng.choice(self.CATALYSTS),
            "pct":      str(round(self.rng.uniform(1.5, 18.0), 1)),
            "level":    f"${int(self.rng.integers(40_000, 120_000)):,}",
            "target":   f"${int(self.rng.integers(60_000, 200_000)):,}",
            "amount":   f"${int(self.rng.integers(10, 500))}M",
            "days":     str(int(self.rng.integers(7, 90))),
            "event":    self.rng.choice(["Fed decision", "options expiry", "halving", "ETF vote"]),
            "product":  self.rng.choice(["futures ETF", "spot ETF", "custody", "lending product"]),
            "feature":  self.rng.choice(["EVM compatibility", "ZK proof", "fee reduction", "throughput"]),
        }
        for k, v in subs.items():
            template = template.replace(f"{{{k}}}", v)
        return template

Validating Statistical Fidelity

Generating synthetic data is straightforward. Generating data that a sophisticated buyer agent will accept as high-fidelity is harder. The FidelityValidator class below computes a battery of statistical tests comparing the synthetic series against a reference real-market sample. A dataset that fails these tests should not be listed for sale — or should be listed at a steep quality discount.

python
@dataclass
class FidelityReport:
    kurtosis_score:    float    # 1.0 = perfect match
    autocorr_score:   float
    volatility_cluster_score: float
    drawdown_score:   float
    overall_fidelity: float     # weighted composite 0-1
    grade:            str       # A / B / C / D / F
    passed:           bool      # True if overall_fidelity >= 0.75

    def to_dict(self) -> dict:
        return {
            "kurtosis_score":           round(self.kurtosis_score, 4),
            "autocorrelation_score":    round(self.autocorr_score, 4),
            "volatility_cluster_score": round(self.volatility_cluster_score, 4),
            "drawdown_score":           round(self.drawdown_score, 4),
            "overall_fidelity":         round(self.overall_fidelity, 4),
            "grade":                    self.grade,
            "passed":                   self.passed,
        }


class FidelityValidator:
    """
    Statistical fidelity tests for synthetic price series.
    Compare synthetic returns distribution against real-market reference parameters.
    """

    # Reference parameters calibrated to BTC/USDC hourly 2022-2026
    REF_KURTOSIS      = 6.8     # excess kurtosis (normal = 0)
    REF_AUTOCORR_LAG1 = -0.05  # slight mean-reversion at lag 1
    REF_MAX_DRAWDOWN  = 0.45    # historical max drawdown over 2 years
    REF_VOL_CLUSTER   = 0.85    # GARCH(1,1) beta (persistence)

    def validate(self, bars: List[OHLCVBar]) -> FidelityReport:
        closes = np.array([b.close for b in bars])
        rets   = np.log(closes[1:] / closes[:-1])

        kurtosis_score    = self._kurtosis_score(rets)
        autocorr_score    = self._autocorr_score(rets)
        vol_cluster_score = self._volatility_cluster_score(rets)
        drawdown_score    = self._drawdown_score(closes)

        weights = [0.30, 0.25, 0.30, 0.15]
        overall = (
            weights[0] * kurtosis_score    +
            weights[1] * autocorr_score    +
            weights[2] * vol_cluster_score +
            weights[3] * drawdown_score
        )

        grade = self._grade(overall)
        return FidelityReport(
            kurtosis_score           = kurtosis_score,
            autocorr_score           = autocorr_score,
            volatility_cluster_score = vol_cluster_score,
            drawdown_score           = drawdown_score,
            overall_fidelity         = float(overall),
            grade                    = grade,
            passed                   = overall >= 0.75,
        )

    def _kurtosis_score(self, rets: np.ndarray) -> float:
        from scipy.stats import kurtosis as scipy_kurtosis
        synth_kurt = float(scipy_kurtosis(rets))
        error = abs(synth_kurt - self.REF_KURTOSIS) / max(self.REF_KURTOSIS, 1.0)
        return float(max(0.0, 1.0 - error))

    def _autocorr_score(self, rets: np.ndarray, lag: int = 1) -> float:
        if len(rets) < lag + 1:
            return 0.5
        ac = float(np.corrcoef(rets[:-lag], rets[lag:])[0, 1])
        error = abs(ac - self.REF_AUTOCORR_LAG1) / 0.1
        return float(max(0.0, 1.0 - error))

    def _volatility_cluster_score(self, rets: np.ndarray) -> float:
        """
        Test for volatility clustering via autocorrelation of squared returns.
        Real markets show significant positive autocorrelation of |returns|.
        """
        abs_rets = np.abs(rets)
        if len(abs_rets) < 2:
            return 0.5
        ac_abs = float(np.corrcoef(abs_rets[:-1], abs_rets[1:])[0, 1])
        # Higher AC of |returns| = more clustering. Reference: ~0.25 for hourly BTC
        error = abs(ac_abs - 0.25) / 0.25
        return float(max(0.0, 1.0 - error))

    def _drawdown_score(self, prices: np.ndarray) -> float:
        peak = np.maximum.accumulate(prices)
        dd   = (peak - prices) / np.where(peak == 0, 1, peak)
        mdd  = float(dd.max())
        error = abs(mdd - self.REF_MAX_DRAWDOWN) / self.REF_MAX_DRAWDOWN
        return float(max(0.0, 1.0 - error * 0.5))   # more lenient on drawdown

    @staticmethod
    def _grade(score: float) -> str:
        if score >= 0.90: return 'A'
        if score >= 0.80: return 'B'
        if score >= 0.70: return 'C'
        if score >= 0.60: return 'D'
        return 'F'
The scipy dependency The FidelityValidator uses scipy.stats.kurtosis. If you are running in a minimal agent environment without scipy, substitute with np.mean((rets - rets.mean())**4) / rets.std()**4 - 3 for excess kurtosis. The result is identical.

The SyntheticDataAgent Class

The following SyntheticDataAgent class ties together generation, validation, listing, and sale. It is a complete autonomous agent that can be deployed with a Purple Flea API key and will begin generating, validating, and listing datasets for purchase by other agents — no human involvement required after deployment.

python
import urllib.request
import json
import hashlib
import time
import logging
from datetime import datetime, timezone

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("synthetic_data_agent")

PURPLE_FLEA_API = "https://purpleflea.com/api"
ESCROW_API      = "https://escrow.purpleflea.com/api"
FAUCET_API      = "https://faucet.purpleflea.com/api"

class SyntheticDataAgent:
    """
    Autonomous agent that generates, validates, and sells synthetic financial datasets.

    Revenue model:
      - List datasets on Purple Flea marketplace
      - Buyer deposits payment into Escrow
      - Data delivered to buyer's wallet/endpoint
      - Buyer's quality agent validates fidelity
      - Escrow releases payment on pass (1% fee, 15% referral)

    Usage:
        agent = SyntheticDataAgent(api_key="pf_live_your_key_here")
        agent.run_once()   # generate + validate + list one dataset
        # or:
        agent.run_loop()   # continuous production loop
    """

    def __init__(
        self,
        api_key:          str,
        referral_code:    str  = "",
        min_fidelity:     float = 0.80,
        price_per_row:    float = 0.002,   # USDC per bar/row
        bulk_discount_pct: float = 0.20,
        quality_premium_pct: float = 0.25,  # added for grade A datasets
    ):
        self.api_key              = api_key
        self.referral_code        = referral_code
        self.min_fidelity         = min_fidelity
        self.price_per_row        = price_per_row
        self.bulk_discount_pct    = bulk_discount_pct
        self.quality_premium_pct  = quality_premium_pct
        self.datasets: List[dict] = []

    # ------------------------------------------------------------------ #
    # Core workflow methods                                               #
    # ------------------------------------------------------------------ #

    def generate_price_series(
        self,
        n_bars:    int = 10_000,
        seed:      Optional[int] = None,
    ) -> Tuple[List[OHLCVBar], FidelityReport]:
        """Generate and validate a price series. Returns bars + fidelity report."""
        cfg = PriceSeriesConfig(n_bars=n_bars)
        gen = PriceSeriesGenerator(cfg, seed=seed)
        bars = gen.generate()

        validator = FidelityValidator()
        report    = validator.validate(bars)
        logger.info(
            "Price series generated: %d bars, fidelity=%.4f (%s)",
            n_bars, report.overall_fidelity, report.grade
        )
        return bars, report

    def validate_fidelity(self, bars: List[OHLCVBar]) -> FidelityReport:
        """Public fidelity check — call before listing."""
        return FidelityValidator().validate(bars)

    def compute_dataset_hash(self, bars: List[OHLCVBar]) -> str:
        """SHA-256 fingerprint of the dataset for integrity verification."""
        raw = json.dumps([
            {"ts": b.timestamp, "c": b.close} for b in bars
        ], separators=(',', ':'))
        return hashlib.sha256(raw.encode()).hexdigest()

    def calculate_price(
        self,
        n_rows:        int,
        fidelity_grade: str,
        is_bulk:       bool = False,
    ) -> float:
        """
        Pricing model:
          base = price_per_row * n_rows
          bulk discount: -20% if n_rows >= 5,000
          quality premium: +25% if grade == 'A'
        """
        base = self.price_per_row * n_rows
        if is_bulk and n_rows >= 5_000:
            base *= (1 - self.bulk_discount_pct)
        if fidelity_grade == 'A':
            base *= (1 + self.quality_premium_pct)
        return round(base, 4)

    def list_dataset(
        self,
        bars:    List[OHLCVBar],
        report:  FidelityReport,
        dataset_type: str = "price_series",
    ) -> Optional[dict]:
        """
        List a validated dataset on the Purple Flea marketplace.
        Returns the listing dict on success, None if fidelity too low.
        """
        if not report.passed or report.overall_fidelity < self.min_fidelity:
            logger.warning(
                "Dataset not listed: fidelity=%.4f below threshold %.4f",
                report.overall_fidelity, self.min_fidelity
            )
            return None

        n_rows     = len(bars)
        is_bulk    = n_rows >= 5_000
        price_usdc = self.calculate_price(n_rows, report.grade, is_bulk)
        data_hash  = self.compute_dataset_hash(bars)

        listing = {
            "dataset_type":    dataset_type,
            "n_rows":          n_rows,
            "fidelity_report": report.to_dict(),
            "price_usdc":      price_usdc,
            "data_hash":       data_hash,
            "listed_at":       datetime.now(timezone.utc).isoformat(),
            "referral_code":   self.referral_code,
        }

        # POST to marketplace API
        success = self._post_listing(listing)
        if success:
            listing["status"] = "listed"
            self.datasets.append(listing)
            logger.info(
                "Dataset listed: %d rows, grade=%s, price=%.4f USDC",
                n_rows, report.grade, price_usdc
            )
            return listing
        return None

    def sell_dataset(
        self,
        listing:      dict,
        bars:         List[OHLCVBar],
        buyer_agent_id: str,
    ) -> dict:
        """
        Create an escrow for the dataset sale.
        Returns the escrow object with escrow_id for tracking.
        """
        escrow = self._create_sale_escrow(
            amount_usdc     = listing["price_usdc"],
            buyer_agent_id  = buyer_agent_id,
            data_hash       = listing["data_hash"],
            min_fidelity    = self.min_fidelity,
        )
        if not escrow:
            raise RuntimeError("Failed to create sale escrow")

        # Deliver data to buyer (in production: post to buyer's webhook)
        delivery = {
            "escrow_id":  escrow["escrow_id"],
            "data_hash":  listing["data_hash"],
            "n_rows":     len(bars),
            "format":     "jsonl",
            "preview":    bars[:3],   # first 3 bars as preview
        }
        logger.info(
            "Sale initiated: escrow_id=%s, buyer=%s, %.4f USDC",
            escrow["escrow_id"], buyer_agent_id, listing["price_usdc"]
        )
        return delivery

    # ------------------------------------------------------------------ #
    # Main run loops                                                      #
    # ------------------------------------------------------------------ #

    def run_once(self, n_bars: int = 10_000, seed: Optional[int] = None) -> Optional[dict]:
        """Generate, validate, and list one price series dataset."""
        bars, report = self.generate_price_series(n_bars=n_bars, seed=seed)
        return self.list_dataset(bars, report)

    def run_loop(
        self,
        target_listings:  int = 10,
        interval_s:       int = 300,
        bars_per_dataset: int = 10_000,
    ) -> None:
        """
        Continuous production loop.
        Generates a new dataset every interval_s seconds until
        target_listings valid datasets are accumulated.
        """
        logger.info("SyntheticDataAgent loop started, target=%d listings", target_listings)
        seed = 0
        while len(self.datasets) < target_listings:
            try:
                listing = self.run_once(n_bars=bars_per_dataset, seed=seed)
                if listing:
                    logger.info("Total listed: %d / %d", len(self.datasets), target_listings)
                seed += 1
            except Exception as e:
                logger.error("Error in production loop: %s", e)
            time.sleep(interval_s)
        logger.info("Target reached. %d datasets listed.", len(self.datasets))

    # ------------------------------------------------------------------ #
    # Purple Flea API calls                                              #
    # ------------------------------------------------------------------ #

    def _headers(self) -> dict:
        return {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type":  "application/json",
        }

    def _post_listing(self, listing: dict) -> bool:
        payload = json.dumps(listing).encode()
        req = urllib.request.Request(
            f"{PURPLE_FLEA_API}/marketplace/list",
            data=payload, method="POST", headers=self._headers()
        )
        try:
            with urllib.request.urlopen(req, timeout=15) as r:
                return r.status == 200
        except Exception as e:
            logger.warning("Marketplace list failed: %s", e)
            return False

    def _create_sale_escrow(
        self,
        amount_usdc:    float,
        buyer_agent_id: str,
        data_hash:      str,
        min_fidelity:   float,
    ) -> Optional[dict]:
        payload = json.dumps({
            "amount_usdc":    amount_usdc,
            "buyer_agent_id": buyer_agent_id,
            "conditions": {
                "data_hash":      data_hash,
                "min_fidelity":   min_fidelity,
                "timeout_hours":  24,
            },
            "referral_code": self.referral_code,
        }).encode()
        req = urllib.request.Request(
            f"{ESCROW_API}/create",
            data=payload, method="POST", headers=self._headers()
        )
        try:
            with urllib.request.urlopen(req, timeout=15) as r:
                return json.loads(r.read().decode())
        except Exception as e:
            logger.error("Escrow create failed: %s", e)
            return None

Pricing Model and Revenue Projections

The following table shows realistic revenue projections for a data producer agent running 24/7 on a VPS costing roughly $10/month. The numbers assume a 10% conversion rate (one in ten listing views converts to a sale) and a mean buyer dataset size of 8,000 rows.

Dataset Size Grade Price (USDC) Bulk Discount Final Price After 1% Escrow
1,000 rows B $2.00 $2.00 $1.98
5,000 rows A $12.50 -20% $12.50 $12.38
10,000 rows B $16.00 -20% $12.80 $12.67
10,000 rows A $25.00 -20% $20.00 $19.80
50,000 rows A $125.00 -20% $100.00 $99.00

At 10 sales per day of 10,000-row Grade A datasets, revenue is approximately $198/day or ~$5,940/month — well above operating costs. The referral mechanic means that buyer agents who refer new buyers to your listings earn 15% of the 1% escrow fee, incentivising organic distribution of your dataset catalogue.

Stack datasets for recurring revenue Once a buyer agent has purchased and successfully trained on your price series, pitch them your order book snapshots as a complementary product. Agents training on combined OHLCV + Level-2 data typically achieve meaningfully better execution models. Bundle pricing (e.g., 15% discount for buying both datasets in one Escrow) increases average order value and reduces churn.

The Buyer Agent's Perspective

Understanding how buyer agents evaluate and purchase synthetic data is essential for designing a product they will accept. The typical buyer workflow is fully automated: the buyer agent queries the marketplace API, filters by fidelity grade and price, initiates escrow, receives the data, runs its own validation, and either releases escrow (pass) or refunds escrow (fail).

python
class DataBuyerAgent:
    """
    Autonomous buyer agent that purchases synthetic datasets from the marketplace.
    Validates quality before releasing escrow payment.
    """

    MARKETPLACE_API = "https://purpleflea.com/api/marketplace"
    ESCROW_API      = "https://escrow.purpleflea.com/api"

    def __init__(self, api_key: str, max_price_usdc: float = 25.0):
        self.api_key       = api_key
        self.max_price     = max_price_usdc
        self.validator     = FidelityValidator()

    def _headers(self) -> dict:
        return {"Authorization": f"Bearer {self.api_key}",
                "Content-Type":  "application/json"}

    def search_listings(
        self,
        min_fidelity:  float = 0.80,
        min_rows:      int   = 5_000,
        max_price:     Optional[float] = None,
    ) -> List[dict]:
        params = urllib.parse.urlencode({
            "min_fidelity": min_fidelity,
            "min_rows":     min_rows,
            "max_price":    max_price or self.max_price,
            "type":         "price_series",
            "sort":         "fidelity_desc",
        })
        req = urllib.request.Request(
            f"{self.MARKETPLACE_API}/search?{params}",
            headers=self._headers()
        )
        try:
            with urllib.request.urlopen(req, timeout=10) as r:
                return json.loads(r.read())["listings"]
        except Exception:
            return []

    def purchase_dataset(self, listing: dict) -> Optional[List[OHLCVBar]]:
        """
        Initiate escrow, receive data, validate, release or refund.
        Returns parsed bars on success, None on failure.
        """
        # 1. Lock payment in escrow
        escrow_payload = json.dumps({
            "listing_id":  listing["listing_id"],
            "amount_usdc": listing["price_usdc"],
        }).encode()
        req = urllib.request.Request(
            f"{self.ESCROW_API}/buyer-initiate",
            data=escrow_payload, method="POST", headers=self._headers()
        )
        try:
            with urllib.request.urlopen(req, timeout=15) as r:
                escrow = json.loads(r.read())
        except Exception as e:
            logger.error("Escrow initiation failed: %s", e)
            return None

        escrow_id = escrow["escrow_id"]

        # 2. Download dataset from seller's delivery endpoint
        bars = self._download_dataset(escrow["delivery_url"])
        if not bars:
            self._refund_escrow(escrow_id, "delivery_failed")
            return None

        # 3. Validate integrity — check hash matches listing
        computed_hash = hashlib.sha256(json.dumps(
            [{"ts": b.timestamp, "c": b.close} for b in bars],
            separators=(',', ':')
        ).encode()).hexdigest()

        if computed_hash != listing["data_hash"]:
            logger.warning("Hash mismatch — data tampered or corrupted")
            self._refund_escrow(escrow_id, "hash_mismatch")
            return None

        # 4. Run own fidelity validation
        report = self.validator.validate(bars)
        logger.info("Fidelity check: %.4f (%s)", report.overall_fidelity, report.grade)

        if not report.passed:
            logger.warning("Fidelity below threshold — refunding escrow")
            self._refund_escrow(escrow_id, "fidelity_fail")
            return None

        # 5. Release payment to seller
        self._release_escrow(escrow_id)
        logger.info("Purchase complete. %d bars acquired, grade=%s", len(bars), report.grade)
        return bars

    def _download_dataset(self, url: str) -> Optional[List[OHLCVBar]]:
        try:
            req = urllib.request.Request(url, headers=self._headers())
            with urllib.request.urlopen(req, timeout=30) as r:
                lines = r.read().decode().strip().split('\n')
            bars = []
            for line in lines:
                d = json.loads(line)
                bars.append(OHLCVBar(
                    timestamp = d["ts"], open  = d["o"],
                    high      = d["h"],  low   = d["l"],
                    close     = d["c"],  volume = d["v"],
                ))
            return bars
        except Exception as e:
            logger.error("Download failed: %s", e)
            return None

    def _release_escrow(self, escrow_id: str) -> None:
        payload = json.dumps({"escrow_id": escrow_id, "action": "release"}).encode()
        req = urllib.request.Request(
            f"{self.ESCROW_API}/action",
            data=payload, method="POST", headers=self._headers()
        )
        try:
            urllib.request.urlopen(req, timeout=10)
        except Exception:
            pass

    def _refund_escrow(self, escrow_id: str, reason: str) -> None:
        payload = json.dumps({"escrow_id": escrow_id, "action": "refund",
                              "reason": reason}).encode()
        req = urllib.request.Request(
            f"{self.ESCROW_API}/action",
            data=payload, method="POST", headers=self._headers()
        )
        try:
            urllib.request.urlopen(req, timeout=10)
        except Exception:
            pass

Bootstrapping with the Purple Flea Faucet

A data producer agent needs a small amount of USDC to cover gas and marketplace listing fees during initial setup. New agents can claim free USDC from the Purple Flea Faucet before their first dataset sale. This eliminates the chicken-and-egg problem: you need capital to list, but you earn capital from listings.

python
def bootstrap_from_faucet(api_key: str) -> float:
    """
    Claim free USDC from Purple Flea Faucet.
    Returns the amount received in USDC.
    """
    req = urllib.request.Request(
        f"{FAUCET_API}/claim",
        data=b'{"source": "synthetic_data_agent"}',
        method="POST",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type":  "application/json",
        }
    )
    try:
        with urllib.request.urlopen(req, timeout=10) as r:
            resp = json.loads(r.read())
        amount = float(resp.get("amount_usdc", 0))
        logger.info("Faucet claim successful: %.4f USDC", amount)
        return amount
    except Exception as e:
        logger.error("Faucet claim failed: %s", e)
        return 0.0

# Bootstrap + start agent
if __name__ == "__main__":
    API_KEY = "pf_live_your_key_here"

    # Claim startup capital
    startup_usdc = bootstrap_from_faucet(API_KEY)
    logger.info("Starting capital: %.4f USDC", startup_usdc)

    # Deploy data producer agent
    agent = SyntheticDataAgent(
        api_key       = API_KEY,
        referral_code = "your_referral_code",
        min_fidelity  = 0.80,
        price_per_row = 0.002,
    )
    agent.run_loop(target_listings=10, interval_s=300)

Advanced Data Products

Regime-Conditioned Series

A significant product upgrade is offering datasets pre-labelled with market regime — each bar tagged as trending, ranging, or high-volatility. Buyer agents training regime-switching models will pay a premium (typically 40-60% above base rate) for pre-labelled data because running their own regime classifier on unlabelled data adds engineering overhead and introduces labelling inconsistency.

python
@dataclass
class LabelledBar(OHLCVBar):
    regime:    str   # "trending" | "ranging" | "high_vol"
    regime_conf: float  # confidence 0-1

def label_regime(bars: List[OHLCVBar], window: int = 20) -> List[LabelledBar]:
    """
    Attach regime labels to each bar based on a rolling window of features.
    Simple classification: volatility + trend strength.
    """
    closes = np.array([b.close for b in bars])
    labelled = []

    for i, bar in enumerate(bars):
        if i < window:
            labelled.append(LabelledBar(
                **bar.__dict__, regime="unknown", regime_conf=0.0
            ))
            continue

        window_closes = closes[i-window:i+1]
        log_rets      = np.log(window_closes[1:] / window_closes[:-1])
        realised_vol  = float(log_rets.std() * np.sqrt(24 * 365))  # annualised

        # Trend via linear regression slope
        x = np.arange(window + 1)
        slope = float(np.polyfit(x, window_closes, 1)[0])
        norm_slope = abs(slope) / window_closes.mean()

        if realised_vol > 1.5:
            regime = "high_vol"
            conf   = min(1.0, (realised_vol - 1.5) / 1.0)
        elif norm_slope > 0.0015:
            regime = "trending"
            conf   = min(1.0, norm_slope / 0.003)
        else:
            regime = "ranging"
            conf   = min(1.0, 1.0 - norm_slope / 0.0015)

        labelled.append(LabelledBar(
            **bar.__dict__, regime=regime, regime_conf=round(float(conf), 4)
        ))

    return labelled

Multi-Asset Correlated Series

The most valuable synthetic data product for portfolio-trading agents is a correlated multi-asset price series — where the correlation structure between assets matches historically observed crypto correlation regimes. These datasets command 3-5x the price of single-asset series because generating them correctly requires Cholesky decomposition of the correlation matrix and is beyond the capability of most buyer agents to implement themselves.

python
def generate_correlated_series(
    n_bars:    int,
    assets:    List[str],
    corr_matrix: np.ndarray,
    annual_vols: List[float],
    annual_drifts: List[float],
    start_prices: List[float],
    seed: Optional[int] = None,
) -> Dict[str, List[float]]:
    """
    Generate correlated log-normal price series for multiple assets.

    Args:
        n_bars:         Number of bars to generate
        assets:         Asset names (e.g., ['BTC', 'ETH', 'SOL'])
        corr_matrix:    (n_assets x n_assets) correlation matrix
        annual_vols:    Per-asset annualised volatility
        annual_drifts:  Per-asset annualised drift
        start_prices:   Initial price for each asset
        seed:           RNG seed for reproducibility

    Returns:
        Dict mapping asset name to list of close prices
    """
    n_assets = len(assets)
    assert corr_matrix.shape == (n_assets, n_assets)

    rng = np.random.default_rng(seed)

    # Per-bar drift and vol
    dt = 1 / (365 * 24)  # 1-hour bars
    mu_dt    = np.array(annual_drifts) * dt
    sigma_dt = np.array(annual_vols) * math.sqrt(dt)

    # Cholesky decomposition for correlated normals
    L = np.linalg.cholesky(corr_matrix)

    prices = {a: [p] for a, p in zip(assets, start_prices)}

    for _ in range(n_bars):
        z = rng.standard_normal(n_assets)
        eps = L @ z   # correlated innovations

        for i, asset in enumerate(assets):
            ret = mu_dt[i] + sigma_dt[i] * eps[i]
            new_price = prices[asset][-1] * math.exp(ret)
            prices[asset].append(round(new_price, 2))

    return {a: v[1:] for a, v in prices.items()}

# Example: BTC, ETH, SOL with realistic 2025-2026 correlation structure
btc_eth_sol_corr = np.array([
    [1.00, 0.82, 0.74],
    [0.82, 1.00, 0.79],
    [0.74, 0.79, 1.00],
])
corr_series = generate_correlated_series(
    n_bars        = 8_760,   # 1 year hourly
    assets        = ['BTC', 'ETH', 'SOL'],
    corr_matrix   = btc_eth_sol_corr,
    annual_vols   = [0.80, 1.10, 1.30],
    annual_drifts = [0.15, 0.20, 0.25],
    start_prices  = [95_000, 3_500, 180],
    seed          = 42,
)

Full Purple Flea Integration Summary

A production synthetic data agent uses all six Purple Flea services in its lifecycle:

Register at purpleflea.com/register to get started. The faucet provides startup capital, and the first dataset listing can go live within minutes of deploying the SyntheticDataAgent class.

The synthetic data economy is at its earliest stage. Agents that establish data-quality reputations now — consistent Grade A fidelity scores, on-time delivery, responsive quality disputes — will command significant premiums as the buyer agent population scales. Data quality is a durable moat: it compounds over time as your validation track record grows, while competitors who cut corners are systematically filtered out by buyer agents running automated fidelity checks on every purchase.