Introduction
NVIDIA NIM (NVIDIA Inference Microservices) lets you run production-grade LLMs on your own GPU infrastructure with an OpenAI-compatible API. Combined with Purple Flea's trading API, you get a fully autonomous, privacy-preserving trading agent where your positions, strategies, and alpha never leave your own hardware.
No cloud API costs, no rate limits, no latency spikes at 3am when a market moves. This guide shows you how to wire up NIM's local inference endpoint to Purple Flea's perpetual futures trading API in roughly 50 lines of Python.
Why Local Inference for Trading Agents
Cloud LLMs work fine for most agent tasks, but trading has unique requirements that push the tradeoffs in favor of local inference:
- No cloud API latency: Local inference adds sub-5ms overhead vs 200-800ms round-trip to OpenAI's servers. When a liquidation cascade is in progress, milliseconds matter.
- Strategy stays private: Your prompts, market context, and position logic never leave your hardware. No AI provider can observe your trading patterns or sell aggregated signal data.
- Consistent throughput: No shared-infrastructure rate limits. Your agent can run inference loops as fast as your GPU allows โ critical during high-volatility periods when you want faster decision cycles.
- Run multiple models in parallel: Dedicated hardware lets you run a fast model for execution decisions and a slower, larger model for risk analysis simultaneously.
Prerequisites
- NVIDIA GPU (A10G, A100, or H100 recommended; RTX 3090/4090 works for smaller models)
- NVIDIA NIM installed and running locally โ see NVIDIA NIM docs
- Purple Flea API key from /api-keys
- Python 3.10+ with
openaiandrequestsinstalled - A model pulled into NIM โ
meta/llama-3.1-70b-instructis recommended for trading
Quick start: Install dependencies with pip install openai requests. NIM exposes an OpenAI-compatible endpoint at http://localhost:8000/v1 by default โ no SDK changes needed.
The Tool Schema
Define Purple Flea's trading capabilities as OpenAI-compatible function tools. The model will decide when and how to call each one:
get_price
Fetch current mark price and 24h change for any perpetual market
open_long
Open a leveraged long position with configurable size in USD
close_trade
Close an existing position by trade ID, full or partial
get_portfolio
Return current open positions, unrealized PnL, and margin usage
tools = [
{
"type": "function",
"function": {
"name": "get_price",
"description": "Get current price for a perpetual market",
"parameters": {
"type": "object",
"properties": {
"symbol": {"type": "string", "description": "e.g. BTC-PERP, ETH-PERP"}
},
"required": ["symbol"]
}
}
},
{
"type": "function",
"function": {
"name": "open_long",
"description": "Open a long position on a perpetual futures market",
"parameters": {
"type": "object",
"properties": {
"symbol": {"type": "string"},
"size_usd": {"type": "number", "description": "Position size in USD"},
"leverage": {"type": "number", "description": "Leverage multiplier, 1-10"}
},
"required": ["symbol", "size_usd"]
}
}
},
{
"type": "function",
"function": {
"name": "get_portfolio",
"description": "Get current portfolio: open positions, PnL, margin",
"parameters": {"type": "object", "properties": {}}
}
},
{
"type": "function",
"function": {
"name": "close_trade",
"description": "Close an open position",
"parameters": {
"type": "object",
"properties": {
"trade_id": {"type": "string"},
"close_pct": {"type": "number", "description": "0-100, percentage to close"}
},
"required": ["trade_id"]
}
}
}
]
The Full Agent Code
Here is the complete trading agent โ NIM handles the LLM reasoning, Purple Flea handles execution:
from openai import OpenAI
import requests
import json
# NIM runs locally on OpenAI-compatible endpoint
nim_client = OpenAI(base_url="http://localhost:8000/v1", api_key="nim-local")
PURPLE_FLEA_KEY = "your-pf-api-key"
PF_BASE = "https://purpleflea.com/api/v1"
HEADERS = {"Authorization": f"Bearer {PURPLE_FLEA_KEY}", "Content-Type": "application/json"}
def call_purple_flea(name: str, args: dict) -> dict:
"""Execute a Purple Flea API call based on tool name."""
if name == "get_price":
r = requests.get(f"{PF_BASE}/markets/{args['symbol']}/price", headers=HEADERS)
return r.json()
elif name == "open_long":
payload = {
"symbol": args["symbol"],
"side": "long",
"size": args["size_usd"],
"leverage": args.get("leverage", 2)
}
r = requests.post(f"{PF_BASE}/trade", json=payload, headers=HEADERS)
return r.json()
elif name == "get_portfolio":
r = requests.get(f"{PF_BASE}/portfolio", headers=HEADERS)
return r.json()
elif name == "close_trade":
payload = {"trade_id": args["trade_id"], "close_pct": args.get("close_pct", 100)}
r = requests.post(f"{PF_BASE}/trade/close", json=payload, headers=HEADERS)
return r.json()
return {"error": "unknown tool"}
def run_trading_agent():
messages = [
{
"role": "system",
"content": (
"You are an autonomous crypto trading agent running on Purple Flea's "
"perpetual futures exchange. Your goal: maximize returns while keeping "
"max drawdown under 10%. Check BTC-PERP price, review your portfolio, "
"and make a trading decision. Be concise and decisive."
)
},
{
"role": "user",
"content": "Analyze the market and take action if conditions warrant it."
}
]
# Agentic loop: run until model stops calling tools
while True:
response = nim_client.chat.completions.create(
model="meta/llama-3.1-70b-instruct",
messages=messages,
tools=tools,
tool_choice="auto",
temperature=0.3
)
msg = response.choices[0].message
messages.append(msg)
if not msg.tool_calls:
print("Agent decision:", msg.content)
break
# Execute each tool call
for tc in msg.tool_calls:
fn_name = tc.function.name
fn_args = json.loads(tc.function.arguments)
result = call_purple_flea(fn_name, fn_args)
print(f"Tool: {fn_name}({fn_args}) -> {result}")
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": json.dumps(result)
})
if __name__ == "__main__":
run_trading_agent()
Handling Tool Calls in the Loop
The agentic loop above keeps running until the model returns a message with no tool calls โ that's when it has finished acting and wants to summarize. A typical execution trace looks like:
- Model calls
get_portfolio()โ sees $500 balance, no open positions - Model calls
get_price(BTC-PERP)โ sees $78,200, down 2.1% in 24h - Model calls
open_long(BTC-PERP, size_usd=50, leverage=2) - Model returns summary: "Opened $50 2x BTC long at $78,200. Stop loss mentally at -3%. Awaiting confirmation."
The model sequences these calls itself โ you don't need to orchestrate the order. This is the core of tool-calling agents: the LLM decides what information it needs and fetches it iteratively.
Production Tips
Before running this agent with real capital, add these safeguards:
- Hard position limits: Add a pre-check that rejects any
open_longcall wheresize_usdexceeds 10% of portfolio balance. Do this incall_purple_flea(), not in the prompt. - Stop-loss enforcement: After opening a position, spawn a background monitor thread that polls the price every 30s and closes the trade if PnL falls below your threshold.
- Async execution: Use
asyncioandaiohttpfor the tool execution layer โ NIM inference is already fast, don't let network I/O stall the loop. - Rate limiting: Add a minimum 5-minute interval between consecutive trades to avoid chasing noise. Track last trade timestamp in a simple dict.
- Logging: Write all tool calls and results to a structured log file. Post-trade analysis is the fastest way to improve your agent's system prompt.
Performance: Local vs Cloud
On an A10G GPU running llama-3.1-70b-instruct via NIM, a full 4-tool-call agent loop completes in approximately 8-12 seconds total โ about 2-3 seconds per inference step. Cloud API equivalent is 6-15 seconds with network variance included.
For latency-critical strategies (scalping, liquidation cascades), the more meaningful optimization is reducing Purple Flea API round-trips and pre-fetching market data into the context window before starting the loop. The LLM reasoning itself is rarely the bottleneck for trade execution timescales.
Conclusion
NVIDIA NIM plus Purple Flea gives you a local, private, rate-limit-free autonomous trading stack in under 50 lines. The OpenAI-compatible interface means you can swap any model โ try smaller, faster models like llama-3.1-8b for high-frequency signals and larger models for portfolio-level strategy.
Full trading API docs at /trading-api, API reference at /api-reference, and the NVIDIA NIM integration guide at /for-nvidia-nim.