Six frontier LLMs each received $10,000 of real capital on Hyperliquid to trade crypto perps for 72 hours using the same prompt and inputs. The leaderboard crowned DeepSeek Chat v3.1 (+38%), followed by Grok 4 (+35%) and Claude Sonnet 4.5 (+25%); Qwen 3 Max finished +9%, while GPT-5 and Gemini 2.5 Pro fell −27% and −31%. This piece unpacks how such outcomes can happen—and the many caveats before you let an AI touch your account
Could a large language model (LLM) really run a live futures book for three straight days and beat skilled humans? A recent entertainment-style bake-off tested exactly that idea. Six state-of-the-art AI agents were each funded with $10,000 of real capital on Hyperliquid. They received the same trading prompt and the same market inputs, then ran autonomously for 72 hours in crypto perpetual futures (long/short). The goal wasn’t to crown a permanent champion so much as to stress whether modern agents can manage risk, adapt to microstructure, and avoid the classic traps—overtrading, funding bleed, and liquidation cascades.
The Headline Results (Start: $10,000 per model)
| Model | Ending Equity | Return | Approx. P&L |
|---|---|---|---|
| DeepSeek Chat v3.1 | $13,830 | +38% | +$3,830 |
| Grok 4 | $13,481 | +35% | +$3,481 |
| Claude Sonnet 4.5 | $12,506 | +25% | +$2,506 |
| Qwen 3 Max | $10,896 | +9% | +$896 |
| GPT-5 | $7,265 | −27% | −$2,735 |
| Gemini 2.5 Pro | $6,864 | −31% | −$3,136 |
Important: This was a short, controlled trial. It shows what happened over a single, noisy, 72-hour window—not what will happen next week or in a different market regime. The organizers framed it as a fun experiment to probe reaction speed and decision hygiene, not as investment advice.
What Did the Top Three Likely Do Right?
We don’t have full telemetry (tick-by-tick decisions, exact prompts, or internal state), but the outcome suggests three broad advantages:
1. Regime recognition over prediction. In high-beta tapes, agents that react to what is (volatility regime, funding skew, order-book elasticity) tend to beat agents that guess what should be. DeepSeek and Grok likely leaned on simple but robust features—price acceptance above/below intraday value, rising spot participation, and low-latency momentum confirms—then sized modestly until structure strengthened.
2. Funding and basis discipline. Perpetual futures leak value if you sit on the wrong side of funding for hours. The top models probably avoided extended negative carry, clipped reversals in windows when funding converged to neutral, or hedged drift by flattening during hot intervals. That alone can explain a large performance gap across only three days.
3. Fewer, better trades. Overtrading is the silent P&L killer. The leaders likely enforced a trade budget (e.g., max N entries per hour) and cool-down after a stop, forcing higher selectivity and reducing slippage/fees.
Why Did Bottom Performers Struggle?
Two failure modes routinely sink autonomous agents:
- Latency-induced whipsaw. When a model chases micro-moves without accounting for API/update latency or order-book depth, it buys tops and sells bottoms. Slippage compounds, then stops trigger, then the agent re-enters—classic churn.
- Policy drift after losses. Some agents loosen risk rules after drawdowns (“one more trade to get back to even”). In live markets, this transforms a small cold streak into a large loss—precisely the pattern consistent with −27%/−31% outcomes in just 72 hours.
Methodology Snapshot (What We Can Infer)
All agents received the same high-level prompt and identical market data feeds. Orders were placed through the exchange venue using programmatic interfaces; funding, fees, and liquidity conditions were identical ex ante. The differences therefore hinge on policy (signals, sizing, and risk) and on the quality of each model’s internal reasoning plus tooling (execution layer, throttling, and state tracking). We don’t know the leverage caps, exact stop logic, or whether agents used limit vs market orders by default—but the dispersion of results strongly suggests divergent risk engines.
What These Results Do—and Don’t—Mean
It’s tempting to say, “DeepSeek > Grok > Claude; therefore use DeepSeek for trading.” That is the wrong takeaway. Three crucial caveats apply:
- Path dependence. Results over 72 hours are path dependent; a different three-day window (say, low-vol chop or a single directional squeeze) can reshuffle the leaderboard.
- Prompt dependence. Change the prompt—e.g., push the agent to mean-reversion instead of momentum—and you will change the winners. LLMs are highly sensitive to objective framing.
- Tooling dependence. The execution harness (risk caps, throttle, cool-downs, order types) can make a mediocre strategy look good and vice versa. The agent’s IQ matters; the seatbelt matters more.
Inside the P&L: Where 38% vs −31% Comes From
Over three days, a handful of choices dominate equity curves:
- Stop placement by structure, not distance. Stops under/over accepted levels (previous day’s value area, session VWAP bands, or obvious swing points) survive noise better than fixed 0.5%/1% cuts.
- Position tapering during illiquidity. Night-session or event-risk periods invite air pockets. Smart agents reduce size when top-of-book depth thins or spreads widen.
- Funding neutralization. Flatten into funding spikes; re-risk on the other side. Over a 72-hour horizon with 3–6 funding prints, this single habit can separate green from red.
A Simple Audit Framework (If You Ever Run an AI Agent)
1. Define hard kill-switches. Daily loss cap (e.g., −5%), max concurrent positions, and a cool-down after two consecutive stops.
2. Instrument hygiene. Only trade pairs with proven depth; ban long-tail symbols vulnerable to wick hunts.
3. Stateful memory. Persist P&L, funding, and recent slippage; don’t let the agent re-enter the same losing pattern within N minutes.
4. Execution rules. Prefer limit-first with timeouts that escalate to IOC/market only when liquidity is adequate; cap market order size to a fraction of top-of-book.
5. Transparency. Log prompts, decisions, and fills; print a one-page report per session (trades taken, win rate, average R, slippage, fees, funding paid/received).
Model-by-Model Hypotheses
DeepSeek Chat v3.1 (+38%). Likely leaned into trend-following with volatility filters and disciplined exposure during funding neutrality. The curve suggests good selectivity rather than hyperactivity.
Grok 4 (+35%). Close second indicates similar signal quality—perhaps slightly higher turnover or a late session drawdown that clipped the top spot.
Claude Sonnet 4.5 (+25%). Respectable performance consistent with strong risk control but more conservative adds.
Qwen 3 Max (+9%). Positive but cautious—possibly under-risked entries or overreliance on mean-reversion in trending segments.
GPT-5 (−27%) and Gemini 2.5 Pro (−31%). Both appear to have suffered from policy drift or latency-chase behavior—buying strength too late, then stopping out repeatedly, or holding against funding for too long.
What Would a Fair Rematch Require?
- Longer horizon. Run at least two weeks across mixed regimes (trend, chop, event risk) to test adaptability.
- Fixed toolchain. Same execution framework for all models (identical order router, slippage guards, risk caps) so the model is the only variable.
- Pre-registered prompts. Publish prompts and evaluation metrics before the start; freeze them for the entire run.
- Live telemetry. Open dashboards with anonymized trade logs, funding paid/received, and time-in-market so the community can reproduce insights.
Can AI Replace Human Traders?
Not yet—and that’s the wrong framing. The right framing is augmentation. An LLM can monitor dozens of pairs, estimate volatility regimes, remind you of funding/fees, and enforce discipline. But humans still make the calls that matter most: objectives, risk appetite, and when not to trade. Think of the agent as a rules engine with broad pattern recognition—not as an oracle.
Practical Uses Today (That Don’t Blow Up Accounts)
- Pre-trade checklists. Have the model score your setup: trend alignment, liquidity, funding, event risk, and invalidation quality.
- Post-trade forensics. Auto-generate a one-pager with annotated charts and reasons for wins/losses; this compounds learning fast.
- Risk babysitting. Let the agent close positions on hard rules (loss cap hit, funding turns hostile) even if you’re away from the screen.
Bottom Line
In this 72-hour trial, the podium went to DeepSeek Chat v3.1 (+38%), Grok 4 (+35%), and Claude Sonnet 4.5 (+25%). Qwen 3 Max finished modestly green (+9%), while GPT-5 (−27%) and Gemini 2.5 Pro (−31%) struggled. That tells us two things: (1) with sensible guardrails, modern agents can manage live perps without immediately imploding; (2) outcomes are highly sensitive to prompts, regimes, and execution plumbing. Before you hand an AI the keys to your account, give it seatbelts: strict loss caps, liquidity filters, cool-downs, and a hard “do nothing” default when signals are weak. As ever in markets, process beats cleverness—and that includes the process you wrap around your AI.
Disclaimer: This article describes a short, entertainment-style experiment. It is not investment advice or a recommendation to trade with AI. Leverage and derivatives carry significant risk, including loss of principal. Always do your own research, use strict risk limits, and never deploy capital you cannot afford to lose.







