Five weeks ago it was 10,000 backtests; today it's over 40,000. With 4x the data, the "beats Buy & Hold" rate falls from 64% to 52%, four in five runs land below 50% win rate, and one memecoin still holds its 197,923% record. What selection bias does to sample size — on real numbers.

35 days ago we wrote about the first 10,000 backtests. Today the Arena stands at over 40,000. The first ten thousand took 41 days. The next thirty thousand took 35.

That's the boring headline. The interesting one is in the numbers — and especially in how the numbers changed when the sample quadrupled. Because bigger samples aren't just "more of the same." They wash out what small samples flatter.

What changed about the pace

Milestone	Day	Total backtests	Avg/day to date
10k	Day 41	10,154	248
40k	Day 76	40,777	537

The second 30,000 ran at roughly 880 backtests per day — more than triple the launch pace. Honest caveat: a large share of that jump is not user clicks but our systematic coverage engine. Since the 10k post, the daily pipeline that fills the strategy×asset matrix for Strategy Insights and the Edge Library has been running. That's exactly what we promised at 10k — and it changes how you have to read the aggregate numbers.

The strategy distribution now measures something else

At 10k the distribution was user-driven: RSI/SMA Cross had nearly 10x the runs of #2. Today:

Strategy	Runs	Share
rsi_sma	6,439	15.8%
stoch_rsi_sma	4,428	10.9%
ema_trend_bias	4,153	10.2%
rsi_ob_os	3,429	8.4%
golden_cross	3,428	8.4%
wma_trend	3,271	8.0%

The distribution flattened. The reason isn't that users suddenly love Stochastic RSI — it's that the coverage engine runs every strategy systematically across the universe. At 10k the distribution measured user taste. At 40k it measures our coverage. Both are valid, but they're not the same thing — and if you don't separate them, you read the wrong story out of the numbers.

The inconvenient truth, now even sharper

The win-rate distribution across all runs with at least 5 trades:

Win rate	Backtests
<30%	6,057
30–50%	14,890
50–60%	2,623
60–70%	1,393
70–80%	650
>80%	579

80% of all backtests come in under 50% win rate. At 10k it was 60%. Only 4.7% break above 70% — the "70% win-rate strategy" from trading Twitter lives in the top five percent, usually on a handful of trades.

That doesn't mean these strategies are bad. An RSI/SMA strategy with a 30% win rate can be CAGR-positive if the winners are big enough. Win rate alone says nothing — it's the most-quoted and least-informative number in trading.

The headline finding: 64% became 52%

At 10k, 64.3% of all comparable backtests (≥5 trades) beat Buy & Hold. Today, across 26,189 comparisons:

Metric	10k	40k
Beats single-point B&H	64.3%	51.6%

The rate dropped more than twelve points — almost a coin flip. That's not a regression, it's selection bias washing out. At 10k the dataset was smaller and more shaped by users keeping their winners and deleting their blow-ups. With 4x the data — including many systematic, not cherry-picked coverage runs — reality converges on what it always was: active beats passive roughly half the time, before costs.

New at 40k: Avg B&H exposes timing luck

At 10k we had too few runs with a computed Avg B&H (the average of all possible entry points with at least 20% remaining duration — the fairer, harder benchmark). Now there are 2,518:

Comparison	Share
Beats single-point B&H	59.1%
Beats Avg B&H	60.4%
Of the single-B&H beaters: fail Avg B&H	16.0%

One in six backtests that beats a single (lucky) B&H entry fails against the average entry. That 16% is pure timing luck — a run that happened to be measured from the right bottom. This is exactly why we show Avg B&H as the primary benchmark: it strips the luck out of the backtest.

Anecdotes — the returning cast

The memecoin defends its title. In the 10k post, a Chinese-named memecoin with 197,923% CAGR was the wildest run. At 40k: still undefeated. Buy & Hold on the same coin: 2,239% — the strategy beat B&H by a factor of 88. Win rate: 44%. Survivorship luck or real edge? The coin only exists because it pumped — that's the honest answer.

FTM is the more credible star. More interesting than the memecoin: FTMUSDT beats its Buy & Hold (≈130%) across three independent strategies — sha_smooth (259% CAGR), rsi_sma (250%), obv_macd_v2 (247%). When the same edge shows up across multiple uncorrelated logics, that's closer to signal than to luck. A memecoin miracle is an anecdote. Three strategies agreeing is a hint.

TSLA stays the patience king. The longest single run still goes back to 2010: Tesla, EMA Trend Bias, monthly bars, 16.3 years, 32.7% CAGR. Few trades, long holds — easy to romanticize in hindsight, easy to forget the 80% drawdown along the way.

A2Z stays cursed. The coin that killed every strategy in the 10k post is still here: A2ZUSDC with RSI/SMA over 8 years, −95.9% CAGR, 11% win rate. It has company in SAHARAFDUSD (−99.3%). Sometimes the asset is the problem, not the system.

Naked strategy vs. strategy with a filter

The most interesting new question at 40k: do the filters actually do anything? We split every comparable run (≥5 trades) into two camps — naked (no filter) vs. at least one filter active.

Group	Runs	Beats single-point B&H	Avg CAGR	Avg B&H
Naked	10,316	49.0%	23.8%	−0.6%
Filtered	15,873	53.3%	8.1%	2.1%

Filtered strategies beat Buy & Hold slightly more often (53% vs. 49%) — but at a much lower average CAGR (8% vs. 24%). That's not a contradiction, it's the deal: filters trade upside for consistency. They keep you out of the market during dangerous regimes — you dodge more disasters but also miss more moonshots. The high naked average CAGR is pulled up by crypto outliers (see the memecoin); the frequency of beating B&H is the more honest comparison.

Honest framing: the filters apply mostly to crypto, so the groups aren't perfectly comparable (different asset mix, different B&H baseline). This is a hint, not a controlled A/B.

And not every filter is equal:

Filter active	Beats single-point B&H
Bullmarket Stage	65.0%
Altcoin Season	62.6%
200-WMA	47.3%
ATR Volatility	42.1%

The regime filters — Bullmarket Stage and Altcoin Season — beat B&H far more often than the ATR volatility filter, which on average even trailed B&H. A filter is only as good as the regime question it asks.

Filter adoption — read carefully

Filter (Pro+)	10k	40k
200-WMA	5.6%	9.6%
Altcoin Season	9.2%	10.2%
ATR Volatility	8.6%	27.9%
Bullmarket Stage	0%	12.0%

Usage is up, ATR notably. But the honest footnote applies here too: part of this is our own regime-coverage runs, not just user choices. We separate that internally — externally what holds is: filters are used more than at 10k, but they're still the exception, not the rule.

What Backtesting Arena contributes here

We don't claim to be the only honest backtesting platform. What we do: show the aggregate numbers as they are — including the fact that the "beats B&H" rate falls as the sample grows. We separate user runs from systematic coverage instead of blending both into one flattering headline. And with Strategy Insights and the Edge Library we're building exactly the layer that systematically answers the survivorship question from the 10k post: which setups outperform robustly — and which only shone once.

The first three backtests are free. And if you've already run one: the best insights in this collection don't come from us. They come from you.

FAQ

Are the 40,000 real user backtests? No, and we say so openly. They're all platform runs — user runs plus the systematic coverage pipeline (admin bulk runs) that fills the strategy×asset matrix. Data is data, but we label what's what: the strategy distribution at 40k mostly reflects our coverage, not user taste.

Why does the "beats B&H" rate fall as the platform grows? Because the small 10k sample was flattered by selection bias — users keep winners, delete blow-ups. With 4x the data plus many non-picked coverage runs, the rate converges on the realistic value: about half, before costs.

Does a win rate below 50% mean a strategy is bad? No. A 30% win-rate strategy can be profitable if the winners are much larger than the losers. Win rate without expectancy is theater — which is why we optimize for "beats B&H by CAGR + drawdown," not for a single quote.

What is Avg B&H and why is it harder than regular Buy & Hold? Single-point B&H measures against one entry time — often a lucky bottom. Avg B&H averages over all possible entries (with at least 20% remaining duration) and removes the timing luck. 16% of the runs that beat single-point B&H fail against Avg B&H.

Is the 197,923% memecoin sold as proof of your strategies? No. It's an anecdote with explicitly stated survivorship bias: the coin only exists because it pumped. FTM is more credible — it beats its B&H across three independent strategies, and multiple confirmation is closer to signal than a single miracle.

Do the filters actually do anything — naked vs. filtered? Something, yes, but with a trade-off. Filtered runs beat single-point B&H more often (53% vs. 49% naked) but at a lower average CAGR (8% vs. 24%) — filters trade upside for consistency. And not every filter is equal: regime filters (Bullmarket Stage 65%, Altcoin Season 63%) beat B&H far more often than the ATR volatility filter (42%, on average behind B&H). Caveat: filters apply mostly to crypto, so it's not a perfectly controlled comparison.

How do you avoid cherry-picking the anecdotes? We show winners and losers (A2Z at −96%, SAHARAFD at −99%), state the sample size and trade count, and treat anything under 30 trades as an anecdote, not evidence.

Backtesting Arena

40,000 Backtests — What the Data Says Now (and What Changed Since 10k)