Every additional backtest run increases the chance of finding good results by luck. The Deflated Sharpe Ratio (DSR) corrects for exactly this: it measures whether your result is real — or the best noise from N trials. Pro+ feature.

A new Pro+ feature is live: Deflated Sharpe Ratio (DSR).

It solves a problem that almost nobody talks about — but that affects everyone who runs backtests seriously.

The problem: multiple testing

Imagine flipping a coin 10 times. With a bit of luck — or bad luck — you get 8 heads. Nobody would say the coin is "proven better than random." You just ran too few trials, and randomness looks good.

Backtesting has the same dynamic, just more subtle:

The more parameter combinations or strategies you test on the same data, the more likely you are to find one that looks good by chance.

The classic Sharpe ratio ignores this entirely. It measures the outcome of the current run — not how many tries it took to get there.

The fix: Probabilistic and Deflated Sharpe

We implement two corrections from "The Deflated Sharpe Ratio" (Bailey & López de Prado, 2014, Journal of Portfolio Management):

PSR(0) — Probabilistic Sharpe Ratio:

P(true Sharpe > 0)

The probability that your measured Sharpe is not a random artifact. Accounts for skewness and kurtosis of returns — because backtest returns are rarely normally distributed.

DSR — Deflated Sharpe Ratio:

PSR evaluated at the random benchmark SR̂₀

SR̂₀ is the Sharpe you would statistically expect as the best result from N random trials (no real edge). DSR measures: does your result clear this bar — or is it just noise?

Verdict: pass (≥95%) / borderline (90–95%) / fail / single_trial (first run, deflation not yet possible) / insufficient (too few trades for reliable statistics).

What you see in the UI

After each backtest (Pro+, excluding reference strategies), a new block appears below the Arena Score:

🛡️ Deflated Sharpe (DSR)

📊 Sharpe (annualized)   🎲 PSR(0)        🛡️ DSR
        1.34               72.1%           64.8%

🟡 Borderline — weak evidence after 3 trials (90–95%).
N=3 runs for this strategy/asset/period from your backtest history.

Below it: a collapsible methodology section that explains PSR and DSR in plain terms.

Why N trials matter

Every backtest run with the same strategy, the same asset, the same interval and the same period — but different parameters — counts as another trial N.

Test Golden Cross with 50/200, then 50/150, then 40/200 — that is N=3. With each run, the deflation threshold SR̂₀ rises. Your final result must be credibly better than the best of those 3 random attempts.

On the first run (N=1), there is no deflation yet — the block shows PSR(0) with verdict single_trial.

What changes and what doesn't

DSR does not replace the Arena Score — that remains the primary quality score for robustness, CAGR outperformance and trade frequency.

DSR is a different question: "Is this result statistically distinguishable from noise?" Arena Score asks: "Is this result practically useful?"

Both questions matter. Both together are more honest than any single number alone.

Technical details

Implementation: computeReturnMoments() (biased 1/n estimators) · psr() · expectedMaxSharpe() · computeDsr() — all in src/lib/stats/deflatedSharpe.ts
normCdf: Abramowitz & Stegun 26.2.17 (max error 7.5e-8)
normInv: Peter Acklam (2003)
Periodicity: trade-level returns (pnlPct / 100), annualized for display (√252 / √52 / √12)
sharpe_per_period is now stored for all runs (all asset classes) in backtest_runs

For those who want to go deeper on the paper: Bailey & López de Prado, "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality", Journal of Portfolio Management, 2014.

Backtesting Arena

Why your Sharpe ratio lies — and how we correct for it