de Prado Validation
Purged K-Fold, CPCV, Deflated Sharpe Ratio, Probability of Backtest Overfit
Marcos López de Prado’s Advances in Financial Machine Learning established that standard ML validation methods systematically overfit financial time series. Time series have auto-correlation; naive train/test splits leak information; trying many strategies inflates apparent performance. de Prado’s techniques address each failure mode.
The four core techniques
Purged K-Fold
Why it matters
In financial ML, labels often span multiple bars (e.g., “did the price go up within 3 days?”). A test sample’s label may be computed using prices that a training sample also uses: a direct leak.
Purging removes training samples whose label windows overlap with any test sample. Embargo adds a fixed time gap after each test fold to prevent residual serial correlation.
API
from horizon.validate import PurgedKFold # planned
pkf = PurgedKFold(
n_splits=5,
embargo_pct=0.02, # 2% of sample size as post-test embargo
purge_window="5d", # purge training samples within 5 days of any test sample
)
result = pkf.run(
strategy=MyStrategy,
backtest=BacktestConfig(start="2020-01-01", end="2024-12-31", ...),
universe=...,
asset_classes=[Equity],
)
for fold_metrics in result.fold_metrics:
print(fold_metrics)
print(f"Mean Sharpe: {result.mean_sharpe:.3f}")
print(f"Std Sharpe: {result.std_sharpe:.3f}")
print(f"IQR: {result.sharpe_iqr}")
CPCV. Combinatorial Purged Cross-Validation
Standard walk-forward gives you one backtest path. CPCV gives you many. Given n_splits folds and choosing n_test_groups held out at a time, CPCV runs a backtest for every combination. yielding a distribution of performance estimates.
With n_splits=10 and n_test_groups=2, you get C(10,2) = 45 backtest paths. Each has its own Sharpe. The distribution tells you whether the observed performance is reliable.
API
from horizon.validate import CPCV # planned
cpcv = CPCV(
n_splits=10,
n_test_groups=2,
embargo_pct=0.02,
)
result = cpcv.run(strategy=MyStrategy, backtest=my_config, ...)
print(f"Mean Sharpe: {result.sharpe_mean:.3f}")
print(f"Median: {result.sharpe_median:.3f}")
print(f"95% CI: {result.sharpe_ci(0.95)}")
print(f"P5-P95: {result.sharpe_p05:.3f} to {result.sharpe_p95:.3f}")
result.plot_distribution() # histogram of all 45 paths
Deflated Sharpe Ratio (DSR)
The problem: you observe a Sharpe of 1.5 on your best strategy. But you tried 100 variants. How confident should you be that the true Sharpe is above zero?
de Prado and Bailey’s answer: deflate the observed Sharpe by accounting for:
- Multiple testing: trying 100 strategies guarantees at least one will look good by chance
- Non-normality: returns have skew and fat tails; standard Sharpe formulas assume Gaussian
- Sample size: fewer observations → noisier Sharpe
The deflated formula:
DSR = Φ((SR_observed - SR_null) × √(T-1) / √(1 - γ₃·SR + (γ₄-1)/4·SR²))
Where:
SR_nullis the expected max Sharpe fromn_trialsrandom strategies (accounts for multiple testing)γ₃,γ₄are sample skew and kurtosis (account for non-normality)Tis the number of observationsΦis the standard normal CDF
DSR is interpreted as the probability that the true Sharpe is > 0. A DSR of 0.95 means you’re 95% confident the strategy has real edge.
API
from horizon.validate import DeflatedSharpe # planned
ds = DeflatedSharpe(n_trials=100)
result = ds.run(strategy=MyStrategy, backtest=my_config, ...)
print(f"Observed Sharpe: {result.observed_sharpe:.3f}")
print(f"Deflated Sharpe: {result.deflated_sharpe:.3f}")
print(f"P(true Sharpe > 0): {result.dsr_probability:.2%}")
print(f"Skew: {result.skew:.3f}")
print(f"Kurtosis: {result.kurtosis:.3f}")
print(f"Adjustment components: {result.components}")
PBO. Probability of Backtest Overfit
The problem: you ran 50 parameter combinations and picked the one with the best in-sample Sharpe. How often would this selection process produce a below-median out-of-sample result?
PBO uses Combinatorially Symmetric Cross-Validation (CSCV) to answer this empirically:
Split data into N equal groups
For every (N/2) pair of IS / OOS assignments
Compute PBO
API
from horizon.validate import PBO # planned
pbo = PBO(
n_trials=50, # number of strategy variants you tested
n_splits=16, # CSCV split count
)
result = pbo.run(trials=list_of_parameter_configs, backtest=my_config, ...)
print(f"PBO: {result.pbo_probability:.1%}")
print(f"Rank correlation IS vs OOS: {result.rank_correlation:.3f}")
print(f"Slope of OOS vs IS: {result.slope:.3f}")
if result.pbo_probability > 0.3:
print("⚠️ High probability of overfit")
Using them together
A reasonable research workflow:
from horizon.validate import Bootstrap, DeflatedSharpe, PBO, PurgedKFold
# 1. Basic in-sample metrics
result = hz.run(mode="backtest", strategy=MyStrategy, ...)
print(f"In-sample Sharpe: {result.sharpe:.3f}")
# 2. Block bootstrap CI. is the point estimate reliable?
bs = Bootstrap(metrics=["sharpe"], n_samples=1000)
bs_result = bs.run(returns=extract_returns(result))
print(f"95% CI: {bs_result.ci('sharpe', 0.95)}")
# 3. Purged k-fold. does the strategy generalize?
pkf = PurgedKFold(n_splits=5, embargo_pct=0.02)
pkf_result = pkf.run(strategy=MyStrategy, backtest=my_config, ...)
print(f"Mean fold Sharpe: {pkf_result.mean_sharpe:.3f}")
print(f"Std fold Sharpe: {pkf_result.std_sharpe:.3f}")
# 4. Deflated Sharpe. adjust for multiple testing
ds = DeflatedSharpe(n_trials=N_STRATEGIES_TRIED)
ds_result = ds.run(strategy=MyStrategy, backtest=my_config, ...)
print(f"DSR: {ds_result.dsr_probability:.2%}")
# 5. PBO. probability of overfit via CSCV
pbo = PBO(n_trials=N_STRATEGIES_TRIED)
pbo_result = pbo.run(trials=all_variants, backtest=my_config, ...)
print(f"PBO: {pbo_result.pbo_probability:.1%}")
# User decides whether to deploy
deploy = (
bs_result.ci('sharpe', 0.95)[0] > 0.5 # CI lower bound > 0.5
and pkf_result.mean_sharpe > 0.5 # folds agree
and ds_result.dsr_probability > 0.90 # multiple-testing-adjusted
and pbo_result.pbo_probability < 0.20 # low overfit probability
)
No single test is sufficient. Combining Bootstrap + PurgedKFold + DSR + PBO gives a robust picture.
Further reading
- “Advances in Financial Machine Learning” by López de Prado (the main reference)
- “The Probability of Backtest Overfitting”. Bailey, Borwein, López de Prado, Zhu (2014)
- “The Deflated Sharpe Ratio”. Bailey, López de Prado (2014)
- “Pseudo-Mathematics and Financial Charlatanism” by Bailey, Borwein, López de Prado, Zhu (2014): the cautionary tale that motivates all of this