de Prado Validation

Purged K-Fold, CPCV, Deflated Sharpe Ratio, Probability of Backtest Overfit

Marcos López de Prado’s Advances in Financial Machine Learning established that standard ML validation methods systematically overfit financial time series. Time series have auto-correlation; naive train/test splits leak information; trying many strategies inflates apparent performance. de Prado’s techniques address each failure mode.

The four core techniques

Purged K-Fold Standard k-fold leaks via label overlap. Purged k-fold removes training samples that overlap with test samples, and adds a time **embargo** between them.
CPCV Combinatorial Purged Cross-Validation. Exponentially more backtest paths than walk-forward by testing every combination of held-out folds. Gives a distribution of performance, not a single point estimate.
Deflated Sharpe Ratio Adjusts observed Sharpe for: - Non-normality (skew, kurtosis) - Sample size - **Multiple testing** (how many strategies you tried) Returns the probability that the true Sharpe is greater than zero.
PBO Probability of Backtest Overfit. Uses Combinatorially Symmetric CV (CSCV) to estimate the probability that your in-sample best strategy will be **below median** out of sample. A PBO near 0.5 means you're flipping coins.

Purged K-Fold

Why it matters

In financial ML, labels often span multiple bars (e.g., “did the price go up within 3 days?”). A test sample’s label may be computed using prices that a training sample also uses: a direct leak.

Purging removes training samples whose label windows overlap with any test sample. Embargo adds a fixed time gap after each test fold to prevent residual serial correlation.

API

python
from horizon.validate import PurgedKFold   # planned

pkf = PurgedKFold(
    n_splits=5,
    embargo_pct=0.02,          # 2% of sample size as post-test embargo
    purge_window="5d",          # purge training samples within 5 days of any test sample
)

result = pkf.run(
    strategy=MyStrategy,
    backtest=BacktestConfig(start="2020-01-01", end="2024-12-31", ...),
    universe=...,
    asset_classes=[Equity],
)

for fold_metrics in result.fold_metrics:
    print(fold_metrics)

print(f"Mean Sharpe: {result.mean_sharpe:.3f}")
print(f"Std Sharpe:  {result.std_sharpe:.3f}")
print(f"IQR: {result.sharpe_iqr}")

CPCV. Combinatorial Purged Cross-Validation

Standard walk-forward gives you one backtest path. CPCV gives you many. Given n_splits folds and choosing n_test_groups held out at a time, CPCV runs a backtest for every combination. yielding a distribution of performance estimates.

With n_splits=10 and n_test_groups=2, you get C(10,2) = 45 backtest paths. Each has its own Sharpe. The distribution tells you whether the observed performance is reliable.

API

python
from horizon.validate import CPCV   # planned

cpcv = CPCV(
    n_splits=10,
    n_test_groups=2,
    embargo_pct=0.02,
)

result = cpcv.run(strategy=MyStrategy, backtest=my_config, ...)

print(f"Mean Sharpe: {result.sharpe_mean:.3f}")
print(f"Median:      {result.sharpe_median:.3f}")
print(f"95% CI:      {result.sharpe_ci(0.95)}")
print(f"P5-P95:      {result.sharpe_p05:.3f} to {result.sharpe_p95:.3f}")
result.plot_distribution()  # histogram of all 45 paths

Deflated Sharpe Ratio (DSR)

The problem: you observe a Sharpe of 1.5 on your best strategy. But you tried 100 variants. How confident should you be that the true Sharpe is above zero?

de Prado and Bailey’s answer: deflate the observed Sharpe by accounting for:

  1. Multiple testing: trying 100 strategies guarantees at least one will look good by chance
  2. Non-normality: returns have skew and fat tails; standard Sharpe formulas assume Gaussian
  3. Sample size: fewer observations → noisier Sharpe

The deflated formula:

DSR = Φ((SR_observed - SR_null) × √(T-1) / √(1 - γ₃·SR + (γ₄-1)/4·SR²))

Where:

  • SR_null is the expected max Sharpe from n_trials random strategies (accounts for multiple testing)
  • γ₃, γ₄ are sample skew and kurtosis (account for non-normality)
  • T is the number of observations
  • Φ is the standard normal CDF

DSR is interpreted as the probability that the true Sharpe is > 0. A DSR of 0.95 means you’re 95% confident the strategy has real edge.

API

python
from horizon.validate import DeflatedSharpe   # planned

ds = DeflatedSharpe(n_trials=100)
result = ds.run(strategy=MyStrategy, backtest=my_config, ...)

print(f"Observed Sharpe: {result.observed_sharpe:.3f}")
print(f"Deflated Sharpe: {result.deflated_sharpe:.3f}")
print(f"P(true Sharpe > 0): {result.dsr_probability:.2%}")
print(f"Skew: {result.skew:.3f}")
print(f"Kurtosis: {result.kurtosis:.3f}")
print(f"Adjustment components: {result.components}")

PBO. Probability of Backtest Overfit

The problem: you ran 50 parameter combinations and picked the one with the best in-sample Sharpe. How often would this selection process produce a below-median out-of-sample result?

PBO uses Combinatorially Symmetric Cross-Validation (CSCV) to answer this empirically:

Split data into N equal groups

e.g., 16 sub-periods.

For every (N/2) pair of IS / OOS assignments

Rank all trials on IS performance. Rank them again on OOS performance. Record the rank correlation.

Compute PBO

Fraction of cases where the IS-best trial is below median on OOS. A PBO of 0.5 means IS performance has zero predictive value for OOS. pure overfitting.

API

python
from horizon.validate import PBO   # planned

pbo = PBO(
    n_trials=50,              # number of strategy variants you tested
    n_splits=16,              # CSCV split count
)

result = pbo.run(trials=list_of_parameter_configs, backtest=my_config, ...)

print(f"PBO: {result.pbo_probability:.1%}")
print(f"Rank correlation IS vs OOS: {result.rank_correlation:.3f}")
print(f"Slope of OOS vs IS: {result.slope:.3f}")

if result.pbo_probability > 0.3:
    print("⚠️ High probability of overfit")

Using them together

A reasonable research workflow:

python
from horizon.validate import Bootstrap, DeflatedSharpe, PBO, PurgedKFold

# 1. Basic in-sample metrics
result = hz.run(mode="backtest", strategy=MyStrategy, ...)
print(f"In-sample Sharpe: {result.sharpe:.3f}")

# 2. Block bootstrap CI. is the point estimate reliable?
bs = Bootstrap(metrics=["sharpe"], n_samples=1000)
bs_result = bs.run(returns=extract_returns(result))
print(f"95% CI: {bs_result.ci('sharpe', 0.95)}")

# 3. Purged k-fold. does the strategy generalize?
pkf = PurgedKFold(n_splits=5, embargo_pct=0.02)
pkf_result = pkf.run(strategy=MyStrategy, backtest=my_config, ...)
print(f"Mean fold Sharpe: {pkf_result.mean_sharpe:.3f}")
print(f"Std fold Sharpe:  {pkf_result.std_sharpe:.3f}")

# 4. Deflated Sharpe. adjust for multiple testing
ds = DeflatedSharpe(n_trials=N_STRATEGIES_TRIED)
ds_result = ds.run(strategy=MyStrategy, backtest=my_config, ...)
print(f"DSR: {ds_result.dsr_probability:.2%}")

# 5. PBO. probability of overfit via CSCV
pbo = PBO(n_trials=N_STRATEGIES_TRIED)
pbo_result = pbo.run(trials=all_variants, backtest=my_config, ...)
print(f"PBO: {pbo_result.pbo_probability:.1%}")

# User decides whether to deploy
deploy = (
    bs_result.ci('sharpe', 0.95)[0] > 0.5      # CI lower bound > 0.5
    and pkf_result.mean_sharpe > 0.5            # folds agree
    and ds_result.dsr_probability > 0.90        # multiple-testing-adjusted
    and pbo_result.pbo_probability < 0.20       # low overfit probability
)

No single test is sufficient. Combining Bootstrap + PurgedKFold + DSR + PBO gives a robust picture.

Further reading

  • “Advances in Financial Machine Learning” by López de Prado (the main reference)
  • “The Probability of Backtest Overfitting”. Bailey, Borwein, López de Prado, Zhu (2014)
  • “The Deflated Sharpe Ratio”. Bailey, López de Prado (2014)
  • “Pseudo-Mathematics and Financial Charlatanism” by Bailey, Borwein, López de Prado, Zhu (2014): the cautionary tale that motivates all of this

Next