Validation
Bootstrap confidence intervals, overfitting protection
Validation tests are user-driven: each one is a standalone class you instantiate, run, and inspect. Results are rich metrics, not pass/fail booleans. You decide what the numbers mean.
Built-in tests
from horizon.validate import (
Bootstrap, # real block bootstrap CI on metrics
OutOfSample, # IS/OOS split (scaffold, runner coming in a future release)
WalkForward, # rolling train/test (scaffold, runner coming in a future release)
)
Bootstrap (real and working)
Block bootstrap with confidence intervals on any metric:
from horizon.validate import Bootstrap
# Some return series (daily log returns)
returns = [0.001, -0.002, 0.003, 0.0, -0.001, 0.002, ...]
bs = Bootstrap(
metrics=["sharpe", "sortino", "max_drawdown", "cagr"],
n_samples=1000,
method="block", # preserves autocorrelation
block_size=20,
seed=42,
)
result = bs.run(returns=returns)
# Point estimates
print(result.point_estimates["sharpe"])
# 95% CI on Sharpe
lo, hi = result.ci("sharpe", conf=0.95)
print(f"Sharpe 95% CI: [{lo:.3f}, {hi:.3f}]")
# Full distribution
distribution = result.distribution("sharpe")
How it works
- Resample the return series with the block bootstrap (preserves autocorrelation)
- Recompute each metric on each sample
- Report the distribution, point estimate, median, and quantile-based CI
Threshold-based pass/fail (optional)
If you want a boolean flag:
bs = Bootstrap(
metrics=["sharpe"],
n_samples=1000,
thresholds={"sharpe_ci_lo_min": 0.5},
)
result = bs.run(returns=returns)
if result.passed:
print("Sharpe lower CI bound is at least 0.5")
Threshold keys take the form {metric}_ci_lo_min (lower CI bound) or {metric}_median_min (median). You supply the threshold value; the framework returns the pass/fail flag.
OutOfSample (scaffold)
Split history into in-sample and out-of-sample, run the strategy on each, compare:
from horizon.validate import OutOfSample
oos = OutOfSample(
train_pct=0.7,
thresholds={
"oos_sharpe_min": 0.5,
"is_oos_sharpe_ratio_max": 2.0,
},
)
result = oos.run(
strategy=MyStrategy,
backtest=BacktestConfig(initial_cash_usd=100_000),
universe=my_universe,
asset_classes=[Equity],
)
print(result.is_metrics)
print(result.oos_metrics)
print(result.degradation)
The runner is scaffolded (returns zero metrics until wired to the real backtest engine). The interface is stable; fill-in is a future release.
WalkForward (scaffold)
Rolling train/test validation:
from horizon.validate import WalkForward
wf = WalkForward(
train="2y",
test="3m",
step="3m",
retune_params=["kelly_fraction"],
tuner=None, # plug in Optuna in a future release
)
result = wf.run(strategy=MyStrategy, backtest=my_bt_config)
print(result.aggregate_sharpe)
print(result.per_window_sharpe) # list[float]
print(result.per_window_drawdown) # list[float]
Same story as OutOfSample: interface stable, runner a future release.
The result object
Every validation result is a subclass of ValidationResult:
@dataclass
class ValidationResult:
test_name: str
thresholds: dict[str, float] # what you supplied
threshold_checks: dict[str, bool] # per-threshold pass/fail
metadata: dict[str, Any]
@property
def passed(self) -> bool:
"""True iff all thresholds passed. False if no thresholds supplied."""
Plus serialization:
result.to_dict() # JSON-safe dict
result.to_json(indent=2) # JSON string
result.save("report.json")
Philosophy
The framework returns numbers. The user decides meaning.
There’s no hz.validate() that runs a canned suite. There’s a library of tests you compose manually:
from horizon.validate import Bootstrap, OutOfSample
bs_result = Bootstrap(metrics=["sharpe"], n_samples=1000).run(returns=rets)
oos_result = OutOfSample(train_pct=0.7).run(strategy=MyStrategy, backtest=bt)
# Your own decision logic
if (
bs_result.ci("sharpe", 0.95)[0] > 0.3
and oos_result.oos_metrics["sharpe"] > 0.5
and oos_result.is_oos_gap_sharpe < 1.5
):
print("deploy-worthy")
A junior quant doesn’t want a black-box verdict; they want the numbers so they can make the call themselves. The framework should never pretend to have more certainty than it has.
Roadmap
| Test | Status |
|---|---|
Bootstrap | ✅ Real, working block bootstrap with CI/median/distribution |
OutOfSample | ⏸️ Interface ready, needs backtest runner wiring |
WalkForward | ⏸️ Interface ready, needs runner + Optuna integration |
PurgedKFold | ⏸️ a future release (de Prado’s purged CV) |
CPCV | ⏸️ a future release (combinatorial purged CV) |
MonteCarloPermutation | ⏸️ a future release |
DeflatedSharpe | ⏸️ a future release |
PBO | ⏸️ a future release (probability of backtest overfit) |
ParamStability | ⏸️ a future release |
LookaheadAudit | ⏸️ a future release (static feature inspection) |