Validation

Bootstrap confidence intervals, overfitting protection

Validation tests are user-driven: each one is a standalone class you instantiate, run, and inspect. Results are rich metrics, not pass/fail booleans. You decide what the numbers mean.

Built-in tests

python
from horizon.validate import (
    Bootstrap,        # real block bootstrap CI on metrics
    OutOfSample,      # IS/OOS split (scaffold, runner coming in a future release)
    WalkForward,      # rolling train/test (scaffold, runner coming in a future release)
)

Bootstrap (real and working)

Block bootstrap with confidence intervals on any metric:

python
from horizon.validate import Bootstrap

# Some return series (daily log returns)
returns = [0.001, -0.002, 0.003, 0.0, -0.001, 0.002, ...]

bs = Bootstrap(
    metrics=["sharpe", "sortino", "max_drawdown", "cagr"],
    n_samples=1000,
    method="block",    # preserves autocorrelation
    block_size=20,
    seed=42,
)

result = bs.run(returns=returns)

# Point estimates
print(result.point_estimates["sharpe"])

# 95% CI on Sharpe
lo, hi = result.ci("sharpe", conf=0.95)
print(f"Sharpe 95% CI: [{lo:.3f}, {hi:.3f}]")

# Full distribution
distribution = result.distribution("sharpe")

How it works

  1. Resample the return series with the block bootstrap (preserves autocorrelation)
  2. Recompute each metric on each sample
  3. Report the distribution, point estimate, median, and quantile-based CI

Threshold-based pass/fail (optional)

If you want a boolean flag:

python
bs = Bootstrap(
    metrics=["sharpe"],
    n_samples=1000,
    thresholds={"sharpe_ci_lo_min": 0.5},
)
result = bs.run(returns=returns)

if result.passed:
    print("Sharpe lower CI bound is at least 0.5")

Threshold keys take the form {metric}_ci_lo_min (lower CI bound) or {metric}_median_min (median). You supply the threshold value; the framework returns the pass/fail flag.

OutOfSample (scaffold)

Split history into in-sample and out-of-sample, run the strategy on each, compare:

python
from horizon.validate import OutOfSample

oos = OutOfSample(
    train_pct=0.7,
    thresholds={
        "oos_sharpe_min": 0.5,
        "is_oos_sharpe_ratio_max": 2.0,
    },
)

result = oos.run(
    strategy=MyStrategy,
    backtest=BacktestConfig(initial_cash_usd=100_000),
    universe=my_universe,
    asset_classes=[Equity],
)

print(result.is_metrics)
print(result.oos_metrics)
print(result.degradation)

The runner is scaffolded (returns zero metrics until wired to the real backtest engine). The interface is stable; fill-in is a future release.

WalkForward (scaffold)

Rolling train/test validation:

python
from horizon.validate import WalkForward

wf = WalkForward(
    train="2y",
    test="3m",
    step="3m",
    retune_params=["kelly_fraction"],
    tuner=None,   # plug in Optuna in a future release
)

result = wf.run(strategy=MyStrategy, backtest=my_bt_config)

print(result.aggregate_sharpe)
print(result.per_window_sharpe)        # list[float]
print(result.per_window_drawdown)      # list[float]

Same story as OutOfSample: interface stable, runner a future release.

The result object

Every validation result is a subclass of ValidationResult:

python
@dataclass
class ValidationResult:
    test_name: str
    thresholds: dict[str, float]        # what you supplied
    threshold_checks: dict[str, bool]   # per-threshold pass/fail
    metadata: dict[str, Any]

    @property
    def passed(self) -> bool:
        """True iff all thresholds passed. False if no thresholds supplied."""

Plus serialization:

python
result.to_dict()         # JSON-safe dict
result.to_json(indent=2) # JSON string
result.save("report.json")

Philosophy

The framework returns numbers. The user decides meaning.

There’s no hz.validate() that runs a canned suite. There’s a library of tests you compose manually:

python
from horizon.validate import Bootstrap, OutOfSample

bs_result = Bootstrap(metrics=["sharpe"], n_samples=1000).run(returns=rets)
oos_result = OutOfSample(train_pct=0.7).run(strategy=MyStrategy, backtest=bt)

# Your own decision logic
if (
    bs_result.ci("sharpe", 0.95)[0] > 0.3
    and oos_result.oos_metrics["sharpe"] > 0.5
    and oos_result.is_oos_gap_sharpe < 1.5
):
    print("deploy-worthy")

A junior quant doesn’t want a black-box verdict; they want the numbers so they can make the call themselves. The framework should never pretend to have more certainty than it has.

Roadmap

TestStatus
Bootstrap✅ Real, working block bootstrap with CI/median/distribution
OutOfSample⏸️ Interface ready, needs backtest runner wiring
WalkForward⏸️ Interface ready, needs runner + Optuna integration
PurgedKFold⏸️ a future release (de Prado’s purged CV)
CPCV⏸️ a future release (combinatorial purged CV)
MonteCarloPermutation⏸️ a future release
DeflatedSharpe⏸️ a future release
PBO⏸️ a future release (probability of backtest overfit)
ParamStability⏸️ a future release
LookaheadAudit⏸️ a future release (static feature inspection)

Next