Bootstrap

Block-bootstrap confidence intervals on any metric

Bootstrap is Horizon’s validated statistical test. It resamples trade/return sequences many times, recomputes the metric on each sample, and returns the full distribution, not a point estimate.

Import

python
from horizon.validate import Bootstrap

Signature

python
Bootstrap(
    metrics: list[str] | None = None,
    n_samples: int = 1000,
    method: str = "block",           # "block" | "iid"
    block_size: int = 20,
    seed: int | None = None,
    thresholds: dict[str, float] | None = None,
)
metricslist[str]
Metrics to bootstrap. Defaults to `["sharpe", "sortino", "max_drawdown", "cagr"]`.
n_samplesint
Number of bootstrap resamples. More = tighter CIs, slower.
methodstr
`"block"` preserves autocorrelation (recommended for financial time series). `"iid"` is naive random sampling.
block_sizeint
For block bootstrap, the length of each resampled block.
seedint | None
RNG seed for reproducibility.

Why bootstrap?

A point estimate of Sharpe = 1.5 tells you one number. Maybe that’s the true Sharpe; maybe your strategy got lucky. You can’t tell from one number.

Bootstrap gives you the distribution. You resample the returns 1000 times, compute Sharpe on each resample, and look at the spread:

  • Median: best single estimate of true Sharpe
  • 95% CI: range you’re 95% confident contains the true Sharpe
  • Percentiles: what’s the p05, p95, p99?

If the 95% CI lower bound is 0.5, you’re “95% sure the true Sharpe is at least 0.5”. If it’s 0.0 or negative, the apparent 1.5 might be pure noise.

Block vs IID

Financial returns are autocorrelated: today’s return is correlated with yesterday’s. Naive (IID) bootstrap samples individual returns, breaking this correlation and underestimating variance.

Block bootstrap samples contiguous blocks of size block_size, preserving autocorrelation within each block. Default block size 20 is a reasonable compromise.

Usage

python
import horizon as hz
from horizon.validate import Bootstrap

# Get returns from a backtest
result = hz.run(mode="backtest", ...)
equities = [e for _, e in result.equity_curve]
returns = [(equities[i] / equities[i-1]) - 1 for i in range(1, len(equities))]

# Bootstrap
bs = Bootstrap(metrics=["sharpe", "max_drawdown"], n_samples=1000, seed=42)
bs_result = bs.run(returns=returns)

# Point estimates
print(bs_result.point_estimates)
# {"sharpe": 1.23, "max_drawdown": -0.15}

# Median of the bootstrap
print(f"Sharpe median: {bs_result.median('sharpe'):+.3f}")

# 95% CI
lo, hi = bs_result.ci("sharpe", conf=0.95)
print(f"Sharpe 95% CI: [{lo:+.3f}, {hi:+.3f}]")

# Full distribution
distribution = bs_result.distribution("sharpe")
print(f"5th percentile: {sorted(distribution)[50]:+.3f}")

Result fields

python
@dataclass
class BootstrapResult(ValidationResult):
    samples: dict[str, list[float]]           # per-metric list of bootstrap values
    point_estimates: dict[str, float]          # original estimates
    n_samples: int
    method: str
    block_size: int

    def ci(self, metric, conf=0.95) -> tuple[float, float]
    def distribution(self, metric) -> list[float]
    def median(self, metric) -> float

Thresholds (optional pass/fail)

Pass user thresholds to get a .passed attribute:

python
bs = Bootstrap(
    metrics=["sharpe"],
    n_samples=1000,
    thresholds={
        "sharpe_ci_lo_min": 0.5,       # lower CI bound must exceed 0.5
        "sharpe_median_min": 1.0,      # median must exceed 1.0
    },
)
result = bs.run(returns=returns)

if result.passed:
    print("All thresholds met")
else:
    for check, ok in result.threshold_checks.items():
        print(f"{check}: {'✓' if ok else '✗'}")

Thresholds are user-declared. Horizon doesn’t have an opinion about what’s acceptable, so you decide.

Threshold keys

  • {metric}_ci_lo_min: lower CI bound at 95% must exceed value
  • {metric}_median_min: median must exceed value

Tests

Block bootstrap is tested end-to-end against a known return series:

python
bs = Bootstrap(metrics=["sharpe"], n_samples=50)
bs_result = bs.run(returns=[0.001, -0.002, 0.003, 0.0, -0.001, 0.002] * 40)
assert bs_result.median("sharpe") != 0

See tests/test_behavioral_audit.py for the full coverage.

Next