Bootstrap
Block-bootstrap confidence intervals on any metric
Bootstrap is Horizon’s validated statistical test. It resamples trade/return sequences many times, recomputes the metric on each sample, and returns the full distribution, not a point estimate.
Import
from horizon.validate import Bootstrap
Signature
Bootstrap(
metrics: list[str] | None = None,
n_samples: int = 1000,
method: str = "block", # "block" | "iid"
block_size: int = 20,
seed: int | None = None,
thresholds: dict[str, float] | None = None,
)
metricslist[str]n_samplesintmethodstrblock_sizeintseedint | NoneWhy bootstrap?
A point estimate of Sharpe = 1.5 tells you one number. Maybe that’s the true Sharpe; maybe your strategy got lucky. You can’t tell from one number.
Bootstrap gives you the distribution. You resample the returns 1000 times, compute Sharpe on each resample, and look at the spread:
- Median: best single estimate of true Sharpe
- 95% CI: range you’re 95% confident contains the true Sharpe
- Percentiles: what’s the p05, p95, p99?
If the 95% CI lower bound is 0.5, you’re “95% sure the true Sharpe is at least 0.5”. If it’s 0.0 or negative, the apparent 1.5 might be pure noise.
Block vs IID
Financial returns are autocorrelated: today’s return is correlated with yesterday’s. Naive (IID) bootstrap samples individual returns, breaking this correlation and underestimating variance.
Block bootstrap samples contiguous blocks of size block_size, preserving autocorrelation within each block. Default block size 20 is a reasonable compromise.
Usage
import horizon as hz
from horizon.validate import Bootstrap
# Get returns from a backtest
result = hz.run(mode="backtest", ...)
equities = [e for _, e in result.equity_curve]
returns = [(equities[i] / equities[i-1]) - 1 for i in range(1, len(equities))]
# Bootstrap
bs = Bootstrap(metrics=["sharpe", "max_drawdown"], n_samples=1000, seed=42)
bs_result = bs.run(returns=returns)
# Point estimates
print(bs_result.point_estimates)
# {"sharpe": 1.23, "max_drawdown": -0.15}
# Median of the bootstrap
print(f"Sharpe median: {bs_result.median('sharpe'):+.3f}")
# 95% CI
lo, hi = bs_result.ci("sharpe", conf=0.95)
print(f"Sharpe 95% CI: [{lo:+.3f}, {hi:+.3f}]")
# Full distribution
distribution = bs_result.distribution("sharpe")
print(f"5th percentile: {sorted(distribution)[50]:+.3f}")
Result fields
@dataclass
class BootstrapResult(ValidationResult):
samples: dict[str, list[float]] # per-metric list of bootstrap values
point_estimates: dict[str, float] # original estimates
n_samples: int
method: str
block_size: int
def ci(self, metric, conf=0.95) -> tuple[float, float]
def distribution(self, metric) -> list[float]
def median(self, metric) -> float
Thresholds (optional pass/fail)
Pass user thresholds to get a .passed attribute:
bs = Bootstrap(
metrics=["sharpe"],
n_samples=1000,
thresholds={
"sharpe_ci_lo_min": 0.5, # lower CI bound must exceed 0.5
"sharpe_median_min": 1.0, # median must exceed 1.0
},
)
result = bs.run(returns=returns)
if result.passed:
print("All thresholds met")
else:
for check, ok in result.threshold_checks.items():
print(f"{check}: {'✓' if ok else '✗'}")
Thresholds are user-declared. Horizon doesn’t have an opinion about what’s acceptable, so you decide.
Threshold keys
{metric}_ci_lo_min: lower CI bound at 95% must exceed value{metric}_median_min: median must exceed value
Tests
Block bootstrap is tested end-to-end against a known return series:
bs = Bootstrap(metrics=["sharpe"], n_samples=50)
bs_result = bs.run(returns=[0.001, -0.002, 0.003, 0.0, -0.001, 0.002] * 40)
assert bs_result.median("sharpe") != 0
See tests/test_behavioral_audit.py for the full coverage.