Thresholds

User-driven pass/fail criteria for validation tests

Every validation test accepts an optional thresholds dict. When you provide it, the result gains a .passed boolean and a threshold_checks dict showing which thresholds were met.

Horizon doesn’t have an opinion about what’s acceptable. You declare the criteria; the framework just does the math.

The contract

python
class ValidationTest(ABC):
    def __init__(self, thresholds: dict[str, float] | None = None):
        self.thresholds = thresholds or {}

Pass thresholds at construction:

python
from horizon.validate import Bootstrap

bs = Bootstrap(
    metrics=["sharpe"],
    n_samples=1000,
    thresholds={
        "sharpe_ci_lo_min": 0.5,
        "sharpe_median_min": 1.0,
    },
)

result = bs.run(returns=my_returns)

if result.passed:
    print("All thresholds met")

Threshold key naming

Each test documents its supported threshold keys. The general pattern is {metric}_{criterion}_{bound}:

  • sharpe_ci_lo_min: lower CI bound of Sharpe must be at least this value
  • sharpe_median_min: bootstrap median of Sharpe must be at least this value
  • oos_sharpe_min: OOS Sharpe must be at least this value
  • is_oos_sharpe_ratio_max: IS/OOS Sharpe ratio must not exceed this value
  • aggregate_sharpe_min: stitched walk-forward Sharpe must be at least this value
  • min_window_sharpe: worst walk-forward window Sharpe must be at least this value

Result fields

python
@dataclass
class ValidationResult:
    test_name: str
    thresholds: dict[str, float]            # user-supplied thresholds
    threshold_checks: dict[str, bool]        # per-threshold pass/fail
    metadata: dict[str, Any]

    @property
    def passed(self) -> bool:
        """True iff all user thresholds are met."""

Inspecting the checks

python
result = bs.run(returns=my_returns)

for check_name, passed in result.threshold_checks.items():
    icon = "✓" if passed else "✗"
    print(f"{icon} {check_name}")

Combining multiple tests

python
from horizon.validate import Bootstrap, WalkForward

# Run two tests with different thresholds
bs = Bootstrap(
    metrics=["sharpe"],
    thresholds={"sharpe_ci_lo_min": 0.5},
).run(returns=my_returns)

wf = WalkForward(
    train="2y", test="3m",
    thresholds={"aggregate_sharpe_min": 0.8, "min_window_sharpe": 0.0},
).run(strategy=MyStrategy, backtest=my_bt, ...)

# User's own gate
deploy_worthy = bs.passed and wf.passed

if deploy_worthy:
    print("All validation passed, safe to promote")
else:
    print("Validation failed, need more work")

Why user-driven thresholds?

Because “is a Sharpe of 1.2 good enough?” depends on:

  • Your capital base
  • Your investor expectations
  • The asset class you’re trading
  • How many strategy variants you tested
  • Your confidence in the backtest assumptions

A framework that has an opinion about this is either too strict (blocks everything) or too lax (lets overfits through). Horizon returns the numbers; you decide the meaning.

Next