de Prado Labeling
Triple-barrier labeling, meta-labeling, sample weights
One of the biggest contributions in Advances in Financial Machine Learning is a set of labeling techniques for supervised learning on financial time series. The naive approach (“label +1 if price went up tomorrow”) systematically fails. de Prado’s methods address the structural issues.
The problem with fixed-horizon labels
Consider labeling “will this stock be up in 5 days”:
t: 0 1 2 3 4 5 6 7
p: 100 101 99 103 102 100 98 95
y5: ? ? ? ? ? ? ? ?
The label at t=0 compares p(5) = 100 to p(0) = 100 → label = 0 (flat). But at t=0, during the 5-day window, the price went as high as 103 (profit!) and as low as 95 (stop loss!). The fixed-horizon label ignores what actually happened.
The fix: the triple-barrier method.
Triple-barrier labeling
Define three barriers
Assign label by which barrier is hit first
+1 (profit)
- Lower first → −1 (loss)
- Time expires → 0 (flat)Compute meta information
Why it’s better
The triple-barrier label respects the trading setup: you’d close the position at the first barrier hit, not ride it to the arbitrary horizon. This makes the label economically meaningful and well-aligned with the actual strategy logic.
Triple-barrier recipe
import pandas as pd
import numpy as np
def triple_barrier_labels(
prices: pd.Series,
upper_pct: float = 0.02,
lower_pct: float = 0.01,
horizon_bars: int = 5,
) -> pd.DataFrame:
"""Triple-barrier labels for every entry point.
Returns a DataFrame with columns:
label (+1 / -1 / 0),
barrier_hit ('upper' / 'lower' / 'time'),
bars_to_hit,
realized_return
"""
out = []
for i in range(len(prices) - horizon_bars):
entry = prices.iloc[i]
upper = entry * (1 + upper_pct)
lower = entry * (1 - lower_pct)
label = 0
barrier = "time"
hit_bar = horizon_bars
for j in range(1, horizon_bars + 1):
p = prices.iloc[i + j]
if p >= upper:
label = 1
barrier = "upper"
hit_bar = j
break
if p <= lower:
label = -1
barrier = "lower"
hit_bar = j
break
realized = (prices.iloc[i + hit_bar] / entry - 1)
out.append({
"entry_idx": i,
"label": label,
"barrier_hit": barrier,
"bars_to_hit": hit_bar,
"realized_return": realized,
})
return pd.DataFrame(out)
Use it to build a supervised dataset:
prices = pd.Series(my_price_data)
labels = triple_barrier_labels(prices, upper_pct=0.02, lower_pct=0.01, horizon_bars=5)
features = compute_features_at_entry_points(labels["entry_idx"])
# Now train a classifier
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(features, labels["label"])
Meta-labeling
The idea: instead of training a model to predict direction, train a primary model to predict direction (possibly naive), then train a meta model to decide when to trust the primary model.
Train a primary model
Apply triple-barrier labeling to the primary's signals
Train a secondary (meta) model
Deploy both together
Why it works
- The primary can be simple (e.g., a pure trend follower); it doesn’t need to know when to sit out
- The meta focuses on a very specific binary task: “given this signal, should I trade?”
- Splitting reduces the hypothesis space each model has to cover and typically improves out-of-sample behavior
Recipe
# Step 1: primary signals from a simple strategy
from horizon.quant import TSMomentum
primary = TSMomentum(lookback=20)
primary_signals = run_primary_on_history(primary, ...)
# Step 2: triple-barrier labels for each primary signal
from my_utils import triple_barrier_labels
tb_labels = triple_barrier_labels(
prices=history,
entry_points=primary_signals.timestamps,
upper_pct=0.02,
lower_pct=0.01,
horizon_bars=5,
)
# Meta label: 1 if primary's trade was profitable, 0 otherwise
meta_y = (tb_labels["label"] == primary_signals.direction).astype(int)
# Step 3: train a meta classifier
from sklearn.ensemble import RandomForestClassifier
meta_features = compute_features_at(primary_signals.timestamps) # could be different features
meta_clf = RandomForestClassifier(n_estimators=200, max_depth=5)
meta_clf.fit(meta_features, meta_y)
# Step 4: combined strategy
class MetaLabeled(Strategy):
asset_classes = [Equity]
features = {...}
def evaluate(self, f, universe):
primary_signals = primary.evaluate(f, universe)
result = []
for sig in primary_signals:
meta_features = self._build_meta_features(sig, f)
meta_confidence = meta_clf.predict_proba(meta_features)[0][1]
if meta_confidence > 0.6:
result.append(replace(sig, confidence=meta_confidence))
return result
Sample weights
de Prado argues that samples with overlapping labels are correlated and should be down-weighted. Consider two entries one day apart in a 5-day horizon: their labels share 4 days of price data.
def sample_weights_by_overlap(
label_times: list[datetime],
label_durations: list[int],
) -> list[float]:
"""Compute weights inversely proportional to label overlap."""
import numpy as np
n = len(label_times)
weights = np.zeros(n)
for i in range(n):
overlap_count = 0
for j in range(n):
# Does label j overlap with label i in time?
i_end = label_times[i] + timedelta(days=label_durations[i])
j_end = label_times[j] + timedelta(days=label_durations[j])
if (label_times[j] <= i_end and j_end >= label_times[i]):
overlap_count += 1
weights[i] = 1.0 / overlap_count if overlap_count > 0 else 1.0
return weights / weights.sum() * n # normalize so mean weight = 1
Pass these to your classifier’s sample_weight parameter:
clf.fit(X, y, sample_weight=sample_weights_by_overlap(...))
When to use
Pitfalls
Status in Horizon
Source references
python/horizon/fund/_backtest_runner.pyhas rudimentary per-trade outcome tracking you can repurposehorizon/state/ledger.pyrecords realized P&L per trade. feed that into your labelerhorizon/data/synthetic.pyis useful for testing labelers against known-truth data