de Prado Features
Fractional differentiation + VPIN + information-driven bars
de Prado’s Advances in Financial Machine Learning proposes several feature engineering techniques that address structural issues with naive financial features. This page covers the three most important: fractional differentiation, VPIN, and information-driven bars.
Fractional differentiation
The stationarity dilemma
The solution: fractional differences
Instead of taking a full difference (r = p - p_{-1}), take a fractional one (p differentiated to order d where 0 < d < 1). This blends stationarity with memory.
The formula, in expanded form:
Δ^d p(t) = p(t) - d·p(t-1) + d(d-1)/2·p(t-2) - d(d-1)(d-2)/6·p(t-3) + ...
As d → 1, you approach full differencing (returns). As d → 0, you approach no differencing (levels). Somewhere in between (typically d ≈ 0.3 to 0.5 for daily equity prices) you get:
- Stationarity (Augmented Dickey-Fuller test passes)
- Preserved memory (correlation with the original level remains high)
Implementation
import numpy as np
from typing import Sequence
def frac_diff(
series: Sequence[float],
d: float,
threshold: float = 1e-4,
) -> list[float]:
"""Fractional differentiation of a time series.
Uses the expanding-window method from de Prado Ch. 5. Weights are
computed until they drop below `threshold`.
"""
weights = [1.0]
k = 1
while True:
w = -weights[-1] * (d - k + 1) / k
if abs(w) < threshold:
break
weights.append(w)
k += 1
window = len(weights)
out = [float("nan")] * (window - 1)
for t in range(window - 1, len(series)):
v = 0.0
for i, w in enumerate(weights):
v += w * series[t - i]
out.append(v)
return out
Finding the right d
Try many values
d ∈ {0.1, 0.2, ..., 0.9}.Test stationarity for each
Pick the smallest d that passes
d that gives a stationary series preserves the most memory.from statsmodels.tsa.stattools import adfuller
for d in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
fd = frac_diff(prices, d=d)
fd_clean = [x for x in fd if not math.isnan(x)]
adf_stat, p_value, *_ = adfuller(fd_clean)
print(f"d={d}: ADF p={p_value:.4f}")
The smallest d with p_value < 0.05 is the optimal fractional difference order for your data.
As a Horizon feature
from horizon.features.base import Feature, PriceHistory
from horizon.context import FeedData
class FracDiff(Feature):
def __init__(self, d: float = 0.4, window: int = 100, market: str | None = None):
super().__init__(market=market)
self.d = d
self.window = window
def compute(self, market_id, history: PriceHistory, feeds):
prices = history.last_n_prices(self.window)
if len(prices) < self.window:
return float("nan")
fd = frac_diff(prices, d=self.d)
return fd[-1]
Use in a strategy:
class FracDiffStrat(Strategy):
features = {"fd": FracDiff(d=0.4, window=100)}
def evaluate(self, f, universe):
# The fractional difference is stationary, so you can compare
# its current value against its own history
...
VPIN
VPIN is covered in detail in ExecutionIntelligence. Here’s the short version as a feature:
from horizon.fund._execution_intelligence import ExecutionIntelligence
from horizon.features.base import Feature
class VPINFeature(Feature):
"""Volume-synchronized Probability of Informed Trading as a feature."""
def __init__(self, bucket_volume: float = 10_000, market: str | None = None):
super().__init__(market=market)
self._exec_int = ExecutionIntelligence(bucket_volume=bucket_volume)
def compute(self, market_id, history, feeds):
# You need to have been feeding trades into self._exec_int
result = self._exec_int.get_vpin(market_id)
return result.vpin if result else 0.5 # neutral default
Information-driven bars
The problem with time bars
Most charting systems use time bars: a bar per minute, per hour, per day. But trading activity is not uniform in time: some minutes have 100x the volume of others. Time bars treat high-volume minutes the same as dead-quiet minutes, which:
- Produces serial dependence (correlation between consecutive bars)
- Gives garbage-in data during lunch / overnight
- Misses the structure of informed trading, which clusters in volume spikes
Alternatives
Dollar bar example
def build_dollar_bars(
ticks: list[dict],
dollar_threshold: float = 1_000_000,
) -> list[dict]:
"""Group ticks into dollar bars."""
bars = []
current = {
"start_time": None,
"open": None,
"high": float("-inf"),
"low": float("inf"),
"close": None,
"volume": 0,
"dollar_volume": 0,
}
for t in ticks:
if current["open"] is None:
current["open"] = t["price"]
current["start_time"] = t["timestamp"]
current["high"] = max(current["high"], t["price"])
current["low"] = min(current["low"], t["price"])
current["close"] = t["price"]
current["volume"] += t["volume"]
current["dollar_volume"] += t["price"] * t["volume"]
if current["dollar_volume"] >= dollar_threshold:
bars.append(current)
current = {
"start_time": None,
"open": None,
"high": float("-inf"),
"low": float("inf"),
"close": None,
"volume": 0,
"dollar_volume": 0,
}
return bars
Why it matters
Dollar bars have higher predictability than time bars when training classifiers:
- Serial correlation is lower (bars are more “independent”)
- Features have better distribution (closer to normal)
- Sample efficiency is higher. fewer bars needed for significance
Integration with Horizon
For now, the approach is:
- Run dollar bar construction offline using your tick data
- Save the resulting dollar bars as a CSV / parquet file
- Load via a custom
DataSourcethat yields them in chronological order - Build features on top of the dollar-bar time series
This gives you de Prado’s information-driven bars within the existing Horizon pipeline.
Further reading
- “Advances in Financial Machine Learning” Ch. 2 (information-driven bars) and Ch. 5 (fractional differentiation)
- Easley, López de Prado, O’Hara (2012): the original VPIN paper
- de Prado (2018). “The 10 reasons most machine learning funds fail”. lists feature-engineering mistakes you should avoid