de Prado Features

Fractional differentiation + VPIN + information-driven bars

de Prado’s Advances in Financial Machine Learning proposes several feature engineering techniques that address structural issues with naive financial features. This page covers the three most important: fractional differentiation, VPIN, and information-driven bars.

Fractional differentiation

The stationarity dilemma

The solution: fractional differences

Instead of taking a full difference (r = p - p_{-1}), take a fractional one (p differentiated to order d where 0 < d < 1). This blends stationarity with memory.

The formula, in expanded form:

Δ^d p(t) = p(t) - d·p(t-1) + d(d-1)/2·p(t-2) - d(d-1)(d-2)/6·p(t-3) + ...

As d → 1, you approach full differencing (returns). As d → 0, you approach no differencing (levels). Somewhere in between (typically d ≈ 0.3 to 0.5 for daily equity prices) you get:

  • Stationarity (Augmented Dickey-Fuller test passes)
  • Preserved memory (correlation with the original level remains high)

Implementation

python
import numpy as np
from typing import Sequence

def frac_diff(
    series: Sequence[float],
    d: float,
    threshold: float = 1e-4,
) -> list[float]:
    """Fractional differentiation of a time series.

    Uses the expanding-window method from de Prado Ch. 5. Weights are
    computed until they drop below `threshold`.
    """
    weights = [1.0]
    k = 1
    while True:
        w = -weights[-1] * (d - k + 1) / k
        if abs(w) < threshold:
            break
        weights.append(w)
        k += 1

    window = len(weights)
    out = [float("nan")] * (window - 1)
    for t in range(window - 1, len(series)):
        v = 0.0
        for i, w in enumerate(weights):
            v += w * series[t - i]
        out.append(v)
    return out

Finding the right d

Try many values

Run frac_diff for d ∈ {0.1, 0.2, ..., 0.9}.

Test stationarity for each

Run Augmented Dickey-Fuller on each series.

Pick the smallest d that passes

The smallest d that gives a stationary series preserves the most memory.
python
from statsmodels.tsa.stattools import adfuller

for d in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    fd = frac_diff(prices, d=d)
    fd_clean = [x for x in fd if not math.isnan(x)]
    adf_stat, p_value, *_ = adfuller(fd_clean)
    print(f"d={d}: ADF p={p_value:.4f}")

The smallest d with p_value < 0.05 is the optimal fractional difference order for your data.

As a Horizon feature

python
from horizon.features.base import Feature, PriceHistory
from horizon.context import FeedData

class FracDiff(Feature):
    def __init__(self, d: float = 0.4, window: int = 100, market: str | None = None):
        super().__init__(market=market)
        self.d = d
        self.window = window

    def compute(self, market_id, history: PriceHistory, feeds):
        prices = history.last_n_prices(self.window)
        if len(prices) < self.window:
            return float("nan")
        fd = frac_diff(prices, d=self.d)
        return fd[-1]

Use in a strategy:

python
class FracDiffStrat(Strategy):
    features = {"fd": FracDiff(d=0.4, window=100)}

    def evaluate(self, f, universe):
        # The fractional difference is stationary, so you can compare
        # its current value against its own history
        ...

VPIN

VPIN is covered in detail in ExecutionIntelligence. Here’s the short version as a feature:

python
from horizon.fund._execution_intelligence import ExecutionIntelligence
from horizon.features.base import Feature

class VPINFeature(Feature):
    """Volume-synchronized Probability of Informed Trading as a feature."""

    def __init__(self, bucket_volume: float = 10_000, market: str | None = None):
        super().__init__(market=market)
        self._exec_int = ExecutionIntelligence(bucket_volume=bucket_volume)

    def compute(self, market_id, history, feeds):
        # You need to have been feeding trades into self._exec_int
        result = self._exec_int.get_vpin(market_id)
        return result.vpin if result else 0.5   # neutral default

Information-driven bars

The problem with time bars

Most charting systems use time bars: a bar per minute, per hour, per day. But trading activity is not uniform in time: some minutes have 100x the volume of others. Time bars treat high-volume minutes the same as dead-quiet minutes, which:

  • Produces serial dependence (correlation between consecutive bars)
  • Gives garbage-in data during lunch / overnight
  • Misses the structure of informed trading, which clusters in volume spikes

Alternatives

Tick bars A new bar every N trades. Normalizes for trade count but not size.
Volume bars A new bar every V units of volume. Normalizes for actual activity.
Dollar bars A new bar every D dollars of notional. Normalizes for dollar-weighted activity. **This is what de Prado recommends for equities.**
Imbalance bars A new bar whenever the order-flow imbalance exceeds a threshold. Bars correspond to moments of "information arrival."

Dollar bar example

python
def build_dollar_bars(
    ticks: list[dict],
    dollar_threshold: float = 1_000_000,
) -> list[dict]:
    """Group ticks into dollar bars."""
    bars = []
    current = {
        "start_time": None,
        "open": None,
        "high": float("-inf"),
        "low": float("inf"),
        "close": None,
        "volume": 0,
        "dollar_volume": 0,
    }
    for t in ticks:
        if current["open"] is None:
            current["open"] = t["price"]
            current["start_time"] = t["timestamp"]
        current["high"] = max(current["high"], t["price"])
        current["low"] = min(current["low"], t["price"])
        current["close"] = t["price"]
        current["volume"] += t["volume"]
        current["dollar_volume"] += t["price"] * t["volume"]

        if current["dollar_volume"] >= dollar_threshold:
            bars.append(current)
            current = {
                "start_time": None,
                "open": None,
                "high": float("-inf"),
                "low": float("inf"),
                "close": None,
                "volume": 0,
                "dollar_volume": 0,
            }
    return bars

Why it matters

Dollar bars have higher predictability than time bars when training classifiers:

  • Serial correlation is lower (bars are more “independent”)
  • Features have better distribution (closer to normal)
  • Sample efficiency is higher. fewer bars needed for significance

Integration with Horizon

For now, the approach is:

  1. Run dollar bar construction offline using your tick data
  2. Save the resulting dollar bars as a CSV / parquet file
  3. Load via a custom DataSource that yields them in chronological order
  4. Build features on top of the dollar-bar time series

This gives you de Prado’s information-driven bars within the existing Horizon pipeline.

Further reading

  • “Advances in Financial Machine Learning” Ch. 2 (information-driven bars) and Ch. 5 (fractional differentiation)
  • Easley, López de Prado, O’Hara (2012): the original VPIN paper
  • de Prado (2018). “The 10 reasons most machine learning funds fail”. lists feature-engineering mistakes you should avoid

Next