de Prado Features

Fractional differentiation + VPIN + information-driven bars

de Prado’s Advances in Financial Machine Learning proposes several feature engineering techniques that address structural issues with naive financial features. This page covers the three most important: fractional differentiation, VPIN, and information-driven bars.

Fractional differentiation

The stationarity dilemma

The solution: fractional differences

Instead of taking a full difference (r = p - p_{-1}), take a fractional one (p differentiated to order d where 0 < d < 1). This blends stationarity with memory.

The formula, in expanded form:

Δ^d p(t) = p(t) - d·p(t-1) + d(d-1)/2·p(t-2) - d(d-1)(d-2)/6·p(t-3) + ...

As d → 1, you approach full differencing (returns). As d → 0, you approach no differencing (levels). Somewhere in between (typically d ≈ 0.3 to 0.5 for daily equity prices) you get:

Stationarity (Augmented Dickey-Fuller test passes)
Preserved memory (correlation with the original level remains high)

Implementation

python

import numpy as np
from typing import Sequence

def frac_diff(
    series: Sequence[float],
    d: float,
    threshold: float = 1e-4,
) -> list[float]:
    """Fractional differentiation of a time series.

    Uses the expanding-window method from de Prado Ch. 5. Weights are
    computed until they drop below `threshold`.
    """
    weights = [1.0]
    k = 1
    while True:
        w = -weights[-1] * (d - k + 1) / k
        if abs(w) < threshold:
            break
        weights.append(w)
        k += 1

    window = len(weights)
    out = [float("nan")] * (window - 1)
    for t in range(window - 1, len(series)):
        v = 0.0
        for i, w in enumerate(weights):
            v += w * series[t - i]
        out.append(v)
    return out

Finding the right `d`

Try many values

Run frac_diff for d ∈ {0.1, 0.2, ..., 0.9}.

Test stationarity for each

Run Augmented Dickey-Fuller on each series.

Pick the smallest d that passes

The smallest d that gives a stationary series preserves the most memory.

python

from statsmodels.tsa.stattools import adfuller

for d in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    fd = frac_diff(prices, d=d)
    fd_clean = [x for x in fd if not math.isnan(x)]
    adf_stat, p_value, *_ = adfuller(fd_clean)
    print(f"d={d}: ADF p={p_value:.4f}")

The smallest d with p_value < 0.05 is the optimal fractional difference order for your data.

As a Horizon feature

python

from horizon.features.base import Feature, PriceHistory
from horizon.context import FeedData

class FracDiff(Feature):
    def __init__(self, d: float = 0.4, window: int = 100, market: str | None = None):
        super().__init__(market=market)
        self.d = d
        self.window = window

    def compute(self, market_id, history: PriceHistory, feeds):
        prices = history.last_n_prices(self.window)
        if len(prices) < self.window:
            return float("nan")
        fd = frac_diff(prices, d=self.d)
        return fd[-1]

Use in a strategy:

python

class FracDiffStrat(Strategy):
    features = {"fd": FracDiff(d=0.4, window=100)}

    def evaluate(self, f, universe):
        # The fractional difference is stationary, so you can compare
        # its current value against its own history
        ...

VPIN

VPIN is covered in detail in ExecutionIntelligence. Here’s the short version as a feature:

python

from horizon.fund._execution_intelligence import ExecutionIntelligence
from horizon.features.base import Feature

class VPINFeature(Feature):
    """Volume-synchronized Probability of Informed Trading as a feature."""

    def __init__(self, bucket_volume: float = 10_000, market: str | None = None):
        super().__init__(market=market)
        self._exec_int = ExecutionIntelligence(bucket_volume=bucket_volume)

    def compute(self, market_id, history, feeds):
        # You need to have been feeding trades into self._exec_int
        result = self._exec_int.get_vpin(market_id)
        return result.vpin if result else 0.5   # neutral default

Information-driven bars

The problem with time bars

Most charting systems use time bars: a bar per minute, per hour, per day. But trading activity is not uniform in time: some minutes have 100x the volume of others. Time bars treat high-volume minutes the same as dead-quiet minutes, which:

Produces serial dependence (correlation between consecutive bars)
Gives garbage-in data during lunch / overnight
Misses the structure of informed trading, which clusters in volume spikes

Alternatives

Tick bars A new bar every N trades. Normalizes for trade count but not size.

Volume bars A new bar every V units of volume. Normalizes for actual activity.

Dollar bars A new bar every D dollars of notional. Normalizes for dollar-weighted activity. **This is what de Prado recommends for equities.**

Imbalance bars A new bar whenever the order-flow imbalance exceeds a threshold. Bars correspond to moments of "information arrival."

Dollar bar example

python

def build_dollar_bars(
    ticks: list[dict],
    dollar_threshold: float = 1_000_000,
) -> list[dict]:
    """Group ticks into dollar bars."""
    bars = []
    current = {
        "start_time": None,
        "open": None,
        "high": float("-inf"),
        "low": float("inf"),
        "close": None,
        "volume": 0,
        "dollar_volume": 0,
    }
    for t in ticks:
        if current["open"] is None:
            current["open"] = t["price"]
            current["start_time"] = t["timestamp"]
        current["high"] = max(current["high"], t["price"])
        current["low"] = min(current["low"], t["price"])
        current["close"] = t["price"]
        current["volume"] += t["volume"]
        current["dollar_volume"] += t["price"] * t["volume"]

        if current["dollar_volume"] >= dollar_threshold:
            bars.append(current)
            current = {
                "start_time": None,
                "open": None,
                "high": float("-inf"),
                "low": float("inf"),
                "close": None,
                "volume": 0,
                "dollar_volume": 0,
            }
    return bars

Why it matters

Dollar bars have higher predictability than time bars when training classifiers:

Serial correlation is lower (bars are more “independent”)
Features have better distribution (closer to normal)
Sample efficiency is higher. fewer bars needed for significance

Integration with Horizon

For now, the approach is:

Run dollar bar construction offline using your tick data
Save the resulting dollar bars as a CSV / parquet file
Load via a custom DataSource that yields them in chronological order
Build features on top of the dollar-bar time series

This gives you de Prado’s information-driven bars within the existing Horizon pipeline.

de Prado Features

Fractional differentiation

The stationarity dilemma

The solution: fractional differences

Implementation

Finding the right `d`

Try many values

Test stationarity for each

Pick the smallest d that passes

As a Horizon feature

VPIN

Information-driven bars

The problem with time bars

Alternatives

Dollar bar example

Why it matters

Integration with Horizon

Further reading

Next

Fractional differentiation

The stationarity dilemma

The solution: fractional differences

Implementation

Finding the right d

Try many values

Test stationarity for each

Pick the smallest d that passes

As a Horizon feature

VPIN

Information-driven bars

The problem with time bars

Alternatives

Dollar bar example

Why it matters

Integration with Horizon

Further reading

Next

Finding the right `d`