Policy reverse-engineering
Shadow policy (decision-tree + gradient-boosted + SHAP) is the default, interpretable path. MaxEnt IRL / AIRL / GAIL are reserved for v0.3 via the horizon[flow-irl] extras.
The policy layer answers “what does this bot do?“. Given an actor’s observed trajectory of (state, action) pairs, infer the conditions under which it acts.
Two paths:
- Shadow policy (default, v0.1). Decision-tree + gradient-boosted classifier trained on observed
(state, action)pairs. Produces human-readable rules. Interpretable. Fast. Ships today. - Inverse RL (v0.3). MaxEnt IRL, AIRL, GAIL. Recovers a reward function rather than a policy. Theoretically stronger claim; computationally heavy; less directly interpretable.
The default is Shadow. Most compliance use cases want interpretability, not reward recovery. Compliance reviewers can read a decision tree; recovering a torch model and explaining what “the implied reward function” means is a harder sell.
State features
Both paths consume the same feature vector. PolicyFeatureExtractor builds it from a MarketEvent stream:
from horizon.flow.policy.features import PolicyFeatureExtractor, FEATURE_NAMES
from datetime import datetime, timezone
ext = PolicyFeatureExtractor()
for ev in stream:
ext.observe(ev)
features = ext.featurize(
actor_id="0xabc...",
market_id="0xTRUMP...",
now=datetime.now(timezone.utc),
)
# {'ofi_5s': 0.32, 'ofi_30s': 0.41, 'spread_bps': 12.3, ...}
FEATURE_NAMES is the canonical ordering. Every PolicyModel records the feature names it was fit with; downstream code matching features at inference time must use the exact same order and semantics.
Feature set:
| Feature | Horizon |
|---|---|
ofi_5s, ofi_30s, ofi_5m | Order flow imbalance at three horizons (Cont 2014) |
spread_bps | Top-of-book spread in bps of mid |
depth_imbalance | (bid_5 - ask_5) / (bid_5 + ask_5) (top-5 levels) |
mid_return_5s, mid_return_1m | Short-horizon mid drift |
realized_vol_5m | Rolling realized vol of log-returns |
hour_of_day, minute_of_hour | UTC time of decision |
own_inventory | Running position from observed fills |
own_recent_trades_1m, own_recent_cancels_1m | Per-actor pace |
book_snapshot_age_s | Freshness of the depth observation |
ShadowPolicyFitter
Default path. horizon.flow.policy.ShadowPolicyFitter fits a decision tree (for rule extraction) and a gradient-boosted classifier (for accuracy) on the same (state, action) pairs, then summarizes via SHAP (if installed) or sklearn feature-importances.
Usage
from horizon.flow.policy.shadow import ShadowPolicyFitter
from horizon.flow.config import FlowConfig
fitter = ShadowPolicyFitter(FlowConfig())
model = fitter.fit(
actor_id="0xabc...",
trajectories=[
({"ofi_5s": 0.4, "spread_bps": 8.0, ...}, "buy"),
({"ofi_5s": -0.5, "spread_bps": 15.0, ...}, "sell"),
# ... at least FlowConfig.policy.policy_min_events of these
],
feature_names=FEATURE_NAMES,
)
# PolicyModel with model_blob (pickled sklearn), top_rule, holdout_accuracy, SHAP summary.
Below FlowConfig.policy.policy_min_events (default 200) the fitter returns None. Small samples produce unreliable rules that would mislead compliance.
Output
PolicyModel.summary contains:
method."shap_tree_explainer"if SHAP is installed,"gbdt_feature_importances"otherwise.top_features. Ranked list of{"feature", "mean_abs_shap" | "importance"}entries, truncated toshap_top_k(default 10).rules. Up to 10 decision-tree leaves as{"description", "support", "positive_fraction", "predicted_action"}. The decision tree is capped at depth 6 so rules stay readable.class_labels. Sorted list of observed actions (e.g.["buy", "hold", "sell"]).holdout_accuracy. Score on the reserved holdout slice (default 20%).
PolicyModel.top_rule is the single-highest-support decision-tree rule, rendered as a plain-text condition like “action when feature1 ≤ t1 AND feature2 > t2 AND …“. Exactly the line a compliance reviewer wants.
Determinism
FlowConfig.seed drives the train/test split (via random.Random(seed)) and the sklearn estimators’ random_state. Same seed + same trajectories → same model.
Dependencies
Requires scikit-learn (in horizon[flow]). Optional shap for SHAP summaries; fallback is sklearn feature_importances_.
Inverse RL
horizon.flow.policy.irl answers the complementary question to shadow policy: not “what does the actor do” but “what does the actor APPEAR TO BE OPTIMIZING.” The output is a reward function such that the observed behavior is the maximum-entropy policy under that reward. Ziebart et al. (2008).
MaxEntIRLFitter (v0.3.0, shipped)
Fully implemented in pure numpy. No torch, no extras required. The algorithm:
- Feature selection. The top-K most variable features from the trajectory data become the MDP’s state features.
FlowConfig.policy.irl_n_features(default 4) sets K. - Discretization. Each selected feature is binned into equal-frequency quantile bins.
irl_n_bins(default 5) sets the bin count. State space size isn_bins^n_features. Default 5⁴ = 625 states. - Empirical transition model.
P(s' | s, a)is estimated from consecutive trajectory transitions with Laplace smoothing. Unseen (s, a) pairs fall back to uniform over s’. - Soft value iteration.
Q(s, a) = R(s, a) + γ · Σ_{s'} P(s'|s,a) · logsumexp_{a'} Q(s', a')under current θ. Discount γ defaults to 0.9. - Gradient descent on θ. The gradient of the MaxEnt log-likelihood is
∇L = f_expert - f_policy - 2·λ·θwheref_expertis the empirical per-(state, action) visitation rate in expert data andf_policyis the visitation expected under the learned soft-max policy. L2 regularizationλdefaults to 0.01. - Convergence. Stop when
||∇L|| < irl_convergence_tol(default 1e-4) or afterirl_max_iterations(default 80) outer steps.
Usage
import horizon as hz
from horizon.flow.config import FlowConfig
from horizon.flow.policy.features import PolicyFeatureExtractor, FEATURE_NAMES
from horizon.flow.policy.irl import MaxEntIRLFitter
# Build trajectories the same way as for the shadow path
feat = PolicyFeatureExtractor()
trajectories: list[tuple[dict, str]] = []
for ev in event_stream:
feat.observe(ev)
if ev.actor_id == target_wallet and ev.side in ("buy", "sell"):
state = feat.featurize(
actor_id=target_wallet,
market_id=ev.market_id,
now=ev.timestamp,
)
trajectories.append((state, ev.side))
model = MaxEntIRLFitter(FlowConfig()).fit(
actor_id=target_wallet,
trajectories=trajectories,
feature_names=FEATURE_NAMES,
action_space=["buy", "sell", "hold"],
)
# Output
print(model.top_rule) # "buy when OFI_5s in ~[0.32]"
print(model.summary["reward_weights"]) # per-action per-state weights
for row in model.summary["top_rewarding_states"][:5]:
print(f"{row['action']:5s} @ {row['bin_centers']} reward={row['reward']:.3f}")
Output: PolicyModel.summary for MaxEnt
Key fields the reader consumes:
| Key | Meaning |
|---|---|
method | "maxent_irl" |
feature_basis | Which features entered the MDP (top-K by variance). |
n_bins | Per-feature discretization granularity. |
bin_edges | Per-feature quantile boundaries (for decoding state indices). |
reward_weights | {action → list[reward per state index]}. The θ matrix. |
top_rewarding_states | Top-K (bin_centers, action, reward) triples. The direct compliance-readable answer to “what state-action combos does this actor value most?” |
log_likelihood | Log-likelihood of expert trajectories under the fitted policy (closer to 0 = better fit). |
iterations_run / final_gradient_norm | Convergence diagnostics. |
expert_feature_expectations / policy_feature_expectations | The two visitation vectors whose difference drove the gradient. Should align at convergence. |
Paper reproduction
tests/flow/test_policy_irl.py::test_gridworld_reward_concentrates_near_goal runs the canonical gridworld sanity check from Ziebart et al. (2008): a 5×5 grid with a goal corner, 400 expert trajectories moving toward the goal, verify recovered reward’s argmax lies in the goal quadrant. The test passes deterministically with the default config.
When to prefer MaxEnt over shadow policy
- Shadow policy answers “what does the actor do at observed states”. High-accuracy on observed data but no guarantee about behavior in state regions the actor hasn’t visited.
- MaxEnt IRL answers “what reward function makes this behavior optimal”. Lower-accuracy on observed data (intentionally: it’s a lower-dim model) but generalizes to unobserved state regions via the learned reward.
Use MaxEnt when you want to REASON about the actor: “if OFI_5s dropped to -0.5 in a regime they haven’t seen, what would they do?” The reward function extrapolates; the shadow-policy classifier doesn’t.
Use shadow when you want maximum observed-state prediction accuracy and interpretable rules.
For compliance defense, both answers are defensible; ship both side-by-side in the policy_models table and let the reader pick.
GAILFitter (v0.3.1, shipped: offline variant)
Ho & Ermon (2016) adversarial imitation, adapted for the offline setting.
The original GAIL assumes an on-policy simulator: you roll out trajectories under the current policy, feed them to a discriminator, backprop via TRPO/PPO. Market flow doesn’t have a rewindable simulator, so we ship the OFFLINE variant. Faithful to the discriminator-as-reward formulation but without the policy-optimization loop.
Method. Fit a torch discriminator D(s, a) to distinguish expert demonstrations from random-policy samples (same states, actions drawn uniformly from the action space). The learned reward at each observed state is the standard GAIL expression:
r(s, a) = log D(s, a) - log(1 - D(s, a))
Higher log-odds = more expert-like. The reward is non-linear (MLP), so it catches feature interactions MaxEnt’s linear basis can’t represent.
Usage.
from horizon.flow.config import FlowConfig
from horizon.flow.policy.irl import GAILFitter
fitter = GAILFitter(FlowConfig())
model = fitter.fit(
actor_id="0xabc...",
trajectories=[(state_features, action_label), ...],
feature_names=FEATURE_NAMES,
action_space=["buy", "sell", "hold"],
epochs=50, # default
hidden_dim=32, # default
learning_rate=1e-3, # default
)
print(model.top_rule)
# "buy preferred (mean_reward=0.45); top driver = ofi_5s (|grad|=0.38)"
print(model.summary["action_preferences"])
# {"buy": 0.45, "sell": -0.20, "hold": 0.02}
print(model.summary["feature_gradient_importance"])
# {"ofi_5s": 0.38, "spread_bps": 0.12, ...}
Output PolicyModel.summary.
| Key | Meaning |
|---|---|
method | "gail_offline" |
reward_network_dims | Discriminator layer widths. |
training_loss | Final BCE loss. |
action_preferences | {action → mean learned reward} across observed states. Identifies which action the discriminator considers most expert-like overall. |
feature_gradient_importance | {feature → mean \|∂reward/∂feature\|}. Interpretable approximation of which features drive the learned reward. |
feature_basis, action_space, feat_mean, feat_std | Encoding metadata for loading model_blob. |
model_blob is the torch state dict. Reload with torch.load(io.BytesIO(blob)).
Dependency. torch via the [flow-irl] extras. Without it, .fit() raises ModuleNotFoundError with the install command.
AIRLFitter (v0.3.1, shipped: offline state-only variant)
Fu, Luo, Levine (2018). Same discriminator-as-reward spirit as GAIL, but with the disentangled-reward decomposition:
D(s, a) = exp(f(s)) / (exp(f(s)) + π_baseline(a | s))
where f(s) is a state-only reward network and π_baseline(a | s) = 1 / |A| is the uniform policy. At convergence, f(s) recovers the state-only reward component. The transferable part of the actor’s reward in AIRL’s sense.
When to prefer vs GAIL.
- GAIL gives you
r(s, a). Action-aware. Best for “what does this actor value” when the action choice itself carries information. - AIRL gives you
f(s). Action-agnostic. Best for “what market states does this actor value” independent of what they chose there. Useful when you want to characterize the actor’s preferred regime.
Output PolicyModel.summary.
| Key | Meaning |
|---|---|
method | "airl_offline" |
state_reward_top_k | The top-K states (by inferred reward) from the expert set, sorted descending. Each row has the feature dict + scalar reward. |
feature_gradient_importance | Same shape as GAIL. Average \|∂f/∂feature\| over expert states. |
The state-reward decomposition is the AIRL claim to fame. A reviewer reads the top-K rows and sees which market states this actor finds most valuable. No action attached.
What the offline variants don’t do
The offline simplification has limits worth documenting:
- No policy network, no rollouts. You get a learned reward, not a trained agent. If you want the agent, that’s the simulator-dependent full GAIL. Out of scope for v0.3.1.
- Reward support = demonstration support. The reward function extrapolates only modestly beyond the observed states. Regions with no expert data produce undefined rewards (the discriminator output is arbitrary).
- Random-baseline assumption. GAIL’s reward interpretation assumes the “generator” is a uniform random policy. In the real on-policy algorithm, the generator is the current learned policy, which converges toward the expert. Offline, we stay at the initial assumption. The reward describes “expert vs uniform-random,” not “expert vs converged-policy.”
For compliance use cases, these limitations rarely matter. The reward is a description of observed behavior, not a prediction of unseen behavior. The shadow policy + MaxEnt + GAIL/AIRL triple gives progressively richer views of the same demonstration set:
- Shadow. Interpretable classification (rules).
- MaxEnt. Tabular reward (linear over binned features).
- GAIL. Action-aware nonlinear reward.
- AIRL. State-only nonlinear reward.
Ship all four side-by-side in the policy_models table and let the reader compare.
Why shadow is the default
A compliance reviewer asking “what does this bot do?” gets the same answer from both approaches but in very different forms:
- Shadow. “this wallet buys when ofi_5s > 0.3 AND spread_bps ≤ 10 (support: 210, positive fraction 0.91).” One line, actionable, auditable.
- IRL. “the reward function that best explains this wallet’s behavior weights signed-volatility at 0.42, spread-compression at 0.28, directional drift at 0.18…” Requires a PhD to evaluate.
For the same compliance purpose, the interpretable answer is the more defensible one. IRL becomes genuinely useful when you care about transfer. Predicting how the bot would behave under different market conditions. That’s an alpha-research question, not a compliance one, and it lives behind the extras flag.
Common recipe: fit a policy for every active actor
import horizon as hz
from horizon.flow.policy.features import PolicyFeatureExtractor, FEATURE_NAMES
from horizon.flow.policy.shadow import ShadowPolicyFitter
fit = ShadowPolicyFitter(config)
ext = PolicyFeatureExtractor()
trajectories: dict[str, list[tuple[dict, str]]] = {}
for ev in stream:
ext.observe(ev)
if ev.event_kind.value == "order.filled" and ev.actor_id:
state = ext.featurize(actor_id=ev.actor_id, market_id=ev.market_id, now=ev.timestamp)
trajectories.setdefault(ev.actor_id, []).append((state, ev.side or "hold"))
for actor_id, trajs in trajectories.items():
model = fit.fit(actor_id=actor_id, trajectories=trajs, feature_names=FEATURE_NAMES)
if model is not None:
store.write_policy(model)
print(f"{actor_id}: {model.top_rule} (acc={model.holdout_accuracy:.2f})")
Citations
- Pomerleau, D. A. (1988). “ALVINN: An Autonomous Land Vehicle in a Neural Network.” NIPS.
- Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification and Regression Trees.
- Chen, T., Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System.” KDD 2016.
- Lundberg, S. M., Lee, S.-I. (2017). “A Unified Approach to Interpreting Model Predictions.” NeurIPS 2017.
- Ng, A. Y., Russell, S. J. (2000). “Algorithms for Inverse Reinforcement Learning.” ICML 2000.
- Ziebart, B. D., Maas, A., Bagnell, J. A., Dey, A. K. (2008). “Maximum Entropy Inverse Reinforcement Learning.” AAAI 2008.
- Ho, J., Ermon, S. (2016). “Generative Adversarial Imitation Learning.” NeurIPS 2016.
- Fu, J., Luo, K., Levine, S. (2018). “Learning Robust Rewards with Adversarial Inverse Reinforcement Learning.” ICLR 2018.