Policy reverse-engineering

Shadow policy (decision-tree + gradient-boosted + SHAP) is the default, interpretable path. MaxEnt IRL / AIRL / GAIL are reserved for v0.3 via the horizon[flow-irl] extras.

The policy layer answers “what does this bot do?“. Given an actor’s observed trajectory of (state, action) pairs, infer the conditions under which it acts.

Two paths:

  • Shadow policy (default, v0.1). Decision-tree + gradient-boosted classifier trained on observed (state, action) pairs. Produces human-readable rules. Interpretable. Fast. Ships today.
  • Inverse RL (v0.3). MaxEnt IRL, AIRL, GAIL. Recovers a reward function rather than a policy. Theoretically stronger claim; computationally heavy; less directly interpretable.

The default is Shadow. Most compliance use cases want interpretability, not reward recovery. Compliance reviewers can read a decision tree; recovering a torch model and explaining what “the implied reward function” means is a harder sell.

State features

Both paths consume the same feature vector. PolicyFeatureExtractor builds it from a MarketEvent stream:

python
from horizon.flow.policy.features import PolicyFeatureExtractor, FEATURE_NAMES
from datetime import datetime, timezone

ext = PolicyFeatureExtractor()
for ev in stream:
 ext.observe(ev)

features = ext.featurize(
 actor_id="0xabc...",
 market_id="0xTRUMP...",
 now=datetime.now(timezone.utc),
)
# {'ofi_5s': 0.32, 'ofi_30s': 0.41, 'spread_bps': 12.3, ...}

FEATURE_NAMES is the canonical ordering. Every PolicyModel records the feature names it was fit with; downstream code matching features at inference time must use the exact same order and semantics.

Feature set:

FeatureHorizon
ofi_5s, ofi_30s, ofi_5mOrder flow imbalance at three horizons (Cont 2014)
spread_bpsTop-of-book spread in bps of mid
depth_imbalance(bid_5 - ask_5) / (bid_5 + ask_5) (top-5 levels)
mid_return_5s, mid_return_1mShort-horizon mid drift
realized_vol_5mRolling realized vol of log-returns
hour_of_day, minute_of_hourUTC time of decision
own_inventoryRunning position from observed fills
own_recent_trades_1m, own_recent_cancels_1mPer-actor pace
book_snapshot_age_sFreshness of the depth observation

ShadowPolicyFitter

Default path. horizon.flow.policy.ShadowPolicyFitter fits a decision tree (for rule extraction) and a gradient-boosted classifier (for accuracy) on the same (state, action) pairs, then summarizes via SHAP (if installed) or sklearn feature-importances.

Usage

python
from horizon.flow.policy.shadow import ShadowPolicyFitter
from horizon.flow.config import FlowConfig

fitter = ShadowPolicyFitter(FlowConfig())

model = fitter.fit(
 actor_id="0xabc...",
 trajectories=[
 ({"ofi_5s": 0.4, "spread_bps": 8.0, ...}, "buy"),
 ({"ofi_5s": -0.5, "spread_bps": 15.0, ...}, "sell"),
 # ... at least FlowConfig.policy.policy_min_events of these
 ],
 feature_names=FEATURE_NAMES,
)
# PolicyModel with model_blob (pickled sklearn), top_rule, holdout_accuracy, SHAP summary.

Below FlowConfig.policy.policy_min_events (default 200) the fitter returns None. Small samples produce unreliable rules that would mislead compliance.

Output

PolicyModel.summary contains:

  • method. "shap_tree_explainer" if SHAP is installed, "gbdt_feature_importances" otherwise.
  • top_features. Ranked list of {"feature", "mean_abs_shap" | "importance"} entries, truncated to shap_top_k (default 10).
  • rules. Up to 10 decision-tree leaves as {"description", "support", "positive_fraction", "predicted_action"}. The decision tree is capped at depth 6 so rules stay readable.
  • class_labels. Sorted list of observed actions (e.g. ["buy", "hold", "sell"]).
  • holdout_accuracy. Score on the reserved holdout slice (default 20%).

PolicyModel.top_rule is the single-highest-support decision-tree rule, rendered as a plain-text condition like “action when feature1 ≤ t1 AND feature2 > t2 AND …“. Exactly the line a compliance reviewer wants.

Determinism

FlowConfig.seed drives the train/test split (via random.Random(seed)) and the sklearn estimators’ random_state. Same seed + same trajectories → same model.

Dependencies

Requires scikit-learn (in horizon[flow]). Optional shap for SHAP summaries; fallback is sklearn feature_importances_.

Inverse RL

horizon.flow.policy.irl answers the complementary question to shadow policy: not “what does the actor do” but “what does the actor APPEAR TO BE OPTIMIZING.” The output is a reward function such that the observed behavior is the maximum-entropy policy under that reward. Ziebart et al. (2008).

MaxEntIRLFitter (v0.3.0, shipped)

Fully implemented in pure numpy. No torch, no extras required. The algorithm:

  1. Feature selection. The top-K most variable features from the trajectory data become the MDP’s state features. FlowConfig.policy.irl_n_features (default 4) sets K.
  2. Discretization. Each selected feature is binned into equal-frequency quantile bins. irl_n_bins (default 5) sets the bin count. State space size is n_bins^n_features. Default 5⁴ = 625 states.
  3. Empirical transition model. P(s' | s, a) is estimated from consecutive trajectory transitions with Laplace smoothing. Unseen (s, a) pairs fall back to uniform over s’.
  4. Soft value iteration. Q(s, a) = R(s, a) + γ · Σ_{s'} P(s'|s,a) · logsumexp_{a'} Q(s', a') under current θ. Discount γ defaults to 0.9.
  5. Gradient descent on θ. The gradient of the MaxEnt log-likelihood is ∇L = f_expert - f_policy - 2·λ·θ where f_expert is the empirical per-(state, action) visitation rate in expert data and f_policy is the visitation expected under the learned soft-max policy. L2 regularization λ defaults to 0.01.
  6. Convergence. Stop when ||∇L|| < irl_convergence_tol (default 1e-4) or after irl_max_iterations (default 80) outer steps.

Usage

python
import horizon as hz
from horizon.flow.config import FlowConfig
from horizon.flow.policy.features import PolicyFeatureExtractor, FEATURE_NAMES
from horizon.flow.policy.irl import MaxEntIRLFitter

# Build trajectories the same way as for the shadow path
feat = PolicyFeatureExtractor()
trajectories: list[tuple[dict, str]] = []
for ev in event_stream:
 feat.observe(ev)
 if ev.actor_id == target_wallet and ev.side in ("buy", "sell"):
 state = feat.featurize(
 actor_id=target_wallet,
 market_id=ev.market_id,
 now=ev.timestamp,
 )
 trajectories.append((state, ev.side))

model = MaxEntIRLFitter(FlowConfig()).fit(
 actor_id=target_wallet,
 trajectories=trajectories,
 feature_names=FEATURE_NAMES,
 action_space=["buy", "sell", "hold"],
)

# Output
print(model.top_rule) # "buy when OFI_5s in ~[0.32]"
print(model.summary["reward_weights"]) # per-action per-state weights
for row in model.summary["top_rewarding_states"][:5]:
 print(f"{row['action']:5s} @ {row['bin_centers']} reward={row['reward']:.3f}")

Output: PolicyModel.summary for MaxEnt

Key fields the reader consumes:

KeyMeaning
method"maxent_irl"
feature_basisWhich features entered the MDP (top-K by variance).
n_binsPer-feature discretization granularity.
bin_edgesPer-feature quantile boundaries (for decoding state indices).
reward_weights{action → list[reward per state index]}. The θ matrix.
top_rewarding_statesTop-K (bin_centers, action, reward) triples. The direct compliance-readable answer to “what state-action combos does this actor value most?”
log_likelihoodLog-likelihood of expert trajectories under the fitted policy (closer to 0 = better fit).
iterations_run / final_gradient_normConvergence diagnostics.
expert_feature_expectations / policy_feature_expectationsThe two visitation vectors whose difference drove the gradient. Should align at convergence.

Paper reproduction

tests/flow/test_policy_irl.py::test_gridworld_reward_concentrates_near_goal runs the canonical gridworld sanity check from Ziebart et al. (2008): a 5×5 grid with a goal corner, 400 expert trajectories moving toward the goal, verify recovered reward’s argmax lies in the goal quadrant. The test passes deterministically with the default config.

When to prefer MaxEnt over shadow policy

  • Shadow policy answers “what does the actor do at observed states”. High-accuracy on observed data but no guarantee about behavior in state regions the actor hasn’t visited.
  • MaxEnt IRL answers “what reward function makes this behavior optimal”. Lower-accuracy on observed data (intentionally: it’s a lower-dim model) but generalizes to unobserved state regions via the learned reward.

Use MaxEnt when you want to REASON about the actor: “if OFI_5s dropped to -0.5 in a regime they haven’t seen, what would they do?” The reward function extrapolates; the shadow-policy classifier doesn’t.

Use shadow when you want maximum observed-state prediction accuracy and interpretable rules.

For compliance defense, both answers are defensible; ship both side-by-side in the policy_models table and let the reader pick.

GAILFitter (v0.3.1, shipped: offline variant)

Ho & Ermon (2016) adversarial imitation, adapted for the offline setting.

The original GAIL assumes an on-policy simulator: you roll out trajectories under the current policy, feed them to a discriminator, backprop via TRPO/PPO. Market flow doesn’t have a rewindable simulator, so we ship the OFFLINE variant. Faithful to the discriminator-as-reward formulation but without the policy-optimization loop.

Method. Fit a torch discriminator D(s, a) to distinguish expert demonstrations from random-policy samples (same states, actions drawn uniformly from the action space). The learned reward at each observed state is the standard GAIL expression:

text
r(s, a) = log D(s, a) - log(1 - D(s, a))

Higher log-odds = more expert-like. The reward is non-linear (MLP), so it catches feature interactions MaxEnt’s linear basis can’t represent.

Usage.

python
from horizon.flow.config import FlowConfig
from horizon.flow.policy.irl import GAILFitter

fitter = GAILFitter(FlowConfig())
model = fitter.fit(
 actor_id="0xabc...",
 trajectories=[(state_features, action_label), ...],
 feature_names=FEATURE_NAMES,
 action_space=["buy", "sell", "hold"],
 epochs=50, # default
 hidden_dim=32, # default
 learning_rate=1e-3, # default
)

print(model.top_rule)
# "buy preferred (mean_reward=0.45); top driver = ofi_5s (|grad|=0.38)"

print(model.summary["action_preferences"])
# {"buy": 0.45, "sell": -0.20, "hold": 0.02}

print(model.summary["feature_gradient_importance"])
# {"ofi_5s": 0.38, "spread_bps": 0.12, ...}

Output PolicyModel.summary.

KeyMeaning
method"gail_offline"
reward_network_dimsDiscriminator layer widths.
training_lossFinal BCE loss.
action_preferences{action → mean learned reward} across observed states. Identifies which action the discriminator considers most expert-like overall.
feature_gradient_importance{feature → mean \|∂reward/∂feature\|}. Interpretable approximation of which features drive the learned reward.
feature_basis, action_space, feat_mean, feat_stdEncoding metadata for loading model_blob.

model_blob is the torch state dict. Reload with torch.load(io.BytesIO(blob)).

Dependency. torch via the [flow-irl] extras. Without it, .fit() raises ModuleNotFoundError with the install command.

AIRLFitter (v0.3.1, shipped: offline state-only variant)

Fu, Luo, Levine (2018). Same discriminator-as-reward spirit as GAIL, but with the disentangled-reward decomposition:

text
D(s, a) = exp(f(s)) / (exp(f(s)) + π_baseline(a | s))

where f(s) is a state-only reward network and π_baseline(a | s) = 1 / |A| is the uniform policy. At convergence, f(s) recovers the state-only reward component. The transferable part of the actor’s reward in AIRL’s sense.

When to prefer vs GAIL.

  • GAIL gives you r(s, a). Action-aware. Best for “what does this actor value” when the action choice itself carries information.
  • AIRL gives you f(s). Action-agnostic. Best for “what market states does this actor value” independent of what they chose there. Useful when you want to characterize the actor’s preferred regime.

Output PolicyModel.summary.

KeyMeaning
method"airl_offline"
state_reward_top_kThe top-K states (by inferred reward) from the expert set, sorted descending. Each row has the feature dict + scalar reward.
feature_gradient_importanceSame shape as GAIL. Average \|∂f/∂feature\| over expert states.

The state-reward decomposition is the AIRL claim to fame. A reviewer reads the top-K rows and sees which market states this actor finds most valuable. No action attached.

What the offline variants don’t do

The offline simplification has limits worth documenting:

  • No policy network, no rollouts. You get a learned reward, not a trained agent. If you want the agent, that’s the simulator-dependent full GAIL. Out of scope for v0.3.1.
  • Reward support = demonstration support. The reward function extrapolates only modestly beyond the observed states. Regions with no expert data produce undefined rewards (the discriminator output is arbitrary).
  • Random-baseline assumption. GAIL’s reward interpretation assumes the “generator” is a uniform random policy. In the real on-policy algorithm, the generator is the current learned policy, which converges toward the expert. Offline, we stay at the initial assumption. The reward describes “expert vs uniform-random,” not “expert vs converged-policy.”

For compliance use cases, these limitations rarely matter. The reward is a description of observed behavior, not a prediction of unseen behavior. The shadow policy + MaxEnt + GAIL/AIRL triple gives progressively richer views of the same demonstration set:

  1. Shadow. Interpretable classification (rules).
  2. MaxEnt. Tabular reward (linear over binned features).
  3. GAIL. Action-aware nonlinear reward.
  4. AIRL. State-only nonlinear reward.

Ship all four side-by-side in the policy_models table and let the reader compare.

Why shadow is the default

A compliance reviewer asking “what does this bot do?” gets the same answer from both approaches but in very different forms:

  • Shadow. “this wallet buys when ofi_5s > 0.3 AND spread_bps ≤ 10 (support: 210, positive fraction 0.91).” One line, actionable, auditable.
  • IRL. “the reward function that best explains this wallet’s behavior weights signed-volatility at 0.42, spread-compression at 0.28, directional drift at 0.18…” Requires a PhD to evaluate.

For the same compliance purpose, the interpretable answer is the more defensible one. IRL becomes genuinely useful when you care about transfer. Predicting how the bot would behave under different market conditions. That’s an alpha-research question, not a compliance one, and it lives behind the extras flag.

Common recipe: fit a policy for every active actor

python
import horizon as hz
from horizon.flow.policy.features import PolicyFeatureExtractor, FEATURE_NAMES
from horizon.flow.policy.shadow import ShadowPolicyFitter

fit = ShadowPolicyFitter(config)
ext = PolicyFeatureExtractor()

trajectories: dict[str, list[tuple[dict, str]]] = {}

for ev in stream:
 ext.observe(ev)
 if ev.event_kind.value == "order.filled" and ev.actor_id:
 state = ext.featurize(actor_id=ev.actor_id, market_id=ev.market_id, now=ev.timestamp)
 trajectories.setdefault(ev.actor_id, []).append((state, ev.side or "hold"))

for actor_id, trajs in trajectories.items():
 model = fit.fit(actor_id=actor_id, trajectories=trajs, feature_names=FEATURE_NAMES)
 if model is not None:
 store.write_policy(model)
 print(f"{actor_id}: {model.top_rule} (acc={model.holdout_accuracy:.2f})")

Citations

  • Pomerleau, D. A. (1988). “ALVINN: An Autonomous Land Vehicle in a Neural Network.” NIPS.
  • Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification and Regression Trees.
  • Chen, T., Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System.” KDD 2016.
  • Lundberg, S. M., Lee, S.-I. (2017). “A Unified Approach to Interpreting Model Predictions.” NeurIPS 2017.
  • Ng, A. Y., Russell, S. J. (2000). “Algorithms for Inverse Reinforcement Learning.” ICML 2000.
  • Ziebart, B. D., Maas, A., Bagnell, J. A., Dey, A. K. (2008). “Maximum Entropy Inverse Reinforcement Learning.” AAAI 2008.
  • Ho, J., Ermon, S. (2016). “Generative Adversarial Imitation Learning.” NeurIPS 2016.
  • Fu, J., Luo, K., Levine, S. (2018). “Learning Robust Rewards with Adversarial Inverse Reinforcement Learning.” ICLR 2018.