Clustering

Three orthogonal lenses for coordinated-actor detection: behavioral (HDBSCAN), temporal (DTW / k-Shape), and network (Louvain).

Wallet clusters are actors believed to represent the same entity (or a coordinated group). Four methods contribute: three statistical / graph-based ones covered here, plus the on-chain wallet heuristics. Each produces a WalletCluster with a method-specific confidence. Cross-method aggregation happens in downstream logic, not inside the clusterer.

BehavioralClusterer

Reference. Campello, R. J., Moulavi, D., Sander, J. (2013). “Density-based clustering based on hierarchical density estimates.” PAKDD. Trader-clustering template: Tumminello, M., Lillo, F., Piilo, J., Mantegna, R. N. (2012). New Journal of Physics, 14(1).

Signal. Actors with similar fingerprints. Same order-to-trade ratio, same cancel behavior, same Hawkes branching, same hour-of-day activity. Are more likely to be the same party or the same algorithmic implementation.

Method. HDBSCAN on the standardized actor feature matrix. HDBSCAN is preferred over DBSCAN because:

  • No $\epsilon$ parameter (hard to pick in mixed-unit feature space).
  • Variable density (some clusters tight, others loose).
  • Native “noise” label for actors that don’t cluster. Honest about non-clusterable outliers.

Features used. order_to_trade_ratio, cancel_before_fill_rate, median_time_to_cancel_ms, maker_ratio, mean_order_size, inter_arrival_median_s, hawkes_branching_ratio, market_entropy_bits. Actors with fewer than 5 populated features are dropped. HDBSCAN imputation on sparse rows produces arbitrary clusters.

python
from horizon.flow.clustering import BehavioralClusterer

clusterer = BehavioralClusterer(config)
clusters = clusterer.cluster(profiles, venue_name="polymarket")
# List[WalletCluster] with method=ClusterMethod.Behavioral

Dependency. hdbscan from the horizon[flow] extras. Without it the clusterer returns []. A missing cluster is better than a wrong cluster for compliance.

Config. ClusteringConfig.hdbscan_min_cluster_size (default 5), hdbscan_min_samples (3), min_actors_for_clustering (10).

TemporalClusterer

Reference. Paparrizos, J., Gravano, L. (2015). “k-Shape.” SIGMOD 2015. Sakoe, H., Chiba, S. (1978). Dynamic Time Warping.

Signal. Many bots share upstream triggers (same oracle, same news wire, same arbitrage opportunity). Their event-timing series are highly correlated even when per-event sizes differ.

Method. For each actor, build a binned rate series (events per $\Delta$ over a rolling window). Pairwise DTW distance between normalized rate series; pairs below dtw_max_distance are linked; connected components → clusters.

python
from horizon.flow.clustering import TemporalClusterer

clusterer = TemporalClusterer(config)
clusters = clusterer.cluster(
 actor_timestamps={
 "0xabc...": [ts1, ts2, ...],
 "0xdef...": [ts1, ts2, ...],
 },
 venue_name="polymarket",
 bin_s=5.0,
 window_s=3600.0,
)

Dependency. dtaidistance from horizon[flow] (fast C-accelerated DTW). A pure-Python fallback is built in for environments without the extras. Slower but keeps tests green.

Config. ClusteringConfig.dtw_max_distance (default 1.0 after L2 normalization). Lower = tighter clusters.

NetworkCommunityClusterer

Reference. Blondel, V. D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E. (2008). “Fast unfolding of communities in large networks.” Journal of Statistical Mechanics. Trader-network template: Cohen-Cole, E., Kirilenko, A., Patacchini, E.

Signal. Actors who trade the same markets at the same time share upstream information or coordination. Building an actor graph (edge weight = co-trade frequency within time windows) and running community detection identifies groups.

Method. Louvain algorithm on the actor × actor graph; edges weighted by co-occurrence count within cooccurrence_window_s-long windows. Falls back to networkx.community.greedy_modularity_communities when python-louvain is not installed.

python
from horizon.flow.clustering import NetworkCommunityClusterer

clusterer = NetworkCommunityClusterer(config)
clusters = clusterer.cluster(
 events=all_market_events,
 venue_name="polymarket",
 cooccurrence_window_s=60.0,
)

Dependencies. networkx always (lightweight); python-louvain optional for the tunable-resolution Louvain. Both live in horizon[flow].

Config. ClusteringConfig.louvain_resolution (default 1.0; standard modularity). Higher = more, smaller communities.

Composite clusters

V0.1 does not ship a cross-method composite clusterer. Each lens emits WalletCluster records with its method set, and downstream application code can reconcile them. V1.0 adds a CompositeClusterer that runs all four lenses and merges overlapping assignments via confidence-weighted union.

Why three lenses

No single method suffices:

  • Behavioral sees “same algorithm implementation” but not “same operator running two strategies.” Two bots from different firms with similar structure end up in the same cluster; the same firm running a market-maker and a directional strategy ends up in different clusters.
  • Temporal sees “triggered by the same external signal” but not “different triggers with identical feature profiles.” Two bots listening to the same oracle cluster; two bots with the same behavior but different reaction times do not.
  • Network sees “co-presence in the same markets” but not “causally related.” Index-rebalancers all touch the same names; so do manipulators coordinating a pump.
  • On-chain heuristics (WalletHeuristicLinker) see “same controller’s wallets” but only on venues that expose addresses.

A claim that survives two independent lenses is worth acting on. Single-lens claims are leads, not conclusions.

Determinism

All three clusterers honor FlowConfig.seed. HDBSCAN is deterministic by construction; Louvain via python-louvain accepts a random_state (seeded); the DTW fallback is deterministic. The same input + the same seed produces the same clusters across runs. A compliance property as much as a testing one.

Citations

  • Campello, R. J., Moulavi, D., Sander, J. (2013). “Density-based clustering based on hierarchical density estimates.” PAKDD.
  • Paparrizos, J., Gravano, L. (2015). “k-Shape: Efficient and Accurate Clustering of Time Series.” SIGMOD 2015.
  • Blondel, V. D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E. (2008). “Fast unfolding of communities in large networks.” Journal of Statistical Mechanics.
  • Tumminello, M., Lillo, F., Piilo, J., Mantegna, R. N. (2012). “Identification of clusters of investors from their real trading activity in a financial market.” New Journal of Physics, 14(1).
  • Sakoe, H., Chiba, S. (1978). “Dynamic programming algorithm optimization for spoken word recognition.” IEEE TASSP.