Dead-letter queue

Capture failed order submissions for inspection and replay instead of losing them.

When a venue submit() raises (network blip, expired credentials, a 4xx the broker returned), the run loop catches the exception and continues. That is safe for the loop. The order itself is lost unless a DLQ catches it.

horizon.ops.dlq is the dead-letter queue: a sink that records the original OrderAction plus context (error, timestamp, venue, account, retry count) for an operator to inspect, replay, or dismiss.

Protocol

python
class DLQSink(Protocol):
    def write(self, entry: DeadLetteredOrder) -> None: ...
    def list(self, *, include_dismissed: bool = False) -> list[DeadLetteredOrder]: ...
    def get(self, dlq_id: str) -> DeadLetteredOrder | None: ...
    def mark_dismissed(self, dlq_id: str) -> bool: ...
    def bump_retry(self, dlq_id: str) -> int: ...
    def depth(self) -> int: ...
    def close(self) -> None: ...

Two implementations:

  • InMemoryDLQ. Process-local. Tests and research.
  • SQLiteDLQ. File-backed. Triggers reject UPDATE of identity columns and DELETE of any row.

Quickstart

python
from horizon.ops import SQLiteDLQ
import horizon as hz

dlq = SQLiteDLQ("/var/lib/horizon/dlq.db")

hz.run(
    mode="live",
    feed=my_feed,
    venues={"alpaca": venue},
    audit_log=audit_log,
    dlq=dlq,                       # captures every failed submit
    ...,
)

Every exception from a venue submit() inside the live loop now:

  1. Writes a DeadLetteredOrder to the DLQ.
  2. Emits an AuditCategory.OrderRejected event with dlq_id in the payload.
  3. Increments horizon_order_rejects_total{layer="venue_exception"} if metrics is configured.
  4. Continues the loop.

Entry shape

python
@dataclass(frozen=True)
class DeadLetteredOrder:
    dlq_id: str                       # "dlq_<16 hex>"
    captured_at: datetime             # tz-aware
    venue_name: str
    account_id: str | None
    market_id: str
    side: str
    quantity: float
    order_type: str
    price: float | None
    client_order_id: str | None
    error: str                        # truncated to 1000 chars
    retry_count: int = 0
    dismissed: bool = False
    action_json: str = ""             # full OrderAction for replay

Inspecting

python
for e in dlq.list():
    print(f"{e.captured_at.isoformat()} {e.venue_name} "
          f"{e.side} {e.quantity} {e.market_id} @ {e.price}  "
          f"(retries={e.retry_count})")
    print(f"  error: {e.error}")

dlq.list() hides dismissed entries. Pass include_dismissed=True to see them.

Replaying

Resubmit one entry. On success the entry is marked dismissed. On failure the retry count is bumped.

python
from horizon.ops import replay_order

ok, detail = replay_order(dlq, "dlq_abc123", venue, audit_log=audit_log)
if ok:
    print(f"replayed -> venue order {detail}")
else:
    print(f"still failing: {detail}")

Replay reconstructs the OrderAction from action_json and calls venue.submit(). The successful venue order id is emitted as an Annotation audit event referencing the original dlq_id.

Dismissing

For entries that should not be replayed (the position is no longer valid, the market closed, the strategy was disabled):

python
dlq.mark_dismissed("dlq_abc123")

The entry remains in the sink for the audit record. depth() no longer counts it.

CLI

The horizon CLI wires the common operations. Backed by the same DLQSink implementation used in the run loop:

$ horizon dlq list --db /var/lib/horizon/dlq.db
dlq_abc123  2026-04-18T14:32:15Z  alpaca  buy 100 AAPL @ 180.0  [pending, retries=0]
  error: RuntimeError: HTTP 429 after 3 attempts

$ horizon dlq list --db /var/lib/horizon/dlq.db --all    # include dismissed
...

$ horizon dlq replay dlq_abc123 --db /var/lib/horizon/dlq.db --venue alpaca --paper
replayed dlq_abc123  -> venue order ord_xyz

$ horizon dlq dismiss dlq_abc123 --db /var/lib/horizon/dlq.db
dismissed dlq_abc123

Supported --venue values: alpaca, kalshi, hyperliquid, polymarket, ibkr, and ccxt:<exchange_id> (for example ccxt:binance). The replay command constructs the venue with credentials from Secrets. Use --paper to route to the venue’s demo / sandbox / paper mode where supported.

WORM properties

SQLiteDLQ has two triggers:

  • UPDATE of any identity column (dlq_id, captured_at, venue_name, account_id, market_id, side, quantity, order_type, price, client_order_id, error, action_json) raises IntegrityError.
  • DELETE from the table raises IntegrityError.

Only retry_count and dismissed are mutable, and only via the Protocol’s methods. An auditor can run the SQLite file through regulatory review the same way the audit log is reviewed.

Storage sizing

Rough: each entry is about 1 KB in SQLite (plus the serialized OrderAction, which is 200 to 500 bytes). At 10 failed submits per day, a year is under 5 MB. Sizing is not a concern for advisor-scale deployments.

Retention policy is the firm’s call. Dismissed entries can stay for the audit period (five years under Rule 204-2; six under Rule 17a-4) and be archived like the audit log.

Metrics

Depth is exposed as a gauge for dashboards:

python
dlq_depth_samples = metrics.gauge(
    MetricName.DlqDepth, dlq.depth(), venue="alpaca",
)

Wire a periodic gauge update or call it at EOD.

Not in scope

  • Automatic replay. By design. Failed submits need human inspection; a retry loop at the DLQ layer would hide credential expiry, broker bans, or symbol typos that the operator should address.
  • Priority queues. One table, chronological order. If prioritization matters, filter in the operator CLI.
  • Cross-process DLQ. SQLiteDLQ is single-writer. For multi-writer deployments, a Postgres-backed DLQSink lands in L2 on the same Protocol.