Crash recovery
The audit log is the source of truth. Restart, rebuild the ledger and lot books from events, resume.
The watchdog protects while the process is running. Recovery protects across restarts. Once real money is in play and the SDK runs unattended, a crash (OS reboot, OOM kill, network partition, deploy) must not mean lost state.
The audit log is the source of truth. Every fill, wash-sale adjustment, and block allocation is a chain-linked event in an append-only store. Replaying those events reconstructs the ledger, per-account lot books, and PDT counter. The next process resumes from there.
horizon.state.recovery is the replay engine. Read-only. Additive. Not on the hot path.
Quickest start
from horizon.audit import AuditLog, SQLiteSink
from horizon.state import rebuild_from_audit_log
audit_log = AuditLog(sink=SQLiteSink("/var/lib/horizon/audit.db"))
state = rebuild_from_audit_log(audit_log, accounts=registry)
print(state.summary())
# {'last_sequence': 128312, 'n_fills_replayed': 4283,
# 'n_wash_sales_replayed': 12, 'ledger_positions': 47, ...}
result = hz.run(
mode="live",
feed=my_feed,
audit_log=audit_log, # keep writing to the SAME log
accounts=registry,
recover_from=audit_log, # enables recovery
watchdog=LiveWatchdogConfig(...),
strategies=[MyStrategy],
...,
)
With recover_from=, hz.run():
- Walks the audit sink in order.
- Rebuilds
PositionLedgerby replayingOrderFilled. - Rebuilds per-account
LotBookusing each account’stax_lot_election. - Rebuilds the
PDTCounterfrom same-day buy/sell pairs. - Re-applies
WashSaleDetectedevents so adjusted bases carry through. - Seeds the run loop with the recovered state and resumes.
Recovered
| State | Mechanism |
|---|---|
| Ledger positions (quantity, avg cost, realized P&L, fees) | Replay OrderFilled through PositionLedger.apply_fill |
| Per-account lot books (FIFO / LIFO / HIFO / SpecID / AverageCost) | Replay fills using each account’s tax_lot_election |
| Closed lots with LT/ST tags | Re-derived |
| Wash-sale adjustments | Re-applied from WashSaleDetected |
| PDT day-trade counter | Re-derived from replay |
| Trade history (blotter) | From ledger trade records |
Not recovered
- Open orders at the broker. The venue is ground truth for what is alive right now. Call
VenueLedgerReconciler.reconcile()after recovery to tie out against the broker’s positions and open orders. - Feed subscriptions. The live feed starts fresh. Market-data feeds generally do not support replay.
- Risk-engine counters (rate window, stop-loss cooldowns). Session-local defensive state. A fresh session with recovered positions and policies is correct.
- Equity curve samples. Recomputed forward from the recovery point. Historical TWR/MWR comes from the audit log when needed.
- Strategy internal state. Feature buffers, counters, model state must re-warm from data. For checkpointed state, serialize to the audit log’s
Annotationcategory and restore manually. - Parent-order in-flight map. If the process died mid-block-allocation with the parent live at the broker but children unsplit, that parent needs manual reconciliation. Watchdog’s
auto_flatten_on_stopreduces exposure.
Event-sourcing flow
Prior run:
┌──────────────────────────┐
│ audit log (SQLite WORM) │
│ 1 SystemStart │
│ 2 OrderSubmitted AAPL │
│ 3 OrderFilled AAPL │
│ 4 OrderSubmitted MSFT │
│ 5 OrderFilled MSFT │
│ ... │
│ 4283 WashSaleDetected │
│ 4284 <crash> │
└──────────────────────────┘
│
│ rebuild_from_audit_log(log, accounts=registry)
▼
┌──────────────────────────┐
│ RecoveredState │
│ PositionLedger │
│ lot_books[acc_id] │
│ PDTCounter │
│ last_sequence=4283 │
└──────────────────────────┘
│
│ hz.run(..., recover_from=log)
▼
┌──────────────────────────┐
│ new run │
│ 4284 SystemStart │
│ 4285 OrderSubmitted ... │
│ ... │
└──────────────────────────┘
The audit chain is continuous. Sequence numbers grow across restarts. A later auditor cannot tell from the log alone that there was a restart.
SIGINT, stop, restart, recover
import signal
import threading
from horizon.audit import AuditLog, SQLiteSink
from horizon.ops import LiveWatchdogConfig
LOG_PATH = "/var/lib/horizon/audit.db"
def run_session():
audit_log = AuditLog(sink=SQLiteSink(LOG_PATH))
stop = threading.Event()
signal.signal(signal.SIGINT, lambda *_: stop.set())
hz.run(
mode="live",
feed=make_feed(),
strategies=[MyStrategy],
accounts=registry,
audit_log=audit_log,
recover_from=audit_log,
watchdog=LiveWatchdogConfig(
grace_period_s=10,
feed_stale_seconds=15,
auto_flatten_on_stop=True,
),
stop_event=stop,
max_duration_s=3600 * 12,
)
if __name__ == "__main__":
run_session()
First run: the log is empty, recovery is a no-op. Later runs: recovery replays, new events append to the same chain.
Cold analysis
Recovery also answers historical state questions:
state_q1_end = rebuild_from_audit_log(
audit_log, accounts=registry,
start_sequence=1,
end_sequence=128312, # last sequence on 2026-03-31
)
print(state_q1_end.summary())
Useful for statement generation from cold data, post-hoc best-execution analysis, and exam questions like “what did we look like on 2026-03-14 14:32:00?”
Robustness
- Malformed events are tolerated. A fill with a missing
quantityor unknownsideis skipped.n_fills_replayedreflects what actually replayed. - Missing registry is fine.
accounts=Nonerebuilds the ledger; lot books and PDT stay empty. Useful for audit-only analysis. - Append-only safe. Recovery reads only. Multiple recoveries from the same log are idempotent.
- Partial ranges.
start_sequenceandend_sequencereplay a window without rewinding the whole log.
Reconciliation still required
After recovery in live mode, tie the rebuilt ledger against the broker:
from horizon.execution import VenueLedgerReconciler
reconciler = VenueLedgerReconciler(my_alpaca_venue, ledger, audit_log=audit_log)
report = reconciler.reconcile()
if not report.clean:
for m in report.mismatches:
# Investigate before submitting new orders
...
The audit log captures what the process knew. The broker may have filled or canceled orders during the outage. The reconciler catches those gaps and emits ReconcileMismatch events.
Limits
- Recovery handles ungraceful process exits. It does not handle broker desync; that is the reconciler.
- For hosts with persistent WAL / disk, crash recovery works with no coordination. For containerized deployments, ensure
SQLiteSink’s path is mounted on durable storage. - The first seconds after recovery are sensitive. Watchdog
grace_period_slets the feed warm up before enforcement. Do not set 0 in production. - WORM storage is a deployment choice.
SQLiteSinkis append-only within Python (triggers reject UPDATE/DELETE), but filesystem write access can still overwrite the file. For hard WORM, archive to S3 Object Lock or Glacier on a nightly job. See Audit trail.