Crash recovery

The audit log is the source of truth. Restart, rebuild the ledger and lot books from events, resume.

The watchdog protects while the process is running. Recovery protects across restarts. Once real money is in play and the SDK runs unattended, a crash (OS reboot, OOM kill, network partition, deploy) must not mean lost state.

The audit log is the source of truth. Every fill, wash-sale adjustment, and block allocation is a chain-linked event in an append-only store. Replaying those events reconstructs the ledger, per-account lot books, and PDT counter. The next process resumes from there.

horizon.state.recovery is the replay engine. Read-only. Additive. Not on the hot path.

Quickest start

python
from horizon.audit import AuditLog, SQLiteSink
from horizon.state import rebuild_from_audit_log

audit_log = AuditLog(sink=SQLiteSink("/var/lib/horizon/audit.db"))

state = rebuild_from_audit_log(audit_log, accounts=registry)
print(state.summary())
# {'last_sequence': 128312, 'n_fills_replayed': 4283,
#  'n_wash_sales_replayed': 12, 'ledger_positions': 47, ...}

result = hz.run(
    mode="live",
    feed=my_feed,
    audit_log=audit_log,                # keep writing to the SAME log
    accounts=registry,
    recover_from=audit_log,              # enables recovery
    watchdog=LiveWatchdogConfig(...),
    strategies=[MyStrategy],
    ...,
)

With recover_from=, hz.run():

  1. Walks the audit sink in order.
  2. Rebuilds PositionLedger by replaying OrderFilled.
  3. Rebuilds per-account LotBook using each account’s tax_lot_election.
  4. Rebuilds the PDTCounter from same-day buy/sell pairs.
  5. Re-applies WashSaleDetected events so adjusted bases carry through.
  6. Seeds the run loop with the recovered state and resumes.

Recovered

StateMechanism
Ledger positions (quantity, avg cost, realized P&L, fees)Replay OrderFilled through PositionLedger.apply_fill
Per-account lot books (FIFO / LIFO / HIFO / SpecID / AverageCost)Replay fills using each account’s tax_lot_election
Closed lots with LT/ST tagsRe-derived
Wash-sale adjustmentsRe-applied from WashSaleDetected
PDT day-trade counterRe-derived from replay
Trade history (blotter)From ledger trade records

Not recovered

  • Open orders at the broker. The venue is ground truth for what is alive right now. Call VenueLedgerReconciler.reconcile() after recovery to tie out against the broker’s positions and open orders.
  • Feed subscriptions. The live feed starts fresh. Market-data feeds generally do not support replay.
  • Risk-engine counters (rate window, stop-loss cooldowns). Session-local defensive state. A fresh session with recovered positions and policies is correct.
  • Equity curve samples. Recomputed forward from the recovery point. Historical TWR/MWR comes from the audit log when needed.
  • Strategy internal state. Feature buffers, counters, model state must re-warm from data. For checkpointed state, serialize to the audit log’s Annotation category and restore manually.
  • Parent-order in-flight map. If the process died mid-block-allocation with the parent live at the broker but children unsplit, that parent needs manual reconciliation. Watchdog’s auto_flatten_on_stop reduces exposure.

Event-sourcing flow

text
Prior run:
    ┌──────────────────────────┐
    │ audit log (SQLite WORM)  │
    │  1  SystemStart           │
    │  2  OrderSubmitted AAPL   │
    │  3  OrderFilled AAPL      │
    │  4  OrderSubmitted MSFT   │
    │  5  OrderFilled MSFT      │
    │  ...                      │
    │ 4283 WashSaleDetected     │
    │ 4284 <crash>              │
    └──────────────────────────┘
                │
                │  rebuild_from_audit_log(log, accounts=registry)
                ▼
    ┌──────────────────────────┐
    │ RecoveredState            │
    │  PositionLedger           │
    │  lot_books[acc_id]        │
    │  PDTCounter               │
    │  last_sequence=4283       │
    └──────────────────────────┘
                │
                │  hz.run(..., recover_from=log)
                ▼
    ┌──────────────────────────┐
    │ new run                   │
    │  4284 SystemStart         │
    │  4285 OrderSubmitted ...  │
    │  ...                      │
    └──────────────────────────┘

The audit chain is continuous. Sequence numbers grow across restarts. A later auditor cannot tell from the log alone that there was a restart.

SIGINT, stop, restart, recover

python
import signal
import threading
from horizon.audit import AuditLog, SQLiteSink
from horizon.ops import LiveWatchdogConfig

LOG_PATH = "/var/lib/horizon/audit.db"

def run_session():
    audit_log = AuditLog(sink=SQLiteSink(LOG_PATH))
    stop = threading.Event()
    signal.signal(signal.SIGINT, lambda *_: stop.set())

    hz.run(
        mode="live",
        feed=make_feed(),
        strategies=[MyStrategy],
        accounts=registry,
        audit_log=audit_log,
        recover_from=audit_log,
        watchdog=LiveWatchdogConfig(
            grace_period_s=10,
            feed_stale_seconds=15,
            auto_flatten_on_stop=True,
        ),
        stop_event=stop,
        max_duration_s=3600 * 12,
    )

if __name__ == "__main__":
    run_session()

First run: the log is empty, recovery is a no-op. Later runs: recovery replays, new events append to the same chain.

Cold analysis

Recovery also answers historical state questions:

python
state_q1_end = rebuild_from_audit_log(
    audit_log, accounts=registry,
    start_sequence=1,
    end_sequence=128312,            # last sequence on 2026-03-31
)

print(state_q1_end.summary())

Useful for statement generation from cold data, post-hoc best-execution analysis, and exam questions like “what did we look like on 2026-03-14 14:32:00?”

Robustness

  • Malformed events are tolerated. A fill with a missing quantity or unknown side is skipped. n_fills_replayed reflects what actually replayed.
  • Missing registry is fine. accounts=None rebuilds the ledger; lot books and PDT stay empty. Useful for audit-only analysis.
  • Append-only safe. Recovery reads only. Multiple recoveries from the same log are idempotent.
  • Partial ranges. start_sequence and end_sequence replay a window without rewinding the whole log.

Reconciliation still required

After recovery in live mode, tie the rebuilt ledger against the broker:

python
from horizon.execution import VenueLedgerReconciler

reconciler = VenueLedgerReconciler(my_alpaca_venue, ledger, audit_log=audit_log)
report = reconciler.reconcile()
if not report.clean:
    for m in report.mismatches:
        # Investigate before submitting new orders
        ...

The audit log captures what the process knew. The broker may have filled or canceled orders during the outage. The reconciler catches those gaps and emits ReconcileMismatch events.

Limits

  • Recovery handles ungraceful process exits. It does not handle broker desync; that is the reconciler.
  • For hosts with persistent WAL / disk, crash recovery works with no coordination. For containerized deployments, ensure SQLiteSink’s path is mounted on durable storage.
  • The first seconds after recovery are sensitive. Watchdog grace_period_s lets the feed warm up before enforcement. Do not set 0 in production.
  • WORM storage is a deployment choice. SQLiteSink is append-only within Python (triggers reject UPDATE/DELETE), but filesystem write access can still overwrite the file. For hard WORM, archive to S3 Object Lock or Glacier on a nightly job. See Audit trail.