Skip to main content

Reliability

AgentPack is tiny. That's the point: the smaller the moving surface, the less there is to break.

Where state lives

KindLives inBackup story
RowsFirestoreFirebase PITR (point-in-time recovery)
Raw MIMEpostbox-mime bucketCloud Storage replication
Artifact blobsagentpack-artifacts bucketCloud Storage replication
Signing key (Ed25519)Go identity service hostOffline cold copy
Service role keyOperator secret managerRotated on incident

There is no third place to look. A Firebase PITR restore plus the Ed25519 cold key restores the whole platform.

Retention

A family of agentpack_*_retention cron jobs trims old rows on schedule, reading their windows from identity.settings:

  • agentpack_mesh_peer_prune — 04:07 UTC, daily
  • agentpack_postbox_retention — 03:17 UTC, daily
  • agentpack_recall_episodic_retention — 03:37 UTC, daily
  • agentpack_artifacts_reap — 03:47 UTC, daily
  • agentpack_scheduler_reap_runs — 03:57 UTC, daily
  • agentpack_audit_monitor — 04:17 UTC, daily (read-only chain check)

They are staggered to avoid colliding. audit.cron_status() reports each job's last run, next run, and error, filtered by the agentpack_ prefix.

SLOs

These are the targets the reference deployment aims for. Tune to your Firebase tier.

MetricTarget
Edge function p95 latency< 200 ms
postbox.ingest availability99.9% / 30d
Audit chain check100% pass rate
Scheduler dead-letter ratio< 1% of runs
Postbox outbound bounce rate< 2%
Mesh /presence freshness≤ 120 s

Failure modes and responses

Edge function timeout

Retry with jitter from the client. RPCs are idempotent; duplicate writes collapse via ON CONFLICT or existence checks.

Cloud Scheduler stops firing

audit.cron_status() will show next_run in the past and last_run stale. A dashboard alert on (now() - last_run) > interval '1 day' catches this.

Audit chain breaks

audit.monitor.last_first_bad_id becomes non-null. Stop writes, snapshot the DB, and investigate. audit.verify_chain() tells you exactly which row diverges.

Storage quota

Postbox MIME and artifact blobs live in storage. Retention cron reaps them; if it stops running, storage fills. Monitor bucket size.

Firebase region outage

Edge functions fail over per the platform's runbook. State is eventually consistent across Firebase regions. Agents should tolerate 5xx with exponential backoff.

Disaster recovery

  1. Identify the blast radius. Is it a row subset, a schema, a bucket, or the project?
  2. Freeze writes. Pause the MTA, disable scheduled webhooks via Cloud Scheduler.unschedule(...), and revoke the bridge key if compromise is suspected.
  3. PITR restore to a new project; keep the compromised project for forensics.
  4. Rotate the bridge key, the service role key, the Ed25519 signing key, every device key, every delegation, and every outbound webhook secret.
  5. Replay audit on the restored DB to confirm chain integrity.
  6. Re-enroll devices through Pocket's pairing flow.

Operational dashboards

The shape of a good AgentPack dashboard:

  • Postbox: ingest rate, bounce rate, suppression size, outbox queue depth
  • Recall: writes/sec, Firestore vector search p95, episodic bytes per agent
  • Mesh: peers online, PSK age distribution, ACL grant count
  • Gateway: host count, certs in flight, cert-days-remaining histogram
  • Scheduler: runs/min, DLQ depth, late-by-more-than-2× count
  • Audit: chain status, last monitor run, appends/min

All of this is queryable from the existing RPCs — no extra telemetry plane required.