Reliability
AgentPack is tiny. That's the point: the smaller the moving surface, the less there is to break.
Where state lives
| Kind | Lives in | Backup story |
|---|---|---|
| Rows | Firestore | Firebase PITR (point-in-time recovery) |
| Raw MIME | postbox-mime bucket | Cloud Storage replication |
| Artifact blobs | agentpack-artifacts bucket | Cloud Storage replication |
| Signing key (Ed25519) | Go identity service host | Offline cold copy |
| Service role key | Operator secret manager | Rotated on incident |
There is no third place to look. A Firebase PITR restore plus the Ed25519 cold key restores the whole platform.
Retention
A family of agentpack_*_retention cron jobs trims old rows on schedule,
reading their windows from identity.settings:
agentpack_mesh_peer_prune— 04:07 UTC, dailyagentpack_postbox_retention— 03:17 UTC, dailyagentpack_recall_episodic_retention— 03:37 UTC, dailyagentpack_artifacts_reap— 03:47 UTC, dailyagentpack_scheduler_reap_runs— 03:57 UTC, dailyagentpack_audit_monitor— 04:17 UTC, daily (read-only chain check)
They are staggered to avoid colliding. audit.cron_status() reports each
job's last run, next run, and error, filtered by the agentpack_ prefix.
SLOs
These are the targets the reference deployment aims for. Tune to your Firebase tier.
| Metric | Target |
|---|---|
| Edge function p95 latency | < 200 ms |
postbox.ingest availability | 99.9% / 30d |
| Audit chain check | 100% pass rate |
| Scheduler dead-letter ratio | < 1% of runs |
| Postbox outbound bounce rate | < 2% |
Mesh /presence freshness | ≤ 120 s |
Failure modes and responses
Edge function timeout
Retry with jitter from the client. RPCs are idempotent; duplicate writes
collapse via ON CONFLICT or existence checks.
Cloud Scheduler stops firing
audit.cron_status() will show next_run in the past and last_run
stale. A dashboard alert on (now() - last_run) > interval '1 day' catches
this.
Audit chain breaks
audit.monitor.last_first_bad_id becomes non-null. Stop writes, snapshot
the DB, and investigate. audit.verify_chain() tells you exactly which
row diverges.
Storage quota
Postbox MIME and artifact blobs live in storage. Retention cron reaps them; if it stops running, storage fills. Monitor bucket size.
Firebase region outage
Edge functions fail over per the platform's runbook. State is eventually consistent across Firebase regions. Agents should tolerate 5xx with exponential backoff.
Disaster recovery
- Identify the blast radius. Is it a row subset, a schema, a bucket, or the project?
- Freeze writes. Pause the MTA, disable scheduled webhooks via
Cloud Scheduler.unschedule(...), and revoke the bridge key if compromise is suspected. - PITR restore to a new project; keep the compromised project for forensics.
- Rotate the bridge key, the service role key, the Ed25519 signing key, every device key, every delegation, and every outbound webhook secret.
- Replay audit on the restored DB to confirm chain integrity.
- Re-enroll devices through Pocket's pairing flow.
Operational dashboards
The shape of a good AgentPack dashboard:
- Postbox: ingest rate, bounce rate, suppression size, outbox queue depth
- Recall: writes/sec, Firestore vector search p95, episodic bytes per agent
- Mesh: peers online, PSK age distribution, ACL grant count
- Gateway: host count, certs in flight, cert-days-remaining histogram
- Scheduler: runs/min, DLQ depth, late-by-more-than-2× count
- Audit: chain status, last monitor run, appends/min
All of this is queryable from the existing RPCs — no extra telemetry plane required.