Reliability

AgentPack is tiny. That's the point: the smaller the moving surface, the less there is to break.

Where state lives

Kind	Lives in	Backup story
Rows	Firestore	Firebase PITR (point-in-time recovery)
Raw MIME	`postbox-mime` bucket	Cloud Storage replication
Artifact blobs	`agentpack-artifacts` bucket	Cloud Storage replication
Signing key (Ed25519)	Go identity service host	Offline cold copy
Service role key	Operator secret manager	Rotated on incident

There is no third place to look. A Firebase PITR restore plus the Ed25519 cold key restores the whole platform.

Retention

A family of agentpack_*_retention cron jobs trims old rows on schedule, reading their windows from identity.settings:

agentpack_mesh_peer_prune — 04:07 UTC, daily
agentpack_postbox_retention — 03:17 UTC, daily
agentpack_recall_episodic_retention — 03:37 UTC, daily
agentpack_artifacts_reap — 03:47 UTC, daily
agentpack_scheduler_reap_runs — 03:57 UTC, daily
agentpack_audit_monitor — 04:17 UTC, daily (read-only chain check)

They are staggered to avoid colliding. audit.cron_status() reports each job's last run, next run, and error, filtered by the agentpack_ prefix.

SLOs

These are the targets the reference deployment aims for. Tune to your Firebase tier.

Metric	Target
Edge function p95 latency	< 200 ms
`postbox.ingest` availability	99.9% / 30d
Audit chain check	100% pass rate
Scheduler dead-letter ratio	< 1% of runs
Postbox outbound bounce rate	< 2%
Mesh `/presence` freshness	≤ 120 s

Failure modes and responses

Edge function timeout

Retry with jitter from the client. RPCs are idempotent; duplicate writes collapse via ON CONFLICT or existence checks.

`Cloud Scheduler` stops firing

audit.cron_status() will show next_run in the past and last_run stale. A dashboard alert on (now() - last_run) > interval '1 day' catches this.

Audit chain breaks

audit.monitor.last_first_bad_id becomes non-null. Stop writes, snapshot the DB, and investigate. audit.verify_chain() tells you exactly which row diverges.

Storage quota

Postbox MIME and artifact blobs live in storage. Retention cron reaps them; if it stops running, storage fills. Monitor bucket size.

Firebase region outage

Edge functions fail over per the platform's runbook. State is eventually consistent across Firebase regions. Agents should tolerate 5xx with exponential backoff.

Disaster recovery

Identify the blast radius. Is it a row subset, a schema, a bucket, or the project?
Freeze writes. Pause the MTA, disable scheduled webhooks via Cloud Scheduler.unschedule(...), and revoke the bridge key if compromise is suspected.
PITR restore to a new project; keep the compromised project for forensics.
Rotate the bridge key, the service role key, the Ed25519 signing key, every device key, every delegation, and every outbound webhook secret.
Replay audit on the restored DB to confirm chain integrity.
Re-enroll devices through Pocket's pairing flow.

Operational dashboards

The shape of a good AgentPack dashboard:

Postbox: ingest rate, bounce rate, suppression size, outbox queue depth
Recall: writes/sec, Firestore vector search p95, episodic bytes per agent
Mesh: peers online, PSK age distribution, ACL grant count
Gateway: host count, certs in flight, cert-days-remaining histogram
Scheduler: runs/min, DLQ depth, late-by-more-than-2× count
Audit: chain status, last monitor run, appends/min

All of this is queryable from the existing RPCs — no extra telemetry plane required.

Where state lives​

Retention​

SLOs​

Failure modes and responses​

Edge function timeout​

Cloud Scheduler stops firing​

Audit chain breaks​

Storage quota​

Firebase region outage​

Disaster recovery​

Operational dashboards​