Skip to content

STD-OPS-003: Observability & Incident Response

1. Context

To ensure "Production Survivability". We must know when it breaks and how to fix it.

2. The Standard (The Floor)

  • [MUST] Structured Logging: Logs MUST be JSON objects, not strings.
  • Required: correlation_id, service, level, host.
  • [MUST] Error Taxonomy:
  • Sev-1 (Critical): Data Loss, Security Breach, Total Outage. (Wake up the Human).
  • Sev-2 (High): Major feature broken.
  • Sev-3 (Medium): Minor bug / UX issue.
  • Sev-4 (Low): Noise / Warning.
  • [MUST] Incident Playbook: Every Stable module MUST have a RUNBOOK.md describing how to debug its most common failures.

3. Best Practices (The Path)

  • [SHOULD] Tracing: Pass x-request-id across all service boundaries.
  • [SHOULD] Alerting: Alert on Symptoms (High Error Rate), not just Causes (High CPU).

5. Version History

Version Date Author Change
0.1 2026-01-25 AI Draft P0 Standard

Version History

Version Date Author Change
0.1.0 2026-01-26 Antigravity Initial Audit & Metadata Injection