STD-OPS-003: Observability & Incident Response
1. Context
To ensure "Production Survivability". We must know when it breaks and how to fix it.
2. The Standard (The Floor)
- [MUST] Structured Logging: Logs MUST be JSON objects, not strings.
- Required:
correlation_id, service, level, host.
- [MUST] Error Taxonomy:
- Sev-1 (Critical): Data Loss, Security Breach, Total Outage. (Wake up the Human).
- Sev-2 (High): Major feature broken.
- Sev-3 (Medium): Minor bug / UX issue.
- Sev-4 (Low): Noise / Warning.
- [MUST] Incident Playbook: Every
Stable module MUST have a RUNBOOK.md describing how to debug its most common failures.
3. Best Practices (The Path)
- [SHOULD] Tracing: Pass
x-request-id across all service boundaries.
- [SHOULD] Alerting: Alert on Symptoms (High Error Rate), not just Causes (High CPU).
5. Version History
| Version |
Date |
Author |
Change |
| 0.1 |
2026-01-25 |
AI |
Draft P0 Standard |
Version History
| Version |
Date |
Author |
Change |
| 0.1.0 |
2026-01-26 |
Antigravity |
Initial Audit & Metadata Injection |