ADR: Observability Platform Selection (Sentry + Grafana Loki)
Status: ✅ APPROVED
Date: 2026-01-18
Decision Makers: Michael Higgins
ADR Number: 003
Context
Singular Dream requires a comprehensive observability solution to support: 1. Error tracking and exception monitoring 2. Application performance monitoring (APM) 3. Cron job and batch worker monitoring 4. Infrastructure health monitoring 5. Centralized logging for batch jobs and applications 6. Self-healing batch system integration
We needed to formally evaluate monitoring platforms and select the optimal solution(s) for our scale, budget, and technical requirements.
Decision
We will use a hybrid observability approach:
Primary Platform: Sentry
- Purpose: Error tracking, APM, cron monitoring, batch system integration
- Cost: $27/month (Team plan)
- Status: Already integrated and operational
Secondary Platform: Grafana Loki
- Purpose: Centralized logging (high-volume logs)
- Cost: $0/month (within free tier: 50GB/month)
- Status: To be implemented
Rationale
Why Sentry for Monitoring
Strengths: 1. Best-in-class error tracking (10/10) - Detailed stack traces with local variables - Source code context - User impact tracking - Error grouping and deduplication
- Excellent cron monitoring (10/10)
- Perfect for batch worker health checks
- Automatic alerting on missed check-ins
- API for programmatic setup
-
Historical uptime tracking
-
Cost-effective ($27/month vs $100-500/month for competitors)
- 5,000 errors/month included
- 10,000 transactions/month included
- Unlimited cron monitors
-
Transparent pricing
-
Developer-friendly
- Easy setup and integration
- Excellent documentation
- Open-source core (can self-host if needed)
-
Strong community
-
Batch system integration
- Webhook support for triggering remediation jobs
- Transaction tracking for job execution
- Error capture with rich context
- Self-healing workflow enablement
Weaknesses: - Not optimal for high-volume centralized logging ($0.20/GB, limited retention) - Limited infrastructure monitoring (requires custom scripts)
Why Grafana Loki for Logging
Strengths: 1. Cost-effective for logs - FREE for our usage (50GB/month tier) - $0.50/GB beyond free tier (vs $0.20/GB for Sentry) - No platform fees
- Purpose-built for logging
- Label-based indexing (efficient storage)
- Powerful LogQL query language
- Long-term retention (configurable)
-
Advanced log parsing and filtering
-
Grafana ecosystem integration
- Unified dashboards with metrics and traces
- Excellent visualization
-
Alerting on log patterns
-
Open-source
- Can self-host if needed
- No vendor lock-in
- Active community
Weaknesses: - Requires separate platform (not unified with Sentry) - Moderate setup complexity - No built-in error tracking (use Sentry for this)
Alternatives Considered
Datadog
- Pros: Full-stack observability, extensive integrations, excellent APM
- Cons: $92-500/month (too expensive), overkill for our needs
- Verdict: Rejected due to cost
New Relic
- Pros: Robust APM, "Errors Inbox", AI-powered insights
- Cons: $100-200/month (too expensive), steeper learning curve
- Verdict: Rejected due to cost and complexity
Rollbar
- Pros: Specialized error tracking, similar pricing to Sentry
- Cons: No APM, no cron monitoring, fewer features than Sentry
- Verdict: Rejected - Sentry offers more features
Bugsnag
- Pros: Error monitoring with stability scores
- Cons: $59-299/month (more expensive), limited APM
- Verdict: Rejected due to cost and fewer features
Google Cloud Logging (for logs)
- Pros: FREE (50GB/month), native GCP integration, enterprise features
- Cons: Separate from Sentry, less flexible than Loki
- Verdict: Viable alternative to Loki, but Loki preferred for ecosystem
Sentry Logs Only
- Pros: Unified platform, correlated with errors
- Cons: Expensive ($0.20/GB), limited retention, not designed for high-volume
- Verdict: Rejected for centralized logging (use for error-correlated logs only)
Cost Analysis
Our Estimated Usage
- Error events: ~1,000/month
- Transactions: ~10,000/month
- Cron check-ins: ~4,500/month
- Logs: ~3GB/month
Selected Solution Cost
| Component | Monthly Cost |
|---|---|
| Sentry (Team plan) | $27 |
| Grafana Loki (Free tier) | $0 |
| Total | $27/month |
Rejected Alternatives Cost
| Solution | Monthly Cost | Savings |
|---|---|---|
| Datadog | $92-500 | $65-473 |
| New Relic | $100-200 | $73-173 |
| Sentry + Sentry Logs | $27 + $0.40 = $27.40 | $0.40 |
Annual Savings: $780-5,676 compared to enterprise solutions
Implementation Plan
Phase 1: Sentry (Complete ✅)
- ✅ Sentry SDK integrated
- ✅ Error tracking operational
- ✅ Cron monitors active (Elastic Muscle, Upstash)
- ✅ Infrastructure monitoring via custom scripts
- ✅ Webhooks configured
Phase 2: Batch System Integration (Checkpoint 2)
- [ ] Initialize Sentry in batch worker
- [ ] Add transaction tracking for job execution
- [ ] Capture job errors with context
- [ ] Setup webhook handlers for remediation triggers
- [ ] Test self-healing workflows
Phase 3: Grafana Loki Setup (Week 1-2)
- [ ] Sign up for Grafana Cloud (free tier)
- [ ] Configure Loki data source
- [ ] Update batch executor to send logs to Loki
- [ ] Create Grafana dashboards for batch jobs
- [ ] Setup log-based alerts
Phase 4: Optimization (Week 3-4)
- [ ] Fine-tune log retention policies
- [ ] Create unified dashboard (Sentry errors + Loki logs)
- [ ] Optimize costs and performance
- [ ] Document logging strategy
Consequences
Positive
- Cost-effective: $27/month vs $100-500/month for competitors
- Best-in-class error tracking: Sentry's core strength
- Comprehensive logging: Loki handles high-volume logs efficiently
- Self-healing capability: Sentry webhooks trigger batch remediation
- No vendor lock-in: Both platforms are open-source
- Scalable: Can grow with our needs
- Developer-friendly: Excellent DX for both platforms
Negative
- Two platforms to manage: Sentry + Loki (vs single platform like Datadog)
- Integration overhead: Need to correlate logs and errors manually
- Setup complexity: Loki requires initial configuration
- Limited infrastructure monitoring: Requires custom scripts
Neutral
- Learning curve: Team needs to learn both platforms
- Maintenance: Two platforms to maintain and monitor
- Future migration: May need to revisit if requirements change significantly
Mitigation Strategies
Two Platforms Management
- Mitigation:
- Both have excellent APIs for automation
- Can create unified dashboards in Grafana
- Minimal operational overhead once configured
Integration Overhead
- Mitigation:
- Use consistent job IDs across platforms
- Create correlation dashboards
- Automate log-error linking via metadata
Setup Complexity
- Mitigation:
- Grafana Cloud simplifies Loki setup
- Comprehensive documentation available
- Can start with basic setup and iterate
Success Metrics
Monitoring Coverage
- ✅ 100% of errors captured in Sentry
- ✅ 100% of batch jobs monitored via cron checks
- ✅ 100% of batch job logs in Loki
- ✅ <5 minute alert latency for critical errors
Cost Efficiency
- ✅ Stay within $30/month budget
- ✅ Maintain <$0.01/GB effective log cost
- ✅ No unexpected cost spikes
Developer Experience
- ✅ <30 seconds to find relevant logs
- ✅ <5 minutes to debug production errors
- ✅ 90%+ developer satisfaction with tools
System Reliability
- ✅ 99.9% uptime for monitoring platforms
- ✅ <1 hour MTTR (Mean Time To Resolution) for batch failures
- ✅ Self-healing success rate >80%
Review Schedule
This ADR will be reviewed: - Quarterly: Cost and usage analysis - Annually: Full platform evaluation - Ad-hoc: If requirements change significantly
Next Review: 2026-04-18
References
Documentation
Related ADRs
- ADR-001: Secrets Management (Doppler)
- ADR-002: Directory Refactor
- ADR-004: Batch System Architecture (pending)
Internal Documents
architecture/SENTRY_BATCH_INTEGRATION.md- Integration strategyarchitecture/SENTRY_CENTRALIZED_LOGGING.md- Logging architecturearchitecture/BATCH_TWO_TIER_ARCHITECTURE.md- Batch system designdocs/SENTRY_ACTIVATION_COMPLETE.md- Current Sentry setup
Approval
Approved by: Michael Higgins
Date: 2026-01-18
Status: ✅ APPROVED
Signature: Approved via voice command during architecture review session
Changelog
| Date | Change | Author |
|---|---|---|
| 2026-01-18 | Initial ADR created and approved | Antigravity AI |
| 2026-01-18 | Added Grafana Loki as secondary platform | Antigravity AI |
| 2026-01-18 | Finalized cost analysis and implementation plan | Antigravity AI |