Skip to content

ADR: Observability Platform Selection (Sentry + Grafana Loki)

Status: ✅ APPROVED
Date: 2026-01-18
Decision Makers: Michael Higgins
ADR Number: 003


Context

Singular Dream requires a comprehensive observability solution to support: 1. Error tracking and exception monitoring 2. Application performance monitoring (APM) 3. Cron job and batch worker monitoring 4. Infrastructure health monitoring 5. Centralized logging for batch jobs and applications 6. Self-healing batch system integration

We needed to formally evaluate monitoring platforms and select the optimal solution(s) for our scale, budget, and technical requirements.


Decision

We will use a hybrid observability approach:

Primary Platform: Sentry

  • Purpose: Error tracking, APM, cron monitoring, batch system integration
  • Cost: $27/month (Team plan)
  • Status: Already integrated and operational

Secondary Platform: Grafana Loki

  • Purpose: Centralized logging (high-volume logs)
  • Cost: $0/month (within free tier: 50GB/month)
  • Status: To be implemented

Rationale

Why Sentry for Monitoring

Strengths: 1. Best-in-class error tracking (10/10) - Detailed stack traces with local variables - Source code context - User impact tracking - Error grouping and deduplication

  1. Excellent cron monitoring (10/10)
  2. Perfect for batch worker health checks
  3. Automatic alerting on missed check-ins
  4. API for programmatic setup
  5. Historical uptime tracking

  6. Cost-effective ($27/month vs $100-500/month for competitors)

  7. 5,000 errors/month included
  8. 10,000 transactions/month included
  9. Unlimited cron monitors
  10. Transparent pricing

  11. Developer-friendly

  12. Easy setup and integration
  13. Excellent documentation
  14. Open-source core (can self-host if needed)
  15. Strong community

  16. Batch system integration

  17. Webhook support for triggering remediation jobs
  18. Transaction tracking for job execution
  19. Error capture with rich context
  20. Self-healing workflow enablement

Weaknesses: - Not optimal for high-volume centralized logging ($0.20/GB, limited retention) - Limited infrastructure monitoring (requires custom scripts)


Why Grafana Loki for Logging

Strengths: 1. Cost-effective for logs - FREE for our usage (50GB/month tier) - $0.50/GB beyond free tier (vs $0.20/GB for Sentry) - No platform fees

  1. Purpose-built for logging
  2. Label-based indexing (efficient storage)
  3. Powerful LogQL query language
  4. Long-term retention (configurable)
  5. Advanced log parsing and filtering

  6. Grafana ecosystem integration

  7. Unified dashboards with metrics and traces
  8. Excellent visualization
  9. Alerting on log patterns

  10. Open-source

  11. Can self-host if needed
  12. No vendor lock-in
  13. Active community

Weaknesses: - Requires separate platform (not unified with Sentry) - Moderate setup complexity - No built-in error tracking (use Sentry for this)


Alternatives Considered

Datadog

  • Pros: Full-stack observability, extensive integrations, excellent APM
  • Cons: $92-500/month (too expensive), overkill for our needs
  • Verdict: Rejected due to cost

New Relic

  • Pros: Robust APM, "Errors Inbox", AI-powered insights
  • Cons: $100-200/month (too expensive), steeper learning curve
  • Verdict: Rejected due to cost and complexity

Rollbar

  • Pros: Specialized error tracking, similar pricing to Sentry
  • Cons: No APM, no cron monitoring, fewer features than Sentry
  • Verdict: Rejected - Sentry offers more features

Bugsnag

  • Pros: Error monitoring with stability scores
  • Cons: $59-299/month (more expensive), limited APM
  • Verdict: Rejected due to cost and fewer features

Google Cloud Logging (for logs)

  • Pros: FREE (50GB/month), native GCP integration, enterprise features
  • Cons: Separate from Sentry, less flexible than Loki
  • Verdict: Viable alternative to Loki, but Loki preferred for ecosystem

Sentry Logs Only

  • Pros: Unified platform, correlated with errors
  • Cons: Expensive ($0.20/GB), limited retention, not designed for high-volume
  • Verdict: Rejected for centralized logging (use for error-correlated logs only)

Cost Analysis

Our Estimated Usage

  • Error events: ~1,000/month
  • Transactions: ~10,000/month
  • Cron check-ins: ~4,500/month
  • Logs: ~3GB/month

Selected Solution Cost

Component Monthly Cost
Sentry (Team plan) $27
Grafana Loki (Free tier) $0
Total $27/month

Rejected Alternatives Cost

Solution Monthly Cost Savings
Datadog $92-500 $65-473
New Relic $100-200 $73-173
Sentry + Sentry Logs $27 + $0.40 = $27.40 $0.40

Annual Savings: $780-5,676 compared to enterprise solutions


Implementation Plan

Phase 1: Sentry (Complete ✅)

  • ✅ Sentry SDK integrated
  • ✅ Error tracking operational
  • ✅ Cron monitors active (Elastic Muscle, Upstash)
  • ✅ Infrastructure monitoring via custom scripts
  • ✅ Webhooks configured

Phase 2: Batch System Integration (Checkpoint 2)

  • [ ] Initialize Sentry in batch worker
  • [ ] Add transaction tracking for job execution
  • [ ] Capture job errors with context
  • [ ] Setup webhook handlers for remediation triggers
  • [ ] Test self-healing workflows

Phase 3: Grafana Loki Setup (Week 1-2)

  • [ ] Sign up for Grafana Cloud (free tier)
  • [ ] Configure Loki data source
  • [ ] Update batch executor to send logs to Loki
  • [ ] Create Grafana dashboards for batch jobs
  • [ ] Setup log-based alerts

Phase 4: Optimization (Week 3-4)

  • [ ] Fine-tune log retention policies
  • [ ] Create unified dashboard (Sentry errors + Loki logs)
  • [ ] Optimize costs and performance
  • [ ] Document logging strategy

Consequences

Positive

  1. Cost-effective: $27/month vs $100-500/month for competitors
  2. Best-in-class error tracking: Sentry's core strength
  3. Comprehensive logging: Loki handles high-volume logs efficiently
  4. Self-healing capability: Sentry webhooks trigger batch remediation
  5. No vendor lock-in: Both platforms are open-source
  6. Scalable: Can grow with our needs
  7. Developer-friendly: Excellent DX for both platforms

Negative

  1. Two platforms to manage: Sentry + Loki (vs single platform like Datadog)
  2. Integration overhead: Need to correlate logs and errors manually
  3. Setup complexity: Loki requires initial configuration
  4. Limited infrastructure monitoring: Requires custom scripts

Neutral

  1. Learning curve: Team needs to learn both platforms
  2. Maintenance: Two platforms to maintain and monitor
  3. Future migration: May need to revisit if requirements change significantly

Mitigation Strategies

Two Platforms Management

  • Mitigation:
  • Both have excellent APIs for automation
  • Can create unified dashboards in Grafana
  • Minimal operational overhead once configured

Integration Overhead

  • Mitigation:
  • Use consistent job IDs across platforms
  • Create correlation dashboards
  • Automate log-error linking via metadata

Setup Complexity

  • Mitigation:
  • Grafana Cloud simplifies Loki setup
  • Comprehensive documentation available
  • Can start with basic setup and iterate

Success Metrics

Monitoring Coverage

  • ✅ 100% of errors captured in Sentry
  • ✅ 100% of batch jobs monitored via cron checks
  • ✅ 100% of batch job logs in Loki
  • ✅ <5 minute alert latency for critical errors

Cost Efficiency

  • ✅ Stay within $30/month budget
  • ✅ Maintain <$0.01/GB effective log cost
  • ✅ No unexpected cost spikes

Developer Experience

  • ✅ <30 seconds to find relevant logs
  • ✅ <5 minutes to debug production errors
  • ✅ 90%+ developer satisfaction with tools

System Reliability

  • ✅ 99.9% uptime for monitoring platforms
  • ✅ <1 hour MTTR (Mean Time To Resolution) for batch failures
  • ✅ Self-healing success rate >80%

Review Schedule

This ADR will be reviewed: - Quarterly: Cost and usage analysis - Annually: Full platform evaluation - Ad-hoc: If requirements change significantly

Next Review: 2026-04-18


References

Documentation

  • ADR-001: Secrets Management (Doppler)
  • ADR-002: Directory Refactor
  • ADR-004: Batch System Architecture (pending)

Internal Documents

  • architecture/SENTRY_BATCH_INTEGRATION.md - Integration strategy
  • architecture/SENTRY_CENTRALIZED_LOGGING.md - Logging architecture
  • architecture/BATCH_TWO_TIER_ARCHITECTURE.md - Batch system design
  • docs/SENTRY_ACTIVATION_COMPLETE.md - Current Sentry setup

Approval

Approved by: Michael Higgins
Date: 2026-01-18
Status: ✅ APPROVED

Signature: Approved via voice command during architecture review session


Changelog

Date Change Author
2026-01-18 Initial ADR created and approved Antigravity AI
2026-01-18 Added Grafana Loki as secondary platform Antigravity AI
2026-01-18 Finalized cost analysis and implementation plan Antigravity AI