Sentry-Batch System Integration Strategy
Date: 2026-01-18
Status: Architecture Design
Purpose: Define how Sentry monitoring integrates with batch processing system
Sentry Capabilities (Current State)
1. Monitoring Role
- ✅ Application error tracking (JavaScript, API errors)
- ✅ Performance monitoring
- ✅ Session replays
- ✅ Infrastructure health checks (Elastic Muscle, Upstash)
- ✅ Cron monitors (10-minute intervals)
2. Alerting Role
- ✅ Email alerts on failures
- ⏳ Slack integration (configurable)
- ⏳ GitHub integration (configurable)
- ✅ Threshold-based alerts
3. Triggering Role
- ⏳ Webhook support (can trigger external systems)
- ⏳ API for programmatic access
- ✅ Cron check-ins (passive monitoring)
4. Error Detection
- ✅ Fatal errors: Application crashes
- ✅ Non-fatal errors: Untrapped exceptions
- ✅ Performance issues: Slow transactions
- ✅ Infrastructure failures: VM down, services unavailable
Critical Integration Points
The Synergy
Sentry detects problems → Batch system fixes problems
Example scenarios: 1. Build failure → Sentry alerts → Batch job auto-fixes 2. Production error → Sentry captures → Batch job analyzes and patches 3. Infrastructure down → Sentry detects → Batch job restarts services 4. Performance degradation → Sentry monitors → Batch job optimizes
Integration Architecture
Level 1: Batch Jobs Report to Sentry
Every batch job sends telemetry to Sentry
import * as Sentry from '@sentry/node';
class JobExecutor {
async execute(job: BatchJobRequest): Promise<BatchJobResult> {
// Start Sentry transaction
const transaction = Sentry.startTransaction({
op: 'batch.job',
name: `Batch Job: ${job.script}`,
tags: {
tier: job.tier,
namespace: job.namespace,
class: job.class,
jobId: job.id,
},
});
try {
// Execute job
const result = await this.executeScript(job);
// Report success
transaction.setStatus('ok');
transaction.setData('output', result.output);
return result;
} catch (error) {
// Capture error in Sentry
Sentry.captureException(error, {
tags: {
jobId: job.id,
script: job.script,
tier: job.tier,
},
contexts: {
job: {
id: job.id,
script: job.script,
class: job.class,
tier: job.tier,
namespace: job.namespace,
},
},
});
transaction.setStatus('internal_error');
throw error;
} finally {
transaction.finish();
}
}
}
Benefits: - Every batch job failure captured in Sentry - Performance tracking for long-running jobs - Context-rich error reports - Unified monitoring dashboard
Level 2: Sentry Triggers Batch Jobs
Sentry webhooks trigger remediation jobs
// apps/devops/src/batch/backend/sentry-webhook.ts
interface SentryWebhook {
action: 'created' | 'resolved' | 'assigned';
data: {
issue: {
id: string;
title: string;
level: 'error' | 'warning' | 'info';
culprit: string;
metadata: Record<string, any>;
};
};
}
export async function handleSentryWebhook(webhook: SentryWebhook) {
const { issue } = webhook.data;
// Determine if this error should trigger a batch job
const remediation = await determineRemediation(issue);
if (remediation) {
// Submit batch job to fix the issue
await batch.submit({
tier: remediation.tier,
namespace: 'sentry-remediation',
script: remediation.script,
class: remediation.priority,
metadata: {
sentryIssueId: issue.id,
sentryIssueTitle: issue.title,
triggeredBy: 'sentry-webhook',
},
});
// Add comment to Sentry issue
await sentry.addComment(issue.id,
`🤖 Automated remediation job submitted: ${remediation.script}`
);
}
}
async function determineRemediation(issue: any) {
// Build failure → Run build-fix job
if (issue.title.includes('Build failed')) {
return {
tier: 'nonprod',
script: 'fix-build-errors',
priority: 'B',
};
}
// Database connection error → Restart connection pool
if (issue.culprit.includes('database')) {
return {
tier: 'prod',
script: 'restart-db-pool',
priority: 'A',
};
}
// Memory leak → Restart service
if (issue.metadata.type === 'OutOfMemoryError') {
return {
tier: 'prod',
script: 'restart-service',
priority: 'A',
};
}
return null; // No automated remediation
}
Benefits: - Automatic error remediation - Self-healing system - Reduced manual intervention - Faster incident response
Level 3: Batch System Monitors via Sentry Crons
Batch workers report health via Sentry cron monitors
import * as Sentry from '@sentry/node';
class BatchWorker {
private sentryMonitorSlug: string;
constructor(tier: 'prod' | 'nonprod') {
this.sentryMonitorSlug = `batch-worker-${tier}`;
}
async start() {
// Create Sentry cron monitor
await this.setupSentryMonitor();
// Start polling loop
while (this.running) {
// Check in with Sentry (start)
const checkInId = Sentry.captureCheckIn({
monitorSlug: this.sentryMonitorSlug,
status: 'in_progress',
});
try {
// Poll and process jobs
await this.pollQueues();
// Check in with Sentry (success)
Sentry.captureCheckIn({
checkInId,
monitorSlug: this.sentryMonitorSlug,
status: 'ok',
});
} catch (error) {
// Check in with Sentry (failure)
Sentry.captureCheckIn({
checkInId,
monitorSlug: this.sentryMonitorSlug,
status: 'error',
});
Sentry.captureException(error);
}
await this.sleep(this.config.pollInterval);
}
}
private async setupSentryMonitor() {
// Auto-create monitor via Sentry API
await fetch('https://sentry.io/api/0/organizations/singular-dream/monitors/', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.SENTRY_AUTH_TOKEN}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
project: 'javascript-nextjs',
name: `Batch Worker (${this.tier})`,
slug: this.sentryMonitorSlug,
type: 'cron_job',
config: {
schedule_type: 'interval',
schedule: [1, 'minute'], // Check every minute
timezone: 'America/Chicago',
checkin_margin: 2,
max_runtime: 5,
failure_issue_threshold: 3, // Alert after 3 failures
},
}),
});
}
}
Benefits: - Worker health monitoring - Automatic alerts if worker crashes - Visual dashboard of worker uptime - Historical health data
Level 4: Production Error → Batch Analysis
Production errors trigger AI analysis jobs
// Sentry webhook handler for production errors
export async function handleProductionError(webhook: SentryWebhook) {
const { issue } = webhook.data;
// Only process production errors
if (issue.metadata.environment !== 'production') return;
// Submit AI analysis job
await batch.submit({
tier: 'prod',
namespace: 'error-analysis',
script: 'analyze-production-error',
class: 'B', // High priority but not critical
args: [
`--sentry-issue-id=${issue.id}`,
`--error-type=${issue.metadata.type}`,
`--culprit=${issue.culprit}`,
],
metadata: {
sentryIssueId: issue.id,
errorTitle: issue.title,
firstSeen: issue.metadata.firstSeen,
},
});
}
Analysis script (scripts/analyze-production-error.ts):
// Uses M2 AI to analyze error and suggest fixes
import { M2Client } from '@sd/devops';
import * as Sentry from '@sentry/node';
async function analyzeProductionError(sentryIssueId: string) {
// Fetch error details from Sentry
const issue = await sentry.getIssue(sentryIssueId);
const events = await sentry.getEvents(sentryIssueId);
// Use M2 AI to analyze
const analysis = await m2.analyze({
error: issue.title,
stackTrace: events[0].stacktrace,
context: events[0].context,
frequency: issue.count,
affectedUsers: issue.userCount,
});
// Post analysis back to Sentry
await sentry.addComment(sentryIssueId, `
## 🤖 AI Analysis
**Root Cause**: ${analysis.rootCause}
**Suggested Fix**:
\`\`\`${analysis.language}
${analysis.suggestedFix}
\`\`\`
**Impact**: ${analysis.impact}
**Urgency**: ${analysis.urgency}
**Confidence**: ${analysis.confidence}%
`);
// If high confidence, create fix PR
if (analysis.confidence > 80 && analysis.canAutoFix) {
await batch.submit({
tier: 'nonprod',
namespace: 'auto-fix',
script: 'apply-ai-fix',
class: 'C',
args: [`--sentry-issue-id=${sentryIssueId}`],
});
}
}
Benefits: - AI-powered error analysis - Automated fix suggestions - Reduced time to resolution - Learning system (improves over time)
Level 5: Build Pipeline Integration
Build failures trigger batch remediation
// Sentry captures build errors
Sentry.captureException(new Error('Build failed'), {
tags: {
pipeline: 'ci-cd',
stage: 'build',
branch: 'main',
},
contexts: {
build: {
exitCode: 1,
errors: buildErrors,
warnings: buildWarnings,
},
},
});
// Webhook triggers batch job
await batch.submit({
tier: 'nonprod',
namespace: 'ci-cd',
script: 'fix-build-errors',
class: 'B',
metadata: {
branch: 'main',
commit: commitSha,
buildId: buildId,
},
});
Fix script (scripts/fix-build-errors.ts):
// M2 AI analyzes build errors and applies fixes
async function fixBuildErrors() {
// Get build errors from Sentry
const errors = await getBuildErrors();
// Use M2 to analyze and fix
for (const error of errors) {
const fix = await m2.fixBuildError(error);
if (fix.canAutoFix) {
await applyFix(fix);
await runBuild(); // Re-run build
if (buildSucceeds()) {
await commitFix(fix);
break; // Success!
}
}
}
}
Monitoring Dashboard Integration
Unified View
Single dashboard showing: 1. Sentry Issues - Application errors 2. Batch Jobs - Remediation status 3. Worker Health - Batch worker uptime 4. Infrastructure - Elastic Muscle, Upstash
Implementation:
// Custom Sentry dashboard widget
{
"title": "Batch System Health",
"displayType": "big_number",
"queries": [
{
"name": "Active Jobs",
"fields": ["count()"],
"conditions": "transaction.op:batch.job status:ok",
},
{
"name": "Failed Jobs",
"fields": ["count()"],
"conditions": "transaction.op:batch.job status:error",
},
{
"name": "Worker Uptime",
"fields": ["avg(duration)"],
"conditions": "monitor.slug:batch-worker-*",
},
],
}
Alert Routing Strategy
Tier-Based Routing
Production tier: - Sentry alerts → Slack #production-alerts - Critical batch job failures → PagerDuty - All production errors → Sentry dashboard
Non-production tier: - Sentry alerts → Slack #dev-alerts - Batch job failures → Email only - Build failures → GitHub PR comments
Implementation:
// Sentry alert rule configuration
const alertRules = {
production: {
conditions: [
{ id: 'sentry.rules.conditions.event_attribute.EventAttributeCondition',
attribute: 'environment',
match: 'eq',
value: 'production',
},
],
actions: [
{ id: 'sentry.integrations.slack.notify_action.SlackNotifyServiceAction',
channel: '#production-alerts',
},
{ id: 'sentry.integrations.pagerduty.notify_action.PagerDutyNotifyServiceAction',
severity: 'critical',
},
],
},
nonproduction: {
conditions: [
{ id: 'sentry.rules.conditions.event_attribute.EventAttributeCondition',
attribute: 'environment',
match: 'ne',
value: 'production',
},
],
actions: [
{ id: 'sentry.mail.actions.NotifyEmailAction',
targetType: 'IssueOwners',
},
],
},
};
Self-Healing Workflows
Example: Database Connection Pool Exhaustion
1. Sentry detects error:
2. Webhook triggers batch job:
await batch.submit({
tier: 'prod',
namespace: 'self-healing',
script: 'restart-db-pool',
class: 'A', // Critical
});
3. Batch job executes:
// scripts/restart-db-pool.ts
async function restartDbPool() {
// Gracefully drain connections
await db.drain();
// Restart pool
await db.restart();
// Verify health
await db.healthCheck();
// Report to Sentry
Sentry.captureMessage('Database pool restarted successfully', 'info');
}
4. Sentry confirms resolution:
Implementation Roadmap
Phase 1: Basic Integration (Checkpoint 2)
- [ ] Add Sentry SDK to batch worker
- [ ] Capture batch job errors in Sentry
- [ ] Create cron monitors for workers
- [ ] Test error reporting
Phase 2: Webhook Integration (Week 1)
- [ ] Setup Sentry webhook endpoint
- [ ] Implement remediation logic
- [ ] Test production error → batch job flow
- [ ] Configure alert routing
Phase 3: Self-Healing (Week 2)
- [ ] Implement common remediation scripts
- [ ] Add AI analysis for production errors
- [ ] Setup auto-fix workflows
- [ ] Test end-to-end self-healing
Phase 4: Advanced Monitoring (Week 3)
- [ ] Custom Sentry dashboard
- [ ] Performance tracking
- [ ] Cost optimization
- [ ] Documentation
Benefits Summary
For Operations
- Reduced MTTR (Mean Time To Resolution)
- Errors detected instantly
- Automated remediation
-
Faster incident response
-
Proactive Monitoring
- Worker health tracking
- Infrastructure monitoring
-
Predictive alerts
-
Unified Dashboard
- Single pane of glass
- Application + Infrastructure
- Historical trends
For Development
- Automated Error Analysis
- AI-powered root cause analysis
- Suggested fixes
-
Learning system
-
Build Pipeline Integration
- Auto-fix build errors
- Faster CI/CD
-
Less manual intervention
-
Better Debugging
- Context-rich errors
- Session replays
- Performance traces
For Business
- Higher Uptime
- Self-healing system
- Faster recovery
-
Reduced downtime
-
Lower Costs
- Less manual intervention
- Automated remediation
-
Efficient resource use
-
Better Insights
- Error trends
- Performance metrics
- System health
Conclusion
Sentry + Batch System = Self-Healing Infrastructure
The integration creates a powerful feedback loop: 1. Sentry detects problems 2. Batch system fixes problems 3. Sentry confirms resolution 4. System learns from patterns
This transforms reactive monitoring into proactive self-healing, dramatically reducing operational burden and improving system reliability.
Next Step: Implement Phase 1 (Basic Integration) in Checkpoint 2.