Sentry-Batch System Integration Strategy

Date: 2026-01-18
Status: Architecture Design
Purpose: Define how Sentry monitoring integrates with batch processing system

Sentry Capabilities (Current State)

1. Monitoring Role

✅ Application error tracking (JavaScript, API errors)
✅ Performance monitoring
✅ Session replays
✅ Infrastructure health checks (Elastic Muscle, Upstash)
✅ Cron monitors (10-minute intervals)

2. Alerting Role

✅ Email alerts on failures
⏳ Slack integration (configurable)
⏳ GitHub integration (configurable)
✅ Threshold-based alerts

3. Triggering Role

⏳ Webhook support (can trigger external systems)
⏳ API for programmatic access
✅ Cron check-ins (passive monitoring)

4. Error Detection

✅ Fatal errors: Application crashes
✅ Non-fatal errors: Untrapped exceptions
✅ Performance issues: Slow transactions
✅ Infrastructure failures: VM down, services unavailable

Critical Integration Points

The Synergy

Sentry detects problems → Batch system fixes problems

Example scenarios: 1. Build failure → Sentry alerts → Batch job auto-fixes 2. Production error → Sentry captures → Batch job analyzes and patches 3. Infrastructure down → Sentry detects → Batch job restarts services 4. Performance degradation → Sentry monitors → Batch job optimizes

Integration Architecture

Level 1: Batch Jobs Report to Sentry

Every batch job sends telemetry to Sentry

import * as Sentry from '@sentry/node';

class JobExecutor {
  async execute(job: BatchJobRequest): Promise<BatchJobResult> {
    // Start Sentry transaction
    const transaction = Sentry.startTransaction({
      op: 'batch.job',
      name: `Batch Job: ${job.script}`,
      tags: {
        tier: job.tier,
        namespace: job.namespace,
        class: job.class,
        jobId: job.id,
      },
    });

    try {
      // Execute job
      const result = await this.executeScript(job);

      // Report success
      transaction.setStatus('ok');
      transaction.setData('output', result.output);

      return result;
    } catch (error) {
      // Capture error in Sentry
      Sentry.captureException(error, {
        tags: {
          jobId: job.id,
          script: job.script,
          tier: job.tier,
        },
        contexts: {
          job: {
            id: job.id,
            script: job.script,
            class: job.class,
            tier: job.tier,
            namespace: job.namespace,
          },
        },
      });

      transaction.setStatus('internal_error');
      throw error;
    } finally {
      transaction.finish();
    }
  }
}

Benefits: - Every batch job failure captured in Sentry - Performance tracking for long-running jobs - Context-rich error reports - Unified monitoring dashboard

Level 2: Sentry Triggers Batch Jobs

Sentry webhooks trigger remediation jobs

// apps/devops/src/batch/backend/sentry-webhook.ts

interface SentryWebhook {
  action: 'created' | 'resolved' | 'assigned';
  data: {
    issue: {
      id: string;
      title: string;
      level: 'error' | 'warning' | 'info';
      culprit: string;
      metadata: Record<string, any>;
    };
  };
}

export async function handleSentryWebhook(webhook: SentryWebhook) {
  const { issue } = webhook.data;

  // Determine if this error should trigger a batch job
  const remediation = await determineRemediation(issue);

  if (remediation) {
    // Submit batch job to fix the issue
    await batch.submit({
      tier: remediation.tier,
      namespace: 'sentry-remediation',
      script: remediation.script,
      class: remediation.priority,
      metadata: {
        sentryIssueId: issue.id,
        sentryIssueTitle: issue.title,
        triggeredBy: 'sentry-webhook',
      },
    });

    // Add comment to Sentry issue
    await sentry.addComment(issue.id, 
      `🤖 Automated remediation job submitted: ${remediation.script}`
    );
  }
}

async function determineRemediation(issue: any) {
  // Build failure → Run build-fix job
  if (issue.title.includes('Build failed')) {
    return {
      tier: 'nonprod',
      script: 'fix-build-errors',
      priority: 'B',
    };
  }

  // Database connection error → Restart connection pool
  if (issue.culprit.includes('database')) {
    return {
      tier: 'prod',
      script: 'restart-db-pool',
      priority: 'A',
    };
  }

  // Memory leak → Restart service
  if (issue.metadata.type === 'OutOfMemoryError') {
    return {
      tier: 'prod',
      script: 'restart-service',
      priority: 'A',
    };
  }

  return null; // No automated remediation
}

Benefits: - Automatic error remediation - Self-healing system - Reduced manual intervention - Faster incident response

Level 3: Batch System Monitors via Sentry Crons

Batch workers report health via Sentry cron monitors

import * as Sentry from '@sentry/node';

class BatchWorker {
  private sentryMonitorSlug: string;

  constructor(tier: 'prod' | 'nonprod') {
    this.sentryMonitorSlug = `batch-worker-${tier}`;
  }

  async start() {
    // Create Sentry cron monitor
    await this.setupSentryMonitor();

    // Start polling loop
    while (this.running) {
      // Check in with Sentry (start)
      const checkInId = Sentry.captureCheckIn({
        monitorSlug: this.sentryMonitorSlug,
        status: 'in_progress',
      });

      try {
        // Poll and process jobs
        await this.pollQueues();

        // Check in with Sentry (success)
        Sentry.captureCheckIn({
          checkInId,
          monitorSlug: this.sentryMonitorSlug,
          status: 'ok',
        });
      } catch (error) {
        // Check in with Sentry (failure)
        Sentry.captureCheckIn({
          checkInId,
          monitorSlug: this.sentryMonitorSlug,
          status: 'error',
        });

        Sentry.captureException(error);
      }

      await this.sleep(this.config.pollInterval);
    }
  }

  private async setupSentryMonitor() {
    // Auto-create monitor via Sentry API
    await fetch('https://sentry.io/api/0/organizations/singular-dream/monitors/', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.SENTRY_AUTH_TOKEN}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        project: 'javascript-nextjs',
        name: `Batch Worker (${this.tier})`,
        slug: this.sentryMonitorSlug,
        type: 'cron_job',
        config: {
          schedule_type: 'interval',
          schedule: [1, 'minute'], // Check every minute
          timezone: 'America/Chicago',
          checkin_margin: 2,
          max_runtime: 5,
          failure_issue_threshold: 3, // Alert after 3 failures
        },
      }),
    });
  }
}

Benefits: - Worker health monitoring - Automatic alerts if worker crashes - Visual dashboard of worker uptime - Historical health data

Level 4: Production Error → Batch Analysis

Production errors trigger AI analysis jobs

// Sentry webhook handler for production errors
export async function handleProductionError(webhook: SentryWebhook) {
  const { issue } = webhook.data;

  // Only process production errors
  if (issue.metadata.environment !== 'production') return;

  // Submit AI analysis job
  await batch.submit({
    tier: 'prod',
    namespace: 'error-analysis',
    script: 'analyze-production-error',
    class: 'B', // High priority but not critical
    args: [
      `--sentry-issue-id=${issue.id}`,
      `--error-type=${issue.metadata.type}`,
      `--culprit=${issue.culprit}`,
    ],
    metadata: {
      sentryIssueId: issue.id,
      errorTitle: issue.title,
      firstSeen: issue.metadata.firstSeen,
    },
  });
}

Analysis script (scripts/analyze-production-error.ts):

// Uses M2 AI to analyze error and suggest fixes
import { M2Client } from '@sd/devops';
import * as Sentry from '@sentry/node';

async function analyzeProductionError(sentryIssueId: string) {
  // Fetch error details from Sentry
  const issue = await sentry.getIssue(sentryIssueId);
  const events = await sentry.getEvents(sentryIssueId);

  // Use M2 AI to analyze
  const analysis = await m2.analyze({
    error: issue.title,
    stackTrace: events[0].stacktrace,
    context: events[0].context,
    frequency: issue.count,
    affectedUsers: issue.userCount,
  });

  // Post analysis back to Sentry
  await sentry.addComment(sentryIssueId, `
## 🤖 AI Analysis

**Root Cause**: ${analysis.rootCause}

**Suggested Fix**:
\`\`\`${analysis.language}
${analysis.suggestedFix}
\`\`\`

**Impact**: ${analysis.impact}
**Urgency**: ${analysis.urgency}
**Confidence**: ${analysis.confidence}%
  `);

  // If high confidence, create fix PR
  if (analysis.confidence > 80 && analysis.canAutoFix) {
    await batch.submit({
      tier: 'nonprod',
      namespace: 'auto-fix',
      script: 'apply-ai-fix',
      class: 'C',
      args: [`--sentry-issue-id=${sentryIssueId}`],
    });
  }
}

Benefits: - AI-powered error analysis - Automated fix suggestions - Reduced time to resolution - Learning system (improves over time)

Level 5: Build Pipeline Integration

Build failures trigger batch remediation

// Sentry captures build errors
Sentry.captureException(new Error('Build failed'), {
  tags: {
    pipeline: 'ci-cd',
    stage: 'build',
    branch: 'main',
  },
  contexts: {
    build: {
      exitCode: 1,
      errors: buildErrors,
      warnings: buildWarnings,
    },
  },
});

// Webhook triggers batch job
await batch.submit({
  tier: 'nonprod',
  namespace: 'ci-cd',
  script: 'fix-build-errors',
  class: 'B',
  metadata: {
    branch: 'main',
    commit: commitSha,
    buildId: buildId,
  },
});

Fix script (scripts/fix-build-errors.ts):

// M2 AI analyzes build errors and applies fixes
async function fixBuildErrors() {
  // Get build errors from Sentry
  const errors = await getBuildErrors();

  // Use M2 to analyze and fix
  for (const error of errors) {
    const fix = await m2.fixBuildError(error);

    if (fix.canAutoFix) {
      await applyFix(fix);
      await runBuild(); // Re-run build

      if (buildSucceeds()) {
        await commitFix(fix);
        break; // Success!
      }
    }
  }
}

Monitoring Dashboard Integration

Unified View

Single dashboard showing: 1. Sentry Issues - Application errors 2. Batch Jobs - Remediation status 3. Worker Health - Batch worker uptime 4. Infrastructure - Elastic Muscle, Upstash

Implementation:

// Custom Sentry dashboard widget
{
  "title": "Batch System Health",
  "displayType": "big_number",
  "queries": [
    {
      "name": "Active Jobs",
      "fields": ["count()"],
      "conditions": "transaction.op:batch.job status:ok",
    },
    {
      "name": "Failed Jobs",
      "fields": ["count()"],
      "conditions": "transaction.op:batch.job status:error",
    },
    {
      "name": "Worker Uptime",
      "fields": ["avg(duration)"],
      "conditions": "monitor.slug:batch-worker-*",
    },
  ],
}

Alert Routing Strategy

Tier-Based Routing

Production tier: - Sentry alerts → Slack #production-alerts - Critical batch job failures → PagerDuty - All production errors → Sentry dashboard

Non-production tier: - Sentry alerts → Slack #dev-alerts - Batch job failures → Email only - Build failures → GitHub PR comments

Implementation:

co

href="#__codelineno-8-1">// Sentry alert rule configuration nst alertRules = { production: { conditions: [ { id: 'sentry.rules.conditions.event_attribute.EventAttributeCondition', attribute: 'environment', match: 'eq', value: 'production', }, ], actions: [ { id: 'sentry.integrations.slack.notify_action.SlackNotifyServiceAction', channel: '#production-alerts', }, { id: 'sentry.integrations.pagerduty.notify_action.PagerDutyNotifyServiceAction', severity: 'critical', }, ], }, nonproduction: { conditions: [ { id: 'sentry.rules.conditions.event_attribute.EventAttributeCondition', attribute: 'environment', match: 'ne', value: 'production', }, ], actions: [ { id: 'sentry.mail.actions.NotifyEmailAction', targetType: 'IssueOwners', }, ], }, };

Self-Healing Workflows

Example: Database Connection Pool Exhaustion

1. Sentry detects error:

Error: Connection pool exhausted
Environment: production
Frequency: 50 events/minute

2. Webhook triggers batch job:

await batch.submit({
  tier: 'prod',
  namespace: 'self-healing',
  script: 'restart-db-pool',
  class: 'A', // Critical
});

3. Batch job executes:

// scripts/restart-db-pool.ts
async function restartDbPool() {
  // Gracefully drain connections
  await db.drain();

  // Restart pool
  await db.restart();

  // Verify health
  await db.healthCheck();

  // Report to Sentry
  Sentry.captureMessage('Database pool restarted successfully', 'info');
}

4. Sentry confirms resolution:

✅ Issue auto-resolved
📊 Downtime: 45 seconds
🤖 Remediated by: batch-job-12345

Implementation Roadmap

Phase 1: Basic Integration (Checkpoint 2)

[ ] Add Sentry SDK to batch worker
[ ] Capture batch job errors in Sentry
[ ] Create cron monitors for workers
[ ] Test error reporting

Phase 2: Webhook Integration (Week 1)

[ ] Setup Sentry webhook endpoint
[ ] Implement remediation logic
[ ] Test production error → batch job flow
[ ] Configure alert routing

Phase 3: Self-Healing (Week 2)

[ ] Implement common remediation scripts
[ ] Add AI analysis for production errors
[ ] Setup auto-fix workflows
[ ] Test end-to-end self-healing

Phase 4: Advanced Monitoring (Week 3)

[ ] Custom Sentry dashboard
[ ] Performance tracking
[ ] Cost optimization
[ ] Documentation

Benefits Summary

For Operations

Reduced MTTR (Mean Time To Resolution)
Errors detected instantly
Automated remediation
Faster incident response
Proactive Monitoring
Worker health tracking
Infrastructure monitoring
Predictive alerts
Unified Dashboard
Single pane of glass
Application + Infrastructure
Historical trends

For Development

Automated Error Analysis
AI-powered root cause analysis
Suggested fixes
Learning system
Build Pipeline Integration
Auto-fix build errors
Faster CI/CD
Less manual intervention
Better Debugging
Context-rich errors
Session replays
Performance traces

For Business

Higher Uptime
Self-healing system
Faster recovery
Reduced downtime
Lower Costs
Less manual intervention
Automated remediation
Efficient resource use
Better Insights
Error trends
Performance metrics
System health

Conclusion

Sentry + Batch System = Self-Healing Infrastructure

The integration creates a powerful feedback loop: 1. Sentry detects problems 2. Batch system fixes problems 3. Sentry confirms resolution 4. System learns from patterns

This transforms reactive monitoring into proactive self-healing, dramatically reducing operational burden and improving system reliability.

Next Step: Implement Phase 1 (Basic Integration) in Checkpoint 2.