Batch System Architecture - Two-Tier Strategy (APPROVED)

Date: 2026-01-18
Status: ✅ APPROVED - Production/Non-Production Split
Decision: Two-tier batch system with flexible namespacing

Architecture Decision

Two-Tier Approach

Tier 1: Production - Spans all production systems and environments - Dedicated Redis namespace - Priority workers - Strict resource allocation - Possible dedicated mini-server

Tier 2: Non-Production - Spans dev, staging, and all non-production environments - Shared Redis namespace (with sub-namespaces) - Shared workers - Flexible resource allocation - Runs on Elastic Muscle

Infrastructure

Redis: Upstash (Cloud Managed)

Why Upstash: - ✅ Serverless Redis - ✅ Pay-per-request pricing - ✅ Global edge network - ✅ REST API (no connection pooling needed) - ✅ TLS encryption - ✅ Automatic scaling

Configuration:

// Two separate Upstash databases
const REDIS_CONFIG = {
  production: {
    url: process.env.UPSTASH_REDIS_PROD_URL,
    token: process.env.UPSTASH_REDIS_PROD_TOKEN,
  },
  nonProduction: {
    url: process.env.UPSTASH_REDIS_NONPROD_URL,
    token: process.env.UPSTASH_REDIS_NONPROD_TOKEN,
  },
};

Upstash Pricing: - Free tier: 10,000 commands/day - Pay-as-you-go: $0.20 per 100K commands - No idle charges - No connection limits

Namespace Strategy

Flexible Namespacing

Purpose: Allow different applications to define namespace meaning

Structure:

{tier}:{namespace}:{queue}

Examples:
prod:platform:queue:batch:class-A
prod:marketing:queue:batch:class-B
nonprod:dev:queue:batch:class-C
nonprod:stg:queue:batch:class-D

Flexibility: - Platform app: namespace = environment (dev/stg/prod) - Marketing app: namespace = deployment (preview/staging/production) - Batch app: namespace = purpose (scheduled/adhoc/maintenance) - Custom apps: namespace = whatever makes sense

Implementation:

interface BatchJobRequest {
  id: string;
  tier: 'prod' | 'nonprod';      // Required: Which tier
  namespace: string;              // Flexible: App-defined meaning
  script: string;
  class: JobClass;
  args?: string[];
  scheduledFor?: Date;
  metadata?: Record<string, unknown>;
}

Worker Allocation

Production Tier

Dedicated Resources: - Location: Possibly dedicated mini-server (TBD) - Workers: 2-4 dedicated workers - Priority: ALWAYS preempts non-production - Concurrency: Lower (2 concurrent max) for safety - Monitoring: Enhanced monitoring and alerting

Worker Configuration:

const PROD_WORKER_CONFIG: WorkerConfig = {
  tier: 'prod',
  pollInterval: 5000,           // 5 seconds
  maxConcurrent: 2,              // Conservative for production
  healthCheckInterval: 30000,    // 30 seconds
  priority: 1,                   // Highest priority
};

Non-Production Tier

Shared Resources: - Location: Elastic Muscle - Workers: 4-8 shared workers - Priority: Lower than production - Concurrency: Higher (4-8 concurrent) for throughput - Monitoring: Standard monitoring

Worker Configuration:

const NONPROD_WORKER_CONFIG: WorkerConfig = {
  tier: 'nonprod',
  pollInterval: 10000,          // 10 seconds (less aggressive)
  maxConcurrent: 8,              // Higher throughput
  healthCheckInterval: 60000,    // 1 minute
  priority: 2,                   // Lower priority
};

Priority Management

Production Always Wins

Rule: Production jobs ALWAYS preempt non-production jobs

Implementation:

href="#__codelineno-5-1">class BatchWorker { private async pollQueues(): Promise<void> { while (this.running) { // ALWAYS check production tier first if (this.tier === 'prod' || this.canProcessProduction()) { const prodJob = await this.dequeueProduction(); if (prodJob) { await this.processJob(prodJob); continue; // Skip non-prod if prod job found } } // Only process non-prod if no prod jobs waiting if (this.tier === 'nonprod') { const nonprodJob = await this.dequeueNonProduction(); if (nonprodJob) { await this.processJob(nonprodJob); } } await this.sleep(this.config.pollInterval); } } }

Preemption Strategy: - Production jobs can interrupt non-production jobs - Non-production jobs pause when production jobs arrive - Non-production jobs resume after production queue clears

Production Mini-Server (Optional)

Rationale

Why a dedicated server for production: 1. Isolation: Complete separation from dev/test workloads 2. Reliability: No resource contention 3. Security: Separate credentials and access 4. Monitoring: Dedicated monitoring and alerting 5. Compliance: Easier to audit and certify

Architecture

Option A: Dedicated GCP VM

Production Mini-Server (GCP VM)
├── Redis Client (Upstash Prod)
├── Batch Workers (2-4 dedicated)
├── Monitoring Agent
└── Alerting System

Option B: Cloud Run (Serverless)

Cloud Run Service (Production Batch)
├── Auto-scaling workers
├── Upstash Redis connection
├── Stackdriver monitoring
└── Automatic failover

Option C: Shared Elastic Muscle (Isolated)

Elastic Muscle
├── Production Workers (isolated process)
│   └── Separate Upstash connection
└── Non-Production Workers
    └── Separate Upstash connection

Recommendation: Start with Option C (isolated on Elastic Muscle), migrate to Option A (dedicated VM) if needed.

Upstash Integration

Setup

1. Create Two Upstash Databases:

# Production database
upstash-redis-prod
  Region: us-east-1
  TLS: Enabled
  Eviction: noeviction

# Non-production database
upstash-redis-nonprod
  Region: us-east-1
  TLS: Enabled
  Eviction: allkeys-lru (can evict old jobs)

2. Store Credentials in Doppler:

# Production Doppler config
UPSTASH_REDIS_PROD_URL=https://...
UPSTASH_REDIS_PROD_TOKEN=...

# Non-production Doppler config
UPSTASH_REDIS_NONPROD_URL=https://...
UPSTASH_REDIS_NONPROD_TOKEN=...

3. Update QueueManager:

import { Redis } from '@upstash/redis';

export class QueueManager {
  private redis: Redis;

  constructor(tier: 'prod' | 'nonprod') {
    const config = tier === 'prod' 
      ? {
          url: process.env.UPSTASH_REDIS_PROD_URL!,
          token: process.env.UPSTASH_REDIS_PROD_TOKEN!,
        }
      : {
          url: process.env.UPSTASH_REDIS_NONPROD_URL!,
          token: process.env.UPSTASH_REDIS_NONPROD_TOKEN!,
        };

    this.redis = new Redis(config);
  }

  // Queue operations use tier-aware keys
  async enqueue(job: BatchJobRequest): Promise<void> {
    const queueKey = `${job.tier}:${job.namespace}:queue:batch:class-${job.class}`;
    await this.redis.lpush(queueKey, job.id);
    await this.redis.hset(`${job.tier}:job:${job.id}`, 'data', JSON.stringify(job));
  }
}

Namespace Examples

Platform Application

Namespaces = Environments:

// Development
await batch.submit({
  tier: 'nonprod',
  namespace: 'dev',
  script: 'test-suite',
  class: 'C',
});

// Staging
await batch.submit({
  tier: 'nonprod',
  namespace: 'stg',
  script: 'integration-test',
  class: 'B',
});

// Production
await batch.submit({
  tier: 'prod',
  namespace: 'platform',
  script: 'backup-database',
  class: 'A',
});

Marketing Application

Namespaces = Deployments:

// Preview deployment
await batch.submit({
  tier: 'nonprod',
  namespace: 'preview-123',
  script: 'lighthouse-audit',
  class: 'D',
});

// Production deployment
await batch.submit({
  tier: 'prod',
  namespace: 'marketing-prod',
  script: 'generate-sitemap',
  class: 'B',
});

Custom Application

Namespaces = Purpose:

// Scheduled maintenance
await batch.submit({
  tier: 'prod',
  namespace: 'scheduled',
  script: 'cleanup-old-logs',
  class: 'C',
});

// Ad-hoc analysis
await batch.submit({
  tier: 'nonprod',
  namespace: 'adhoc',
  script: 'analyze-performance',
  class: 'D',
});

Security & Access Control

Tier-Based Access

Production Tier: - Requires elevated permissions - Audit logging mandatory - Limited to authorized users - Approval workflow for critical jobs

Non-Production Tier: - Standard permissions - Basic logging - Open to developers - No approval required

Implementation:

// API server validates tier access
async submitJob(req, res) {
  const { tier, namespace, script, class } = req.body;

  // Check tier access
  if (tier === 'prod' && !await hasProductionAccess(req.user)) {
    return res.status(403).json({ error: 'Production access required' });
  }

  // Audit production submissions
  if (tier === 'prod') {
    await auditLog.record({
      user: req.user,
      action: 'batch_submit',
      tier,
      namespace,
      script,
    });
  }

  await queueManager.enqueue({ tier, namespace, script, class });
}

Monitoring & Alerting

Production Monitoring

Enhanced monitoring for production: - Queue depth alerts (> 10 jobs) - Worker health checks (every 30s) - Job failure alerts (immediate) - Execution time tracking - Resource utilization

Alerting:

// Alert on production job failure
if (tier === 'prod' && result.status === 'failed') {
  await alerting.send({
    severity: 'high',
    message: `Production batch job failed: ${job.script}`,
    job: job.id,
    error: result.error,
  });
}

Non-Production Monitoring

Standard monitoring: - Queue depth (informational) - Worker health (every 60s) - Job failures (logged, not alerted) - Basic metrics

Migration Path

Phase 1: Single Tier (Current)

Status: Checkpoint 1 complete - Single Redis (local or cloud) - No tier separation - Basic worker

Phase 2: Two-Tier Infrastructure

Tasks: 1. Create two Upstash databases (prod/nonprod) 2. Update types to include tier and namespace 3. Modify QueueManager for tier-aware operations 4. Update BatchWorker with priority logic 5. Add tier validation to API server

Phase 3: Production Isolation

Tasks: 1. Deploy production workers (isolated or dedicated) 2. Setup production monitoring 3. Configure alerting 4. Implement access control 5. Test failover

Phase 4: Optimization

Tasks: 1. Fine-tune worker allocation 2. Optimize queue polling 3. Add preemption logic 4. Performance testing 5. Cost optimization

Cost Estimation

Upstash Redis

Production: - Estimated: 100K commands/day - Cost: ~$0.20/day = $6/month - Free tier covers development

Non-Production: - Estimated: 50K commands/day - Cost: ~$0.10/day = $3/month - Mostly within free tier

Total: ~$9/month for Redis

Compute

Production Mini-Server (if needed): - GCP e2-micro: $7/month - Or Cloud Run: Pay-per-use (~$5/month)

Non-Production (Elastic Muscle): - Already paid for - No additional cost

Total Infrastructure: ~$15-20/month

Implementation Checklist

Immediate (Checkpoint 2)

[ ] Sign up for Upstash
[ ] Create prod and nonprod databases
[ ] Add credentials to Doppler
[ ] Update BatchJobRequest type with tier and namespace
[ ] Modify QueueManager for Upstash
[ ] Test basic two-tier operation

Short-term (Week 1)

[ ] Implement priority logic in BatchWorker
[ ] Add tier validation to API server
[ ] Setup production monitoring
[ ] Deploy workers to Elastic Muscle
[ ] End-to-end testing

Medium-term (Week 2-3)

[ ] Evaluate production mini-server need
[ ] Implement access control
[ ] Setup alerting
[ ] Performance optimization
[ ] Documentation

Decision Summary

✅ Approved Architecture:

Two-tier: Production vs Non-Production
Redis: Upstash (cloud managed)
Namespacing: Flexible, app-defined
Priority: Production always preempts non-production
Production isolation: Possibly dedicated mini-server

Next Steps: 1. Setup Upstash databases 2. Update types and code for two-tier 3. Test and validate 4. Deploy to Checkpoint 2