Batch System Architecture - Two-Tier Strategy (APPROVED)
Date: 2026-01-18
Status: ✅ APPROVED - Production/Non-Production Split
Decision: Two-tier batch system with flexible namespacing
Architecture Decision
Two-Tier Approach
Tier 1: Production - Spans all production systems and environments - Dedicated Redis namespace - Priority workers - Strict resource allocation - Possible dedicated mini-server
Tier 2: Non-Production - Spans dev, staging, and all non-production environments - Shared Redis namespace (with sub-namespaces) - Shared workers - Flexible resource allocation - Runs on Elastic Muscle
Infrastructure
Redis: Upstash (Cloud Managed)
Why Upstash: - ✅ Serverless Redis - ✅ Pay-per-request pricing - ✅ Global edge network - ✅ REST API (no connection pooling needed) - ✅ TLS encryption - ✅ Automatic scaling
Configuration:
// Two separate Upstash databases
const REDIS_CONFIG = {
production: {
url: process.env.UPSTASH_REDIS_PROD_URL,
token: process.env.UPSTASH_REDIS_PROD_TOKEN,
},
nonProduction: {
url: process.env.UPSTASH_REDIS_NONPROD_URL,
token: process.env.UPSTASH_REDIS_NONPROD_TOKEN,
},
};
Upstash Pricing: - Free tier: 10,000 commands/day - Pay-as-you-go: $0.20 per 100K commands - No idle charges - No connection limits
Namespace Strategy
Flexible Namespacing
Purpose: Allow different applications to define namespace meaning
Structure:
{tier}:{namespace}:{queue}
Examples:
prod:platform:queue:batch:class-A
prod:marketing:queue:batch:class-B
nonprod:dev:queue:batch:class-C
nonprod:stg:queue:batch:class-D
Flexibility: - Platform app: namespace = environment (dev/stg/prod) - Marketing app: namespace = deployment (preview/staging/production) - Batch app: namespace = purpose (scheduled/adhoc/maintenance) - Custom apps: namespace = whatever makes sense
Implementation:
interface BatchJobRequest {
id: string;
tier: 'prod' | 'nonprod'; // Required: Which tier
namespace: string; // Flexible: App-defined meaning
script: string;
class: JobClass;
args?: string[];
scheduledFor?: Date;
metadata?: Record<string, unknown>;
}
Worker Allocation
Production Tier
Dedicated Resources: - Location: Possibly dedicated mini-server (TBD) - Workers: 2-4 dedicated workers - Priority: ALWAYS preempts non-production - Concurrency: Lower (2 concurrent max) for safety - Monitoring: Enhanced monitoring and alerting
Worker Configuration:
const PROD_WORKER_CONFIG: WorkerConfig = {
tier: 'prod',
pollInterval: 5000, // 5 seconds
maxConcurrent: 2, // Conservative for production
healthCheckInterval: 30000, // 30 seconds
priority: 1, // Highest priority
};
Non-Production Tier
Shared Resources: - Location: Elastic Muscle - Workers: 4-8 shared workers - Priority: Lower than production - Concurrency: Higher (4-8 concurrent) for throughput - Monitoring: Standard monitoring
Worker Configuration:
const NONPROD_WORKER_CONFIG: WorkerConfig = {
tier: 'nonprod',
pollInterval: 10000, // 10 seconds (less aggressive)
maxConcurrent: 8, // Higher throughput
healthCheckInterval: 60000, // 1 minute
priority: 2, // Lower priority
};
Priority Management
Production Always Wins
Rule: Production jobs ALWAYS preempt non-production jobs
Implementation:
class BatchWorker {
private async pollQueues(): Promise<void> {
while (this.running) {
// ALWAYS check production tier first
if (this.tier === 'prod' || this.canProcessProduction()) {
const prodJob = await this.dequeueProduction();
if (prodJob) {
await this.processJob(prodJob);
continue; // Skip non-prod if prod job found
}
}
// Only process non-prod if no prod jobs waiting
if (this.tier === 'nonprod') {
const nonprodJob = await this.dequeueNonProduction();
if (nonprodJob) {
await this.processJob(nonprodJob);
}
}
await this.sleep(this.config.pollInterval);
}
}
}
Preemption Strategy: - Production jobs can interrupt non-production jobs - Non-production jobs pause when production jobs arrive - Non-production jobs resume after production queue clears
Production Mini-Server (Optional)
Rationale
Why a dedicated server for production: 1. Isolation: Complete separation from dev/test workloads 2. Reliability: No resource contention 3. Security: Separate credentials and access 4. Monitoring: Dedicated monitoring and alerting 5. Compliance: Easier to audit and certify
Architecture
Option A: Dedicated GCP VM
Production Mini-Server (GCP VM)
├── Redis Client (Upstash Prod)
├── Batch Workers (2-4 dedicated)
├── Monitoring Agent
└── Alerting System
Option B: Cloud Run (Serverless)
Cloud Run Service (Production Batch)
├── Auto-scaling workers
├── Upstash Redis connection
├── Stackdriver monitoring
└── Automatic failover
Option C: Shared Elastic Muscle (Isolated)
Elastic Muscle
├── Production Workers (isolated process)
│ └── Separate Upstash connection
└── Non-Production Workers
└── Separate Upstash connection
Recommendation: Start with Option C (isolated on Elastic Muscle), migrate to Option A (dedicated VM) if needed.
Upstash Integration
Setup
1. Create Two Upstash Databases:
# Production database
upstash-redis-prod
Region: us-east-1
TLS: Enabled
Eviction: noeviction
# Non-production database
upstash-redis-nonprod
Region: us-east-1
TLS: Enabled
Eviction: allkeys-lru (can evict old jobs)
2. Store Credentials in Doppler:
# Production Doppler config
UPSTASH_REDIS_PROD_URL=https://...
UPSTASH_REDIS_PROD_TOKEN=...
# Non-production Doppler config
UPSTASH_REDIS_NONPROD_URL=https://...
UPSTASH_REDIS_NONPROD_TOKEN=...
3. Update QueueManager:
import { Redis } from '@upstash/redis';
export class QueueManager {
private redis: Redis;
constructor(tier: 'prod' | 'nonprod') {
const config = tier === 'prod'
? {
url: process.env.UPSTASH_REDIS_PROD_URL!,
token: process.env.UPSTASH_REDIS_PROD_TOKEN!,
}
: {
url: process.env.UPSTASH_REDIS_NONPROD_URL!,
token: process.env.UPSTASH_REDIS_NONPROD_TOKEN!,
};
this.redis = new Redis(config);
}
// Queue operations use tier-aware keys
async enqueue(job: BatchJobRequest): Promise<void> {
const queueKey = `${job.tier}:${job.namespace}:queue:batch:class-${job.class}`;
await this.redis.lpush(queueKey, job.id);
await this.redis.hset(`${job.tier}:job:${job.id}`, 'data', JSON.stringify(job));
}
}
Namespace Examples
Platform Application
Namespaces = Environments:
// Development
await batch.submit({
tier: 'nonprod',
namespace: 'dev',
script: 'test-suite',
class: 'C',
});
// Staging
await batch.submit({
tier: 'nonprod',
namespace: 'stg',
script: 'integration-test',
class: 'B',
});
// Production
await batch.submit({
tier: 'prod',
namespace: 'platform',
script: 'backup-database',
class: 'A',
});
Marketing Application
Namespaces = Deployments:
// Preview deployment
await batch.submit({
tier: 'nonprod',
namespace: 'preview-123',
script: 'lighthouse-audit',
class: 'D',
});
// Production deployment
await batch.submit({
tier: 'prod',
namespace: 'marketing-prod',
script: 'generate-sitemap',
class: 'B',
});
Custom Application
Namespaces = Purpose:
// Scheduled maintenance
await batch.submit({
tier: 'prod',
namespace: 'scheduled',
script: 'cleanup-old-logs',
class: 'C',
});
// Ad-hoc analysis
await batch.submit({
tier: 'nonprod',
namespace: 'adhoc',
script: 'analyze-performance',
class: 'D',
});
Security & Access Control
Tier-Based Access
Production Tier: - Requires elevated permissions - Audit logging mandatory - Limited to authorized users - Approval workflow for critical jobs
Non-Production Tier: - Standard permissions - Basic logging - Open to developers - No approval required
Implementation:
// API server validates tier access
async submitJob(req, res) {
const { tier, namespace, script, class } = req.body;
// Check tier access
if (tier === 'prod' && !await hasProductionAccess(req.user)) {
return res.status(403).json({ error: 'Production access required' });
}
// Audit production submissions
if (tier === 'prod') {
await auditLog.record({
user: req.user,
action: 'batch_submit',
tier,
namespace,
script,
});
}
await queueManager.enqueue({ tier, namespace, script, class });
}
Monitoring & Alerting
Production Monitoring
Enhanced monitoring for production: - Queue depth alerts (> 10 jobs) - Worker health checks (every 30s) - Job failure alerts (immediate) - Execution time tracking - Resource utilization
Alerting:
// Alert on production job failure
if (tier === 'prod' && result.status === 'failed') {
await alerting.send({
severity: 'high',
message: `Production batch job failed: ${job.script}`,
job: job.id,
error: result.error,
});
}
Non-Production Monitoring
Standard monitoring: - Queue depth (informational) - Worker health (every 60s) - Job failures (logged, not alerted) - Basic metrics
Migration Path
Phase 1: Single Tier (Current)
Status: Checkpoint 1 complete - Single Redis (local or cloud) - No tier separation - Basic worker
Phase 2: Two-Tier Infrastructure
Tasks:
1. Create two Upstash databases (prod/nonprod)
2. Update types to include tier and namespace
3. Modify QueueManager for tier-aware operations
4. Update BatchWorker with priority logic
5. Add tier validation to API server
Phase 3: Production Isolation
Tasks: 1. Deploy production workers (isolated or dedicated) 2. Setup production monitoring 3. Configure alerting 4. Implement access control 5. Test failover
Phase 4: Optimization
Tasks: 1. Fine-tune worker allocation 2. Optimize queue polling 3. Add preemption logic 4. Performance testing 5. Cost optimization
Cost Estimation
Upstash Redis
Production: - Estimated: 100K commands/day - Cost: ~$0.20/day = $6/month - Free tier covers development
Non-Production: - Estimated: 50K commands/day - Cost: ~$0.10/day = $3/month - Mostly within free tier
Total: ~$9/month for Redis
Compute
Production Mini-Server (if needed): - GCP e2-micro: $7/month - Or Cloud Run: Pay-per-use (~$5/month)
Non-Production (Elastic Muscle): - Already paid for - No additional cost
Total Infrastructure: ~$15-20/month
Implementation Checklist
Immediate (Checkpoint 2)
- [ ] Sign up for Upstash
- [ ] Create prod and nonprod databases
- [ ] Add credentials to Doppler
- [ ] Update BatchJobRequest type with
tierandnamespace - [ ] Modify QueueManager for Upstash
- [ ] Test basic two-tier operation
Short-term (Week 1)
- [ ] Implement priority logic in BatchWorker
- [ ] Add tier validation to API server
- [ ] Setup production monitoring
- [ ] Deploy workers to Elastic Muscle
- [ ] End-to-end testing
Medium-term (Week 2-3)
- [ ] Evaluate production mini-server need
- [ ] Implement access control
- [ ] Setup alerting
- [ ] Performance optimization
- [ ] Documentation
Decision Summary
✅ Approved Architecture:
- Two-tier: Production vs Non-Production
- Redis: Upstash (cloud managed)
- Namespacing: Flexible, app-defined
- Priority: Production always preempts non-production
- Production isolation: Possibly dedicated mini-server
Next Steps: 1. Setup Upstash databases 2. Update types and code for two-tier 3. Test and validate 4. Deploy to Checkpoint 2