Skip to content

SD-DEVOPS Usage Guide

Unified DevOps Platform for Singular Dream


Overview

SD-DEVOPS is a unified platform that consolidates all development operations tooling into a single application with consistent interfaces. It provides batch job processing, automated scheduling, AI-powered code fixing, and system monitoring through a cohesive architecture.

Access Methods: - CLI: Command-line interface for human operators - SDK: Programmatic TypeScript/JavaScript API - MCP: Model Context Protocol for AI agents (Antigravity) - REST API: HTTP endpoints for external integrations


Subsystems

1. Batch System

Purpose: Execute arbitrary jobs with priority queuing and tier isolation

Capabilities: - Submit jobs to priority queues (A=critical, B=high, C=normal, D=low) - Tier isolation (production vs non-production) - Namespace support for logical grouping - Job status tracking and cancellation - Queue statistics and monitoring

Functions: - submit(job) - Submit a job to the queue - status(jobId) - Get job execution status - list(filters) - List jobs with filtering - cancel(jobId) - Cancel a pending/running job - queueStats(tier, namespace) - Get queue depth statistics

Use Cases: - Overnight refactoring operations - Database migrations - Report generation - Bulk data processing - Long-running builds


2. Scheduler

Purpose: Time-based job orchestration with cron expressions

Capabilities: - Create reusable job templates - Schedule jobs with cron syntax - Timezone-aware scheduling - Pause/resume schedules - Automatic batch job creation

Functions: - createTemplate(template) - Create reusable job definition - createSchedule(cron, templateId) - Schedule a job - listSchedules(filters) - List active schedules - pauseSchedule(id) - Pause a schedule - resumeSchedule(id) - Resume a schedule - deleteSchedule(id) - Delete a schedule

Use Cases: - Nightly builds - Daily cleanup tasks - Weekly reports - Monthly maintenance - Recurring data synchronization


3. aiAutoFix (Planned)

Purpose: AI-powered automated testing, fixing, and debugging

Capabilities: - Automated test execution - AI error analysis (Gemini) - Code fix generation and application - Build-test-fix loop orchestration - Integration with batch for long-running operations

Functions (To be implemented): - test(suite) - Run automated tests - analyze(errors) - AI-powered error analysis - fix(suggestions) - Apply AI-generated fixes - loop(maxIterations) - Run build-test-fix cycle

Use Cases: - Automated error fixing - Continuous code improvement - Test failure resolution - Build error debugging - Code quality enhancement


4. Monitoring (Planned)

Purpose: System health checks and alerting

Capabilities: - Service health monitoring - Queue depth tracking - Worker utilization metrics - Alert routing and notifications - Status dashboards

Functions (To be implemented): - health() - Check system health - metrics(subsystem) - Get subsystem metrics - alert(config) - Configure alerts

Use Cases: - System health monitoring - Performance tracking - Proactive alerting - Capacity planning - Incident response


Access Methods

CLI (Command-Line Interface)

Installation:

cd apps/devops
pnpm install
pnpm build

Usage:

# Batch operations
sd-devops batch submit <script> --class <A|B|C|D> --tier <prod|nonprod>
sd-devops batch status <job-id>
sd-devops batch list --class A --status running
sd-devops batch cancel <job-id>
sd-devops batch queue-stats --tier nonprod

# Scheduler operations
sd-devops scheduler template create --name "Build" --script "build-all" --tier nonprod
sd-devops scheduler create --template <id> --cron "0 2 * * *"
sd-devops scheduler list --status active
sd-devops scheduler pause <schedule-id>
sd-devops scheduler resume <schedule-id>


SDK (Programmatic API)

Installation:

pnpm add @sd/devops

Usage:

import { BatchClient, SchedulerClient } from '@sd/devops';

// Batch operations
const batch = new BatchClient('http://localhost:3001');

await batch.submit({
  id: 'job-123',
  tier: 'nonprod',
  namespace: 'dev',
  script: 'refactor-overnight',
  class: 'C',
  args: ['--target', 'platform'],
});

const status = await batch.status('job-123');
const jobs = await batch.list({ class: 'A', status: 'running' });

// Scheduler operations
const scheduler = new SchedulerClient('http://localhost:3001');

const template = await scheduler.createTemplate({
  name: 'Nightly Build',
  script: 'build-all',
  tier: 'nonprod',
  namespace: 'platform',
  class: 'B',
});

const schedule = await scheduler.createSchedule({
  templateId: template.id,
  cron: '0 2 * * *',  // 2 AM daily
  timezone: 'America/Chicago',
});

await scheduler.pauseSchedule(schedule.id);


MCP (Model Context Protocol)

Configuration (.cursor/mcp.json):

{
  "mcpServers": {
    "sd-devops": {
      "command": "node",
      "args": ["apps/devops/dist/mcp/index.js"]
    }
  }
}

Available Tools: - batch_submit - Submit a batch job - batch_status - Check job status - batch_list - List jobs - batch_cancel - Cancel a job - batch_queue_stats - Get queue statistics - scheduler_template_create - Create job template - scheduler_create - Create schedule - scheduler_list - List schedules - scheduler_pause - Pause schedule - scheduler_resume - Resume schedule

Usage (for AI agents):

// Antigravity can call these tools directly
await use_mcp_tool('sd-devops', 'batch_submit', {
  script: 'refactor-overnight',
  class: 'C',
  tier: 'nonprod',
  namespace: 'dev',
});

await use_mcp_tool('sd-devops', 'scheduler_create', {
  templateId: 'template-123',
  cron: '0 2 * * *',
});


REST API

Base URL: http://localhost:3001/api

Batch Endpoints:

# Submit job
POST /batch/submit
{
  "id": "job-123",
  "tier": "nonprod",
  "namespace": "dev",
  "script": "my-script",
  "class": "C"
}

# Get status
GET /batch/status/:jobId?tier=nonprod

# List jobs
GET /batch/list?tier=nonprod&namespace=dev&class=A

# Cancel job
DELETE /batch/cancel/:jobId?tier=nonprod

# Queue stats
GET /batch/queue-stats?tier=nonprod&namespace=dev

Scheduler Endpoints:

# Create template
POST /scheduler/templates
{
  "name": "Build",
  "script": "build-all",
  "tier": "nonprod",
  "namespace": "platform",
  "class": "B"
}

# Create schedule
POST /scheduler/schedules
{
  "templateId": "template-123",
  "cron": "0 2 * * *",
  "timezone": "America/Chicago"
}

# List schedules
GET /scheduler/schedules?status=active

# Pause schedule
PATCH /scheduler/schedules/:id
{
  "status": "paused"
}


Inter-Module Relationships

Scheduler → Batch Integration

The scheduler creates batch jobs automatically when schedules trigger:

┌─────────────────┐
│   Scheduler     │
│   Engine        │
└────────┬────────┘
         │ Cron triggers
┌─────────────────┐
│  Create Batch   │
│  Job from       │
│  Template       │
└────────┬────────┘
┌─────────────────┐
│  Batch Queue    │
│  (Redis)        │
└────────┬────────┘
┌─────────────────┐
│  Batch Worker   │
│  Executes Job   │
└─────────────────┘

Example Flow: 1. User creates template: "Nightly Build" → build-all script 2. User creates schedule: "0 2 * * *" (2 AM daily) 3. At 2 AM, scheduler creates: BatchJobRequest with template config 4. Batch job queued in Redis: nonprod:platform:queue:batch:class-B 5. Batch worker picks up job and executes build-all script 6. Results stored and status updated

aiAutoFix → Batch Integration (Planned)

Long-running AI operations will run as batch jobs:

┌─────────────────┐
│   aiAutoFix     │
│   Request       │
└────────┬────────┘
┌─────────────────┐
│  Submit as      │
│  Batch Job      │
│  (class B)      │
└────────┬────────┘
┌─────────────────┐
│  Batch Worker   │
│  Runs AI Loop   │
└─────────────────┘

Monitoring → All Subsystems (Planned)

Monitoring observes all subsystems:

┌─────────────────┐
│   Monitoring    │
│   Subsystem     │
└────────┬────────┘
         ├──────────► Batch (health, queue depth)
         ├──────────► Scheduler (active schedules)
         └──────────► aiAutoFix (fix success rate)

Shared Infrastructure

All subsystems use common infrastructure:

Queue Manager (Upstash Redis)

  • Unified queue operations
  • Tier-aware key prefixes
  • Atomic operations
  • Pub/sub for events

Job Executor

  • Single execution engine
  • Script sandboxing
  • Resource limits
  • Error handling

Event Bus (Redis Pub/Sub)

  • Cross-subsystem communication
  • Event-driven architecture
  • Loose coupling

Telemetry (Sentry)

  • Error tracking
  • Performance monitoring
  • Cron monitoring
  • Alert routing

Architecture Principles

1. Subsystem Modularity

Each subsystem is self-contained with clear boundaries: - Own types, core logic, SDK, CLI commands - Independent deployment - Isolated testing

2. Shared Infrastructure

Common services prevent duplication: - Single Redis instance - Unified job executor - Shared telemetry - Common configuration

3. Unified Interface

Consistent API across all access methods: - Same operations via CLI/SDK/MCP/API - Predictable naming: <subsystem>_<action> - Type-safe SDK with full IntelliSense

4. Event-Driven Communication

Subsystems communicate via events: - Scheduler triggers → Batch executes - aiAutoFix completes → Monitoring notifies - Loose coupling enables flexibility


Deployment

Local Development

cd apps/devops
pnpm install
pnpm dev:api      # Start batch API
pnpm dev:worker   # Start batch worker
pnpm dev:scheduler # Start scheduler

Production (PM2)

# Deploy to dev server
./scripts/deploy-batch-system.sh

# Services managed by PM2:
# - batch-api (port 3001)
# - batch-worker-nonprod (2 instances)
# - scheduler (1 instance)

Environment Variables

# Upstash Redis
UPSTASH_REDIS_PROD_URL=https://...
UPSTASH_REDIS_PROD_TOKEN=...
UPSTASH_REDIS_NONPROD_URL=https://...
UPSTASH_REDIS_NONPROD_TOKEN=...

# Sentry
SENTRY_DSN=https://...

# Configuration
BATCH_TIER=nonprod
BATCH_NAMESPACE=platform
PORT=3001

Examples

Example 1: Scheduled Overnight Build

// 1. Create template
const template = await scheduler.createTemplate({
  name: 'Full Platform Build',
  script: 'build-all',
  tier: 'nonprod',
  namespace: 'platform',
  class: 'B',
  args: ['--clean', '--verbose'],
});

// 2. Schedule for 2 AM daily
const schedule = await scheduler.createSchedule({
  templateId: template.id,
  cron: '0 2 * * *',
  timezone: 'America/Chicago',
});

// 3. Scheduler automatically creates batch job at 2 AM
// 4. Batch worker executes build-all script
// 5. Results available via batch.status(jobId)

Example 2: On-Demand Refactoring

// Submit immediate batch job
const job = await batch.submit({
  id: `refactor-${Date.now()}`,
  tier: 'nonprod',
  namespace: 'dev',
  script: 'refactor-codebase',
  class: 'C',
  args: ['--target', 'auth-module'],
});

// Poll for completion
while (true) {
  const status = await batch.status(job.id);
  if (status.status === 'completed') break;
  await new Promise(r => setTimeout(r, 5000));
}

Example 3: AI-Powered Fix (Future)

// Run build-test-fix loop as batch job
const job = await batch.submit({
  id: `autofix-${Date.now()}`,
  tier: 'nonprod',
  namespace: 'test',
  script: 'aiautofix/build-test-fix-loop',
  class: 'B',
  args: ['--max-iterations', '10'],
});

// AI will:
// 1. Run tests
// 2. Analyze errors with Gemini
// 3. Generate fixes
// 4. Apply and verify
// 5. Repeat until clean or max iterations

Best Practices

Job Classification

  • Class A (Critical): Production hotfixes, urgent deployments
  • Class B (High): Scheduled builds, important migrations
  • Class C (Normal): Regular refactoring, reports
  • Class D (Low): Cleanup, optimization, non-urgent tasks

Tier Usage

  • Production: Live system operations, real data
  • Non-Production: Development, testing, staging

Namespace Organization

  • platform - Main application
  • dev - Development experiments
  • test - Automated testing
  • ops - Operational tasks

Error Handling

  • All operations return structured results
  • Use try/catch for SDK calls
  • Check job status for completion
  • Monitor Sentry for system errors

Troubleshooting

Job Not Executing

  1. Check queue stats: batch queue-stats --tier nonprod
  2. Verify workers running: pm2 list
  3. Check worker logs: pm2 logs batch-worker-nonprod

Schedule Not Triggering

  1. Verify schedule status: scheduler list
  2. Check scheduler logs: pm2 logs scheduler
  3. Validate cron expression: Use crontab.guru

Connection Errors

  1. Verify Upstash credentials in Doppler
  2. Check Redis connectivity
  3. Ensure services are running

Future Roadmap

Phase 1: Current ✅

  • Batch system
  • Scheduler
  • Unified SDK/CLI

Phase 2: In Progress

  • aiAutoFix migration
  • MCP server completion
  • Monitoring subsystem

Phase 3: Planned

  • Production tier workers
  • Advanced scheduling (dependencies, retries)
  • Real-time dashboard
  • Slack/email notifications
  • Job templates library

For implementation details, see ARCHITECTURE.md