Skip to content

SD-DEVOPS Usage Guide

Unified DevOps Platform for Singular Dream


Overview

SD-DEVOPS is a unified platform that consolidates all development operations tooling into a single application with consistent interfaces. It provides batch job processing, automated scheduling, AI-powered code fixing, and system monitoring through a cohesive architecture.

Access Methods:

  • CLI: Command-line interface for human operators
  • SDK: Programmatic TypeScript/JavaScript API
  • MCP: Model Context Protocol for AI agents (Antigravity)
  • REST API: HTTP endpoints for external integrations

Subsystems

1. Batch System

Purpose: Execute arbitrary jobs with priority queuing and tier isolation

Capabilities:

  • Submit jobs to priority queues (A=critical, B=high, C=normal, D=low)
  • Tier isolation (production vs non-production)
  • Namespace support for logical grouping
  • Job status tracking and cancellation
  • Queue statistics and monitoring

Functions:

  • submit(job) - Submit a job to the queue
  • status(jobId) - Get job execution status
  • list(filters) - List jobs with filtering
  • cancel(jobId) - Cancel a pending/running job
  • queueStats(tier, namespace) - Get queue depth statistics

Use Cases:

  • Overnight refactoring operations
  • Database migrations
  • Report generation
  • Bulk data processing
  • Long-running builds

2. Scheduler

Purpose: Time-based job orchestration with cron expressions

Capabilities:

  • Create reusable job templates
  • Schedule jobs with cron syntax
  • Timezone-aware scheduling
  • Pause/resume schedules
  • Automatic batch job creation

Functions:

  • createTemplate(template) - Create reusable job definition
  • createSchedule(cron, templateId) - Schedule a job
  • listSchedules(filters) - List active schedules
  • pauseSchedule(id) - Pause a schedule
  • resumeSchedule(id) - Resume a schedule
  • deleteSchedule(id) - Delete a schedule

Use Cases:

  • Nightly builds
  • Daily cleanup tasks
  • Weekly reports
  • Monthly maintenance
  • Recurring data synchronization

3. aiAutoFix (Planned)

Purpose: AI-powered automated testing, fixing, and debugging

Capabilities:

  • Automated test execution
  • AI error analysis (Gemini)
  • Code fix generation and application
  • Build-test-fix loop orchestration
  • Integration with batch for long-running operations

Functions (To be implemented):

  • test(suite) - Run automated tests
  • analyze(errors) - AI-powered error analysis
  • fix(suggestions) - Apply AI-generated fixes
  • loop(maxIterations) - Run build-test-fix cycle

Use Cases:

  • Automated error fixing
  • Continuous code improvement
  • Test failure resolution
  • Build error debugging
  • Code quality enhancement

4. Monitoring (Planned)

Purpose: System health checks and alerting

Capabilities:

  • Service health monitoring
  • Queue depth tracking
  • Worker utilization metrics
  • Alert routing and notifications
  • Status dashboards

Functions (To be implemented):

  • health() - Check system health
  • metrics(subsystem) - Get subsystem metrics
  • alert(config) - Configure alerts

Use Cases:

  • System health monitoring
  • Performance tracking
  • Proactive alerting
  • Capacity planning
  • Incident response

Access Methods

CLI (Command-Line Interface)

Installation:

cd apps/devops
pnpm install
pnpm build

Usage:

# Batch operations
sd-devops batch submit <script> --class <A|B|C|D> --tier <prod|nonprod>
sd-devops batch status <job-id>
sd-devops batch list --class A --status running
sd-devops batch cancel <job-id>
sd-devops batch queue-stats --tier nonprod

# Scheduler operations
sd-devops scheduler template create --name "Build" --script "build-all" --tier nonprod
sd-devops scheduler create --template <id> --cron "0 2 * * *"
sd-devops scheduler list --status active
sd-devops scheduler pause <schedule-id>
sd-devops scheduler resume <schedule-id>

SDK (Programmatic API)

Installation:

pnpm add @sd/devops

Usage:

import { BatchClient, SchedulerClient } from "@sd/devops";

// Batch operations
const batch = new BatchClient("http://localhost:3001");

await batch.submit({
  id: "job-123",
  tier: "nonprod",
  namespace: "dev",
  script: "refactor-overnight",
  class: "C",
  args: ["--target", "platform"],
});

const status = await batch.status("job-123");
const jobs = await batch.list({ class: "A", status: "running" });

// Scheduler operations
const scheduler = new SchedulerClient("http://localhost:3001");

const template = await scheduler.createTemplate({
  name: "Nightly Build",
  script: "build-all",
  tier: "nonprod",
  namespace: "platform",
  class: "B",
});

const schedule = await scheduler.createSchedule({
  templateId: template.id,
  cron: "0 2 * * *", // 2 AM daily
  timezone: "America/Chicago",
});

await scheduler.pauseSchedule(schedule.id);

MCP (Model Context Protocol)

Configuration (.cursor/mcp.json):

{
  "mcpServers": {
    "sd-devops": {
      "command": "node",
      "args": ["apps/devops/dist/mcp/index.js"]
    }
  }
}

Available Tools:

  • batch_submit - Submit a batch job
  • batch_status - Check job status
  • batch_list - List jobs
  • batch_cancel - Cancel a job
  • batch_queue_stats - Get queue statistics
  • scheduler_template_create - Create job template
  • scheduler_create - Create schedule
  • scheduler_list - List schedules
  • scheduler_pause - Pause schedule
  • scheduler_resume - Resume schedule

Usage (for AI agents):

// Antigravity can call these tools directly
await use_mcp_tool("sd-devops", "batch_submit", {
  script: "refactor-overnight",
  class: "C",
  tier: "nonprod",
  namespace: "dev",
});

await use_mcp_tool("sd-devops", "scheduler_create", {
  templateId: "template-123",
  cron: "0 2 * * *",
});

REST API

Base URL: http://localhost:3001/api

Batch Endpoints:

# Submit job
POST /batch/submit
{
  "id": "job-123",
  "tier": "nonprod",
  "namespace": "dev",
  "script": "my-script",
  "class": "C"
}

# Get status
GET /batch/status/:jobId?tier=nonprod

# List jobs
GET /batch/list?tier=nonprod&namespace=dev&class=A

# Cancel job
DELETE /batch/cancel/:jobId?tier=nonprod

# Queue stats
GET /batch/queue-stats?tier=nonprod&namespace=dev

Scheduler Endpoints:

# Create template
POST /scheduler/templates
{
  "name": "Build",
  "script": "build-all",
  "tier": "nonprod",
  "namespace": "platform",
  "class": "B"
}

# Create schedule
POST /scheduler/schedules
{
  "templateId": "template-123",
  "cron": "0 2 * * *",
  "timezone": "America/Chicago"
}

# List schedules
GET /scheduler/schedules?status=active

# Pause schedule
PATCH /scheduler/schedules/:id
{
  "status": "paused"
}

Inter-Module Relationships

Scheduler → Batch Integration

The scheduler creates batch jobs automatically when schedules trigger:

┌─────────────────┐
│   Scheduler     │
│   Engine        │
└────────┬────────┘
         │ Cron triggers
┌─────────────────┐
│  Create Batch   │
│  Job from       │
│  Template       │
└────────┬────────┘
┌─────────────────┐
│  Batch Queue    │
│  (Redis)        │
└────────┬────────┘
┌─────────────────┐
│  Batch Worker   │
│  Executes Job   │
└─────────────────┘

Example Flow:

  1. User creates template: "Nightly Build" → build-all script
  2. User creates schedule: "0 2 * * *" (2 AM daily)
  3. At 2 AM, scheduler creates: BatchJobRequest with template config
  4. Batch job queued in Redis: nonprod:platform:queue:batch:class-B
  5. Batch worker picks up job and executes build-all script
  6. Results stored and status updated

aiAutoFix → Batch Integration (Planned)

Long-running AI operations will run as batch jobs:

┌─────────────────┐
│   aiAutoFix     │
│   Request       │
└────────┬────────┘
┌─────────────────┐
│  Submit as      │
│  Batch Job      │
│  (class B)      │
└────────┬────────┘
┌─────────────────┐
│  Batch Worker   │
│  Runs AI Loop   │
└─────────────────┘

Monitoring → All Subsystems (Planned)

Monitoring observes all subsystems:

┌─────────────────┐
│   Monitoring    │
│   Subsystem     │
└────────┬────────┘
         ├──────────► Batch (health, queue depth)
         ├──────────► Scheduler (active schedules)
         └──────────► aiAutoFix (fix success rate)

Shared Infrastructure

All subsystems use common infrastructure:

Queue Manager (Upstash Redis)

  • Unified queue operations
  • Tier-aware key prefixes
  • Atomic operations
  • Pub/sub for events

Job Executor

  • Single execution engine
  • Script sandboxing
  • Resource limits
  • Error handling

Event Bus (Redis Pub/Sub)

  • Cross-subsystem communication
  • Event-driven architecture
  • Loose coupling

Telemetry (Google Cloud Monitoring & Logging)

  • Log Ingestion: Automated JSON ingestion via Google Cloud Ops Agent.
  • Error Discovery: Native Google Cloud Error Reporting (Standard 119).
  • Service Performance: Cloud Monitoring metrics (CPU, Memory, Latency).
  • Alert Routing: GCP Alert Policies connected to Incident Response channels.

Architecture Principles

1. Subsystem Modularity

Each subsystem is self-contained with clear boundaries:

  • Own types, core logic, SDK, CLI commands
  • Independent deployment
  • Isolated testing

2. Shared Infrastructure

Common services prevent duplication:

  • Single Redis instance
  • Unified job executor
  • Shared telemetry
  • Common configuration

3. Unified Interface

Consistent API across all access methods:

  • Same operations via CLI/SDK/MCP/API
  • Predictable naming: <subsystem>_<action>
  • Type-safe SDK with full IntelliSense

4. Event-Driven Communication

Subsystems communicate via events:

  • Scheduler triggers → Batch executes
  • aiAutoFix completes → Monitoring notifies
  • Loose coupling enables flexibility

Deployment

Local Development

cd apps/devops
pnpm install
pnpm dev:api      # Start batch API
pnpm dev:worker   # Start batch worker
pnpm dev:scheduler # Start scheduler

Production (PM2)

# Deploy to dev server
./scripts/deploy-batch-system.sh

# Services managed by PM2:
# - batch-api (port 3001)
# - batch-worker-nonprod (2 instances)
# - scheduler (1 instance)

Environment Variables

# Upstash Redis
UPSTASH_REDIS_PROD_URL=https://...
UPSTASH_REDIS_PROD_TOKEN=...
UPSTASH_REDIS_NONPROD_URL=https://...
UPSTASH_REDIS_NONPROD_TOKEN=...

# Google Cloud
GCLOUD_PROJECT=singular-dream
GCLOUD_REGION=us-central1

# Configuration
BATCH_TIER=nonprod
BATCH_NAMESPACE=platform
PORT=3001

Examples

Example 1: Scheduled Overnight Build

// 1. Create template
const template = await scheduler.createTemplate({
  name: "Full Platform Build",
  script: "build-all",
  tier: "nonprod",
  namespace: "platform",
  class: "B",
  args: ["--clean", "--verbose"],
});

// 2. Schedule for 2 AM daily
const schedule = await scheduler.createSchedule({
  templateId: template.id,
  cron: "0 2 * * *",
  timezone: "America/Chicago",
});

// 3. Scheduler automatically creates batch job at 2 AM
// 4. Batch worker executes build-all script
// 5. Results available via batch.status(jobId)

Example 2: On-Demand Refactoring

// Submit immediate batch job
const job = await batch.submit({
  id: `refactor-${Date.now()}`,
  tier: "nonprod",
  namespace: "dev",
  script: "refactor-codebase",
  class: "C",
  args: ["--target", "auth-module"],
});

// Poll for completion
while (true) {
  const status = await batch.status(job.id);
  if (status.status === "completed") break;
  await new Promise((r) => setTimeout(r, 5000));
}

Example 3: AI-Powered Fix (Future)

// Run build-test-fix loop as batch job
const job = await batch.submit({
  id: `autofix-${Date.now()}`,
  tier: "nonprod",
  namespace: "test",
  script: "aiautofix/build-test-fix-loop",
  class: "B",
  args: ["--max-iterations", "10"],
});

// AI will:
// 1. Run tests
// 2. Analyze errors with Gemini
// 3. Generate fixes
// 4. Apply and verify
// 5. Repeat until clean or max iterations

Best Practices

Job Classification

  • Class A (Critical): Production hotfixes, urgent deployments
  • Class B (High): Scheduled builds, important migrations
  • Class C (Normal): Regular refactoring, reports
  • Class D (Low): Cleanup, optimization, non-urgent tasks

Tier Usage

  • Production: Live system operations, real data
  • Non-Production: Development, testing, staging

Namespace Organization

  • platform - Main application
  • dev - Development experiments
  • test - Automated testing
  • ops - Operational tasks

Error Handling

  • All operations return structured results
  • Use try/catch for SDK calls
  • Check job status for completion
  • Monitor Google Cloud Error Reporting for system-level exceptions

Troubleshooting

Job Not Executing

  1. Check queue stats: batch queue-stats --tier nonprod
  2. Verify workers running: pm2 list
  3. Check worker logs: pm2 logs batch-worker-nonprod

Schedule Not Triggering

  1. Verify schedule status: scheduler list
  2. Check scheduler logs: pm2 logs scheduler
  3. Validate cron expression: Use crontab.guru

Connection Errors

  1. Verify Upstash credentials in Doppler
  2. Check Redis connectivity
  3. Ensure services are running

Future Roadmap

Phase 1: Current ✅

  • Batch system
  • Scheduler
  • Unified SDK/CLI

Phase 2: In Progress

  • aiAutoFix migration
  • MCP server completion
  • Monitoring subsystem

Phase 3: Planned

  • Production tier workers
  • Advanced scheduling (dependencies, retries)
  • Real-time dashboard
  • Slack/email notifications
  • Job templates library

For implementation details, see ARCHITECTURE.md

Version History

Version Date Author Change
0.1.0 2026-01-26 Antigravity Initial Audit & Metadata Injection