SD-DEVOPS Usage Guide

Unified DevOps Platform for Singular Dream

Overview

SD-DEVOPS is a unified platform that consolidates all development operations tooling into a single application with consistent interfaces. It provides batch job processing, automated scheduling, AI-powered code fixing, and system monitoring through a cohesive architecture.

Access Methods:

CLI: Command-line interface for human operators
SDK: Programmatic TypeScript/JavaScript API
MCP: Model Context Protocol for AI agents (Antigravity)
REST API: HTTP endpoints for external integrations

Subsystems

1. Batch System

Purpose: Execute arbitrary jobs with priority queuing and tier isolation

Capabilities:

Submit jobs to priority queues (A=critical, B=high, C=normal, D=low)
Tier isolation (production vs non-production)
Namespace support for logical grouping
Job status tracking and cancellation
Queue statistics and monitoring

Functions:

submit(job) - Submit a job to the queue
status(jobId) - Get job execution status
list(filters) - List jobs with filtering
cancel(jobId) - Cancel a pending/running job
queueStats(tier, namespace) - Get queue depth statistics

Use Cases:

Overnight refactoring operations
Database migrations
Report generation
Bulk data processing
Long-running builds

2. Scheduler

Purpose: Time-based job orchestration with cron expressions

Capabilities:

Create reusable job templates
Schedule jobs with cron syntax
Timezone-aware scheduling
Pause/resume schedules
Automatic batch job creation

Functions:

createTemplate(template) - Create reusable job definition
createSchedule(cron, templateId) - Schedule a job
listSchedules(filters) - List active schedules
pauseSchedule(id) - Pause a schedule
resumeSchedule(id) - Resume a schedule
deleteSchedule(id) - Delete a schedule

Use Cases:

Nightly builds
Daily cleanup tasks
Weekly reports
Monthly maintenance
Recurring data synchronization

3. aiAutoFix (Planned)

Purpose: AI-powered automated testing, fixing, and debugging

Capabilities:

Automated test execution
AI error analysis (Gemini)
Code fix generation and application
Build-test-fix loop orchestration
Integration with batch for long-running operations

Functions (To be implemented):

test(suite) - Run automated tests
analyze(errors) - AI-powered error analysis
fix(suggestions) - Apply AI-generated fixes
loop(maxIterations) - Run build-test-fix cycle

Use Cases:

Automated error fixing
Continuous code improvement
Test failure resolution
Build error debugging
Code quality enhancement

4. Monitoring (Planned)

Purpose: System health checks and alerting

Capabilities:

Service health monitoring
Queue depth tracking
Worker utilization metrics
Alert routing and notifications
Status dashboards

Functions (To be implemented):

health() - Check system health
metrics(subsystem) - Get subsystem metrics
alert(config) - Configure alerts

Use Cases:

System health monitoring
Performance tracking
Proactive alerting
Capacity planning
Incident response

Access Methods

CLI (Command-Line Interface)

Installation:

cd apps/devops
pnpm install
pnpm build

Usage:

# Batch operations
sd-devops batch submit <script> --class <A|B|C|D> --tier <prod|nonprod>
sd-devops batch status <job-id>
sd-devops batch list --class A --status running
sd-devops batch cancel <job-id>
sd-devops batch queue-stats --tier nonprod

# Scheduler operations
sd-devops scheduler template create --name "Build" --script "build-all" --tier nonprod
sd-devops scheduler create --template <id> --cron "0 2 * * *"
sd-devops scheduler list --status active
sd-devops scheduler pause <schedule-id>
sd-devops scheduler resume <schedule-id>

SDK (Programmatic API)

Installation:

pnpm add @sd/devops

Usage:

import { BatchClient, SchedulerClient } from "@sd/devops";

// Batch operations
const batch = new BatchClient("http://localhost:3001");

await batch.submit({
  id: "job-123",
  tier: "nonprod",
  namespace: "dev",
  script: "refactor-overnight",
  class: "C",
  args: ["--target", "platform"],
});

const status = await batch.status("job-123");
const jobs = await batch.list({ class: "A", status: "running" });

// Scheduler operations
const scheduler = new SchedulerClient("http://localhost:3001");

const template = await scheduler.createTemplate({
  name: "Nightly Build",
  script: "build-all",
  tier: "nonprod",
  namespace: "platform",
  class: "B",
});

const schedule = await scheduler.createSchedule({
  templateId: template.id,
  cron: "0 2 * * *", // 2 AM daily
  timezone: "America/Chicago",
});

await scheduler.pauseSchedule(schedule.id);

MCP (Model Context Protocol)

Configuration (.cursor/mcp.json):

{
  "mcpServers": {
    "sd-devops": {
      "command": "node",
      "args": ["apps/devops/dist/mcp/index.js"]
    }
  }
}

Available Tools:

batch_submit - Submit a batch job
batch_status - Check job status
batch_list - List jobs
batch_cancel - Cancel a job
batch_queue_stats - Get queue statistics
scheduler_template_create - Create job template
scheduler_create - Create schedule
scheduler_list - List schedules
scheduler_pause - Pause schedule
scheduler_resume - Resume schedule

Usage (for AI agents):

// Antigravity can call these tools directly
await use_mcp_tool("sd-devops", "batch_submit", {
  script: "refactor-overnight",
  class: "C",
  tier: "nonprod",
  namespace: "dev",
});

await use_mcp_tool("sd-devops", "scheduler_create", {
  templateId: "template-123",
  cron: "0 2 * * *",
});

REST API

Base URL: http://localhost:3001/api

Batch Endpoints:

# Submit job
POST /batch/submit
{
  "id": "job-123",
  "tier": "nonprod",
  "namespace": "dev",
  "script": "my-script",
  "class": "C"
}

# Get status
GET /batch/status/:jobId?tier=nonprod

# List jobs
GET /batch/list?tier=nonprod&namespace=dev&class=A

# Cancel job
DELETE /batch/cancel/:jobId?tier=nonprod

# Queue stats
GET /batch/queue-stats?tier=nonprod&namespace=dev

Scheduler Endpoints:

# Create template
POST /scheduler/templates
{
  "name": "Build",
  "script": "build-all",
  "tier": "nonprod",
  "namespace": "platform",
  "class": "B"
}

# Create schedule
POST /scheduler/schedules
{
  "templateId": "template-123",
  "cron": "0 2 * * *",
  "timezone": "America/Chicago"
}

# List schedules
GET /scheduler/schedules?status=active

# Pause schedule
PATCH /scheduler/schedules/:id
{
  "status": "paused"
}

Inter-Module Relationships

Scheduler → Batch Integration

The scheduler creates batch jobs automatically when schedules trigger:

┌─────────────────┐
│   Scheduler     │
│   Engine        │
└────────┬────────┘
         │ Cron triggers
         │
         ▼
┌─────────────────┐
│  Create Batch   │
│  Job from       │
│  Template       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Batch Queue    │
│  (Redis)        │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Batch Worker   │
│  Executes Job   │
└─────────────────┘

Example Flow:

User creates template: "Nightly Build" → build-all script
User creates schedule: "0 2 * * *" (2 AM daily)
At 2 AM, scheduler creates: BatchJobRequest with template config
Batch job queued in Redis: nonprod:platform:queue:batch:class-B
Batch worker picks up job and executes build-all script
Results stored and status updated

aiAutoFix → Batch Integration (Planned)

Long-running AI operations will run as batch jobs:

┌─────────────────┐
│   aiAutoFix     │
│   Request       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Submit as      │
│  Batch Job      │
│  (class B)      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Batch Worker   │
│  Runs AI Loop   │
└─────────────────┘

Monitoring → All Subsystems (Planned)

Monitoring observes all subsystems:

┌─────────────────┐
│   Monitoring    │
│   Subsystem     │
└────────┬────────┘
         │
         ├──────────► Batch (health, queue depth)
         │
         ├──────────► Scheduler (active schedules)
         │
         └──────────► aiAutoFix (fix success rate)

Shared Infrastructure

All subsystems use common infrastructure:

Queue Manager (Upstash Redis)

Unified queue operations
Tier-aware key prefixes
Atomic operations
Pub/sub for events

Job Executor

Single execution engine
Script sandboxing
Resource limits
Error handling

Event Bus (Redis Pub/Sub)

Cross-subsystem communication
Event-driven architecture
Loose coupling

Telemetry (Google Cloud Monitoring & Logging)

Log Ingestion: Automated JSON ingestion via Google Cloud Ops Agent.
Error Discovery: Native Google Cloud Error Reporting (Standard 119).
Service Performance: Cloud Monitoring metrics (CPU, Memory, Latency).
Alert Routing: GCP Alert Policies connected to Incident Response channels.

Architecture Principles

1. Subsystem Modularity

Each subsystem is self-contained with clear boundaries:

Own types, core logic, SDK, CLI commands
Independent deployment
Isolated testing

2. Shared Infrastructure

Common services prevent duplication:

Single Redis instance
Unified job executor
Shared telemetry
Common configuration

3. Unified Interface

Consistent API across all access methods:

Same operations via CLI/SDK/MCP/API
Predictable naming: <subsystem>_<action>
Type-safe SDK with full IntelliSense

4. Event-Driven Communication

Subsystems communicate via events:

Scheduler triggers → Batch executes
aiAutoFix completes → Monitoring notifies
Loose coupling enables flexibility

Deployment

Local Development

cd apps/devops
pnpm install
pnpm dev:api      # Start batch API
pnpm dev:worker   # Start batch worker
pnpm dev:scheduler # Start scheduler

Production (PM2)

# Deploy to dev server
./scripts/deploy-batch-system.sh

# Services managed by PM2:
# - batch-api (port 3001)
# - batch-worker-nonprod (2 instances)
# - scheduler (1 instance)

Environment Variables

# Upstash Redis
UPSTASH_REDIS_PROD_URL=https://...
UPSTASH_REDIS_PROD_TOKEN=...
UPSTASH_REDIS_NONPROD_URL=https://...
UPSTASH_REDIS_NONPROD_TOKEN=...

# Google Cloud
GCLOUD_PROJECT=singular-dream
GCLOUD_REGION=us-central1

# Configuration
BATCH_TIER=nonprod
BATCH_NAMESPACE=platform
PORT=3001

Examples

Example 1: Scheduled Overnight Build

// 1. Create template
const template = await scheduler.createTemplate({
  name: "Full Platform Build",
  script: "build-all",
  tier: "nonprod",
  namespace: "platform",
  class: "B",
  args: ["--clean", "--verbose"],
});

// 2. Schedule for 2 AM daily
const schedule = await scheduler.createSchedule({
  templateId: template.id,
  cron: "0 2 * * *",
  timezone: "America/Chicago",
});

// 3. Scheduler automatically creates batch job at 2 AM
// 4. Batch worker executes build-all script
// 5. Results available via batch.status(jobId)

Example 2: On-Demand Refactoring

// Submit immediate batch job
const job = await batch.submit({
  id: `refactor-${Date.now()}`,
  tier: "nonprod",
  namespace: "dev",
  script: "refactor-codebase",
  class: "C",
  args: ["--target", "auth-module"],
});

// Poll for completion
while (true) {
  const status = await batch.status(job.id);
  if (status.status === "completed") break;
  await new Promise((r) => setTimeout(r, 5000));
}

Example 3: AI-Powered Fix (Future)

// Run build-test-fix loop as batch job
const job = await batch.submit({
  id: `autofix-${Date.now()}`,
  tier: "nonprod",
  namespace: "test",
  script: "aiautofix/build-test-fix-loop",
  class: "B",
  args: ["--max-iterations", "10"],
});

// AI will:
// 1. Run tests
// 2. Analyze errors with Gemini
// 3. Generate fixes
// 4. Apply and verify
// 5. Repeat until clean or max iterations

Best Practices

Job Classification

Class A (Critical): Production hotfixes, urgent deployments
Class B (High): Scheduled builds, important migrations
Class C (Normal): Regular refactoring, reports
Class D (Low): Cleanup, optimization, non-urgent tasks

Tier Usage

Production: Live system operations, real data
Non-Production: Development, testing, staging

Namespace Organization

platform - Main application
dev - Development experiments
test - Automated testing
ops - Operational tasks

Error Handling

All operations return structured results
Use try/catch for SDK calls
Check job status for completion
Monitor Google Cloud Error Reporting for system-level exceptions

Troubleshooting

Job Not Executing

Check queue stats: batch queue-stats --tier nonprod
Verify workers running: pm2 list
Check worker logs: pm2 logs batch-worker-nonprod

Schedule Not Triggering

Verify schedule status: scheduler list
Check scheduler logs: pm2 logs scheduler
Validate cron expression: Use crontab.guru

Connection Errors

Verify Upstash credentials in Doppler
Check Redis connectivity
Ensure services are running

Future Roadmap

Phase 1: Current ✅

Batch system
Scheduler
Unified SDK/CLI

Phase 2: In Progress

aiAutoFix migration
MCP server completion
Monitoring subsystem

Phase 3: Planned

Production tier workers
Advanced scheduling (dependencies, retries)
Real-time dashboard
Slack/email notifications
Job templates library

For implementation details, see ARCHITECTURE.md

Version History

Version	Date	Author	Change
0.1.0	2026-01-26	Antigravity	Initial Audit & Metadata Injection