SD-DEVOPS Usage Guide
Unified DevOps Platform for Singular Dream
Overview
SD-DEVOPS is a unified platform that consolidates all development operations tooling into a single application with consistent interfaces. It provides batch job processing, automated scheduling, AI-powered code fixing, and system monitoring through a cohesive architecture.
Access Methods: - CLI: Command-line interface for human operators - SDK: Programmatic TypeScript/JavaScript API - MCP: Model Context Protocol for AI agents (Antigravity) - REST API: HTTP endpoints for external integrations
Subsystems
1. Batch System
Purpose: Execute arbitrary jobs with priority queuing and tier isolation
Capabilities: - Submit jobs to priority queues (A=critical, B=high, C=normal, D=low) - Tier isolation (production vs non-production) - Namespace support for logical grouping - Job status tracking and cancellation - Queue statistics and monitoring
Functions:
- submit(job) - Submit a job to the queue
- status(jobId) - Get job execution status
- list(filters) - List jobs with filtering
- cancel(jobId) - Cancel a pending/running job
- queueStats(tier, namespace) - Get queue depth statistics
Use Cases: - Overnight refactoring operations - Database migrations - Report generation - Bulk data processing - Long-running builds
2. Scheduler
Purpose: Time-based job orchestration with cron expressions
Capabilities: - Create reusable job templates - Schedule jobs with cron syntax - Timezone-aware scheduling - Pause/resume schedules - Automatic batch job creation
Functions:
- createTemplate(template) - Create reusable job definition
- createSchedule(cron, templateId) - Schedule a job
- listSchedules(filters) - List active schedules
- pauseSchedule(id) - Pause a schedule
- resumeSchedule(id) - Resume a schedule
- deleteSchedule(id) - Delete a schedule
Use Cases: - Nightly builds - Daily cleanup tasks - Weekly reports - Monthly maintenance - Recurring data synchronization
3. aiAutoFix (Planned)
Purpose: AI-powered automated testing, fixing, and debugging
Capabilities: - Automated test execution - AI error analysis (Gemini) - Code fix generation and application - Build-test-fix loop orchestration - Integration with batch for long-running operations
Functions (To be implemented):
- test(suite) - Run automated tests
- analyze(errors) - AI-powered error analysis
- fix(suggestions) - Apply AI-generated fixes
- loop(maxIterations) - Run build-test-fix cycle
Use Cases: - Automated error fixing - Continuous code improvement - Test failure resolution - Build error debugging - Code quality enhancement
4. Monitoring (Planned)
Purpose: System health checks and alerting
Capabilities: - Service health monitoring - Queue depth tracking - Worker utilization metrics - Alert routing and notifications - Status dashboards
Functions (To be implemented):
- health() - Check system health
- metrics(subsystem) - Get subsystem metrics
- alert(config) - Configure alerts
Use Cases: - System health monitoring - Performance tracking - Proactive alerting - Capacity planning - Incident response
Access Methods
CLI (Command-Line Interface)
Installation:
Usage:
# Batch operations
sd-devops batch submit <script> --class <A|B|C|D> --tier <prod|nonprod>
sd-devops batch status <job-id>
sd-devops batch list --class A --status running
sd-devops batch cancel <job-id>
sd-devops batch queue-stats --tier nonprod
# Scheduler operations
sd-devops scheduler template create --name "Build" --script "build-all" --tier nonprod
sd-devops scheduler create --template <id> --cron "0 2 * * *"
sd-devops scheduler list --status active
sd-devops scheduler pause <schedule-id>
sd-devops scheduler resume <schedule-id>
SDK (Programmatic API)
Installation:
Usage:
import { BatchClient, SchedulerClient } from '@sd/devops';
// Batch operations
const batch = new BatchClient('http://localhost:3001');
await batch.submit({
id: 'job-123',
tier: 'nonprod',
namespace: 'dev',
script: 'refactor-overnight',
class: 'C',
args: ['--target', 'platform'],
});
const status = await batch.status('job-123');
const jobs = await batch.list({ class: 'A', status: 'running' });
// Scheduler operations
const scheduler = new SchedulerClient('http://localhost:3001');
const template = await scheduler.createTemplate({
name: 'Nightly Build',
script: 'build-all',
tier: 'nonprod',
namespace: 'platform',
class: 'B',
});
const schedule = await scheduler.createSchedule({
templateId: template.id,
cron: '0 2 * * *', // 2 AM daily
timezone: 'America/Chicago',
});
await scheduler.pauseSchedule(schedule.id);
MCP (Model Context Protocol)
Configuration (.cursor/mcp.json):
Available Tools:
- batch_submit - Submit a batch job
- batch_status - Check job status
- batch_list - List jobs
- batch_cancel - Cancel a job
- batch_queue_stats - Get queue statistics
- scheduler_template_create - Create job template
- scheduler_create - Create schedule
- scheduler_list - List schedules
- scheduler_pause - Pause schedule
- scheduler_resume - Resume schedule
Usage (for AI agents):
// Antigravity can call these tools directly
await use_mcp_tool('sd-devops', 'batch_submit', {
script: 'refactor-overnight',
class: 'C',
tier: 'nonprod',
namespace: 'dev',
});
await use_mcp_tool('sd-devops', 'scheduler_create', {
templateId: 'template-123',
cron: '0 2 * * *',
});
REST API
Base URL: http://localhost:3001/api
Batch Endpoints:
# Submit job
POST /batch/submit
{
"id": "job-123",
"tier": "nonprod",
"namespace": "dev",
"script": "my-script",
"class": "C"
}
# Get status
GET /batch/status/:jobId?tier=nonprod
# List jobs
GET /batch/list?tier=nonprod&namespace=dev&class=A
# Cancel job
DELETE /batch/cancel/:jobId?tier=nonprod
# Queue stats
GET /batch/queue-stats?tier=nonprod&namespace=dev
Scheduler Endpoints:
# Create template
POST /scheduler/templates
{
"name": "Build",
"script": "build-all",
"tier": "nonprod",
"namespace": "platform",
"class": "B"
}
# Create schedule
POST /scheduler/schedules
{
"templateId": "template-123",
"cron": "0 2 * * *",
"timezone": "America/Chicago"
}
# List schedules
GET /scheduler/schedules?status=active
# Pause schedule
PATCH /scheduler/schedules/:id
{
"status": "paused"
}
Inter-Module Relationships
Scheduler → Batch Integration
The scheduler creates batch jobs automatically when schedules trigger:
┌─────────────────┐
│ Scheduler │
│ Engine │
└────────┬────────┘
│ Cron triggers
│
▼
┌─────────────────┐
│ Create Batch │
│ Job from │
│ Template │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Batch Queue │
│ (Redis) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Batch Worker │
│ Executes Job │
└─────────────────┘
Example Flow:
1. User creates template: "Nightly Build" → build-all script
2. User creates schedule: "0 2 * * *" (2 AM daily)
3. At 2 AM, scheduler creates: BatchJobRequest with template config
4. Batch job queued in Redis: nonprod:platform:queue:batch:class-B
5. Batch worker picks up job and executes build-all script
6. Results stored and status updated
aiAutoFix → Batch Integration (Planned)
Long-running AI operations will run as batch jobs:
┌─────────────────┐
│ aiAutoFix │
│ Request │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Submit as │
│ Batch Job │
│ (class B) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Batch Worker │
│ Runs AI Loop │
└─────────────────┘
Monitoring → All Subsystems (Planned)
Monitoring observes all subsystems:
┌─────────────────┐
│ Monitoring │
│ Subsystem │
└────────┬────────┘
│
├──────────► Batch (health, queue depth)
│
├──────────► Scheduler (active schedules)
│
└──────────► aiAutoFix (fix success rate)
Shared Infrastructure
All subsystems use common infrastructure:
Queue Manager (Upstash Redis)
- Unified queue operations
- Tier-aware key prefixes
- Atomic operations
- Pub/sub for events
Job Executor
- Single execution engine
- Script sandboxing
- Resource limits
- Error handling
Event Bus (Redis Pub/Sub)
- Cross-subsystem communication
- Event-driven architecture
- Loose coupling
Telemetry (Sentry)
- Error tracking
- Performance monitoring
- Cron monitoring
- Alert routing
Architecture Principles
1. Subsystem Modularity
Each subsystem is self-contained with clear boundaries: - Own types, core logic, SDK, CLI commands - Independent deployment - Isolated testing
2. Shared Infrastructure
Common services prevent duplication: - Single Redis instance - Unified job executor - Shared telemetry - Common configuration
3. Unified Interface
Consistent API across all access methods:
- Same operations via CLI/SDK/MCP/API
- Predictable naming: <subsystem>_<action>
- Type-safe SDK with full IntelliSense
4. Event-Driven Communication
Subsystems communicate via events: - Scheduler triggers → Batch executes - aiAutoFix completes → Monitoring notifies - Loose coupling enables flexibility
Deployment
Local Development
cd apps/devops
pnpm install
pnpm dev:api # Start batch API
pnpm dev:worker # Start batch worker
pnpm dev:scheduler # Start scheduler
Production (PM2)
# Deploy to dev server
./scripts/deploy-batch-system.sh
# Services managed by PM2:
# - batch-api (port 3001)
# - batch-worker-nonprod (2 instances)
# - scheduler (1 instance)
Environment Variables
# Upstash Redis
UPSTASH_REDIS_PROD_URL=https://...
UPSTASH_REDIS_PROD_TOKEN=...
UPSTASH_REDIS_NONPROD_URL=https://...
UPSTASH_REDIS_NONPROD_TOKEN=...
# Sentry
SENTRY_DSN=https://...
# Configuration
BATCH_TIER=nonprod
BATCH_NAMESPACE=platform
PORT=3001
Examples
Example 1: Scheduled Overnight Build
// 1. Create template
const template = await scheduler.createTemplate({
name: 'Full Platform Build',
script: 'build-all',
tier: 'nonprod',
namespace: 'platform',
class: 'B',
args: ['--clean', '--verbose'],
});
// 2. Schedule for 2 AM daily
const schedule = await scheduler.createSchedule({
templateId: template.id,
cron: '0 2 * * *',
timezone: 'America/Chicago',
});
// 3. Scheduler automatically creates batch job at 2 AM
// 4. Batch worker executes build-all script
// 5. Results available via batch.status(jobId)
Example 2: On-Demand Refactoring
// Submit immediate batch job
const job = await batch.submit({
id: `refactor-${Date.now()}`,
tier: 'nonprod',
namespace: 'dev',
script: 'refactor-codebase',
class: 'C',
args: ['--target', 'auth-module'],
});
// Poll for completion
while (true) {
const status = await batch.status(job.id);
if (status.status === 'completed') break;
await new Promise(r => setTimeout(r, 5000));
}
Example 3: AI-Powered Fix (Future)
// Run build-test-fix loop as batch job
const job = await batch.submit({
id: `autofix-${Date.now()}`,
tier: 'nonprod',
namespace: 'test',
script: 'aiautofix/build-test-fix-loop',
class: 'B',
args: ['--max-iterations', '10'],
});
// AI will:
// 1. Run tests
// 2. Analyze errors with Gemini
// 3. Generate fixes
// 4. Apply and verify
// 5. Repeat until clean or max iterations
Best Practices
Job Classification
- Class A (Critical): Production hotfixes, urgent deployments
- Class B (High): Scheduled builds, important migrations
- Class C (Normal): Regular refactoring, reports
- Class D (Low): Cleanup, optimization, non-urgent tasks
Tier Usage
- Production: Live system operations, real data
- Non-Production: Development, testing, staging
Namespace Organization
platform- Main applicationdev- Development experimentstest- Automated testingops- Operational tasks
Error Handling
- All operations return structured results
- Use try/catch for SDK calls
- Check job status for completion
- Monitor Sentry for system errors
Troubleshooting
Job Not Executing
- Check queue stats:
batch queue-stats --tier nonprod - Verify workers running:
pm2 list - Check worker logs:
pm2 logs batch-worker-nonprod
Schedule Not Triggering
- Verify schedule status:
scheduler list - Check scheduler logs:
pm2 logs scheduler - Validate cron expression: Use crontab.guru
Connection Errors
- Verify Upstash credentials in Doppler
- Check Redis connectivity
- Ensure services are running
Future Roadmap
Phase 1: Current ✅
- Batch system
- Scheduler
- Unified SDK/CLI
Phase 2: In Progress
- aiAutoFix migration
- MCP server completion
- Monitoring subsystem
Phase 3: Planned
- Production tier workers
- Advanced scheduling (dependencies, retries)
- Real-time dashboard
- Slack/email notifications
- Job templates library
For implementation details, see ARCHITECTURE.md