SD-DEVOPS Usage Guide
Unified DevOps Platform for Singular Dream
Overview
SD-DEVOPS is a unified platform that consolidates all development operations tooling into a single application with consistent interfaces. It provides batch job processing, automated scheduling, AI-powered code fixing, and system monitoring through a cohesive architecture.
Access Methods:
- CLI: Command-line interface for human operators
- SDK: Programmatic TypeScript/JavaScript API
- MCP: Model Context Protocol for AI agents (Antigravity)
- REST API: HTTP endpoints for external integrations
Subsystems
1. Batch System
Purpose: Execute arbitrary jobs with priority queuing and tier isolation
Capabilities:
- Submit jobs to priority queues (A=critical, B=high, C=normal, D=low)
- Tier isolation (production vs non-production)
- Namespace support for logical grouping
- Job status tracking and cancellation
- Queue statistics and monitoring
Functions:
submit(job)- Submit a job to the queuestatus(jobId)- Get job execution statuslist(filters)- List jobs with filteringcancel(jobId)- Cancel a pending/running jobqueueStats(tier, namespace)- Get queue depth statistics
Use Cases:
- Overnight refactoring operations
- Database migrations
- Report generation
- Bulk data processing
- Long-running builds
2. Scheduler
Purpose: Time-based job orchestration with cron expressions
Capabilities:
- Create reusable job templates
- Schedule jobs with cron syntax
- Timezone-aware scheduling
- Pause/resume schedules
- Automatic batch job creation
Functions:
createTemplate(template)- Create reusable job definitioncreateSchedule(cron, templateId)- Schedule a joblistSchedules(filters)- List active schedulespauseSchedule(id)- Pause a scheduleresumeSchedule(id)- Resume a scheduledeleteSchedule(id)- Delete a schedule
Use Cases:
- Nightly builds
- Daily cleanup tasks
- Weekly reports
- Monthly maintenance
- Recurring data synchronization
3. aiAutoFix (Planned)
Purpose: AI-powered automated testing, fixing, and debugging
Capabilities:
- Automated test execution
- AI error analysis (Gemini)
- Code fix generation and application
- Build-test-fix loop orchestration
- Integration with batch for long-running operations
Functions (To be implemented):
test(suite)- Run automated testsanalyze(errors)- AI-powered error analysisfix(suggestions)- Apply AI-generated fixesloop(maxIterations)- Run build-test-fix cycle
Use Cases:
- Automated error fixing
- Continuous code improvement
- Test failure resolution
- Build error debugging
- Code quality enhancement
4. Monitoring (Planned)
Purpose: System health checks and alerting
Capabilities:
- Service health monitoring
- Queue depth tracking
- Worker utilization metrics
- Alert routing and notifications
- Status dashboards
Functions (To be implemented):
health()- Check system healthmetrics(subsystem)- Get subsystem metricsalert(config)- Configure alerts
Use Cases:
- System health monitoring
- Performance tracking
- Proactive alerting
- Capacity planning
- Incident response
Access Methods
CLI (Command-Line Interface)
Installation:
Usage:
# Batch operations
sd-devops batch submit <script> --class <A|B|C|D> --tier <prod|nonprod>
sd-devops batch status <job-id>
sd-devops batch list --class A --status running
sd-devops batch cancel <job-id>
sd-devops batch queue-stats --tier nonprod
# Scheduler operations
sd-devops scheduler template create --name "Build" --script "build-all" --tier nonprod
sd-devops scheduler create --template <id> --cron "0 2 * * *"
sd-devops scheduler list --status active
sd-devops scheduler pause <schedule-id>
sd-devops scheduler resume <schedule-id>
SDK (Programmatic API)
Installation:
Usage:
import { BatchClient, SchedulerClient } from "@sd/devops";
// Batch operations
const batch = new BatchClient("http://localhost:3001");
await batch.submit({
id: "job-123",
tier: "nonprod",
namespace: "dev",
script: "refactor-overnight",
class: "C",
args: ["--target", "platform"],
});
const status = await batch.status("job-123");
const jobs = await batch.list({ class: "A", status: "running" });
// Scheduler operations
const scheduler = new SchedulerClient("http://localhost:3001");
const template = await scheduler.createTemplate({
name: "Nightly Build",
script: "build-all",
tier: "nonprod",
namespace: "platform",
class: "B",
});
const schedule = await scheduler.createSchedule({
templateId: template.id,
cron: "0 2 * * *", // 2 AM daily
timezone: "America/Chicago",
});
await scheduler.pauseSchedule(schedule.id);
MCP (Model Context Protocol)
Configuration (.cursor/mcp.json):
Available Tools:
batch_submit- Submit a batch jobbatch_status- Check job statusbatch_list- List jobsbatch_cancel- Cancel a jobbatch_queue_stats- Get queue statisticsscheduler_template_create- Create job templatescheduler_create- Create schedulescheduler_list- List schedulesscheduler_pause- Pause schedulescheduler_resume- Resume schedule
Usage (for AI agents):
// Antigravity can call these tools directly
await use_mcp_tool("sd-devops", "batch_submit", {
script: "refactor-overnight",
class: "C",
tier: "nonprod",
namespace: "dev",
});
await use_mcp_tool("sd-devops", "scheduler_create", {
templateId: "template-123",
cron: "0 2 * * *",
});
REST API
Base URL: http://localhost:3001/api
Batch Endpoints:
# Submit job
POST /batch/submit
{
"id": "job-123",
"tier": "nonprod",
"namespace": "dev",
"script": "my-script",
"class": "C"
}
# Get status
GET /batch/status/:jobId?tier=nonprod
# List jobs
GET /batch/list?tier=nonprod&namespace=dev&class=A
# Cancel job
DELETE /batch/cancel/:jobId?tier=nonprod
# Queue stats
GET /batch/queue-stats?tier=nonprod&namespace=dev
Scheduler Endpoints:
# Create template
POST /scheduler/templates
{
"name": "Build",
"script": "build-all",
"tier": "nonprod",
"namespace": "platform",
"class": "B"
}
# Create schedule
POST /scheduler/schedules
{
"templateId": "template-123",
"cron": "0 2 * * *",
"timezone": "America/Chicago"
}
# List schedules
GET /scheduler/schedules?status=active
# Pause schedule
PATCH /scheduler/schedules/:id
{
"status": "paused"
}
Inter-Module Relationships
Scheduler → Batch Integration
The scheduler creates batch jobs automatically when schedules trigger:
┌─────────────────┐
│ Scheduler │
│ Engine │
└────────┬────────┘
│ Cron triggers
│
▼
┌─────────────────┐
│ Create Batch │
│ Job from │
│ Template │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Batch Queue │
│ (Redis) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Batch Worker │
│ Executes Job │
└─────────────────┘
Example Flow:
- User creates template: "Nightly Build" →
build-allscript - User creates schedule: "0 2 * * *" (2 AM daily)
- At 2 AM, scheduler creates:
BatchJobRequestwith template config - Batch job queued in Redis:
nonprod:platform:queue:batch:class-B - Batch worker picks up job and executes
build-allscript - Results stored and status updated
aiAutoFix → Batch Integration (Planned)
Long-running AI operations will run as batch jobs:
┌─────────────────┐
│ aiAutoFix │
│ Request │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Submit as │
│ Batch Job │
│ (class B) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Batch Worker │
│ Runs AI Loop │
└─────────────────┘
Monitoring → All Subsystems (Planned)
Monitoring observes all subsystems:
┌─────────────────┐
│ Monitoring │
│ Subsystem │
└────────┬────────┘
│
├──────────► Batch (health, queue depth)
│
├──────────► Scheduler (active schedules)
│
└──────────► aiAutoFix (fix success rate)
Shared Infrastructure
All subsystems use common infrastructure:
Queue Manager (Upstash Redis)
- Unified queue operations
- Tier-aware key prefixes
- Atomic operations
- Pub/sub for events
Job Executor
- Single execution engine
- Script sandboxing
- Resource limits
- Error handling
Event Bus (Redis Pub/Sub)
- Cross-subsystem communication
- Event-driven architecture
- Loose coupling
Telemetry (Google Cloud Monitoring & Logging)
- Log Ingestion: Automated JSON ingestion via Google Cloud Ops Agent.
- Error Discovery: Native Google Cloud Error Reporting (Standard 119).
- Service Performance: Cloud Monitoring metrics (CPU, Memory, Latency).
- Alert Routing: GCP Alert Policies connected to Incident Response channels.
Architecture Principles
1. Subsystem Modularity
Each subsystem is self-contained with clear boundaries:
- Own types, core logic, SDK, CLI commands
- Independent deployment
- Isolated testing
2. Shared Infrastructure
Common services prevent duplication:
- Single Redis instance
- Unified job executor
- Shared telemetry
- Common configuration
3. Unified Interface
Consistent API across all access methods:
- Same operations via CLI/SDK/MCP/API
- Predictable naming:
<subsystem>_<action> - Type-safe SDK with full IntelliSense
4. Event-Driven Communication
Subsystems communicate via events:
- Scheduler triggers → Batch executes
- aiAutoFix completes → Monitoring notifies
- Loose coupling enables flexibility
Deployment
Local Development
cd apps/devops
pnpm install
pnpm dev:api # Start batch API
pnpm dev:worker # Start batch worker
pnpm dev:scheduler # Start scheduler
Production (PM2)
# Deploy to dev server
./scripts/deploy-batch-system.sh
# Services managed by PM2:
# - batch-api (port 3001)
# - batch-worker-nonprod (2 instances)
# - scheduler (1 instance)
Environment Variables
# Upstash Redis
UPSTASH_REDIS_PROD_URL=https://...
UPSTASH_REDIS_PROD_TOKEN=...
UPSTASH_REDIS_NONPROD_URL=https://...
UPSTASH_REDIS_NONPROD_TOKEN=...
# Google Cloud
GCLOUD_PROJECT=singular-dream
GCLOUD_REGION=us-central1
# Configuration
BATCH_TIER=nonprod
BATCH_NAMESPACE=platform
PORT=3001
Examples
Example 1: Scheduled Overnight Build
// 1. Create template
const template = await scheduler.createTemplate({
name: "Full Platform Build",
script: "build-all",
tier: "nonprod",
namespace: "platform",
class: "B",
args: ["--clean", "--verbose"],
});
// 2. Schedule for 2 AM daily
const schedule = await scheduler.createSchedule({
templateId: template.id,
cron: "0 2 * * *",
timezone: "America/Chicago",
});
// 3. Scheduler automatically creates batch job at 2 AM
// 4. Batch worker executes build-all script
// 5. Results available via batch.status(jobId)
Example 2: On-Demand Refactoring
// Submit immediate batch job
const job = await batch.submit({
id: `refactor-${Date.now()}`,
tier: "nonprod",
namespace: "dev",
script: "refactor-codebase",
class: "C",
args: ["--target", "auth-module"],
});
// Poll for completion
while (true) {
const status = await batch.status(job.id);
if (status.status === "completed") break;
await new Promise((r) => setTimeout(r, 5000));
}
Example 3: AI-Powered Fix (Future)
// Run build-test-fix loop as batch job
const job = await batch.submit({
id: `autofix-${Date.now()}`,
tier: "nonprod",
namespace: "test",
script: "aiautofix/build-test-fix-loop",
class: "B",
args: ["--max-iterations", "10"],
});
// AI will:
// 1. Run tests
// 2. Analyze errors with Gemini
// 3. Generate fixes
// 4. Apply and verify
// 5. Repeat until clean or max iterations
Best Practices
Job Classification
- Class A (Critical): Production hotfixes, urgent deployments
- Class B (High): Scheduled builds, important migrations
- Class C (Normal): Regular refactoring, reports
- Class D (Low): Cleanup, optimization, non-urgent tasks
Tier Usage
- Production: Live system operations, real data
- Non-Production: Development, testing, staging
Namespace Organization
platform- Main applicationdev- Development experimentstest- Automated testingops- Operational tasks
Error Handling
- All operations return structured results
- Use try/catch for SDK calls
- Check job status for completion
- Monitor Google Cloud Error Reporting for system-level exceptions
Troubleshooting
Job Not Executing
- Check queue stats:
batch queue-stats --tier nonprod - Verify workers running:
pm2 list - Check worker logs:
pm2 logs batch-worker-nonprod
Schedule Not Triggering
- Verify schedule status:
scheduler list - Check scheduler logs:
pm2 logs scheduler - Validate cron expression: Use crontab.guru
Connection Errors
- Verify Upstash credentials in Doppler
- Check Redis connectivity
- Ensure services are running
Future Roadmap
Phase 1: Current ✅
- Batch system
- Scheduler
- Unified SDK/CLI
Phase 2: In Progress
- aiAutoFix migration
- MCP server completion
- Monitoring subsystem
Phase 3: Planned
- Production tier workers
- Advanced scheduling (dependencies, retries)
- Real-time dashboard
- Slack/email notifications
- Job templates library
For implementation details, see ARCHITECTURE.md
Version History
| Version | Date | Author | Change |
|---|---|---|---|
| 0.1.0 | 2026-01-26 | Antigravity | Initial Audit & Metadata Injection |