Skip to content

DevOps Tooling Architecture - Recommendation

Status: βœ… IN PROGRESS - Batch subsystem implemented, M2 migration pending
Priority: #2 (Proceeding independently - M2 doesn't block Property refactor)
Date Created: 2026-01-18
Last Updated: 2026-01-18
Decision: Create apps/devops as third application

βœ… IMPLEMENTATION UPDATE: Initial implementation complete! Created apps/devops/ with full batch processing subsystem. M2 migration and remaining script consolidation still pending.

Implementation Progress

Completed (2026-01-18): - βœ… Created apps/devops/ directory structure - βœ… Batch subsystem fully implemented: - Types & configuration (job classes A/B/C/D) - SDK client (BatchClient) - Queue manager (Redis) - Job executor - Batch worker (polling & execution) - CLI commands (submit, status, list, cancel, queue-stats) - MCP server (5 tools for Antigravity) - βœ… Unified SDK entry point - βœ… CLI with Commander.js - βœ… Package.json and TypeScript config

Pending: - ⏳ M2 AI module migration - ⏳ Migration of 180+ scripts from scripts/ directory - ⏳ Audit, seed, verify, deploy, monitor modules - ⏳ Full deployment to Elastic Muscle

See: apps/devops/ARCHITECTURE.md for implementation details

⚠️ ORIGINAL PRIORITY NOTE: This refactoring was originally blocked by Sovereign Property/Ownership module consolidation. However, M2 and batch systems don't depend on that refactor, so we proceeded independently.


Project Priorities

Priority Project Status Blocking
#1 Sovereign Property/Ownership Module πŸ”΄ In Progress Platform development
#2 DevOps Tooling Application 🟑 Approved (Concept) Priority #1 completion
#3 Platform Feature Development ⏸️ Paused Priority #1 completion

Execution Order: 1. Complete Sovereign Property/Ownership module consolidation (Directory + Property β†’ Unified) 2. Resume platform feature development 3. Plan detailed design for DevOps tooling application 4. Execute DevOps tooling refactor (when bandwidth allows)


Current State Analysis

Scripts Directory Inventory

  • Total Scripts: 180+ files
  • TypeScript Scripts: 145+ .ts files
  • Shell Scripts: 30+ .sh files
  • Subdirectories: 6 (checks, git-hooks, jobs, legacy, lib, utils)

Script Categories Identified

Category Count Examples
Audit Tools 12 audit-capabilities.ts, audit-rbac.ts, audit-schema.ts
Data Seeders 8 dataseeder-gold.seed.ts, dataseeder-chaos.seed.ts
AutoTesters 4 autotester-overnight.ts, autotester-sprint.ts
Verification 30+ verify-*.ts (finance, inventory, permissions, etc.)
Infrastructure 15+ check-elastic-muscle.ts, monitor-*.sh, setup-*.sh
Deployment 8 deploy-marketing.ts, sync-doppler-to-vercel.sh
Development 10+ ghostbuster.ts, dev-mode.ts, sunrise.ts, goodnight.ts
Testing 15+ test-*.ts (governance, finance, operations)
Debug Tools 10+ debug-*.ts, probe-*.ts
Batch Processing 5 process-batch-queue.ts, submit-batch-job.ts

Problems with Current Structure

1. Lack of Organization

  • 180+ files in a single flat directory
  • No clear separation of concerns
  • Difficult to discover and maintain tools

2. No Dependency Management

  • Scripts import from apps/platform directly
  • Circular dependency risks
  • No shared utilities between scripts

3. No Type Safety for Shared Code

  • lib/ and utils/ subdirectories exist but are ad-hoc
  • No consistent patterns for reusable code
  • Duplication across scripts

4. Deployment Challenges

  • Scripts run locally or on Elastic Muscle
  • No bundling or optimization
  • Manual dependency tracking

5. Testing Gaps

  • Scripts are not unit tested
  • No integration tests for tooling
  • Hard to verify script behavior

6. Documentation Scattered

  • Some scripts have inline docs
  • No central registry of tools
  • Hard to know what tools exist

Recommendation: Create apps/devops

Architecture

apps/
β”œβ”€β”€ platform/          # Main application (existing)
β”œβ”€β”€ marketing/         # Marketing site (existing)
└── devops/            # NEW: DevOps tooling application
    β”œβ”€β”€ src/
    β”‚   β”œβ”€β”€ commands/  # CLI commands (organized by domain)
    β”‚   β”‚   β”œβ”€β”€ audit/
    β”‚   β”‚   β”‚   β”œβ”€β”€ capabilities.ts
    β”‚   β”‚   β”‚   β”œβ”€β”€ rbac.ts
    β”‚   β”‚   β”‚   └── schema.ts
    β”‚   β”‚   β”œβ”€β”€ seed/
    β”‚   β”‚   β”‚   β”œβ”€β”€ gold.ts
    β”‚   β”‚   β”‚   β”œβ”€β”€ chaos.ts
    β”‚   β”‚   β”‚   └── finance.ts
    β”‚   β”‚   β”œβ”€β”€ verify/
    β”‚   β”‚   β”‚   β”œβ”€β”€ finance.ts
    β”‚   β”‚   β”‚   β”œβ”€β”€ inventory.ts
    β”‚   β”‚   β”‚   └── permissions.ts
    β”‚   β”‚   β”œβ”€β”€ deploy/
    β”‚   β”‚   β”‚   β”œβ”€β”€ marketing.ts
    β”‚   β”‚   β”‚   └── sync-env.ts
    β”‚   β”‚   β”œβ”€β”€ monitor/
    β”‚   β”‚   β”‚   β”œβ”€β”€ elastic-muscle.ts
    β”‚   β”‚   β”‚   └── infrastructure.ts
    β”‚   β”‚   β”œβ”€β”€ dev/
    β”‚   β”‚   β”‚   β”œβ”€β”€ ghostbuster.ts
    β”‚   β”‚   β”‚   β”œβ”€β”€ sunrise.ts
    β”‚   β”‚   β”‚   └── goodnight.ts
    β”‚   β”‚   └── batch/
    β”‚   β”‚       β”œβ”€β”€ submit.ts      # Submit jobs to queue
    β”‚   β”‚       β”œβ”€β”€ status.ts      # Check job status
    β”‚   β”‚       └── cancel.ts      # Cancel pending jobs
    β”‚   β”œβ”€β”€ batch/     # Batch processing subsystem
    β”‚   β”‚   β”œβ”€β”€ core/
    β”‚   β”‚   β”‚   β”œβ”€β”€ job-classes.ts      # A/B/C/D priority classes
    β”‚   β”‚   β”‚   β”œβ”€β”€ job-scheduler.ts    # Scheduling logic
    β”‚   β”‚   β”‚   β”œβ”€β”€ queue-manager.ts    # Redis queue operations
    β”‚   β”‚   β”‚   └── executor.ts         # Job execution engine
    β”‚   β”‚   β”œβ”€β”€ jobs/
    β”‚   β”‚   β”‚   β”œβ”€β”€ build.ts            # Build job
    β”‚   β”‚   β”‚   β”œβ”€β”€ deploy-marketing.ts # Marketing deploy job
    β”‚   β”‚   β”‚   β”œβ”€β”€ test.ts             # Test job
    β”‚   β”‚   β”‚   β”œβ”€β”€ refactor.ts         # Refactor job
    β”‚   β”‚   β”‚   └── index.ts            # Job registry
    β”‚   β”‚   β”œβ”€β”€ dispatcher/
    β”‚   β”‚   β”‚   β”œβ”€β”€ worker.ts           # Job worker process
    β”‚   β”‚   β”‚   β”œβ”€β”€ scheduler.ts        # Scheduled job processor
    β”‚   β”‚   β”‚   └── monitor.ts          # Health monitoring
    β”‚   β”‚   └── types/
    β”‚   β”‚       β”œβ”€β”€ job-request.ts
    β”‚   β”‚       β”œβ”€β”€ job-result.ts
    β”‚   β”‚       └── queue-config.ts
    β”‚   β”œβ”€β”€ lib/       # Shared utilities
    β”‚   β”‚   β”œβ”€β”€ firebase/
    β”‚   β”‚   β”œβ”€β”€ doppler/
    β”‚   β”‚   β”œβ”€β”€ vercel/
    β”‚   β”‚   β”œβ”€β”€ redis/
    β”‚   β”‚   β”œβ”€β”€ logger/
    β”‚   β”‚   └── cli/
    β”‚   β”œβ”€β”€ types/     # Shared types
    β”‚   └── index.ts   # Main CLI entry point
    β”œβ”€β”€ package.json
    β”œβ”€β”€ tsconfig.json
    └── README.md

Benefits

1. Clear Separation of Concerns

  • DevOps tools separate from application code
  • Platform code doesn't mix with tooling
  • Marketing site remains independent

2. Proper Dependency Management

  • DevOps app can depend on @sd/modules/* and @sd/foundation/*
  • No reverse dependencies (platform doesn't depend on devops)
  • Shared utilities properly packaged

3. Better Developer Experience

  • Single CLI entry point: pnpm devops <command>
  • Auto-completion and help text
  • Organized command structure

4. Type Safety

  • Full TypeScript support
  • Shared types between commands
  • Compile-time validation

5. Testing

  • Unit tests for each command
  • Integration tests for workflows
  • CI/CD validation

6. Deployment Options

  • Bundle as standalone CLI
  • Deploy to Elastic Muscle as service
  • Package for distribution

7. Documentation

  • Auto-generated command docs
  • Centralized tool registry
  • Examples and usage guides

Batch Processing Subsystem

The batch processing system is a critical subsystem that deserves dedicated organization.

Current Batch Components

Component Purpose Current Location
Job Submitter Submit jobs to queue scripts/submit-batch-job.ts
Job Processor Execute queued jobs scripts/process-batch-queue.ts
Status Checker Check job status scripts/check-batch-status.ts
Job Definitions Actual job logic scripts/jobs/*.ts
Queue Manager Redis queue operations scripts/lib/redis-init.ts
Batch Client Client library scripts/lib/batch-client.ts

Batch Architecture in DevOps App

apps/devops/src/batch/
β”œβ”€β”€ core/                      # Core batch logic
β”‚   β”œβ”€β”€ job-classes.ts         # A/B/C/D priority system
β”‚   β”‚   - CRITICAL (A): Production deploys, emergency fixes
β”‚   β”‚   - HIGH (B): Time-sensitive, high-priority features
β”‚   β”‚   - NORMAL (C): Regular builds, tests, refactors
β”‚   β”‚   - LOW (D): Cleanup, maintenance tasks
β”‚   β”œβ”€β”€ job-scheduler.ts       # Scheduling engine
β”‚   β”‚   - Immediate execution
β”‚   β”‚   - Scheduled execution (cron-like)
β”‚   β”‚   - Recurring jobs
β”‚   β”œβ”€β”€ queue-manager.ts       # Redis queue operations
β”‚   β”‚   - Priority queues (queue:batch:class-A/B/C/D)
β”‚   β”‚   - Scheduled queue (queue:batch:scheduled)
β”‚   β”‚   - Job lifecycle management
β”‚   └── executor.ts            # Job execution engine
β”‚       - Resource allocation
β”‚       - CPU priority (nice values)
β”‚       - Instance type selection (on-demand vs spot)
β”‚       - Tier selection (micro/mid/extreme)
β”‚
β”œβ”€β”€ jobs/                      # Job implementations
β”‚   β”œβ”€β”€ build.ts               # Build jobs
β”‚   β”œβ”€β”€ deploy-marketing.ts    # Marketing deployment
β”‚   β”œβ”€β”€ test.ts                # Test execution
β”‚   β”œβ”€β”€ refactor.ts            # Code refactoring
β”‚   β”œβ”€β”€ lighthouse.ts          # Performance audits
β”‚   └── index.ts               # Job registry & factory
β”‚
β”œβ”€β”€ dispatcher/                # Job dispatching
β”‚   β”œβ”€β”€ worker.ts              # Worker process
β”‚   β”‚   - Poll queues by priority
β”‚   β”‚   - Execute jobs
β”‚   β”‚   - Update status in Firestore
β”‚   β”œβ”€β”€ scheduler.ts           # Scheduled job processor
β”‚   β”‚   - Check scheduled queue
β”‚   β”‚   - Move jobs to execution queue
β”‚   β”œβ”€β”€ monitor.ts             # Health monitoring
β”‚   β”‚   - Worker health checks
β”‚   β”‚   - Queue depth monitoring
β”‚   β”‚   - Cost tracking
β”‚   └── server.ts              # Batch server (runs on Elastic Muscle)
β”‚
└── types/                     # Type definitions
    β”œβ”€β”€ job-request.ts         # BatchJobRequest interface
    β”œβ”€β”€ job-result.ts          # BatchJobResult interface
    β”œβ”€β”€ queue-config.ts        # Queue configuration
    └── job-class-config.ts    # Priority class configuration

Batch CLI Commands

# Submit jobs
pnpm devops batch submit build --class C --now
pnpm devops batch submit deploy-marketing --class B --now
pnpm devops batch submit refactor --class C --at "2026-01-19T02:00:00Z"

# Check status
pnpm devops batch status <job-id>
pnpm devops batch list --class A
pnpm devops batch list --status running

# Cancel jobs
pnpm devops batch cancel <job-id>
pnpm devops batch cancel --class D --status pending

# Monitor
pnpm devops batch monitor queues
pnpm devops batch monitor workers
pnpm devops batch monitor costs

Batch Server Deployment

The batch server runs on Elastic Muscle:

# Deploy batch server to Elastic Muscle
pnpm devops deploy batch-server

# Start batch worker
ssh elastic-muscle "cd /opt/singular-dream && pnpm devops batch worker start"

# Monitor batch server
pnpm devops monitor batch-server

Implementation Plan

Phase 1: Setup (Week 1)

  • [ ] Create apps/devops directory structure
  • [ ] Setup package.json with CLI framework (Commander.js or Yargs)
  • [ ] Create base CLI entry point
  • [ ] Setup TypeScript configuration
  • [ ] Add to TurboRepo pipeline

Phase 2: Migrate Core Tools (Week 2)

  • [ ] Migrate dev-mode.ts, ghostbuster.ts, sunrise.ts, goodnight.ts
  • [ ] Create commands/dev/ structure
  • [ ] Extract shared utilities to lib/
  • [ ] Update root package.json scripts to use new CLI

Phase 3: Migrate Audit Tools (Week 3)

  • [ ] Migrate all audit-*.ts scripts
  • [ ] Create commands/audit/ structure
  • [ ] Standardize audit output format
  • [ ] Add tests

Phase 4: Migrate Data Seeders (Week 4)

  • [ ] Migrate all dataseeder-*.seed.ts scripts
  • [ ] Create commands/seed/ structure
  • [ ] Standardize seeding patterns
  • [ ] Add idempotency checks

Phase 5: Migrate Verification Tools (Week 5-6)

  • [ ] Migrate all verify-*.ts scripts
  • [ ] Create commands/verify/ structure
  • [ ] Standardize verification patterns
  • [ ] Add comprehensive tests

Phase 6: Migrate Infrastructure Tools (Week 7)

  • [ ] Migrate monitoring, deployment, and setup scripts
  • [ ] Create commands/monitor/, commands/deploy/, commands/setup/
  • [ ] Add health check dashboards

Phase 7: Documentation \u0026 Polish (Week 8)

  • [ ] Generate command documentation
  • [ ] Create usage guides
  • [ ] Add examples
  • [ ] Update Engineering Handbook

CLI Interface Design

Proposed Command Structure

# Development commands
pnpm devops dev start           # Start dev environment
pnpm devops dev ghostbuster     # Kill zombie processes
pnpm devops dev sunrise         # Morning protocol
pnpm devops dev goodnight       # Evening protocol

# Audit commands
pnpm devops audit capabilities  # Audit capability registry
pnpm devops audit rbac          # Audit RBAC implementation
pnpm devops audit schema        # Audit Zod schemas
pnpm devops audit all           # Run all audits

# Seeding commands
pnpm devops seed gold           # Seed golden data
pnpm devops seed chaos          # Seed chaos data
pnpm devops seed finance        # Seed finance data
pnpm devops seed all            # Seed all data

# Verification commands
pnpm devops verify finance      # Verify finance module
pnpm devops verify inventory    # Verify inventory module
pnpm devops verify all          # Run all verifications

# Deployment commands
pnpm devops deploy marketing    # Deploy marketing site
pnpm devops deploy platform     # Deploy platform
pnpm devops sync env            # Sync environment variables

# Monitoring commands
pnpm devops monitor elastic     # Check Elastic Muscle health
pnpm devops monitor infra       # Check infrastructure
pnpm devops monitor costs       # Check Vercel costs

# Testing commands
pnpm devops test overnight      # Run overnight tests
pnpm devops test sprint         # Run sprint certification
pnpm devops test sanctuary      # Run production health checks

Migration Strategy

Backward Compatibility

Keep existing scripts/ directory during migration:

// Root package.json
{
  "scripts": {
    // Old (deprecated but working)
    "sunrise": "tsx scripts/sunrise.ts",

    // New (preferred)
    "sunrise": "pnpm devops dev sunrise",

    // Alias for transition
    "dev:sunrise": "pnpm devops dev sunrise"
  }
}

Gradual Migration

  1. Week 1-2: Core dev tools (high usage)
  2. Week 3-4: Audit and seed tools (medium usage)
  3. Week 5-6: Verification tools (low usage)
  4. Week 7-8: Infrastructure and monitoring (specialized)

Deprecation Timeline

  • Month 1: Both old and new work
  • Month 2: Warnings added to old scripts
  • Month 3: Old scripts removed, scripts/ becomes scripts/legacy/

Alternative: Package-Based Approach

Instead of apps/devops, could create packages/devops-cli:

Pros

  • Lighter weight
  • Can be imported by other packages
  • Easier to version independently

Cons

  • Not a deployable application
  • Harder to bundle for distribution
  • Less clear separation from library code

Recommendation: Stick with apps/devops for clearer separation and deployment options.


Technical Stack

CLI Framework

Recommendation: Commander.js - TypeScript-first - Auto-completion support - Subcommand organization - Help text generation

Alternatives Considered

  • Yargs: More complex API
  • Oclif: Overkill for internal tools
  • Custom: Reinventing the wheel

Shared Libraries

  • @sd/foundation-firebase - Firebase utilities
  • @sd/foundation-doppler - Doppler integration (new)
  • @sd/foundation-logger - Structured logging (new)
  • @sd/foundation-cli - CLI helpers (new)

Success Metrics

Developer Experience

  • [ ] Time to discover a tool: < 30 seconds
  • [ ] Time to run a command: < 5 seconds
  • [ ] Command success rate: > 95%

Code Quality

  • [ ] Test coverage: > 80%
  • [ ] Type safety: 100% (no any)
  • [ ] Documentation: Every command documented

Maintenance

  • [ ] Reduce script duplication by 50%
  • [ ] Centralize shared utilities
  • [ ] Standardize error handling

Risks \u0026 Mitigation

Risk 1: Migration Disruption

Mitigation: Gradual migration with backward compatibility

Risk 2: Learning Curve

Mitigation: Clear documentation, examples, and training

Risk 3: Over-Engineering

Mitigation: Start simple, add complexity only when needed

Risk 4: Dependency Bloat

Mitigation: Careful dependency management, tree-shaking


Recommendation

βœ… YES - Create apps/devops

Why?

  1. 180+ scripts is too many for a flat directory
  2. Clear separation between application and tooling
  3. Better DX with organized CLI
  4. Proper testing and type safety
  5. Future-proof for deployment and distribution

Next Steps

  1. Review and approve this plan
  2. Create apps/devops skeleton
  3. Migrate core dev tools (sunrise, ghostbuster, goodnight)
  4. Update root scripts to use new CLI
  5. Continue phased migration

Questions for Review

  1. Naming: Is apps/devops the right name? Alternatives: apps/cli, apps/tools, apps/ops
  2. Scope: Should this include ALL scripts or just TypeScript ones?
  3. Timeline: Is 8 weeks reasonable or should we accelerate/slow down?
  4. CLI Framework: Commander.js vs. alternatives?
  5. Backward Compatibility: How long should we maintain old scripts?