Multi-model consensus swarm orchestration for the Copilot CLI. Launch 50–250+ AI agents across 15 models with Shadow Score Spec L2 validation — from one command.
Learn more and see the website here: dubsopenhub.github.io/swarm-command
Never used the CLI before? No problem.
- Open your terminal
- Paste this:
curl -fsSL https://raw.githubusercontent.com/DUBSOpenHub/swarm-command/main/quickstart.sh | bash- When Copilot opens, type:
swarm commandRequires an active Copilot subscription.
Swarm Command is for tasks that are too big, risky, or cross-cutting for one model:
- Need one answer from many perspectives? It fans your task out across a layered swarm.
- Need confidence, not vibes? It uses cross-review + consensus scoring.
- Need hidden quality checks? It validates bundles with sealed acceptance criteria.
- Need speed at scale? Designed for parallel execution — agents work simultaneously, not sequentially.
- Need zero setup? No servers, no API keys, no build step.
If your task spans architecture + implementation + testing + docs + integration, this is exactly what Swarm Command is built for.
Swarm Command is a multi-model swarm orchestration skill for the Copilot CLI that launches 50 to 250+ AI agents across 15 different models to solve complex tasks through hierarchical fan-out, cross-family review, and consensus-gated synthesis.
Give it a task — architecture, refactoring, testing, docs, or integration — and it decomposes the mission into domains, dispatches Commanders, Squad Leads, and Workers, validates outputs against sealed acceptance criteria, and synthesizes a final answer from collective intelligence instead of single-model intuition.
One model gives you one perspective.
For small tasks, that's perfect. For high-stakes tasks, it's fragile:
- the model may miss cross-cutting risks,
- the task may exceed one context window,
- the output may sound confident without being complete,
- and you have no independent check that the answer actually satisfies the mission.
Swarm Command solves that by turning one request into a structured swarm process: split, parallelize, review, validate, converge.
These systems are complementary — not competitors.
| If you need to... | Use | Why |
|---|---|---|
| Solve one complex task with layered consensus inside your current Copilot CLI session | Swarm Command | Best when you want decomposition, cross-model review, shadow validation, and one synthesized answer |
| Run parallel coding workstreams across terminals or branches | Stampede | Best when the goal is execution throughput across independent task lanes |
| Run a many-model tournament to pressure-test ideas and rank options | Havoc Hackathon | Best when you want competitive ideation, elimination rounds, and judged synthesis |
Rule of thumb:
- Choose Swarm Command for consensus execution.
- Choose Stampede for parallel implementation.
- Choose Havoc Hackathon for idea tournaments and comparative judging.
- 🐝 True swarm — 50 to 250+ agents, not 3–5
- 🏗️ 5-layer hierarchy — Nexus → Commander → Squad Lead → Worker → Reviewer
- 🔀 Cross-model diversity — Claude + GPT families mixed within every pod
- 🗳️ Consensus scoring — 4-stage gate-then-rank with CONSENSUS / MAJORITY / CONFLICT tiers
- 👻 Shadow Score — Shadow Score Spec L2 conformance. Sealed acceptance criteria generated before commanders execute, validated after, hardened on failure.
- 🛡️ Depth Guard — 5 laws + 3-layer enforcement prevent runaway agent spawning
- ⚡ Circuit breaker — 3-state FSM with 5-level recovery escalation
- 📉 Parallel by design — agents execute concurrently with hierarchical fan-out and pipeline overlap
- 💰 Cost-controlled — 1024:1 token compression, wave deployment, hard cost ceilings, and cheap workers
- 📦 Zero infrastructure — no servers, no API keys, no build step
Running 250+ agents sounds expensive. It isn't — because every layer is engineered to minimize spend.
Context shrinks at every layer. The Nexus holds 128K tokens; by the time instructions reach a worker, they're 128 tokens. Parents strip rationale, narrow file scope, and tighten constraints so children only receive the bytes they need.
Nexus 128K tokens ──► 4K task brief
Commander 64K tokens ──► 2K context capsule
Squad Lead 32K tokens ──► 512 shard
Worker 8K tokens ──► 128 micro-brief
A three-state FSM (CLOSED → OPEN → HALF-OPEN) monitors every layer. If too many agents fail (50–60% threshold), the breaker trips — no new agents spawn, costs stop climbing, and a recovery probe tests before the swarm resumes.
5-level recovery escalation: Retry → Simplify → Model Swap → Scope Reduce → Graceful Degrade.
Agents don't all launch at once. Each pod deploys in three waves with health gates between them:
- Wave 1 (Canary) — 1 agent verifies the task is feasible
- Wave 2 (Probe) — 3 agents test for rate limits and bulk viability
- Wave 3 (Remainder) — full pod only if gates pass
If the canary fails, the full pod never deploys. One cheap test prevents many expensive failures.
| Guard | What it does |
|---|---|
| Timeout cascade | 90s → 60s → 40s → 30s per layer — children always finish before parents |
| Token ceiling | 128K / 64K / 32K / 8K per layer |
| Output size cap | 4K / 1K / 512 / 256 tokens per layer |
| Retry budget | Workers: 0 retries. Squad Leads: 1 retry. |
| Concurrent agent cap | Max 50 agents launching simultaneously |
| Cost ceiling | $5 / $10 / $20 hard cap — kills all agents if breached |
| Scale | Agents | Typical Cost | Hard Cap | Wall-Clock |
|---|---|---|---|---|
| SS-50 | ~36-52 | $2.50 | $5 | ~30s |
| SS-100 | ~89 | $5.50 | $10 | ~45s |
| SS-250 | ~316 | $10 | $20 | ~65–90s |
- Workers are the cheapest models — Haiku and GPT-Mini at L3, 10× cheaper than Opus
- Expensive reasoning stays at the top — Opus and Sonnet only at Commander/Nexus level
- Context compresses monotonically — each layer receives a fraction of its parent's tokens
- Failed work stops early — circuit breakers and canary gates prevent runaway spend
Before the full diagrams, here's the mental model:
- Nexus reads the mission and splits it into domains.
- Commanders own each domain and dispatch sub-work.
- Workers do tiny atomic tasks in parallel.
- Reviewers + Shadow Score decide what survives into the final answer.
You ask one question
↓
Nexus decomposes the mission
↓
Commanders split by domain
↓
Workers execute atomic tasks in parallel
↓
Reviewers score + Shadow Score validates
↓
Nexus emits one final bundle
If you want the visual deep dive, jump to docs/architecture.md or docs/architecture-diagrams.md.
┌─────────────────┐
L0 │ NEXUS (1) │ claude-opus-4.6
│ 128K ctx budget │ Task decomposition + final synthesis
└────────┬────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
L1 │ CMD-ARCH │ │ CMD-IMPL │ ... │ CMD-INTG │ × 5 Commanders
│ 64K ctx │ │ 64K ctx │ │ 64K ctx │ Domain specialists
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
┌────────┼────────┐ │ │
│ │ │ │ │
┌──┴──┐ ┌──┴──┐ ┌──┴──┐
L2 │SQ-1│ │SQ-2│ ... │SQ-10│ × 10 per Commander = 50 Squad Leads
│32K │ │32K │ │32K │ Micro-task decomposition + canary deploy
└──┬──┘ └──┬──┘ └──┬──┘
│ │ │
┌──┴──┐ ┌──┴──┐ ┌──┴──┐
L3 │W×5 │ │W×5 │ │W×5 │ × 5 per Squad Lead = 250 Workers
│ 8K │ │ 8K │ │ 8K │ Atomic execution (LEAF — no spawning)
└─────┘ └─────┘ └─────┘
┌──────────────┐
L4 │ REVIEWERS×10 │ Cross-review mesh (pipeline overlap)
│ 16K ctx │ 4-axis sealed scoring + consensus tiers
└──────────────┘
+ SHADOW SCORING (sealed acceptance criteria, Shadow Score Spec L2)
T+0s T+2s T+5s T+12s T+45s T+65s T+80s T+90s
│ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
┌────┐ ┌──────┐ ┌─────────┐ ┌──────────┐ ┌────────┐ ┌───────┐ ┌────┐ ┌────┐
│NEXUS│→ │CMDs │→ │SQUAD │→ │WORKERS │ │REVIEW │ │MERGE │ │VOTE│ │EMIT│
│BOOT │ │SPAWN │ │LEADS │ │EXECUTE │ │MESH │ │RESULTS│ │ │ │ │
│ │ │ │ │+ CANARY │ │(parallel)│ │(overlap│ │ │ │ │ │ │
│ │ │ │ │VERIFY │ │ │ │start) │ │ │ │ │ │ │
└────┘ └──────┘ └─────────┘ └──────────┘ └────────┘ └───────┘ └────┘ └────┘
2s 3s 7s 33s 20s 15s 10s 5s
CONTEXT DOWN (shrinking) RESULTS UP (compressing)
======================== ========================
L0 Full Task Brief ─── 4K tokens ───► Final Report ◄── 4K tokens
│ ▲
L1 Context Capsule ─── 2K tokens ───► Bundle ◄── 1K tokens
│ ▲
L2 Shard ─── 512 tokens ──► Atom Set ◄── 512 tokens
│ ▲
L3 Micro-Brief ─── 128 tokens ──► Atom ◄── 256 tokens
│ ▲
L4 Review Capsule ─── 1K tokens ───► Score Card ◄── 512 tokens
| Scale | Agents | Commanders | Workers | Reviewers | Best For | Wall-Clock |
|---|---|---|---|---|---|---|
| SS-50 | ~36-52 | 2-3 | 30-45 | 3 | Fast bounded tasks | ~30s |
| SS-100 | ~89 | 5 | 75 | 8 | Multi-file features and reviews | ~45s |
| SS-250 | ~316 | 5 | 250 | 10 | Repo-wide or high-stakes work | ~65–90s |
Do you need a fast second opinion on 1–2 files?
→ SS-50
Do you need a serious answer for a multi-file feature or subsystem?
→ SS-100
Do you need repo-wide coverage, compliance-grade review, or maximum consensus?
→ SS-250
Default is SS-100. Say swarm command ss-250 for full deployment or swarm command ss-50 for quick tasks.
See docs/scaling.md for cost breakdowns, chooser guidance, and a deeper decision matrix.
Curated highlights — see docs/use-cases.md for the full gallery.
🔥 Stack Trace Whisperer
swarm command ss-50 "Diagnose this error — 3 most likely root causes with fixes: [paste error]"
Three fast expert panels race on runtime, dependency, and logic hypotheses. You get ranked diagnoses, not a single guess.
🔍 Explain Like I Own It
swarm command ss-50 "I just inherited this codebase. Explain src/core/ — what does each piece do, where are the landmines?"
Great for onboarding: architecture map, event flow, and hidden footguns in one brief.
⚡ Performance Profiler's Shortcut
swarm command ss-50 "Find the performance bottlenecks in this file with optimized versions: [paste hot-path file]"
Ideal when you need a prioritized hit list before opening a profiler.
🔐 Zero-Downtime Auth Rewrite
swarm command "Migrate our session auth to JWT + refresh tokens across API, web app, DB, and tests"
Architecture, implementation, testing, docs, and rollout risk all get separate ownership before synthesis.
🏗️ Legacy Service Extraction
swarm command "Extract the billing module from our monolith into a service with minimal downtime"
Produces migration phases, interface boundaries, contract tests, and rollback paths.
📱 Offline Sync Feature
swarm command "Design offline-first sync for our field app: local cache, conflict resolution, API changes, UX, and tests"
Covers data model, UX states, conflict semantics, and integration testing in parallel.
🛡️ Zero-Day Security Sweep
swarm command ss-250 "Full security audit: every file, every dependency, every injection surface — CVSS-scored vulnerability report"
Best for broad-surface analysis where missing even one category matters.
⚖️ Compliance Fortress
swarm command ss-250 "Audit for GDPR, HIPAA, SOC2, PCI-DSS compliance — every gap, every control, remediation tickets"
Turns a giant policy problem into parallel control checks with one synthesized risk summary.
🗺️ Living Runbook Generator
swarm command ss-250 "Read every service, every pipeline, every config — generate the complete operations manual"
Excellent when tribal knowledge has to become documentation fast.
- "What's the CLI flag for X?" → Ask a single agent
- Rename one variable → Manual edit or single agent
- Prod is down and seconds matter → Follow the human runbook first
- Writing a single-voice email → One persona is better than a committee
- Step-through debugging → Sequential work beats consensus here
mkdir -p ~/.copilot/skills/swarm-command ~/.copilot/agents && \
curl -sL https://raw.githubusercontent.com/DUBSOpenHub/swarm-command/main/skills/swarm-command/SKILL.md \
-o ~/.copilot/skills/swarm-command/SKILL.md && \
curl -sL https://raw.githubusercontent.com/DUBSOpenHub/swarm-command/main/agents/swarm-command.agent.md \
-o ~/.copilot/agents/swarm-command.agent.md && \
echo "✅ Swarm Command installed — open Copilot CLI and type: swarm command"Verify integrity (optional):
shasum -a 256 ~/.copilot/skills/swarm-command/SKILL.md
shasum -a 256 ~/.copilot/agents/swarm-command.agent.md💡 Security note: We recommend inspecting quickstart.sh before piping to bash. You can also use the manual install above instead.
git clone https://github.com/DUBSOpenHub/swarm-command.git
cd swarm-command
chmod +x quickstart.sh && ./quickstart.shIf you're new, read in this order:
- This README — what it is, when to use it, and how to run it
- docs/learning-path.md — beginner, operator, and architect reading tracks
- docs/architecture.md — the conceptual system model
- docs/scaling.md — which scale to choose and what it costs
- docs/use-cases.md — vivid prompts and expected outcomes
- docs/consensus.md + docs/shadow-scoring.md — the deep mechanics
- I just want to try it: README → install → run
swarm command - I want to operate it well: README → learning path → scaling → use cases
- I want to understand the design: README → architecture → consensus → shadow scoring
No. Swarm Command runs through your active Copilot subscription. No separate servers, queues, or key management required.
Use SS-50 for bounded, fast tasks. Use SS-100 for most real software work. Use SS-250 when the task is repo-wide, high-stakes, or needs maximum coverage and consensus.
Append a personality mode after the scale to adjust how the swarm operates:
swarm command ss-100 thorough "audit auth module"
swarm command ss-250 fast "quick scan of README"| Mode | Workers | Timeout | Models | Retry | Best For |
|---|---|---|---|---|---|
balanced (default) |
5 per squad | 1.0× | mixed | 1 | Most tasks |
thorough |
5 per squad | 1.5× | opus/sonnet | 2 | High-stakes, complex analysis |
fast |
3 per squad | 0.6× | haiku only | 0 | Quick iteration, cost-sensitive |
creative |
4 per squad | 1.0× | max diversity | 1 | Brainstorming, novel problems |
cautious |
5 per squad | 1.2× | sonnet | 2 | Ambiguous tasks, high conflict risk |
Because diversity helps. Different model families catch different failure modes. Swarm Command intentionally mixes them so agreement means more than self-consistency.
Disagreement is preserved, scored, and escalated. Squad Leads and Commanders mark results as CONSENSUS, MAJORITY, CONFLICT, or UNIQUE, then Nexus arbitrates the unresolved pieces.
It is a hidden acceptance test: criteria are generated before execution, kept sealed from the swarm, then used to validate outputs afterward.
It can produce plans, analyses, patches, documentation, tests, and rollout guidance depending on how you invoke it — but the point is not blind automation. The point is reviewable, consensus-backed output.
Avoid it for tiny edits, urgent incident response where every second matters, or tasks that need one strong voice rather than many perspectives.
Swarm Command came out of a simple question: what if one Copilot CLI session could behave less like one assistant and more like a disciplined organization?
The design evolved from SwarmSpeed 250 experiments into a layered system with:
- a single Nexus orchestrator,
- domain-owning Commanders,
- decomposing Squad Leads,
- leaf-node Workers,
- and independent Reviewers.
The turning point was a self-analysis run later documented in docs/shadow-scoring.md: sealed judges rated a design highly even though it contained critical arithmetic errors. That exposed a core truth of multi-agent systems: review alone is not validation.
That failure drove the big ideas that now define this repo:
- Shadow scoring so hidden criteria can catch what the swarm forgot to optimize for
- Depth Guard so recursion never turns into agent explosion
- Token compression so higher-level intent survives while lower layers stay cheap
- Cross-family review so agreement means more than “the same model said it twice”
In other words: Swarm Command is not just a big swarm. It is a swarm that learned from its own failure modes.
See what a completed swarm run looks like → Example Output
🐝 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
S W A R M C O M P L E T E
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
## Results Summary
- Domains completed: 5/5
- Consensus tier: CONSENSUS (4) · MAJORITY (1)
- Overall confidence: 0.77
- Agents deployed: 89
- Wall-clock time: 72s
- Shadow Score: 20.0% 🟡 Moderate (8 pass · 2 fail)
Swarm Command implements Shadow Score Spec L2 conformance — sealed acceptance criteria generated before commanders execute, validated after, hardened on failure.
Formula: Shadow Score = (sealed_failures / sealed_total) × 100
| Shadow Score | Level | Action |
|---|---|---|
| 0% | ✅ Perfect | All sealed criteria passed |
| 1–15% | 🟢 Minor | Proceed normally |
| 16–30% | 🟡 Moderate | Attach Gap Report, warn |
| 31–50% | 🟠 Significant | Quarantine bundle, hardening cycle |
| > 50% | 🔴 Critical | Reject bundle from synthesis |
Sealed-envelope protocol:
- Phase 1.5 — Nexus generates sealed acceptance criteria from the task
- Phases 2–5 — Commanders execute without seeing those criteria
- Phase 6 — Validate outputs, compute Shadow Score, produce Gap Report
- Hardening — If score > 15%, share failure messages only for one fix cycle
See docs/shadow-scoring.md for the full protocol.
A 4-stage consensus pipeline merges the best work from hundreds of agents:
- Worker Self-Score — Each worker emits confidence + self-score with its atom
- Squad Lead Local Merge — Groups atoms by sub-task, classifies as CONSENSUS / MAJORITY / CONFLICT
- Commander Domain Merge — Trimmed mean across squads, applies the consensus formula
- Nexus Cross-Domain Synthesis — Median-of-3 judging and final arbitration
Consensus formula:
score = 0.40 × confidence + 0.30 × evidence + 0.15 × scope + 0.15 × coverage − min(0.30, conflict_rate × 0.30)
| Tier | Condition | Action |
|---|---|---|
| CONSENSUS | ≥ 70% agreement | Auto-accept |
| MAJORITY | ≥ 50% agreement | Accept with dissent note |
| CONFLICT | < 50% agreement | Nexus arbitration |
| UNIQUE | No overlap | Keep if evidence ≥ 7/10 |
See docs/consensus.md for the full mechanics.
All tunables live in config.yml. Key settings:
consensus:
threshold_consensus: 0.70
threshold_majority: 0.50
depth_guard:
max_spawn_depth: 3
max_workers_per_squad_lead: 5
circuit_breaker:
timeout_cascade: [90, 60, 40, 30]
shadow_scoring:
enabled: true
spec_version: "1.0.0"
conformance_level: "L2"
sealed_criteria_count: 10 # max; per-scale: SS-50=6, SS-100=8, SS-250=10
hardening:
enabled: true # SS-50 overrides to disabled
threshold: 15See docs/scaling.md for full scaling configuration and cost estimates.
| Role | Models |
|---|---|
| Nexus | claude-opus-4.6 |
| Commanders (pool: 9) | claude-opus-4.6, claude-opus-4.5, claude-opus-4.6-1m, claude-sonnet-4.6, claude-sonnet-4.5, claude-sonnet-4, gpt-5.4, gpt-5.2, gpt-5.1 |
| Squad Leads (SS-250 only) | claude-haiku-4.5, gpt-5.4-mini |
| Workers (pool: 6) | claude-haiku-4.5, gpt-5.4-mini, gpt-5-mini, gpt-4.1, gpt-5.3-codex, gpt-5.2-codex |
| Reviewers (7 pairs) | claude-opus-4.6↔gpt-5.4, claude-opus-4.5↔gpt-5.2, claude-opus-4.6-1m↔gpt-5.1, claude-sonnet-4.6↔gpt-5.3-codex, claude-sonnet-4.5↔gpt-5.2-codex, claude-sonnet-4↔gpt-5.4-mini, claude-haiku-4.5↔gpt-5-mini |
swarm-command/
├── README.md # Overview, install, comparison, FAQ
├── AGENTS.md # Agent/skill descriptions
├── CONTRIBUTING.md # Contribution guidelines
├── catalog.yml # Skill metadata
├── config.yml # All tunables
├── LICENSE # MIT
├── SECURITY.md # Security policy
├── quickstart.sh # One-line installer
├── .github/
│ ├── copilot-instructions.md # AI agent instructions for this repo
│ ├── workflows/ci.yml # CI: YAML lint + SKILL.md sync check
│ └── skills/swarm-command/SKILL.md # Skill discovery path
├── agents/
│ └── swarm-command.agent.md # Standalone agent version
├── skills/swarm-command/
│ └── SKILL.md # Core skill
├── templates/
│ ├── commander.md # Commander prompt template
│ ├── worker.md # Worker prompt template
│ ├── reviewer.md # Cross-reviewer prompt template
│ └── squad-lead.md # Squad Lead prompt template
├── protocols/
│ ├── depth-guard.md # 5 Laws + 3-layer enforcement
│ ├── circuit-breaker.md # 3-state FSM + 5-level recovery
│ ├── context-capsule.md # JSON schemas for data structures
│ └── meta-reviewer.md # Reviewer quality gate protocol
└── docs/
├── architecture.md # Architecture overview
├── architecture-diagrams.md # Mermaid diagrams
├── consensus.md # Consensus algorithm deep dive
├── example-output.md # Sample completed swarm run output
├── learning-path.md # Recommended reading order
├── scaling.md # Scale chooser + cost estimates
├── shadow-scoring.md # Shadow scoring protocol
└── use-cases.md # Expanded prompt gallery
MIT — use it, fork it, build on it.
This project implements Shadow Score Spec L2 — sealed acceptance criteria generated before execution, validated after, hardened on failure.
🐙 Created with 💜 by @DUBSOpenHub with the GitHub Copilot CLI.