ADR-001: Multi-Agent Safety Verification Pipeline

Last updated 25 Jan 2026, 15:10

Status

Accepted

Context

ZenCursor needs to verify commands before execution to prevent disasters like the 2026-01-23 incident where an rsync command with a typo wiped production data. A single-point verification system could miss edge cases or be bypassed.

We need a robust safety system that:

  1. Catches obvious dangerous patterns quickly (rm -rf /, etc.)
  2. Understands context and intent
  3. Provides alternative suggestions
  4. Has defense in depth through multiple verification stages

Decision

Implement a multi-agent pipeline for command verification:

Command Input
     │
     ▼
┌─────────────┐
│ Safety Agent│ ◄─── Local pattern matching (free, instant)
│   (Local)   │      Catches: rm -rf, format, dd, etc.
└─────┬───────┘
      │ If risky but not critical
      ▼
┌─────────────┐
│ Coder Agent │ ◄─── Haiku (cheap, fast)
│   (Haiku)   │      Proposes safer alternatives
└─────┬───────┘
      │
      ▼
┌─────────────┐
│Reviewer Agent│ ◄── Sonnet (thorough)
│  (Sonnet)   │      Final security review
└─────┬───────┘
      │
      ▼
   Decision

Agent Responsibilities

  1. Safety Agent (Local)

    • Pattern matching against known dangerous commands
    • Zero cost, instant response
    • Catches 90%+ of obvious threats
    • Critical threats fail fast (no further processing)
  2. Coder Agent (Haiku)

    • Only invoked for risky but non-critical commands
    • Proposes safer alternatives
    • Explains why original is risky
    • Cost: ~$0.001 per check
  3. Reviewer Agent (Sonnet)

    • Final security review
    • Considers context and intent
    • Makes approval/denial decision
    • Cost: ~$0.01 per check

Consensus Requirement

By default, all agents must agree for a command to be approved. This can be relaxed for specific use cases.

Consequences

Positive

  • Defense in depth - multiple verification layers
  • Cost-effective - cheap/free agents filter most traffic
  • Flexible - can adjust thresholds per environment
  • Auditable - each agent's decision is logged
  • Extensible - can add more agents (e.g., domain-specific)

Negative

  • Latency - full pipeline takes 2-5 seconds
  • Complexity - more moving parts
  • Cost - Sonnet calls add up for heavy users
  • False positives - conservative system may block legitimate commands

Neutral

  • Requires API keys for Haiku/Sonnet
  • Local-only mode degrades to pattern matching only

Alternatives Considered

Alternative 1: Single LLM Verification

Use one powerful model (Opus) for all verification.

Rejected because:

  • Too expensive for every command
  • Single point of failure
  • Overkill for obvious patterns

Alternative 2: Rule-Based Only

Use only pattern matching without LLM involvement.

Rejected because:

  • Cannot understand context/intent
  • Cannot suggest alternatives
  • Misses novel attack patterns
  • Too many false positives/negatives

Alternative 3: User-Only Confirmation

Just ask user to confirm dangerous commands.

Rejected because:

  • Users click through confirmations
  • Doesn't prevent social engineering
  • No learning or adaptation

References