Smoke test required

...

← All documentation

Documentation

Troubleshooting

23 failure modes with diagnostics + recovery.

source: plugins/endenza/TROUBLESHOOTING.md

Fifteen sections, ordered by how often you’re likely to hit each. Every section follows the same shape: Symptom → Diagnostic → Recovery. If a recovery step involves a command, it’s a play-button you can click.

Mobile access: see the PWA Terminal page (/terminal.html) — that is the supported mobile-control surface. Termius / Tailscale / mobile_dispatch.sh were removed in v1.0.


1. git clone fails during install with auth error

Symptom: installer step 3 halts with remote: Repository not found or Authentication failed for 'https://github.com/...'.

Diagnostic:

gh auth status
gh repo view KiwiMaddog2020/endenza

Recovery: - Not logged in → gh auth login and pick HTTPS / GitHub.com / paste token or browser. - Logged in as the wrong account → gh auth logout then gh auth login again. - Repo is private and you’re not a collaborator → ping the user for an invite before re-running the installer.


2. Tools blocked with ORCHESTRATION LOCK: message

Symptom: MCP VM tool calls (computer-use, Claude_Preview, Claude_in_Chrome) fail with ORCHESTRATION LOCK: mode=direct or ORCHESTRATION LOCK: autopilot held by '<slug>'.

This is correct behavior. The hook is doing its job.

Diagnostic:

jq '.mode, .mode_transitioning, .active_autopilot_chat' ${CLAUDE_PLUGIN_DATA}/state.json

Recovery: - mode=direct → say ***ORCHESTRATOR ON*** verbatim in a chat to flip back to orchestrated. - mode_transitioning=true → shutdown cascade in progress; wait 30 s then re-check. If stuck, see §10. - active_autopilot_chat != null and != you → another Agent holds the lock. Wait for ***AUTOPILOT COMPLETE*** or read the holder’s status file to see what they’re doing. - Lock held by you but shouldn’t be → crash recovery. Force-release (see §4).


3. [ORCHESTRATOR] ... MISSING at SessionStart

Symptom: new chat’s context shows [ORCHESTRATOR] hard-enforcement hook MISSING at ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh or ... INSTALLED but not registered in ....

Diagnostic:

ls -la ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh
jq '.hooks.PreToolUse' ~/.claude/settings.json

Recovery: - Script missing → re-run the installer or git pull in ${CLAUDE_PLUGIN_ROOT} and chmod +x bin/*.sh. - Script present, not in settings → copy templates/CLAUDE_CODE_SETTINGS.example.json contents into ~/.claude/settings.json, merging any existing hooks you want to keep. - Script present, registered, but healthcheck still says MISSING → bug. Paste the healthcheck output into this chat for diagnosis.


4. Autopilot lock stuck

Symptom: jq '.active_autopilot_chat' ${CLAUDE_PLUGIN_DATA}/state.json shows a slug, but the chat that held it is closed or unresponsive for > 45 min.

Diagnostic:

cat ${CLAUDE_PLUGIN_DATA}/state.lock.d/holder.json 2>/dev/null
stat -f '%Sm' ${CLAUDE_PLUGIN_DATA}/state.lock.d 2>/dev/null

Recovery: - Wait 15 min → the stale_lock_sweeper.sh cron (if loaded via launchd) auto-releases locks older than 45 min. Tail /tmp/orchestrator-sweeper.out to watch. - Force-release immediately:

rm -rf ${CLAUDE_PLUGIN_DATA}/state.lock.d && jq '.active_autopilot_chat=null | .active_vm=null | .lock_acquired_at=null' ${CLAUDE_PLUGIN_DATA}/state.json > /tmp/s.json && mv /tmp/s.json ${CLAUDE_PLUGIN_DATA}/state.json && echo "lock force-released"
  • Log the manual release so the audit trail is complete:
echo "$(date -u +%FT%TZ) MANUAL_STEAL by kevin — stuck chat recovery" >> ${CLAUDE_PLUGIN_DATA}/lock_steals.log

5. settings.json merge conflict during install

Symptom: installer step 5 halts with “existing PreToolUse matcher conflicts with ours” or jq parse error.

Diagnostic:

jq '.hooks.PreToolUse[].matcher' ~/.claude/settings.json

Recovery: - Existing identical matcher → your previous hook is already there. Safe to skip step 5. - Existing different command on the same matcher → rename ours to use a unique matcher, or merge the two commands into a single shell script. Ping me with the diff and I’ll propose a merge. - Invalid JSON → back up + start fresh:

mv ~/.claude/settings.json ~/.claude/settings.json.broken-$(date +%s) && cp ${CLAUDE_PLUGIN_ROOT}/templates/CLAUDE_CODE_SETTINGS.example.json ~/.claude/settings.json

6. Scheduled routine didn’t fire overnight

Symptom: expected briefing at ${CLAUDE_PLUGIN_DATA}/briefings/YYYY-MM-DD.md but file doesn’t exist.

Diagnostic:

# Did the Mac sleep past the fire time?
pmset -g log | grep -i "wake\|sleep" | tail -20
# Is the Desktop app running?
pgrep -f "Claude.app" > /dev/null && echo "app running" || echo "app NOT running"
# What's scheduled?
ls -la ~/.claude/scheduled-tasks/

Recovery: - Mac was asleep → Desktop scheduled tasks only fire when the Mac is awake + the app is running. Enable “Keep computer awake” in Desktop app settings, and don’t close the lid overnight. - App wasn’t running → open Claude Desktop and the 7-day catch-up may trigger one replay. Otherwise the fire is lost; wait for tomorrow. - Mode was direct at the scheduled time → routine no-oped silently by design. Flip mode back with ***ORCHESTRATOR ON***.


7. state.json is corrupt or missing

Symptom: jq . ${CLAUDE_PLUGIN_DATA}/state.json fails; hook healthcheck reports state unreadable; tools fail-closed.

Diagnostic:

ls -la ${CLAUDE_PLUGIN_DATA}/state.json*
cat ${CLAUDE_PLUGIN_DATA}/state.json

Recovery: - Restore from latest backup snapshot (if backup_snapshot.sh has been running):

ls -la ${CLAUDE_PLUGIN_ROOT}/backups/ | tail -5
cp ${CLAUDE_PLUGIN_ROOT}/backups/<most-recent>/state.json ${CLAUDE_PLUGIN_DATA}/state.json
jq . ${CLAUDE_PLUGIN_DATA}/state.json && echo "restored"
  • Or re-initialize from schema:
cat > ${CLAUDE_PLUGIN_DATA}/state.json <<'EOF'
{"schema_version":1,"mode":"orchestrated","mode_transitioning":false,"last_mode_change":null,"last_mode_change_reason":null,"active_autopilot_chat":null,"active_vm":null,"lock_acquired_at":null,"current_task":null,"orchestrator_automations":[],"automation_registry":[],"cascade":null,"notes":"Re-initialized $(date -u +%FT%TZ) after corruption."}
EOF
  • Then re-register any automation_registry entries by running bin/status.sh to confirm.

8. Launchd sweeper not firing

Symptom: locks older than 45 min sit un-stolen; lock_steals.log has no recent entries.

Diagnostic:

launchctl list | grep orchestrator
ls -la /tmp/orchestrator-sweeper.out /tmp/orchestrator-sweeper.err
tail -20 /tmp/orchestrator-sweeper.err 2>/dev/null

Recovery: - Not loaded → load it:

launchctl load ${CLAUDE_PLUGIN_ROOT}/launchd/com.kevin.orchestrator.stale-lock-sweeper.plist
  • Loaded but erroring → check stderr; common causes: python3 missing, script not executable (chmod +x ${CLAUDE_PLUGIN_ROOT}/bin/stale_lock_sweeper.sh), or the plist pointing at a stale path after a git pull moved files.
  • Reload after edits:
launchctl unload ${CLAUDE_PLUGIN_ROOT}/launchd/com.kevin.orchestrator.stale-lock-sweeper.plist 2>/dev/null
launchctl load ${CLAUDE_PLUGIN_ROOT}/launchd/com.kevin.orchestrator.stale-lock-sweeper.plist

9. Permission prompt blocking automated work

Symptom: a scheduled routine or Channels-session request stalls on a Claude Code permission dialog that nobody’s there to click.

Diagnostic: check the session in claude.ai/code scheduled tasks sidebar — running tasks that pause for permission show a pending-approval state.

Recovery: - Click “Run now” once in the Scheduled sidebar with you at the keyboard — approvals from that run are saved to the task and auto-applied to future fires. - Or add the command to the starter allow-list by editing ~/.claude/settings.json:

{"permissions": {"allow": ["Bash(your-command *)"]}}
  • For Channels sessions, start them with --dangerously-skip-permissions — the PreToolUse hook still blocks VM tools (hooks run before perm check), so bypassing prompts does not bypass safety.

10. Shutdown cascade stuck (mode_transitioning=true forever)

Symptom: state.json.mode_transitioning shows true for > 1 minute; chats print “⏸ ORCHESTRATOR IN TRANSITION — standing by.” on every turn.

Diagnostic:

jq '.cascade' ${CLAUDE_PLUGIN_DATA}/state.json

Recovery: - Check cascade.phase and cascade.executor_heartbeat ages. - If phase != "done" and heartbeat > 30 s old, any chat can resume the cascade (Track D §3 takeover). Say ***ORCHESTRATOR OFF*** verbatim in any chat. - Force-reset (last resort):

jq '.mode="direct" | .mode_transitioning=false | .cascade=null | .active_autopilot_chat=null | .active_vm=null | .lock_acquired_at=null' ${CLAUDE_PLUGIN_DATA}/state.json > /tmp/s.json && mv /tmp/s.json ${CLAUDE_PLUGIN_DATA}/state.json
rm -rf ${CLAUDE_PLUGIN_DATA}/state.lock.d
echo "$(date -u +%FT%TZ) MANUAL_CASCADE_RESET by kevin" >> ${CLAUDE_PLUGIN_DATA}/lock_steals.log
  • Then ***ORCHESTRATOR ON*** in a fresh chat to resume normal operation.

11. Chat slug mismatch — hook can’t identify the chat

Symptom: ORCHESTRATION LOCK: autopilot held by 'other-slug'. This chat ('unknown') must queue. when you expect the chat to BE the holder.

Diagnostic:

cat "$PWD/.claude/.chat_slug" 2>/dev/null
jq '.active_autopilot_chat' ${CLAUDE_PLUGIN_DATA}/state.json

Recovery: - .chat_slug missing → write it:

mkdir -p "$PWD/.claude" && echo "<your-slug>" > "$PWD/.claude/.chat_slug"
  • Slug written but cwd at tool-call time is a different directory (e.g. subagent working elsewhere) → move the slug up to the repo root, or use an absolute $CLAUDE_PROJECT_DIR-based path in your hook.
  • Slug mismatch vs state.json → just update state.json.active_autopilot_chat manually if you own the lock:
jq '.active_autopilot_chat="<your-slug>"' ${CLAUDE_PLUGIN_DATA}/state.json > /tmp/s.json && mv /tmp/s.json ${CLAUDE_PLUGIN_DATA}/state.json

12. Hook smoke test fails during install

Symptom: installer step 7 reports ✗ Hook did not block. Exit code: 0. when mode was flipped to direct.

Diagnostic:

echo '{"tool_name":"mcp__computer-use__screenshot","cwd":"/tmp"}' | ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh
echo "exit=$?"
which jq
test -x ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh && echo "executable" || echo "NOT executable"

Recovery: - jq missing → brew install jq, re-run test. - Not executable → chmod +x ${CLAUDE_PLUGIN_ROOT}/bin/*.sh. - Script reads $HOME from somewhere unexpected → set an explicit ORCH=${CLAUDE_PLUGIN_ROOT} at the top (already done in the shipped script). - Still failing → set -x at the top of the hook, re-run, paste the trace.


13. gh auth surprises

Symptom: gh commands unexpectedly fail, the auto-GitHub extension can’t create repos, or git push prompts for a password.

Diagnostic:

gh auth status
git config --global credential.helper

Recovery: - Multiple accounts → gh auth switch to the right one. - Token expired → gh auth refresh. - HTTPS vs SSH mismatch → check git remote -v and gh auth setup-git. - Two-factor popping repeatedly → use a personal access token scoped to repo + workflow instead of browser auth.


14. Multi-machine sync surprises

Symptom: new Mac doesn’t have the same state as the old one; charters in the new clone are stale.

Recovery: - Re-run installer on the new Mac. - Clone your personal <gh-user>/my-ensemble-config repo to ${CLAUDE_PLUGIN_ROOT}-config/ (if you opted into the auto-GitHub extension):

git clone https://github.com/<you>/my-ensemble-config.git ${CLAUDE_PLUGIN_ROOT}-config
  • Run bin/sync-config.sh to apply your work_hours.json + allow-list delta from the config repo into the canonical paths.
  • state.json, chats/, and state.lock.d/ are runtime-local and do not sync across machines by design.

15. Channels session dropped

Note: The exact Claude Code invocation for an iMessage Channels listener is plugin-documented, not a core claude CLI flag. The --channels references in earlier drafts were speculative. Before relying on this recovery path, verify the current command via /plugin marketplace and the installed iMessage plugin’s own documentation. Path B (Cloud Routine + iMessage Channels) is still v2-scope.

Symptom: iMessage commands to kill the Ensemble or trigger a routine don’t produce replies. The persistent tmux session that should be listening isn’t.

Diagnostic:

tmux ls 2>/dev/null           # list tmux sessions
pgrep -fl claude | head       # look for a running claude process (exact match string depends on install)

Recovery: - Not running → restart the session. Placeholder shape pending plugin-doc verification:

# Exact invocation TBD per installed iMessage plugin's docs.
cd ${CLAUDE_PLUGIN_ROOT} && tmux new-session -d -s orchestrator 'claude <channel-flags-per-plugin> --dangerously-skip-permissions'
  • Frequent drops → enable launchd KeepAlive (ship a com.kevin.orchestrator.channels.plist that respawns the tmux session on crash).
  • Full Disk Access revoked → System Settings → Privacy & Security → Full Disk Access → add Terminal/iTerm back.

v2.0 failure modes (2026-04-26 contract sweep)

16. Per-project spawn lock conflict (refusal code 2)

Symptom: Autopilot fires fail with SPAWN-LOCK: per-project lock held for '<slug>' (pid=N). Refused.

Cause: Another autopilot is already targeting the same project. The per-project lock prevents two subprocesses from stomping on the same repo simultaneously.

Fix:

# See what's holding it
bash plugins/ensemble/bin/orchestrator_spawn_lock.sh status

# If the holder PID is dead but lock dir lingers, sweep stale
bash plugins/ensemble/bin/orchestrator_spawn_lock.sh sweep

# Or force release a specific slug (only if you're SURE no autopilot is running)
bash plugins/ensemble/bin/orchestrator_spawn_lock.sh release <slug>

17. Concurrency cap hit + queue timeout (refusal code 3)

Symptom: SPAWN-LOCK: cap held >300s, no slot opened for '<slug>'. Refused.

Cause: state.json.max_concurrent_autopilots is at capacity, no slot opened within the 5-minute queue timeout.

Fix: Raise cap or kill an existing autopilot:

jq '.max_concurrent_autopilots = 3' state.json > s.tmp && mv s.tmp state.json
# or kill all running autopilots from the PWA Terminal page (kill switch)
# or manually:
pkill -f autopilot_session.sh

18. Resource floor refused (refusal code 4)

Symptom: SPAWN-LOCK: resource floor — RAM N% > threshold M%. Refused. (or CPU variant)

Cause: state.json.resource_floor_enabled = true and the system is taxed at spawn time.

Fix: Wait or relax thresholds:

jq '.resource_floor_thresholds.ram_pct = 95' state.json > s.tmp && mv s.tmp state.json
# or disable
jq '.resource_floor_enabled = false' state.json > s.tmp && mv s.tmp state.json

19. Mid-run resource watchdog killed an autopilot

Symptom: Autopilot session aborts mid-run; runs/watchdog-<date>.log shows KILLING parent process group (sustained pressure).

Cause: state.json.resource_watchdog_enabled = true and the watchdog detected sustained pressure (default: 3 consecutive breaches at 95% RAM or load > 6.0).

Fix: Accept and reduce parallelism, or relax thresholds:

jq '.resource_watchdog_enabled = false' state.json > s.tmp && mv s.tmp state.json
# or env-tune
WATCHDOG_RAM_PCT_THRESHOLD=98 WATCHDOG_BREACHES_TO_KILL=5 bash bin/autopilot_session.sh 60

20. Network pre-flight failure

Symptom: [autopilot] ABORT: network pre-flight failed (cannot reach api.anthropic.com).

Fix: Check connectivity. If offline by design, disable:

jq '.require_network_check = false' state.json > s.tmp && mv s.tmp state.json

21. Claude CLI version too old

Symptom: [autopilot] ABORT: claude CLI X.Y.Z < required A.B.C.

Fix:

npm i -g @anthropic-ai/claude-code
# or relax the pin
jq '.min_claude_cli_version = ""' state.json > s.tmp && mv s.tmp state.json

22. Project state is Hibernating

Symptom: [autopilot] ABORT: project '<slug>' is Hibernating; promote first.

Fix: Move it out of Hibernating before firing:

# Verbal in any chat:
"set <slug> to Building"  # or Warmer / R&D / Updates / Launch Prep

# Or directly:
jq '.state = "Warmer"' chats/<slug>.json > s.tmp && mv s.tmp chats/<slug>.json

23. State.json corruption recovery

Symptom: state.json won’t parse; autopilot/hooks fail closed.

Recovery path (in order):

  1. Check for atomic-rename leftover — every state-write uses .tmp + mv. If a write was interrupted, look for state.json.tmp: bash ls state.json* | head jq . state.json.tmp && mv state.json.tmp state.json

  2. Restore from backup_snapshotbin/backup_snapshot.sh runs nightly: bash ls backups/state.json.* cp backups/state.json.YYYYMMDD-HHMMSS state.json

  3. Hand-rebuild from schema — if no backup, create minimum viable: bash cat > state.json <<'EOF' { "schema_version": 1, "mode": "orchestrated", "mode_transitioning": false, "active_autopilot_chat": null, "active_vm": null, "lock_acquired_at": null, "lock_intent": null, "max_concurrent_autopilots": 2, "active_autopilots": {}, "resource_floor_enabled": false, "resource_floor_thresholds": {"ram_pct": 90, "cpu_load": 4.0}, "spawn_queue": [], "parallel_code_allowed": true, "rapid_fire_enabled": true, "caffeinate_during_autopilot": true, "require_ac_power": false, "min_free_disk_gb": 5, "require_network_check": true, "min_claude_cli_version": "", "resource_watchdog_enabled": false, "cascade": null } EOF

  4. Re-heartbeat all chats — after recovery, run a Maestro session that bumps each chats/*.json.last_heartbeat so the dashboard reflects current state.


If you hit something not in this list, grab bin/status.sh output and a one-line symptom and ping the Maestro. The failure catalog grows from real incidents.