Endenza — Troubleshooting

Fifteen sections, ordered by how often you’re likely to hit each. Every section follows the same shape: Symptom → Diagnostic → Recovery. If a recovery step involves a command, it’s a play-button you can click.

Mobile access: see the PWA Terminal page (/terminal.html) — that is the supported mobile-control surface. Termius / Tailscale / mobile_dispatch.sh were removed in v1.0.

1. `git clone` fails during install with auth error

Symptom: installer step 3 halts with remote: Repository not found or Authentication failed for 'https://github.com/...'.

Diagnostic:

gh auth status
gh repo view KiwiMaddog2020/endenza

Recovery: - Not logged in → gh auth login and pick HTTPS / GitHub.com / paste token or browser. - Logged in as the wrong account → gh auth logout then gh auth login again. - Repo is private and you’re not a collaborator → ping the user for an invite before re-running the installer.

2. Tools blocked with `ORCHESTRATION LOCK:` message

Symptom: MCP VM tool calls (computer-use, Claude_Preview, Claude_in_Chrome) fail with ORCHESTRATION LOCK: mode=direct or ORCHESTRATION LOCK: autopilot held by '<slug>'.

This is correct behavior. The hook is doing its job.

Diagnostic:

jq '.mode, .mode_transitioning, .active_autopilot_chat' ${CLAUDE_PLUGIN_DATA}/state.json

Recovery: - mode=direct → say ***ORCHESTRATOR ON*** verbatim in a chat to flip back to orchestrated. - mode_transitioning=true → shutdown cascade in progress; wait 30 s then re-check. If stuck, see §10. - active_autopilot_chat != null and != you → another Agent holds the lock. Wait for ***AUTOPILOT COMPLETE*** or read the holder’s status file to see what they’re doing. - Lock held by you but shouldn’t be → crash recovery. Force-release (see §4).

3. `[ORCHESTRATOR] ... MISSING` at SessionStart

Symptom: new chat’s context shows [ORCHESTRATOR] hard-enforcement hook MISSING at ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh or ... INSTALLED but not registered in ....

Diagnostic:

ls -la ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh
jq '.hooks.PreToolUse' ~/.claude/settings.json

Recovery: - Script missing → re-run the installer or git pull in ${CLAUDE_PLUGIN_ROOT} and chmod +x bin/*.sh. - Script present, not in settings → copy templates/CLAUDE_CODE_SETTINGS.example.json contents into ~/.claude/settings.json, merging any existing hooks you want to keep. - Script present, registered, but healthcheck still says MISSING → bug. Paste the healthcheck output into this chat for diagnosis.

4. Autopilot lock stuck

Symptom: jq '.active_autopilot_chat' ${CLAUDE_PLUGIN_DATA}/state.json shows a slug, but the chat that held it is closed or unresponsive for > 45 min.

Diagnostic:

cat ${CLAUDE_PLUGIN_DATA}/state.lock.d/holder.json 2>/dev/null
stat -f '%Sm' ${CLAUDE_PLUGIN_DATA}/state.lock.d 2>/dev/null

Recovery: - Wait 15 min → the stale_lock_sweeper.sh cron (if loaded via launchd) auto-releases locks older than 45 min. Tail /tmp/orchestrator-sweeper.out to watch. - Force-release immediately:

rm -rf ${CLAUDE_PLUGIN_DATA}/state.lock.d && jq '.active_autopilot_chat=null | .active_vm=null | .lock_acquired_at=null' ${CLAUDE_PLUGIN_DATA}/state.json > /tmp/s.json && mv /tmp/s.json ${CLAUDE_PLUGIN_DATA}/state.json && echo "lock force-released"

Log the manual release so the audit trail is complete:

echo "$(date -u +%FT%TZ) MANUAL_STEAL by kevin — stuck chat recovery" >> ${CLAUDE_PLUGIN_DATA}/lock_steals.log

5. `settings.json` merge conflict during install

Symptom: installer step 5 halts with “existing PreToolUse matcher conflicts with ours” or jq parse error.

Diagnostic:

jq '.hooks.PreToolUse[].matcher' ~/.claude/settings.json

Recovery: - Existing identical matcher → your previous hook is already there. Safe to skip step 5. - Existing different command on the same matcher → rename ours to use a unique matcher, or merge the two commands into a single shell script. Ping me with the diff and I’ll propose a merge. - Invalid JSON → back up + start fresh:

mv ~/.claude/settings.json ~/.claude/settings.json.broken-$(date +%s) && cp ${CLAUDE_PLUGIN_ROOT}/templates/CLAUDE_CODE_SETTINGS.example.json ~/.claude/settings.json

6. Scheduled routine didn’t fire overnight

Symptom: expected briefing at ${CLAUDE_PLUGIN_DATA}/briefings/YYYY-MM-DD.md but file doesn’t exist.

Diagnostic:

# Did the Mac sleep past the fire time?
pmset -g log | grep -i "wake\|sleep" | tail -20
# Is the Desktop app running?
pgrep -f "Claude.app" > /dev/null && echo "app running" || echo "app NOT running"
# What's scheduled?
ls -la ~/.claude/scheduled-tasks/

Recovery: - Mac was asleep → Desktop scheduled tasks only fire when the Mac is awake + the app is running. Enable “Keep computer awake” in Desktop app settings, and don’t close the lid overnight. - App wasn’t running → open Claude Desktop and the 7-day catch-up may trigger one replay. Otherwise the fire is lost; wait for tomorrow. - Mode was direct at the scheduled time → routine no-oped silently by design. Flip mode back with ***ORCHESTRATOR ON***.

7. `state.json` is corrupt or missing

Symptom: jq . ${CLAUDE_PLUGIN_DATA}/state.json fails; hook healthcheck reports state unreadable; tools fail-closed.

Diagnostic:

ls -la ${CLAUDE_PLUGIN_DATA}/state.json*
cat ${CLAUDE_PLUGIN_DATA}/state.json

Recovery: - Restore from latest backup snapshot (if backup_snapshot.sh has been running):

ls -la ${CLAUDE_PLUGIN_ROOT}/backups/ | tail -5
cp ${CLAUDE_PLUGIN_ROOT}/backups/<most-recent>/state.json ${CLAUDE_PLUGIN_DATA}/state.json
jq . ${CLAUDE_PLUGIN_DATA}/state.json && echo "restored"

Or re-initialize from schema:

cat > ${CLAUDE_PLUGIN_DATA}/state.json <<'EOF'
{"schema_version":1,"mode":"orchestrated","mode_transitioning":false,"last_mode_change":null,"last_mode_change_reason":null,"active_autopilot_chat":null,"active_vm":null,"lock_acquired_at":null,"current_task":null,"orchestrator_automations":[],"automation_registry":[],"cascade":null,"notes":"Re-initialized $(date -u +%FT%TZ) after corruption."}
EOF

Then re-register any automation_registry entries by running bin/status.sh to confirm.

8. Launchd sweeper not firing

Symptom: locks older than 45 min sit un-stolen; lock_steals.log has no recent entries.

Diagnostic:

launchctl list | grep orchestrator
ls -la /tmp/orchestrator-sweeper.out /tmp/orchestrator-sweeper.err
tail -20 /tmp/orchestrator-sweeper.err 2>/dev/null

Recovery: - Not loaded → load it:

launchctl load ${CLAUDE_PLUGIN_ROOT}/launchd/com.kevin.orchestrator.stale-lock-sweeper.plist

Loaded but erroring → check stderr; common causes: python3 missing, script not executable (chmod +x ${CLAUDE_PLUGIN_ROOT}/bin/stale_lock_sweeper.sh), or the plist pointing at a stale path after a git pull moved files.
Reload after edits:

launchctl unload ${CLAUDE_PLUGIN_ROOT}/launchd/com.kevin.orchestrator.stale-lock-sweeper.plist 2>/dev/null
launchctl load ${CLAUDE_PLUGIN_ROOT}/launchd/com.kevin.orchestrator.stale-lock-sweeper.plist

9. Permission prompt blocking automated work

Symptom: a scheduled routine or Channels-session request stalls on a Claude Code permission dialog that nobody’s there to click.

Diagnostic: check the session in claude.ai/code scheduled tasks sidebar — running tasks that pause for permission show a pending-approval state.

Recovery: - Click “Run now” once in the Scheduled sidebar with you at the keyboard — approvals from that run are saved to the task and auto-applied to future fires. - Or add the command to the starter allow-list by editing ~/.claude/settings.json:

{"permissions": {"allow": ["Bash(your-command *)"]}}

For Channels sessions, start them with --dangerously-skip-permissions — the PreToolUse hook still blocks VM tools (hooks run before perm check), so bypassing prompts does not bypass safety.

10. Shutdown cascade stuck (`mode_transitioning=true` forever)

Symptom: state.json.mode_transitioning shows true for > 1 minute; chats print “⏸ ORCHESTRATOR IN TRANSITION — standing by.” on every turn.

Diagnostic:

jq '.cascade' ${CLAUDE_PLUGIN_DATA}/state.json

Recovery: - Check cascade.phase and cascade.executor_heartbeat ages. - If phase != "done" and heartbeat > 30 s old, any chat can resume the cascade (Track D §3 takeover). Say ***ORCHESTRATOR OFF*** verbatim in any chat. - Force-reset (last resort):

jq '.mode="direct" | .mode_transitioning=false | .cascade=null | .active_autopilot_chat=null | .active_vm=null | .lock_acquired_at=null' ${CLAUDE_PLUGIN_DATA}/state.json > /tmp/s.json && mv /tmp/s.json ${CLAUDE_PLUGIN_DATA}/state.json
rm -rf ${CLAUDE_PLUGIN_DATA}/state.lock.d
echo "$(date -u +%FT%TZ) MANUAL_CASCADE_RESET by kevin" >> ${CLAUDE_PLUGIN_DATA}/lock_steals.log

Then ***ORCHESTRATOR ON*** in a fresh chat to resume normal operation.

11. Chat slug mismatch — hook can’t identify the chat

Symptom: ORCHESTRATION LOCK: autopilot held by 'other-slug'. This chat ('unknown') must queue. when you expect the chat to BE the holder.

Diagnostic:

cat "$PWD/.claude/.chat_slug" 2>/dev/null
jq '.active_autopilot_chat' ${CLAUDE_PLUGIN_DATA}/state.json

Recovery: - .chat_slug missing → write it:

mkdir -p "$PWD/.claude" && echo "<your-slug>" > "$PWD/.claude/.chat_slug"

Slug written but cwd at tool-call time is a different directory (e.g. subagent working elsewhere) → move the slug up to the repo root, or use an absolute $CLAUDE_PROJECT_DIR-based path in your hook.
Slug mismatch vs state.json → just update state.json.active_autopilot_chat manually if you own the lock:

jq '.active_autopilot_chat="<your-slug>"' ${CLAUDE_PLUGIN_DATA}/state.json > /tmp/s.json && mv /tmp/s.json ${CLAUDE_PLUGIN_DATA}/state.json

12. Hook smoke test fails during install

Symptom: installer step 7 reports ✗ Hook did not block. Exit code: 0. when mode was flipped to direct.

Diagnostic:

echo '{"tool_name":"mcp__computer-use__screenshot","cwd":"/tmp"}' | ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh
echo "exit=$?"
which jq
test -x ${CLAUDE_PLUGIN_ROOT}/bin/orchestrator_lock.sh && echo "executable" || echo "NOT executable"

Recovery: - jq missing → brew install jq, re-run test. - Not executable → chmod +x ${CLAUDE_PLUGIN_ROOT}/bin/*.sh. - Script reads $HOME from somewhere unexpected → set an explicit ORCH=${CLAUDE_PLUGIN_ROOT} at the top (already done in the shipped script). - Still failing → set -x at the top of the hook, re-run, paste the trace.

13. `gh auth` surprises

Symptom: gh commands unexpectedly fail, the auto-GitHub extension can’t create repos, or git push prompts for a password.

Diagnostic:

gh auth status
git config --global credential.helper

Recovery: - Multiple accounts → gh auth switch to the right one. - Token expired → gh auth refresh. - HTTPS vs SSH mismatch → check git remote -v and gh auth setup-git. - Two-factor popping repeatedly → use a personal access token scoped to repo + workflow instead of browser auth.

14. Multi-machine sync surprises

Symptom: new Mac doesn’t have the same state as the old one; charters in the new clone are stale.

Recovery: - Re-run installer on the new Mac. - Clone your personal <gh-user>/my-ensemble-config repo to ${CLAUDE_PLUGIN_ROOT}-config/ (if you opted into the auto-GitHub extension):

git clone https://github.com/<you>/my-ensemble-config.git ${CLAUDE_PLUGIN_ROOT}-config

Run bin/sync-config.sh to apply your work_hours.json + allow-list delta from the config repo into the canonical paths.
state.json, chats/, and state.lock.d/ are runtime-local and do not sync across machines by design.

15. Channels session dropped

Note: The exact Claude Code invocation for an iMessage Channels listener is plugin-documented, not a core claude CLI flag. The --channels references in earlier drafts were speculative. Before relying on this recovery path, verify the current command via /plugin marketplace and the installed iMessage plugin’s own documentation. Path B (Cloud Routine + iMessage Channels) is still v2-scope.

Symptom: iMessage commands to kill the Ensemble or trigger a routine don’t produce replies. The persistent tmux session that should be listening isn’t.

Diagnostic:

tmux ls 2>/dev/null           # list tmux sessions
pgrep -fl claude | head       # look for a running claude process (exact match string depends on install)

Recovery: - Not running → restart the session. Placeholder shape pending plugin-doc verification:

# Exact invocation TBD per installed iMessage plugin's docs.
cd ${CLAUDE_PLUGIN_ROOT} && tmux new-session -d -s orchestrator 'claude <channel-flags-per-plugin> --dangerously-skip-permissions'

Frequent drops → enable launchd KeepAlive (ship a com.kevin.orchestrator.channels.plist that respawns the tmux session on crash).
Full Disk Access revoked → System Settings → Privacy & Security → Full Disk Access → add Terminal/iTerm back.

v2.0 failure modes (2026-04-26 contract sweep)

16. Per-project spawn lock conflict (refusal code 2)

Symptom: Autopilot fires fail with SPAWN-LOCK: per-project lock held for '<slug>' (pid=N). Refused.

Cause: Another autopilot is already targeting the same project. The per-project lock prevents two subprocesses from stomping on the same repo simultaneously.

Fix:

# See what's holding it
bash plugins/ensemble/bin/orchestrator_spawn_lock.sh status

# If the holder PID is dead but lock dir lingers, sweep stale
bash plugins/ensemble/bin/orchestrator_spawn_lock.sh sweep

# Or force release a specific slug (only if you're SURE no autopilot is running)
bash plugins/ensemble/bin/orchestrator_spawn_lock.sh release <slug>

17. Concurrency cap hit + queue timeout (refusal code 3)

Symptom: SPAWN-LOCK: cap held >300s, no slot opened for '<slug>'. Refused.

Cause: state.json.max_concurrent_autopilots is at capacity, no slot opened within the 5-minute queue timeout.

Fix: Raise cap or kill an existing autopilot:

jq '.max_concurrent_autopilots = 3' state.json > s.tmp && mv s.tmp state.json
# or kill all running autopilots from the PWA Terminal page (kill switch)
# or manually:
pkill -f autopilot_session.sh

18. Resource floor refused (refusal code 4)

Symptom: SPAWN-LOCK: resource floor — RAM N% > threshold M%. Refused. (or CPU variant)

Cause: state.json.resource_floor_enabled = true and the system is taxed at spawn time.

Fix: Wait or relax thresholds:

jq '.resource_floor_thresholds.ram_pct = 95' state.json > s.tmp && mv s.tmp state.json
# or disable
jq '.resource_floor_enabled = false' state.json > s.tmp && mv s.tmp state.json

19. Mid-run resource watchdog killed an autopilot

Symptom: Autopilot session aborts mid-run; runs/watchdog-<date>.log shows KILLING parent process group (sustained pressure).

Cause: state.json.resource_watchdog_enabled = true and the watchdog detected sustained pressure (default: 3 consecutive breaches at 95% RAM or load > 6.0).

Fix: Accept and reduce parallelism, or relax thresholds:

jq '.resource_watchdog_enabled = false' state.json > s.tmp && mv s.tmp state.json
# or env-tune
WATCHDOG_RAM_PCT_THRESHOLD=98 WATCHDOG_BREACHES_TO_KILL=5 bash bin/autopilot_session.sh 60

20. Network pre-flight failure

Symptom: [autopilot] ABORT: network pre-flight failed (cannot reach api.anthropic.com).

Fix: Check connectivity. If offline by design, disable:

jq '.require_network_check = false' state.json > s.tmp && mv s.tmp state.json

21. Claude CLI version too old

Symptom: [autopilot] ABORT: claude CLI X.Y.Z < required A.B.C.

Fix:

npm i -g @anthropic-ai/claude-code
# or relax the pin
jq '.min_claude_cli_version = ""' state.json > s.tmp && mv s.tmp state.json

22. Project state is Hibernating

Symptom: [autopilot] ABORT: project '<slug>' is Hibernating; promote first.

Fix: Move it out of Hibernating before firing:

# Verbal in any chat:
"set <slug> to Building"  # or Warmer / R&D / Updates / Launch Prep

# Or directly:
jq '.state = "Warmer"' chats/<slug>.json > s.tmp && mv s.tmp chats/<slug>.json

23. State.json corruption recovery

Symptom: state.json won’t parse; autopilot/hooks fail closed.

Recovery path (in order):

Check for atomic-rename leftover — every state-write uses .tmp + mv. If a write was interrupted, look for state.json.tmp: bash ls state.json* | head jq . state.json.tmp && mv state.json.tmp state.json
Restore from backup_snapshot — bin/backup_snapshot.sh runs nightly: bash ls backups/state.json.* cp backups/state.json.YYYYMMDD-HHMMSS state.json
Hand-rebuild from schema — if no backup, create minimum viable: bash cat > state.json <<'EOF' { "schema_version": 1, "mode": "orchestrated", "mode_transitioning": false, "active_autopilot_chat": null, "active_vm": null, "lock_acquired_at": null, "lock_intent": null, "max_concurrent_autopilots": 2, "active_autopilots": {}, "resource_floor_enabled": false, "resource_floor_thresholds": {"ram_pct": 90, "cpu_load": 4.0}, "spawn_queue": [], "parallel_code_allowed": true, "rapid_fire_enabled": true, "caffeinate_during_autopilot": true, "require_ac_power": false, "min_free_disk_gb": 5, "require_network_check": true, "min_claude_cli_version": "", "resource_watchdog_enabled": false, "cascade": null } EOF
Re-heartbeat all chats — after recovery, run a Maestro session that bumps each chats/*.json.last_heartbeat so the dashboard reflects current state.

If you hit something not in this list, grab bin/status.sh output and a one-line symptom and ping the Maestro. The failure catalog grows from real incidents.

1. git clone fails during install with auth error

2. Tools blocked with ORCHESTRATION LOCK: message

3. [ORCHESTRATOR] ... MISSING at SessionStart

4. Autopilot lock stuck

5. settings.json merge conflict during install

6. Scheduled routine didn’t fire overnight

7. state.json is corrupt or missing

8. Launchd sweeper not firing

9. Permission prompt blocking automated work

10. Shutdown cascade stuck (mode_transitioning=true forever)

11. Chat slug mismatch — hook can’t identify the chat

12. Hook smoke test fails during install

13. gh auth surprises

14. Multi-machine sync surprises

15. Channels session dropped

v2.0 failure modes (2026-04-26 contract sweep)

16. Per-project spawn lock conflict (refusal code 2)

17. Concurrency cap hit + queue timeout (refusal code 3)

18. Resource floor refused (refusal code 4)

19. Mid-run resource watchdog killed an autopilot

20. Network pre-flight failure

21. Claude CLI version too old

22. Project state is Hibernating

23. State.json corruption recovery

1. `git clone` fails during install with auth error

2. Tools blocked with `ORCHESTRATION LOCK:` message

3. `[ORCHESTRATOR] ... MISSING` at SessionStart

5. `settings.json` merge conflict during install

7. `state.json` is corrupt or missing

10. Shutdown cascade stuck (`mode_transitioning=true` forever)

13. `gh auth` surprises