| #258 | Secret handoff (Phil paste -> agent) is unreliable - no canonical drop file + agent clipboard/nav interference Fix: Built canonical secret-drop: IT/scripts/secret-drop.ps1 (-Open clears+opens the ONE gitignored file IT/credentials/SECRET-DROP.txt in Notepad; -ReadRaw emits it for the agent without echo; -Status metadata only; -Clear wipes) + the file (gitignored). Banked full rule in memory feedback_open_file_for_paste: ONE window/file; NEVER Set-Clipboard or navigate the active tab during a pending paste; do not peek/assert empty before the user confirms saved; verify the secret (provider verify + scope) before claiming captured; always give the exact path. DONE = next secret handoff one-and-done via SECRET-DROP.txt. | credential | P2 | kara / unassigned | 2026-06-23 | 2026-06-23 | 0 days | On-Time |
| #257 | Migrate Cloudflare tunnel connector from Plex box to the NAS + convert to dashboard-managed config (P-00255 resilience follow-on) Fix: Move the cloudflared connector to the UPS-backed NAS (192.168.1.80) via Container Station and convert tunnel 3cdd63bc to remotely-managed (config lives in the Cloudflare dashboard, no local file). Steps: (1) Cloudflare ZeroTrust > Networks > Tunnels: convert 3cdd63bc to remotely-managed, set public-hostname routes IDENTICAL to current local config (nas->https://192.168.1.80:443, plex->https://192.168.1.5:32400, sab->http://192.168.1.5:8089) + FIX router->https://192.168.1.3:8443, drop unused sonarr/radarr; (2) Container Station on NAS runs cloudflare/cloudflared with the connector token, restart=unless-stopped (auto-start on NAS boot); (3) verify both connectors serve = HA, zero downtime; (4) disable cloudflared on Plex box, keep it pinned as instant rollback; (5) confirm NAS auto-boot (P-00256). Execution: Phil does 2 logins (Cloudflare + Container Station) + 1 token paste; agent drives all navigation/config + the Plex-box cutover via SSH. Rollback: re-enable Plex-box cloudflared (pinned). | network | P2 | kara / unassigned | 2026-06-22 | 2026-06-23 | 0 days | On-Time |
| #249 | WireGuard site-to-site tunnel DOWN (Nicole<->Phil house) — all 192.168.1.0/24 + 10.6.0.1 unreachable Fix: Get router access at one end and re-establish the WG peer. Fastest: Phil power-cycles/checks his HOME router (most likely = his router rebooted or WAN/DDNS IP changed leaving Nicole's peer endpoint stale). If not restored, log into Nicole's RT-BE92U (192.168.2.2), restart the WireGuard client interface + verify peer endpoint resolves to Phil's current public IP. Re-run kara-network-watch to confirm. Longer-term: DDNS-resilience + saved Nicole-router credential so Kara can self-restart the WG interface. | cleanup | P1 | chuck / kara | 2026-06-22 | 2026-06-22 | 0 days | On-Time |
| #255 | Cloudflare tunnel (nas/plex/sab) HTTP 530 after power outage - cloudflared service ran but loaded empty stub config, served no tunnel Fix: Pin the service to the real config: ImagePath set to cloudflared.exe --config C:\Users\engelp\.cloudflared\config.yml tunnel run (DONE 2026-06-22; original backed up to C:\selfheal\cloudflared-ImagePath-ORIGINAL-2026-06-22.txt). Recover a stuck daemon via taskkill /F + sc start, never Stop-Service. Hardened plex-box-selfheal.ps1 to probe the real edge (HTTP 530 = down) and auto force-kill + sc start. | network | P1 | kara / unassigned | 2026-06-22 | 2026-06-22 | 0 days | On-Time |
| #254 | problem.js token auth silently breaks when .env has CRLF line endings (hand-rolled parser not EOL-agnostic) Fix: Harden parser: split(/\r?\n/) tolerates CRLF or LF (done + committed). Normalized live .env back to LF to undo the trigger + restore the also-CRLF-fragile LEDGER_AUTOPUSH regex without chasing it. Durable fix = EOL-agnostic split. | bot-health | P2 | chuck / chuck | 2026-06-22 | 2026-06-22 | 0 days | On-Time |
| #253 | RDP 'password not correct' to gaming machine — engelp is a LOCAL account; Microsoft password reset is irrelevant; RDP needs the local password (PIN never works for RDP) Fix: Log in as '.\engelp' (or philsgamingmach\engelp) with the LOCAL password (not the PIN, not the Microsoft password). If unknown: EITHER (a) reset engelp's local password — CAUTION: re-bind any Windows scheduled tasks/services storing engelp's old password or they'll fail on logon, OR (b) create a dedicated local admin account for RDP and leave engelp + its task bindings untouched. Then document the RDP cred in memory/credentials-ledger.md. VERIFIED THIS SESSION: RDP enabled+correct (fDenyTSConnections=0, TermService running, listening :3389, firewall RDP rules on, NLA on); Tailscale IP 100.65.133.98 = this machine (philsgamingmachine), confirmed via 'tailscale ip -4' + status; laptop online on tailnet. So the ONLY blocker is the local-account credential. | credential | P2 | chuck / chuck | 2026-06-22 | 2026-06-22 | 0 days | On-Time |
| #246 | Cursor USER-level .cursor/mcp.json open-brain still dead v1 docker — connected=false; deploy-verify gate missed user scope Fix: FIXED in-pass: rewrote C:/Users/engelp/.cursor/mcp.json (USER scope) open-brain from docker exec open_brain_mcp to v2 (C:/Python314/python.exe openbrain-v2/brain_mcp_server.py), matching the project .cursor/mcp.json. Added the user-scope path to openbrain-deploy-verify.js FILES so the gate scans it. Cursor must reload MCP (or restart) to reconnect; verify connected=true in Cursor MCP output. | memory-system | P1 | chuck / unassigned | 2026-06-21 | 2026-06-21 | 0 days | On-Time |
| #244 | Agents READ HANDOFF/files instead of RETRIEVING from OpenBrain — memory-as-runtime not proven (Tess test FAILED) Fix: Enforce memory-as-runtime, not config-on-disk: (1) add a hard MEMORY-AS-RUNTIME rule to the surfaces physically in front of every agent every session — memory/STANDING-ORDERS.md (hook-injected), memory/AGENTS.md (negative constraints), CLAUDE.md, .cursor/rules — stating: ANY question about what you know / status from memory / recall = you MUST call get_active_memories + search_brain and CITE returned capture ids in the reply; reading HANDOFF.md or files is NOT retrieval; a memory answer with no cited id = FAILURE. (2) Add the same teeth to all 10 agent SKILL boot steps (source IT/plugins/<agent>/skills/<agent>/SKILL.md + dept copies + deployed marketplace copies), deploy-verify each. (3) Connectivity proof per surface: each of Claude Code / Cursor / Cowork must demonstrate a live search_brain call citing ids (Claude Code DONE this session: ids 328/329/330; engine = brain.db 330 rows). Acceptance test = re-ask each surface 'what do you know from OpenBrain about X' and require cited capture ids. | memory-system | P2 | chuck / unassigned | 2026-06-21 | 2026-06-21 | 0 days | On-Time |
| #174 | Cowork bridge-sync skill is a stale install — wrong laptop bridge path + skips the registry refresh its spec requires Fix: Package the corrected source as an installable plugin via /build-plugin and have Phil upload it in Cowork (Upload local plugin), replacing the stale copy — watch for a bridge-sync name collision with the old install; if Cowork shows two, Phil deletes the old one in the UI. Interim: syncs still work; registry can be refreshed by asking the Cowork session to run list_scheduled_tasks and overwrite registry.md. | architecture | P2 | chuck / unassigned | 2026-06-11 | 2026-06-21 | 10 days | Late |
| #239 | Cross-agent tool doctrine over-restricted: tool bans + single-tool defaults across all 5 agents Fix: Remove all browser/UI tool-choice bans + Default/Fallback-only ranking across chuck/tess/kara/john/alex TOOLS.md+role.md (+tess agents.md PROHIBITED rule, +kara agents.md write-path note). Replace with 'available tools — use the right one for the job' flat lists. Keep desktop-non-intrusion as a SOFT preference, not a ban. Leave safety rails untouched (Alex no-auto-trade, no-secrets-in-chat, RULE 0). Record Phil's 2026-06-20 'no tool bans' order in decisions-log + memory/AGENTS.md so it can't silently regrow. | architecture | P2 | kara / unassigned | 2026-06-21 | 2026-06-21 | 0 days | On-Time |
| #225 | Tess broken: browser-driving + credential-paste instructions stale/self-contradictory Fix: Align Tess TOOLS.md + role.md browser block to her own agents.md Hands-Off rule + Chuck's working pattern: default = Playwright/Puppeteer headless via Docker MCP (mcp__MCP_DOCKER__browser_*) against engelsplace.pages.dev mirror; curl for HTTP smoke; Claude-in-Chrome reserved ONLY for authenticated Cloudflare Zero-Trust/Access work with Phil's go. Rewrite save-credential-to-disk to pre-create the target file (touch) before opening Notepad so Win11 UWP Notepad opens a real empty file with NO Create-new dialog. | website | P1 | kara / unassigned | 2026-06-20 | 2026-06-20 | 0 days | On-Time |
| #224 | kara-network-watch: tunnel pings run with no settle delay after saturating Ookla run, false-WARN on WG-router latency Fix: Insert a ~5s settle delay (sleepSync) between runOokla() and tunnel pings in main() so line/router drains to idle before tunnel latency is measured. Verified+reversible (one helper + one call). A genuinely slow tunnel (sustained >25ms after settle) still warns. | scheduled-task | P2 | kara / unassigned | 2026-06-20 | 2026-06-20 | 0 days | On-Time |
| #200 | Cowork scheduled tasks stall on 'Permissions needed' — 5 morning reports silently half-complete for 10-13h (wrong permission mode, connector calls hang) Fix: Set Cowork scheduled tasks to a full-autonomy/bypass permission mode and/or pre-approve (Always allow) the connectors they use (discord-mcp, resend, open-brain, gmail). Investigate why some sessions launch in plan mode. Verify by a manual run that completes end-to-end (email sent + Discord posted), not just 'ran'. | scheduled-task | P1 | chuck / chuck | 2026-06-15 | 2026-06-20 | 5 days | Late |
| #217 | Home router (ASUS RT-BE92U @ 192.168.1.3) Roaming Assistant kicking IoT/guest devices Fix: Disable Roaming Assistant on all 3 bands (wl0/wl1/wl2_user_rssi=0), same as Nicole P-00214. Save before/after .CFG to Phil's Drive network folder. Reversible, no re-pairing. | network | P2 | kara / kara | 2026-06-16 | 2026-06-16 | 0 days | On-Time |
| #153 | SABnzbd down on Plex box - missing config, won't serve on 8089 Fix: Reconfigure SABnzbd on 192.168.1.5 with Phil's newsserver creds + indexers, set port 8089 (or restore sabnzbd.ini from backup). Verify sab=200; new self-heal + tunnel alerter (fixed 2026-06-07) then go green + page on future failures. | network | P2 | kara / unassigned | 2026-06-08 | 2026-06-16 | 7 days | Late |
| #162 | kara-network-watch: internet packet-loss warn threshold is 0%, fires false WARN on healthy line Fix: Align internet lossPct threshold to the existing tunnelLoss precedent (warn:1): change T.lossPct from {warn:0,crit:2} to {warn:1,crit:2} in IT/scripts/kara-network-watch.js, and update Network/network-watch-task-spec.md threshold table to match. Then 0.41% reads green; a genuinely degraded line (>1% sustained) still warns, >2% still critical. Verified+reversible (one number). | scheduled-task | P2 | kara / kara | 2026-06-10 | 2026-06-16 | 5 days | On-Time |
| #170 | Kara WORKING_MEMORY references 4 dead IT/problems/*.md paths (legacy split-brain ledger) Fix: Update agents/kara/WORKING_MEMORY.md lines 73-75: replace IT/problems/00043.md, 00045.md, 00035.md, 00041.md references with canonical ticket IDs (P-00043, P-00045, P-00035, P-00041) checked against problem.js for current status; drop any that are closed. | memory-system | P2 | chuck / kara | 2026-06-11 | 2026-06-16 | 4 days | On-Time |
| #211 | 4 auto-start scheduled tasks not in systems-check autostart inventory (all verified ours) Fix: Add the 4 confirmed-ours tasks to the systems-check autostart inventory so they read as accounted, leaving genuinely-unaccounted autostarts to stand out. Re-verify each is intended before adding. | cleanup | P2 | chuck / chuck | 2026-06-16 | 2026-06-16 | 0 days | On-Time |
| #210 | systems-check.js inventory still expects decommissioned Activepieces — false P1 every run (docker + reboot-recovery) Fix: Remove activepieces from systems-check.js expected-container + reboot-recovery inventory (mirror the P-00205 fix). Better: source the expected-services list from one shared inventory file so retiring a service updates every monitor at once. | architecture | P2 | chuck / chuck | 2026-06-16 | 2026-06-16 | 0 days | On-Time |
| #212 | kara/WORKING_MEMORY.md cites 4 dead IT/problems/000XX.md paths (ledger moved) Fix: Update kara/WORKING_MEMORY.md to cite the tickets by P-XXXXX id (P-00043/45/35/41) via node IT/scripts/problem.js, not the dead IT/problems/ file paths; verify each ticket's current status while editing. | cleanup | P2 | chuck / kara | 2026-06-16 | 2026-06-16 | 0 days | On-Time |
| #202 | Desktop Commander safety blocklist silently WIPED mid-session (all 33 dangerous-command guards removed; origin untraced) Fix: DONE in-pass: restored the default 33-command blocklist via set_config_value. NEXT: (1) add a DC-config drift guard that detects an empty/short blockedCommands and auto-restores the default (self-healing, not just alert) — wire into the daily verifier or a watchdog; (2) trace the culprit via DC clientHistory + the behavior-auditor (an agent loosening security to do its job is a behavioral failure); (3) consider making blockedCommands tamper-resistant (warn/block set_config_value that shrinks it). | architecture | P1 | chuck / chuck | 2026-06-15 | 2026-06-15 | 0 days | On-Time |
| #190 | Gateway bot dead code + 5S: orphan handlers (skill-candidate-drafter loads Opus-metered module at boot), 10x redundant requires, philsclaude residue, 158MB npm-caches, 14 .bak, 14 orphan logs, unbounded config-guardian.log Fix: Boot reconciliation: fail-fast on task->missing-handler, warn on orphan handlers. Lazy-require skill-candidate-drafter (or Sort it). Extract one runStatusScript helper (removes 10 re-requires AND the redirect race). 5S: move philsclaude-* (IN-PASS, decommissioned), npm-caches, .bak, orphan logs to _DELETE_QUEUE; git rm --cached the 2 tracked .bak. config-guardian.js self-rotates log at 5MB + run it under pm2. | cleanup | P2 | john / chuck | 2026-06-13 | 2026-06-15 | 2 days | On-Time |
| #197 | Outcome missing: dreaming-nightly produced no result (verifier could not self-heal) Fix: Investigate why dreaming-nightly ran without producing its artifact; wire in-process re-fire (increment 2) or fix the producer. | scheduled-task | P1 | auto / chuck | 2026-06-14 | 2026-06-15 | 1 day | On-Time |
| #191 | Heartbeat delivery is unverified + dedup-store write is silently swallowed — a dead Discord channel or lost dedup goes unnoticed Fix: Stamp delivered=(discord.ok) onto each heartbeat entry before recordHeartbeat; watchdog flags delivered===false runs. Make writeAlertStore atomic (tmp+renameSync) + count/log failures + surface via status. Promote repeated postDiscord failure into task-failure-tracker. Escalate to Chuck (heartbeat-lib design). | bot-health | P2 | john / chuck | 2026-06-13 | 2026-06-15 | 2 days | On-Time |
| #184 | Gateway bot: alert coverage is OPT-IN — watchdog crashes/critical-results fire no ticket; flapping never auto-disables Fix: Invert to default-on: SILENT_ON_CRASH exempt-set replaces OPS_TASKS_FOR_HEARTBEAT; rolling failure-RATE gate added to task-failure-tracker; critical-RESULT gate after normalization; heartbeat-watchdog WATCHED derived from registry; boot assertion on any silenced /watchdog/i task. ESCALATED to Chuck (design change). ICAR filed. | architecture | P1 | john / chuck | 2026-06-13 | 2026-06-15 | 2 days | On-Time |
| #195 | Morning briefs re-surface RESOLVED P-00161 as a live Phil-blocker (2nd day running) Fix: system-health-monitor + on-track-check keep posting P-00161 token rotation as 'needs Phil, day N' but P-00161 was RESOLVED 2026-06-11 (rotated on Phil's go, verified live). Recurred 6/12 (ops report caught it) + 6/13 (both AM briefs). Root cause: briefs read a stale Phil-blocker source not reconciled vs resolved-ledger status. Fix: reconcile morning-brief blocker list against problem.js resolved status before posting; ICAR (2nd occurrence=systemic). Owner: interactive Chuck. | scheduled-task | P2 | auto / chuck | 2026-06-13 | 2026-06-15 | 1 day | On-Time |
| #205 | codex-nightly-drift-email still checks decommissioned Journey Journal + Activepieces — cry-wolf FAIL every run Fix: Remove checkJourneyJournal()+checkActivepieces() from the results array and the function defs; drop dead JOURNEY_JOURNAL_DIR/ACTIVEPIECES_CREDS/OPENBRAIN_HEALTH(.log) constants (OpenBrain health is covered by the live openbrain-watchdog-latest.json). Checks 9→7. | cleanup | P2 | chuck / chuck | 2026-06-15 | 2026-06-15 | 0 days | On-Time |
| #183 | Gateway bot: shell > redirect race causes ~100 silent watchdog failures (STILL occurring) Fix: Drop the shell > redirect: openbrain-watchdog.js + burn-watchdog.js atomically self-write -latest.json (tmp+renameSync); bot.js:194/207 -> stdio:inherit + readFileSync, identical to kara-* handlers. JOHN FIXING IN-PASS. | bot-health | P1 | john / chuck | 2026-06-13 | 2026-06-15 | 2 days | On-Time |
| #188 | RUNBOOK two incident-recovery steps are BROKEN: key-rotation edits a keyless file; corruption-recovery restores a stale .bak missing 5 live handlers Fix: IN-PASS doc fix: rotation step -> edit .env, set ANTHROPIC_API_KEY, pm2 restart engel-ops-bot, revoke old; corruption-recovery -> git checkout HEAD -- IT/discord-gateway-bot/bot.js (git HEAD=2489 lines=live), delete the manual copy-to-.bak ritual. Bump Last Updated. Gate = doc-audit now scans the dir (P-00188). | bot-health | P1 | john / chuck | 2026-06-13 | 2026-06-15 | 2 days | On-Time |
| #187 | doc-audit drift guard is BLIND to the gateway bot dir — RUNBOOK/SETUP/ecosystem never scanned Fix: IN-PASS: add IT/discord-gateway-bot/RUNBOOK.md, SETUP.md, SOURCES-INDEX.md, ecosystem.config.js to doc-audit.js SCAN_TARGETS, and add a /\.bak/ entry to EXCLUDE so the 14 .bak files are never scanned. Verify scannedFiles +4. This is the systemic gate behind every RUNBOOK/SETUP drift below. | architecture | P1 | john / chuck | 2026-06-13 | 2026-06-15 | 2 days | On-Time |
| #189 | RUNBOOK + SETUP roster/app drift: retired agents listed live, Alex mislabeled retired, Tess/Kara omitted, config-guardian+engelsplace-dev undocumented, dead credit section Fix: IN-PASS: rewrite RUNBOOK active-prefixes -> CHUCK/TESS/KARA/JOHN/ALEX; remove Alex-retired; replace dead credit section with burn-watchdog model; document config-guardian + engelsplace-dev; fix SETUP table to live CHANNEL_AGENTS + drop #dispatch. Recurrence gate = P-00187 (doc-audit scans dir). | bot-health | P2 | john / chuck | 2026-06-13 | 2026-06-15 | 2 days | On-Time |
| #201 | Cline-on-Ollama true bottleneck = slow Vulkan PREFILL on 6700XT (known llama.cpp bug), not driver/config Fix: Options ranked: (1) Cline 'compact prompt' toggle -> ~2-3x smaller prompt, immediate, reversible, BUT loses MCP+FocusChain. (2) likelovewant/ollama-for-amd ROCm fork (HIP SDK 7.1 + rocBLAS gfx1031 swap; reversible) -> ~20-50% faster pp (Phoronix 2026) + may dodge downclock bug; keeps all features; install is fiddly/community-modified. (3) Accept local 8B for quick Q&A + use cloud Claude for agentic coding (honest best-tool). NOTE: even compact+ROCm leaves ~1min/turn; no 12GB-local setup makes 22k-token agentic Cline truly snappy. | architecture | P2 | chuck / unassigned | 2026-06-15 | 2026-06-15 | 0 days | On-Time |
| #203 | Nightly ~02:07 utility-power dip triggers NAS UPS shutdown countdown Fix: Extend QTS Control Panel -> External Device -> UPS 'Turn off the server after AC power fails for' 5->10 min. APPLIED + VERIFIED 2026-06-14 ~23:25 CDT via gamingpc Chrome (engelp admin): QTS 'Changes applied'; write path proven (9->apply->10->apply confirmed). Transient dips can no longer escalate; ~30min battery headroom remains for a real outage. | network | P2 | kara / unassigned | 2026-06-15 | 2026-06-15 | 0 days | On-Time |
| #198 | Ollama+Cline slow: KEEP_ALIVE=0 forces 14s reload per request + 4096 ctx truncates Cline Fix: Set OLLAMA_KEEP_ALIVE=30m + OLLAMA_CONTEXT_LENGTH=16384 (User env), restart Ollama, re-benchmark. Optional: update AMD Adrenalin to re-enable ROCm; /no_think for coding. | architecture | P2 | chuck / unassigned | 2026-06-15 | 2026-06-15 | 0 days | On-Time |
| #196 | nas-watch free-space probe false-flags SMB unreachable on bare UNC Fix: Probe mapped drive B: first (UNC fallback); only tag unreachable if B:+UNC+port445 all fail. Fix landed in SKILL.md (a2) 2026-06-14. | network | P2 | kara / unassigned | 2026-06-14 | 2026-06-14 | 0 days | On-Time |
| #193 | Journey Journal email bridge: 10 nights of green SUCCEEDED sends, ZERO entries landed — false-positive failure recurred (P-00086 redux) Fix: DECOMMISSION entire Journey Journal stack per Phil 2026-06-13: disable+remove journey-journal-nightly, tear down dedicated Activepieces Docker (containers/images/volumes), retire activepieces-secrets + 18 logs + backup scripts + activepieces MCP server, update SYSTEM_STATE/ORG_STATE/credentials-ledger. ICAR documents the receive-side verification gap so this is never rebuilt blind. | scheduled-task | P1 | chuck / chuck | 2026-06-13 | 2026-06-13 | 0 days | On-Time |
| #192 | nas-watch does not monitor NAS free space — capacity endpoint unwired Fix: Find the working store= value returning capacity/free bytes (lvList + poolList extra_get return empty; candidates: volumeList, volumeStorageInfo, management/chartReq.cgi disk_usage), parse free%, tag vs baseline (>=15 ok / 10-15 warn / <10 crit), update qnap-api-reference.md §3 + nas-health-baselines.md + SKILL. | network | P2 | kara / unassigned | 2026-06-13 | 2026-06-13 | 0 days | On-Time |
| #163 | philsgamingmachine NAS backup fails — QNAP NetBak agent diskutil.exe crashes mid-read Fix: Update or repair the QNAP NetBak PC Agent on philsgamingmachine (v3.1.0.103 is crashing), then re-run the backup job to confirm. If it still crashes, pull the diskutil.exe crash dump (Event ID 1000) and open a QNAP support case. Phil-action: software change on the gaming PC, not auto-done from a scheduled fire. NAS side is healthy and snapshots are current, so no NAS data-loss risk in the interim. | network | P2 | kara / kara | 2026-06-10 | 2026-06-13 | 3 days | On-Time |
| #181 | 'Phil-UI only' capability myth in 3 boot docs stalled agent action — Phil escalated; doctrine flipped to DO-WHAT-YOU-CAN-FIRST Fix: Rewrite all 3 doc copies, add the 2026-06-12 standing order to ORG_STATE, extend AGENTS.md try-it-first rule, ICAR for the repeat class, Discord-notify all agents. | architecture | P2 | chuck / unassigned | 2026-06-12 | 2026-06-12 | 0 days | On-Time |
| #180 | Tess deferred a fixable in-lane fix to Chuck instead of finishing end-to-end — RED ALERT violation (Phil had to correct) Fix: GATE (not another rule): extend chuck-complaint-detector ESCALATION_PATTERNS to catch Phil correcting an improper hand-off/lane-refusal ('not pause and delegate', 'fix anything from your lane', 'you are supposed to continue', 'don't defer/hand off') so the NEXT recurrence auto-files a structural ticket — done by Tess + bot restarted (notify Chuck, per the same don't-defer lesson). Plus sharpened self-check: before writing ANY hand-off phrase (flag to X / X's lane / re-flag / out of my lane), run the RED ALERT test — tools present + non-destructive path = EXECUTE now, notify owner after; the hand-off phrase is only valid with an explicit stop-condition (missing access / Rule 0 / verified-done). | architecture | P1 | tess / unassigned | 2026-06-12 | 2026-06-12 | 0 days | On-Time |
| #178 | commit-content.js 'git add -A' can sweep an oversized transient file into a content deposit and silently break every Pages deploy Fix: Add an oversized-file gate to commit-content.js (the shared puller commit helper): after 'git add -A', scan staged files and UNSTAGE any >24 MiB (Cloudflare Pages rejects any deploy with a file >25 MiB), logging the exclusion to the puller's Discord summary. This protects EVERY puller at the shared chokepoint, preserves the intentional flush-everything coverage, and never blocks legit content. Complements the existing build-gate (which catches broken content but not oversized content). The original trigger file (.ffpass-*) is already gitignored + the embed script now writes passlogs to os.tmpdir(). Restart engel-ops-bot to load. | website | P2 | tess / unassigned | 2026-06-12 | 2026-06-12 | 0 days | On-Time |
| #176 | Cloudflare Pages deploys not landing — today's infographic pushes (tb-500 + auto video embeds) stuck/lagging 20+ min Fix: Immediate: re-triggered a fresh Pages build via empty commit 64c68cc + launched a background watcher (IT/scripts/tb500-notify-when-live.js) that Telegrams Phil the link the moment it deploys. If 64c68cc also fails to land, this is a real Pages pipeline failure → ESCALATE to P1 + check the Cloudflare Pages dashboard build log (needs dashboard or a Pages:read-scoped token; the workers-deploy token returns Authentication error for the Pages API and credential scanning is not authorized). Preventive: add a Pages deployment-status check (wrangler pages deployment list, or Pages:read token) to the publish verify step so a stuck/failed deploy is detected directly instead of inferred from polling the live URL. | website | P2 | tess / unassigned | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #148 | minutes-sync/action-item pipeline emits invalid frontmatter (priority 'normal', duplicate keys) — crashes the live build Fix: Tess hardens the action-item writer: validate priority against the schema enum (low/medium/high/urgent) and REPLACE rather than append 'updated:' on re-sync. This is the root cause of the 2026-06-05 AM engelsplace-dev crash-loop (9 files repaired in commit 932d6a4). | website | P2 | kara / tess | 2026-06-06 | 2026-06-11 | 4 days | On-Time |
| #161 | Agent patch tokens in gateway .env are dev-default values; one echoed into a session transcript Fix: Rotate every *_PATCH_TOKEN in IT/discord-gateway-bot/.env to crypto-random 48-hex values (all consumer scripts read the file at call time - verified chuck-behavior-auditor, complaint-detector, problem-auto-closer, tess-infographic-request-to-ledger, patch-review, skills.js - so rotation is zero-downtime, no other copies exist). AWAITING PHIL GO: rotation attempt 2026-06-09 was blocked by the permission classifier pending explicit authorization. | credential | P2 | chuck / unassigned | 2026-06-10 | 2026-06-11 | 1 day | On-Time |
| #127 | Phil escalation pattern detected — structural review needed Fix: Review the quoted escalations in this ticket's detector status file. Each is Phil raising something he has flagged before — treat as a STRUCTURAL gap, not a one-off symptom. For each: (1) find the root mechanism that let it recur, (2) propose the smallest change that removes the recurrence (rule, hook, handler, or doctrine edit), (3) close this ticket once the structural fix ships or Phil signs off. This is the inside-the-loop corrective signal built as P-00041 mechanism 2. | architecture | P1 | chuck / chuck | 2026-06-02 | 2026-06-11 | 9 days | Late |
| #115 | Behavioral pattern: NO_VERIFY_BEFORE_ASSERT (chuck) Fix: Auto-detected by chuck-behavior-auditor from a session transcript — no Phil complaint triggered this. STRUCTURAL FIX (from the auditor): Pre-commit assertion check: any tool output containing `Claude_pzs8sxrjxfjjc` or `Packages\Claude_` must trigger a mandatory 'sandboxed path detected' warning before any success claim is rendered. — Treat the recurrence count on this ticket as the pattern signal: a rising count means this failure mode is systemic for chuck and needs a doctrine/hook change, not a per-incident nudge. | architecture | P1 | chuck / chuck | 2026-05-30 | 2026-06-11 | 12 days | Late |
| #114 | Behavioral pattern: IGNORED_CORRECTION (chuck) Fix: Auto-detected by chuck-behavior-auditor from a session transcript — no Phil complaint triggered this. STRUCTURAL FIX (from the auditor): Rule: after 2 consecutive user interrupts on the same thread, agent must enter plan/diagnostic mode automatically — no new action commands until root cause is named and acknowledged. — Treat the recurrence count on this ticket as the pattern signal: a rising count means this failure mode is systemic for chuck and needs a doctrine/hook change, not a per-incident nudge. | architecture | P1 | chuck / chuck | 2026-05-30 | 2026-06-11 | 12 days | Late |
| #175 | Infographic gallery cards 404 on the LOCAL dev server (extensionless URLs not served in dev; production unaffected) Fix: Add a dev-only Astro integration (astro:server:setup Vite middleware) that rewrites an extensionless GET /infographics/<slug> to /infographics/<slug>.html when that static file exists in public/ — giving the dev server the same extensionless serving Cloudflare Pages does in production. Keeps the clean canonical URLs (no .html in links, no prod redirect hops), fixes all 15 cards + every future one at once, dev-only (production path unchanged). Restart engelsplace-dev to load it; verify 127.0.0.1:4321/infographics/<slug> returns 200. | website | P2 | tess / unassigned | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #107 | Dependency sweep held upgrades and optional AgentKits audit chain Fix: Test one dependency family at a time with local-cache npm update, syntax/import smoke checks, and no PM2 restart until Phil approves live promotion; keep AgentKits optional deps omitted unless upstream fixes the optional transformer chain. | scheduled-task | P2 | auto / chuck | 2026-05-28 | 2026-06-11 | 14 days | Late |
| #9 | Rewrite chuck-daily-ops-report as handler-typed (no LLM) Fix: Chuck builds IT/discord-gateway-bot/daily-ops-report-handler.js that: (1) reads ORG_STATE active items + live-open-actions.js output (Phil action items section), (2) reads last 24h of chuck-health-beacon + chuck-drift-guard + chuck-heartbeat-watchdog heartbeats from heartbeat file (task health section), (3) greps Gmail for NAS alerts via bot's existing gmail client (NAS alerts section), (4) reads Discord channels for Phil replies via existing discord.js client (reply loop), (5) composes a markdown report deterministically — no LLM. Same content as the LLM narrative, zero timeout risk, zero token cost. Register as scheduled-tasks.json handler='daily-ops-report', same 18:38 CDT cron. Watch for 7 days; if coverage is fine, remove the agent-typed fallback. | system | P2 | chuck / chuck | 2026-04-24 | 2026-06-11 | 47 days | Late |
| #152 | Behavioral pattern: ACT_BEFORE_CONFIRM (chuck) Fix: Auto-detected by chuck-behavior-auditor from a session transcript ? no Phil complaint triggered this. STRUCTURAL FIX (from the auditor): Add a pre-integration checklist to chuck/agents.md TARGET DISCIPLINE Part 2: before generating any third-party signup link or connection token, agent must enumerate the specific accounts/systems the tool must read and confirm each one is supported by that tool (documented check), with Phil's explicit confirmation of account types. No connect-link generation on assumed account types. ? Treat the recurrence count on this ticket as the pattern signal: a rising count means this failure mode is systemic for chuck and needs a doctrine/hook change, not a per-incident nudge. | architecture | P1 | chuck / chuck | 2026-06-07 | 2026-06-11 | 3 days | Late |
| #147 | SYSTEM_STATE scheduled-task inventory drifts stale + checkPaidAgentCrons misses handler-typed paid twins Fix: Chuck re-syncs the SYSTEM_STATE.md scheduled-task table against live config (3 lied 'enabled' tonight; may be more) and extends checkPaidAgentCrons to flag handler-typed-but-paid tasks, not just agent-typed. | scheduled-task | P2 | kara / chuck | 2026-06-06 | 2026-06-11 | 4 days | On-Time |
| #128 | ops-ledger DB can drift from markdown on out-of-band edits — add daily reconcile Fix: Dual-write keeps ops-ledger.db in sync for all problem.js writes, but markdown changed out-of-band (another surface's git commit/pull, manual .md edit) won't reflect until the next write to that ticket. Add a daily cron that runs migrate-ledger-to-sqlite.js + verify-ledger-parity.js and posts to #it-ops only if parity fails. Cheap eventual-consistency safety net. | scheduled-task | P2 | chuck / chuck | 2026-06-03 | 2026-06-11 | 8 days | Late |
| #169 | Orphan preview server on port 8771 (python http.server) left running after infographics session Fix: After the active infographics session wraps: kill PID 54316 (python -m http.server 8771 serving Projects/engelsplace/public/infographics). Then bake a rule into the preview workflow: bind preview servers to 127.0.0.1 and kill them at session end so they never show up as unaccounted listeners. | cleanup | P2 | chuck / chuck | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #159 | OpenBrain compose drift: 3 of 5 services (telegram, crm-api, crm-ui) defined restart:always but not running Fix: Chuck to decide: if CRM-UI/CRM-API/telegram are unused, move them behind a compose profile or comment them out so the active stack is just DB+MCP; if wanted, start them and document. Either way reconcile compose + ARCHITECTURE.md with reality. ESCALATED to Chuck. | architecture | P2 | john / chuck | 2026-06-10 | 2026-06-11 | 1 day | On-Time |
| #158 | OpenBrain backups: no retention/rotation (unbounded growth) Fix: Add 30-day retention to openbrain-backup.ps1 (delete *.sql.gz older than 30 days after a successful run). John already quarantined the 15 zero-byte files to _DELETE_QUEUE/openbrain-zerobyte-backups-2026-05 in-pass. | cleanup | P2 | john / chuck | 2026-06-10 | 2026-06-11 | 1 day | On-Time |
| #156 | OpenBrain: no backup-freshness guard + dumps never restore-tested Fix: Add a freshness+integrity guard to the bot.js openbrain-watchdog run (or systems-check): alert if newest openbrain-backups/*.sql.gz is >36h old OR <1KB; run gzip -t weekly on the newest dump; quarterly restore-into-throwaway-container drill to prove recoverability. 0 dollars, reuses existing cron. | memory-system | P2 | john / chuck | 2026-06-10 | 2026-06-11 | 1 day | On-Time |
| #155 | OpenBrain: no capture-pipeline failure monitor (silent memory-loss path) Fix: Extend IT/scripts/openbrain-watchdog.js to also call get_capture_job_stats each run; emit overall=critical (alert to #it-ops) if failed>0, or pending stays >25 across two consecutive runs. Reuses the existing 30-min bot.js cron - no new schedule, 0 dollars. | memory-system | P2 | john / chuck | 2026-06-10 | 2026-06-11 | 1 day | On-Time |
| #140 | Propagate --root-cause syntax to the 6 agent skill docs + rebuild plugins (QMS 5-Whys enforcement) Fix: Add --root-cause to the problem.js create example in chuck/tess/kara/alex/john/systems-check skill SKILL.md + bot.js ~line 585 prompt; rebuild + reinstall plugins via auto-rebuild-plugins.js so agents file with a root cause first-try. | cleanup | P2 | kara / chuck | 2026-06-06 | 2026-06-11 | 5 days | On-Time |
| #133 | ORG_STATE.md wiped to 0 bytes by PowerShell append during 6/4 on-track fire (recovered; 5/27-6/4 entries reconstructed) Fix: Already repaired: git checkout HEAD + RECONSTRUCTED block from AGENT_BOARD/Discord. Residual risk 1: reconstructed entries are summaries, not originals — interactive Chuck should spot-check vs agents/*/memory journals. Residual risk 2: bootstrap files are committed rarely (ORG_STATE last real commit 5/26 = 9 days exposure) — add a nightly git commit of bootstrap files to an existing cron so git HEAD is never more than 24h stale. Lesson banked memory/learning/2026-06-04-powershell-file-wipe.md. | memory-system | P2 | auto / chuck | 2026-06-04 | 2026-06-11 | 6 days | On-Time |
| #139 | 2 Code routines on disk but not in the live scheduler (chuck-skill-candidate-drafter, cowork-pro-rollover-check) Fix: Chuck decides per task: register it in the Code scheduler if it should run, else move its dir to _DELETE_QUEUE. Then stamp the QMS block or remove. | cleanup | P2 | kara / chuck | 2026-06-06 | 2026-06-11 | 5 days | On-Time |
| #146 | Sort 3 tombstoned Cowork task dirs to _DELETE_QUEUE (mj-daily-drop, kara-network-watch, bridge-test) Fix: Move the 3 tombstoned Cowork Scheduled dirs (skill renamed to .disabled/.migrated/.sorted) into _DELETE_QUEUE; Phil deletes the bridge-test card in the Cowork UI. The nightly doc-audit flags them until cleared. | cleanup | P2 | kara / chuck | 2026-06-06 | 2026-06-11 | 4 days | On-Time |
| #173 | embed-infographic-video.js HTML injection is not CRLF/indent-safe — silently fails on Windows-EOL pages Fix: Replace the literal-string header/content anchor (hardcoded 4-space indent + LF) with an EOL- and indent-agnostic regex /([ \t]*<\/div>)(\r?\n\r?\n)([ \t]*<div class="content">)(\r?\n)/ and the </style> match with /([ \t]*)<\/style>/, building inserted blocks with the file's detected EOL. Same fix already proven in the retatrutide one-off injector. | website | P2 | tess / unassigned | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #110 | Behavioral pattern: PREMATURE_DONE (chuck) Fix: Auto-detected by chuck-behavior-auditor from a session transcript — no Phil complaint triggered this. STRUCTURAL FIX (from the auditor): Add a pre-close gate to problem.js resolve command: require the closer to paste the ticket's literal title/goal and write a one-line mapping of how the shipped artifact satisfies that exact phrasing. If mechanisms in the ticket body are explicitly dropped, require an explicit '--dropped=<list>' flag rather than silent omission, so a partial ship cannot masquerade as full closure. — Treat the recurrence count on this ticket as the pattern signal: a rising count means this failure mode is systemic for chuck and needs a doctrine/hook change, not a per-incident nudge. | architecture | P1 | chuck / chuck | 2026-05-30 | 2026-06-11 | 12 days | Late |
| #134 | problem.js --inactive=0 parsed as 30d filter — on-track triage ran blind (FIXED same fire) Fix: parseInt(flags.inactive,10)||30 treated 0 as falsy, so the on-track-check Phase 4 command 'list --inactive=0' filtered to problems untouched 30+ days. 6/4 morning fire reported '1 open problem / ledger clean' while plain list showed 18 open. Fixed same fire: Number.isNaN guard in problem.js line 562; verified --inactive=0 now returns all 18. Residual: re-triage the 17 problems the morning fire missed in next interactive session. | architecture | P2 | auto / unassigned | 2026-06-04 | 2026-06-11 | 6 days | On-Time |
| #21 | Claude Desktop 1.4758 random crash after 2-3 hours use Fix: GitHub issue #28900 — Cowork window/frame disappears after 2-3 hours. Risks Phil's scheduled tasks (chuck-daily-house-in-order 4:06 AM, system-health-monitor 5:07 AM, daily-financial-report 8:09 AM) if crash falls during fire window. Mitigations already in place: chuck-heartbeat-watchdog every 30 min catches missed fires, ClaudeZombieReaper 4 AM clears stale subprocesses. NEW action: monitor heartbeat-watchdog reports for next 72h. If 2+ missed fires in 24h, escalate to P0 + add explicit Cowork-restart-on-resurrect logic to ClaudeZombieReaper. | system | P1 | auto / chuck | 2026-04-26 | 2026-06-11 | 45 days | Late |
| #172 | dreaming-nightly rotation excluded chief-of-staff — agent had WORKING_MEMORY.md but never got memory consolidation Fix: Add chief-of-staff to the rotation (6 agents, % 6); longer term the roster-completeness pattern from P-00168 covers enumerated-agent-list rot. | scheduled-task | P2 | chuck / unassigned | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #171 | ClaudeMCPDupeReaper task false-green 10 days — run-hidden.vbs mangled switch-string args, script never ran Fix: FIXED IN-PASS 2026-06-10: run-hidden.vbs now appends args starting with '-' raw instead of re-quoting (paths still quoted). Verified end-to-end: Start-ScheduledTask grew mcp-dupe-reaper.jsonl 2->4 lines with fresh timestamps. Residual: vbs still cannot report child failures (fire-and-forget) — acceptable for hidden-window helpers, documented in ICAR. | scheduled-task | P2 | chuck / chuck | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #167 | DMSO infographic understates real uses + omits the veterinary FDA approval (too negative) Fix: Rebuild dmso.html: correct the headline (DMSO has TWO FDA approvals - human interstitial cystitis Rimso-50 1978 AND veterinary Domoso 1970 for dogs/horses, which the page omitted entirely); reorganize around real-world USES with both pillars; add the documented clinical uses the page missed (chemo anthracycline-extravasation = treatment of choice per many authors; the Pennsaid topical-diclofenac DMSO-carrier role; CNS/ICP research; broad veterinary use); give the doctors pillar (Stanley Jacob MD, Jack de la Torre MD/PhD) real weight; KEEP honest caveats (joint-pain monotherapy evidence thin, IV use genuinely risky, pharma-grade-only carrier rule). | website | P2 | tess / unassigned | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #168 | doc-audit blind to roster-table rot — no table-aware retired-agent rule, no roster-completeness check (board ask #41 items 1-2) Fix: Add retired-agent-marked-active-row table rule + checkRosterCompleteness() structural check (canonical names parsed live from CLAUDE.md Agent Roster so the check itself cannot rot). | memory-system | P2 | chuck / unassigned | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #166 | skill-candidate pipeline dead since 2026-04-27 — Stop hook never finds transcript, 0 candidates ever drafted Fix: Rewrite hook to parse stdin JSON transcript_path (corrected fallback munge), align drafter SKILL.md to hook's real schema, verify end-to-end by piping a real Stop payload and confirming a marker line appears. | scheduled-task | P2 | chuck / unassigned | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #164 | engelsplace serves homepage with HTTP 200 for unknown URLs (soft-404, no 404 page) Fix: Add src/pages/404.astro so the static build emits 404.html; Cloudflare Pages then returns a real 404 status for unknown routes instead of SPA-fallback serving index.html with 200. | website | P2 | tess / unassigned | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #165 | Infographic boot reconciler false-positives non-topic tickets into the briefing (matched raw content, not tags) Fix: Match the tags: frontmatter block only in tess-infographic-request-to-ledger.js openTopicRequestWorkOrders(), not the raw file content. | website | P2 | tess / unassigned | 2026-06-11 | 2026-06-11 | 0 days | On-Time |
| #157 | OpenBrain stale duplicate Windows task OpenBrain-HealthCheck fails hourly (exit 2, missing .bat) Rollback: Re-enable with: Enable-ScheduledTask -TaskName OpenBrain-HealthCheck Fix: Retire the stale Windows task: schtasks /delete /tn OpenBrain-HealthCheck /f (the bot.js openbrain-watchdog cron fully supersedes it). Verify the bot.js cron is the single source first. ESCALATED to Chuck - deleting a scheduled task is lane-owner territory. | scheduled-task | P2 | john / chuck | 2026-06-10 | 2026-06-10 | 0 days | On-Time |
| #160 | Official Discord plugin spawned one Bun gateway server per Claude session - June 9 memory blowup Fix: DONE 2026-06-09 (Phil-directed): plugin disabled in settings.json; philsclaude-launcher.vbs moved Startup -> _DELETE_QUEUE; PhilsClaude PID 17328 + bun pair killed; PhilsClaude project marked DECOMMISSIONED; AppX package verified Status=Ok, no cleanup needed. Reversible: re-enable plugin + restore VBS. | architecture | P2 | chuck / unassigned | 2026-06-10 | 2026-06-10 | 0 days | On-Time |
| #154 | notify.js silently dropped every 'warn' push (BAD_STATES vocab mismatch) Fix: Added 'warn' and 'warning' to BAD_STATES in IT/scripts/lib/notify.js. Verified zero blast radius: the only other pushAlert callers (kara-hdp-backup-verifier, kara-tunnel-reachability-check, tess-website-watchdog) all map to alert/critical/ready and never emit 'warn'. Genuine warn-tier alerts now reach Phil's phone; edge-trigger + 6h re-reminder behavior unchanged. | bot-health | P2 | kara / kara | 2026-06-09 | 2026-06-09 | 0 days | On-Time |
| #143 | Review AI-silicon concentration vs Phil's 3-year retirement horizon (sequence risk) Fix: After the 401k snapshot (P-00142) gives the total picture, compute combined retirement allocation. NVDA stays long-term per Phil's explicit wish. If Robinhood is a large share of total retirement, propose a GRADUAL de-risk over 12-18 months on the OTHER overweight basket names (not NVDA) — small steps on green days, never a panic sell. If Robinhood is a small slice, document the risk as accepted and hold. | phil-action | P2 | alex / alex | 2026-06-06 | 2026-06-07 | 0 days | On-Time |
| #142 | Snapshot Phil's JP Morgan 401(k) into the tax tracker (total-retirement picture) Fix: Recommended Option A (zero stored credentials): Phil logs into the 401k himself, Alex reads holdings+balance via browser (read-only, never logs in or trades), records into Finance/taxes tracker, refresh quarterly. First confirm portal (J.P. Morgan Retirement Link / Chase / Empower — JPM sold 401k recordkeeping to Empower). Option B = Plaid aggregation if Phil wants automation later (loop in Chuck). | phil-action | P2 | alex / alex | 2026-06-06 | 2026-06-07 | 0 days | On-Time |
| #141 | chuck-health-beacon + chuck-drift-guard missed 6/5 evening fires (bot online) Fix: Observe tonight's 2026-06-06 18:35/18:36 fires. If BOTH post clean to #it-ops/#network, the 6/5 miss was a one-off transient stall — close this ticket. If either misses a SECOND consecutive evening, escalate to Real Chuck/Code to inspect the bot.js node-cron registration + firing loop for these two task IDs (check for a swallowed exception or a timer that is not rescheduled after a missed tick). External process cannot re-fire these; confirmation requires the live evening run. | bot-health | P2 | chuck / chuck | 2026-06-06 | 2026-06-06 | 0 days | On-Time |
| #131 | OpenBrain nightly backup failing since 2026-05-21 (byte encoding error), no log entries in 13 days Fix: backup.log newest entry 2026-05-22 03:30 with ERROR 'Cannot proceed with byte encoding. When using byte encoding the content must be of type byte.' and NO entries after 5/22 — either the scheduled backup task stopped firing or it dies silently. Fix: inspect the backup script's pg_dump/gzip pipeline encoding (likely PowerShell pipe corrupting binary; use cmd /c redirection or pg_dump -f directly), then verify the Windows scheduled task still exists and fires. Both open_brain containers are Up, so this is backup-only risk. | memory-system | P1 | auto / chuck | 2026-06-04 | 2026-06-06 | 1 day | On-Time |
| #136 | Infographic topic request: THC Fix: Build a THC infographic using the 6-pass research-first protocol + 11-section evidence-honest PubMed-grounded template, then publish to /infographics. Requester context is internal-only and must never appear on the public page (requester-privacy rule). | website | P2 | tess / tess | 2026-06-05 | 2026-06-05 | 0 days | On-Time |
| #130 | Paid bot-cron chuck-daily-ops-report double-fired the free Code routine (~$60/mo regression) Fix: Bot cron chuck-daily-ops-report (scheduled-tasks.json, agent:chuck, Opus 4.8 paid API) was the redundant paid twin of the free Code routine .claude/scheduled-tasks/chuck-daily-ops-report (Max sub, $0). Code routine header claimed the bot cron was disabled 2026-05-14 (~$60/mo) but it stayed enabled:true and double-fired daily ~3 min after the free one (free 6:47 PM / paid 6:50 PM, both to #operations); re-tuned 18:38->18:50 on 6/1 under P-00116 as if canonical. FIX 2026-06-03 (Kara, Phil-directed): bot cron enabled:false + bot restart (now 22 tasks, ops-report absent) + DO-NOT-RE-ENABLE note in scheduled-tasks.json. Free Code routine is sole owner. Guard gap: the Code routine Monday API Credit Check already defines any scheduled-tasks.json task with agent: + enabled:true as a regression — recommend chuck-doc-audit/behavior-auditor assert it automatically. | architecture | P2 | kara / unassigned | 2026-06-03 | 2026-06-03 | 0 days | On-Time |
| #129 | The /dream skill edits canonical memory (MEMORY.md) but logs only to git, not the dream activity trail Fix: Verified root cause (commit d9753584, manual /dream 2026-06-01 21:10): the dream skill committed a MEMORY.md consolidation to git with a clear message, but wrote NO entry to IT/scripts/dreaming_logs/ — that log only captures the nightly cron runs, not manual /dream runs that edit memory. So canonical-memory edits aren't auditable in the dream trail (only via git history + diff). The change itself was safe (line-merging consolidation, no heuristics dropped). FIX: make the dream skill append a record to IT/scripts/dreaming_logs/<date>-<agent>.log (or a dedicated memory-edits log) on EVERY memory-file edit — manual AND nightly — capturing: file, before/after line count, what was merged/removed and why, and the git commit hash. Then a memory prune is always traceable in the dream's own log, not just git. Belt-and-suspenders: also have it note in the commit body the specific lines removed, not just 'under cap'. | architecture | P2 | chuck / chuck | 2026-06-03 | 2026-06-03 | 0 days | On-Time |
| #126 | systems-check.js over-reports: ignores enabled:false tasks + flags gracefully-handled missing files as false-output risk Fix: Two accuracy fixes to IT/scripts/systems-check.js so the frozen inspector stops generating false positives every run: (1) In the scheduled-tasks freshness check, skip tasks with enabled===false (or report them as an intentional 'disabled' note, not 'may have stopped firing') — it flagged kara-network-throughput (enabled:false) as stopped. (2) In the task-drift check, before flagging a referenced-but-missing file as 'stale → false output risk', confirm the consuming script doesn't handle absence gracefully — skills.js handles a missing _session-markers.jsonl with a clean 'no markers yet' path, so that is normal operation, not drift. Goal: every systems-check finding is actionable, so real problems aren't lost in noise and no agent 'fixes' a non-problem. | architecture | P2 | chuck / chuck | 2026-06-02 | 2026-06-03 | 1 day | On-Time |
| #123 | Legacy split-brain Problem Ledger mirror — IT/problems still holds 63 files vs canonical ledger Fix: Reconcile IT/problems (63 stale files) against the canonical ledger, then archive the dir to _DELETE_QUEUE/ so no tool/agent reads the wrong source and reports false counts. Verify the SQLite/canonical migration is complete before removing. | cleanup | P2 | chuck / chuck | 2026-05-31 | 2026-06-03 | 2 days | On-Time |
| #124 | memory/MEMORY.md over the 100-line hard cap (102 lines) Fix: Distill or merge two of the lowest-value heuristics so MEMORY.md returns to <=100 lines. The cap exists so the always-loaded heuristics file stays scannable; let it creep and it stops being the tight source of truth it's meant to be. | cleanup | P2 | chuck / chuck | 2026-05-31 | 2026-06-03 | 2 days | On-Time |
| #112 | [gate-test] throwaway for pre-close gate Fix: test ticket to verify the pre-close gate success path | cleanup | P2 | chuck / chuck | 2026-05-30 | 2026-06-03 | 3 days | On-Time |
| #118 | Finish the Chief of Staff build — SQLite ops-ledger / Control Tower backend Fix: Build the deferred SQLite ops-ledger backend Chief of Staff was designed around: IT/scripts/ops-ledger.js + IT/data/ops-ledger.db + a verified migration from the current sources, with backup/recovery proven first. Define what the ledger holds, wire the read path, then flip Chief of Staff from file-reads to the ledger. | architecture | P1 | chuck / chuck | 2026-05-31 | 2026-06-03 | 2 days | On-Time |
| #29 | Discord one-way — Phil reads bot posts fine, can't reliably RESPOND from phone/away Fix: REFRAMED 2026-04-27 per Phil correction: outbound bot→Discord works fine, Phil reads posts cleanly. The friction is INBOUND — Phil can't easily compose responses from phone/away from gaming PC. Earlier symptoms (double-answers, crashes) are still real but are SECONDARY to the inbound channel gap. Multi-track fix: (1) Cheap & immediate — formalize email-to-Chuck pipeline. Reply Loop already polls Gmail in scheduled tasks; document for Phil that he can email [email protected] with subject prefix 'CHUCK:' from any phone/device and the next bot fire will surface it under '📬 Phil Asked' in the next ops report. ~10 min documentation, zero new code. (2) Medium — build a dedicated email-route handler that polls a chuck-inbox label hourly (or on-demand via a webhook) and routes messages through askAgentReal pipeline, posting Chuck's reply back to #it-ops. ~2-3 hours. (3) Original Discord cleanup still applies for the secondary symptoms — audit duplicate listeners, add per-message idempotency, tag replies [REAL] / [SONNET]. ~2 hours. (4) Anthropic-side wishlist (Phil's hope, not buildable by us): native Discord ↔ Claude Code switch. Order of execution: ship (1) today as a doc + Phil-tested workflow; (2) and (3) next interactive session. Until done, Phil's reliable response paths are: email [email protected], AnyDesk to laptop, or wait for next interactive session at gaming PC. | system | P1 | chuck / chuck | 2026-04-27 | 2026-06-02 | 35 days | Late |
| #44 | pm2 doesn't auto-launch on Windows reboot — PM2ResurrectOnLogin failing (0xC000013A), no pm2-windows-startup configured Fix: Surfaced by Tess 2026-04-27 night after fixing engelsplace-dev pm2-on-Windows recurring crash. Phil's framing: this is the THIRD time we have run into pm2-on-Windows surprises and we keep re-diagnosing. SYSTEM_STATE.md 2026-04-26 already flagged PM2ResurrectOnLogin showing exit 0xC000013A with a note 'Not confirmed broken, but verify pm2 status after next reboot. If pattern repeats, harden with retry or delayed-start trigger.' That flag sat unread for ~36 hours — exact pattern P-00041 self-improvement loop is meant to catch. Disaster-recovery gap: pm2 save handles daemon-restart persistence but NOT Windows-reboot persistence. After Windows reboot pm2 itself does not auto-launch, so engelsplace-dev (Tess's site) AND engel-ops-bot (Chuck's gateway) BOTH go offline until someone manually runs pm2 resurrect. Plan: (1) diagnose 0xC000013A on PM2ResurrectOnLogin — likely needs delayed-start trigger or 'Run only when user logged in' adjustment. Get sample exit codes from event viewer. (2) IF the existing scheduled task can not be hardened, install pm2-windows-startup as a Windows service which auto-launches pm2 daemon at boot before any user logs in, then pm2 resurrect runs from the saved dump.pm2. (3) Verify by simulating reboot (pm2 kill + reboot test on a quiet evening) — confirm bot + engelsplace-dev come back unattended. Estimated ~1-2 hours including the reboot smoke test. Banked the underlying gotcha + reboot caveat at memory/topics/pm2-npm-windows-gotcha.md so the THIRD recurrence has a queryable fix instead of re-diagnosing from scratch. | system | P1 | chuck / chuck | 2026-04-28 | 2026-06-02 | 34 days | Late |
| #116 | daily-ops-report unreliable — fires in the 18:35–38 cron block, sometimes skipped Fix: Phil flagged the daily-ops-report timing as broken repeatedly (Discord #it-ops 2026-05-29 22:51: 'it keeps happening over and over. How many times do I have to flag this?'). ROOT CAUSE confirmed 2026-05-30: gateway scheduled-tasks.json runs chuck-health-beacon(35 18), chuck-drift-guard(36 18), chuck-daily-ops-report(38 18) back-to-back; node-cron logged a 17:00 'missed execution (blocking IO)' so the report sometimes skips entirely. FIX (needs interactive Real-Chuck session with live-bot reload): (1) get Phil's preferred report time; (2) stagger ops-report off the 18:35-38 block in IT/discord-gateway-bot/scheduled-tasks.json; (3) pm2 restart engel-ops-bot to reload JSON; (4) verify next fire lands. Links behavior-auditor P-00114 (IGNORED_CORRECTION). Do NOT close until Phil confirms a clean on-time fire. | scheduled-task | P1 | auto / unassigned | 2026-05-30 | 2026-06-02 | 2 days | On-Time |
| #120 | Cowork tasks reference deleted files (stale → false-output risk): personal-action.js, codex-system-health-monitor.js, skills-pending marker, 2026-05-04 journal Fix: For each task (chuck-openclaw-on-track-check, chuck-skill-candidate-drafter, cowork-pro-rollover-check): repoint to the live file path or retire the dead step. Confirmed missing: IT/scripts/personal-action.js, memory/skills-pending/_session-markers.jsonl, IT/scripts/codex-system-health-monitor.js, agents/chuck/memory/2026-05-04.md. A task reading a gone file either errors silently or emits stale/empty output Phil may trust. | scheduled-task | P1 | chuck / chuck | 2026-05-31 | 2026-06-01 | 0 days | On-Time |
| #52 | Gmail OAuth re-auth landed on wrong Google account (phillip.engel instead of fairriteworksync) — minutes puller blind Fix: Phil re-runs node IT/discord-gateway-bot/gmail-oauth-setup.js. CRITICAL: at the Google account chooser, sign in as [email protected] (NOT phillip.engel — Google preselects the browser default which IS phillip.engel). Best path: incognito window. Or sign out of phillip.engel first at gmail.com, then run script. Setup script overwrites gmail-oauth-tokens.json on success. Tess verifies post-re-auth via Gmail /users/me/profile API call (must return [email protected]) + runGmailMinutesPull smoke test. Then Phil revokes the wrong-account grant at https://myaccount.google.com/permissions to clean up the stray gmail.readonly grant on his personal account. Cross-lane #6 to Chuck filed: harden gmail-oauth-setup.js with [email protected] OAuth param so the consent screen pre-pins the correct account. | website | P1 | auto / tess | 2026-04-29 | 2026-06-01 | 32 days | Late |
| #122 | chuck-complaint-detector stopped firing — last ran ~37h ago (cron 0 6 daily), self-improvement loop blind Fix: Last status file 2026-05-29 22:52; missed 5/30 and 5/31 6am fires. Freshly built (P-00041 mech 2) and already dark. Verify the schedule is registered and firing, and the handler exits clean; re-register or fix. A repeat-complaint detector that doesn't run defeats the no-Phil-complaint-trigger goal of P-00041. | scheduled-task | P1 | chuck / chuck | 2026-05-31 | 2026-06-01 | 0 days | On-Time |
| #121 | kara-hdp-backup-verifier stopped firing — last ran 56h ago (cron 0 4 daily), NAS backup health now unverified Fix: Last status file 2026-05-29 04:00; missed both 5/30 and 5/31 4am fires. Check whether the Cowork/cron schedule still fires the task and that its handler still runs clean; re-register the schedule or fix the handler. Until it fires, Phil has no signal on whether NAS HDP backups are succeeding — that's the silent failure mode this verifier exists to catch. | scheduled-task | P1 | chuck / kara | 2026-05-31 | 2026-06-01 | 0 days | On-Time |
| #125 | engelsplace repo has uncommitted changes — can block deploys + ledger auto-close (close refuses dirty files) Fix: Review the uncommitted change in the engelsplace repo and either commit it (if intended) or revert it (if a stray edit). A dirty tree can block a Cloudflare Pages deploy and makes problem.js auto-close refuse to act on website tickets. | website | P2 | chuck / tess | 2026-05-31 | 2026-05-31 | 0 days | On-Time |
| #119 | Twilio SMS emergency alerts not wired (Plex/tunnel down) — Phil HIGH PRIORITY Fix: Detector fixed + Plex self-heal shipped 2026-05-31. REMAINING: emergency SMS escalation. (1) Phil creates Twilio account, gets Account SID + Auth Token + a Twilio number (free trial covers it). (2) Kara wires kara-tunnel-reachability-check.js (and other red-state monitors) to POST to Twilio SMS on ALERT, texting Phil's cell. (3) Fire a real test text. Phil's standing order 2026-05-31: surface this every time he asks about system status until DONE. | network | P1 | auto / unassigned | 2026-05-31 | 2026-05-31 | 0 days | On-Time |
| #117 | Lane-interference-guard false-blocked agents on their own lanes Fix: Convert PreToolUse guard from blocking to advisory: PATH-only matching (not content), always exit 0, only LOG cross-lane writes to IT/status/lane-crossings-log.jsonl for after-the-fact owner notification; sweep all doctrine to match. | architecture | P1 | chuck / chuck | 2026-05-31 | 2026-05-31 | 0 days | On-Time |
| #106 | Phase E: mid-session memory consolidation nudge (Hermes-pattern, final lift) Fix: Build IT/scripts/mid-session-nudge.js as PostToolUse hook firing once per session at N=20 tool calls; emits additionalContext nudge prompting OpenBrain capture_thought + task re-anchor + tool-history pruning. Closes the runaway-context root cause behind the 2026-05-15 daily-ops-report crashes (3 fires lost in 4 days). | memory-system | P2 | chuck / chuck | 2026-05-26 | 2026-05-26 | 0 days | On-Time |
| #105 | chuck-observer missing engelsplace git diff — Antigravity-shipped routes leak before next Chuck/Tess boot Fix: Extend IT/scripts/chuck-observer.js to also diff 'git log --oneline -10' (and optionally 'git diff HEAD~5 --name-only --diff-filter=A' for newly added files) on C:/Users/engelp/Projects/engelsplace against the snapshot. When new commits appear since the last fire — especially ones touching src/pages/*.astro or src/gated-routes.json — append a bullet to chuck's Material Change Log naming the commit + new pages. That way the next Chuck/Tess boot sees Antigravity-shipped routes before production curl reveals them as 200 OK leaks. Optional bonus: cross-check newly-added pages against src/gated-routes.json gatedRoutes and flag any that aren't listed as 'GATING GAP' so Tess gets a heads-up at boot. | scheduled-task | P2 | tess / chuck | 2026-05-23 | 2026-05-23 | 0 days | On-Time |
| #103 | Dependency sweep hides actionable findings in automation memory only Fix: Update the active dependency-sweep automation so actionable findings create or update Problem Ledger records and write a visible IT/status dependency report. For the current finding, refresh the bot lockfile transitives for ws and qs, then run syntax/startup smoke checks before any PM2 restart. | scheduled-task | P1 | phil / chuck | 2026-05-23 | 2026-05-23 | 0 days | On-Time |
| #101 | Daily drift email automation points to missing script and did not send Fix: Restore or recreate IT/scripts/codex-nightly-drift-email.js, or update/retire the active Codex automation to the correct current drift reporter. Verify by producing IT/status/codex-nightly-drift-latest.md/json, dated scheduled-task log, and a delivered Resend email. | scheduled-task | P1 | phil / chuck | 2026-05-23 | 2026-05-23 | 0 days | On-Time |
| #104 | Mandatory boot files MISSION.md and CROSS-SURFACE-NOTES.md are absent Fix: Determine whether MISSION.md and CROSS-SURFACE-NOTES.md were intentionally migrated, accidentally trimmed, or renamed. Restore canonical live files or update every boot protocol and automation prompt to the new authoritative paths, then verify agent startup reads succeed. | architecture | P1 | chuck / chuck | 2026-05-23 | 2026-05-23 | 0 days | On-Time |
| #34 | Phase A-D fallout: integrity scripts updated for OpenClaw-rename, but ORG_STATE.md + INFRASTRUCTURE-DESIGN.md now over 20KB cap, plus 7 preflight-derived-sync gaps Fix: On-track-check 2026-04-27 07:24 surfaced 3 categories of post-OpenClaw-refactor cleanup. Status: (1) ✅ FIXED 2026-04-27 — test-agent-boot.js OPENCLAW_FILES + preflight-derived-sync.js sourcePattern updated to reference new canonical filenames (IDENTITY.md, role.md, TOOLS.md, WORKING_MEMORY.md). (2) Open — ORG_STATE.md 22.4 KB / INFRASTRUCTURE-DESIGN.md 20.9 KB both over 20 KB cap. Fix: distill ORG_STATE.md older completions to _ARCHIVE per the 30d trim rule (canonical Sunday weekly distill is the standard cadence — Phil weekly audit Sunday 9:04 PM picks this up). INFRASTRUCTURE-DESIGN.md needs prose tightening or split — Tess Cross-lane ask #2 wants it published as a website page anyway, that work can split it. (3) 7 preflight-derived-sync gaps — INFRASTRUCTURE-DESIGN.md staler than 6 source-of-truth files because tonight's Phase A-D didn't update its diagrams. Fix: regenerate the 4 mermaid diagrams + file ownership table in same session as Tess publication ask (#2) — kill two birds. Also: separately, verify-marketplace-clean reported 5 P1 issues about installPath drift in installed_plugins.json — needs investigation but not blocking (Phil reinstalled v0.5.0/0.3.0/0.5.0/0.4.0 successfully per verify-plugin-install all-in-sync, so the install records are stylistically off but functionally working). | system | P2 | chuck / chuck | 2026-04-27 | 2026-05-23 | 26 days | Late |
| #33 | Phase D: agent-platform-watch scheduled task (Hermes / LangChain / AutoGen / CrewAI) Fix: SHIPPED 2026-04-27. New Cowork scheduled task agent-platform-watch via mcp__scheduled-tasks__create_scheduled_task — daily 8:00 AM CDT (fires next at 8:02 AM today with deterministic dispatch jitter). Watches 4 open-source agent platforms for releases worth lifting into our system (NOT migration). State file at IT/discord-gateway-bot/scheduled-task-logs/agent-platform-watch/state.json — seeded with Hermes v0.11.0 (verified 2026-04-27) so first fire doesn't flag everything as new. Posts to #it only when there's a lift candidate or platform-fetch error. Silent on clean runs. Closes when first fire confirms it works (today 8:02 AM). | system | P2 | chuck / unassigned | 2026-04-27 | 2026-05-23 | 26 days | Late |
| #32 | Phase C: parallel delegation in health-beacon (probes run via Promise.all) Fix: SHIPPED 2026-04-27. Refactored chuck-health-beacon's 4 probes (pm2, WireGuard, Resend, version-watch) from sequential to parallel via Promise.all. Sequential baseline ~8-13s, parallel ~2.6s measured. Probes are independent reads with no shared state — refactor is purely await-pattern change, no side-effect changes. Bot restarted with --update-env. Next 6:35 PM CDT fire validates. Banked lesson: full-testing handlers with external side-effects (Resend email send) sends real emails — use syntax-check + dry-run only. | system | P2 | chuck / unassigned | 2026-04-27 | 2026-05-23 | 26 days | Late |
| #31 | Phase B: skill candidate pipeline (skills.js + Stop hook marker) Fix: SHIPPED 2026-04-27. Pipeline: (1) memory/skills-pending/ directory + README documenting the workflow, (2) IT/scripts/skills.js with subcommands list/preview/promote/archive/purge/markers/review-marker/from-marker, (3) IT/scripts/stop-hook-skill-marker.sh as a lightweight Stop hook that writes a one-line JSONL marker to memory/skills-pending/_session-markers.jsonl when a session crosses ≥10 tool calls + ≥3 file edits (no LLM call, just identifies skill-worthy sessions for human review), (4) hook wired in Claude_Lives_Here/.claude/settings.json Stop event alongside existing agentkits-hook-wrapper, (5) first real candidate seeded: skill-candidate-fts5-conversation-index.md captures the Phase A pattern. Lifts Hermes Agent's autonomous-skill-creation pattern but with manual gate (Phil promotes via skills.js promote). v2 enhancement = LLM-auto-generation from markers, deferred to future session for Phil-supervised activation. Closes when first promote happens or 14d of clean operation. | system | P2 | chuck / unassigned | 2026-04-27 | 2026-05-23 | 26 days | Late |
| #30 | Phase A: FTS5 conversation index + recall.js CLI Fix: SHIPPED 2026-04-27. FTS5 index at IT/discord-gateway-bot/scheduled-task-logs/conversation-index/index.db (4 KB, 146 chunks across 35 files). Indexer at IT/scripts/index-conversations.js (handler-typed, idempotent via mtime tracking, --full / --stats flags). Query CLI at IT/scripts/recall.js (FTS5 BM25 ranking, snippet rendering with hit-highlight, --agent / --since / --until / --limit / --json / --paths-only filters). Bot.js wired with new handler chuck-conversation-index, daily 4:30 AM CDT, silent on clean runs. Resolves Hermes Agent searchable-history gap without migration. Closing this problem on first successful cron fire (next: 2026-04-28 04:30 CDT). | system | P2 | chuck / unassigned | 2026-04-27 | 2026-05-23 | 26 days | Late |
| #11 | Discord #engelsplace channel description still says 'Chuck — website' Fix: Phil edits the #engelsplace channel description in Discord (right-click channel -> Edit Channel -> Description). Change from 'Chuck - website, web design, Ghost CMS' to 'Tess - engelsplace.com, web architecture (Astro/Cloudflare Pages)'. UI-only, ~30 sec. Closes the visual lane-reorg loose end after the 2026-04-25 reorg. | system | P2 | chuck / phil | 2026-04-25 | 2026-05-23 | 27 days | Late |
| #8 | Phil: update 3 Cowork task prompts to append write-heartbeat call Fix: Phil edits each of 3 Cowork tasks in Claude Desktop UI. Append to each prompt body: 'At the end of this task, use Desktop Commander (start_process) to run: node C:/Users/engelp/Claude_Lives_Here/IT/scripts/write-heartbeat.js --task=<task-id> --status=green --summary="<one-line>" --silent — with --silent because the task already posted its own Discord summary. Tasks: (1) system-health-monitor, (2) chuck-daily-house-in-order, (3) chuck-openclaw-on-track-check. Once all three updated, Chuck adds them to WATCHED_TASKS array and watchdog covers the full ops surface. | system | P1 | chuck / phil | 2026-04-24 | 2026-05-23 | 29 days | Late |
| #51 | OpenClaw gateway boot blocked on expired Anthropic auth + missing credentials dir Fix: Phil runs 'openclaw models auth login --provider anthropic' in interactive PowerShell/CMD (TTY required, ~2 min). That recreates ~/.openclaw/credentials/ dir + writes fresh OAuth token. Then Chuck restarts gateway via pm2: 'pm2 start C:/Users/engelp/Claude_Lives_Here/IT/openclaw/gateway-launcher.js --name openclaw-gateway && pm2 save'. Verify with: netstat -ano | grep 18789 (should LISTEN), openclaw doctor (no warnings), openclaw channels status --probe (reaches gateway). Bindings already fixed by Chuck 2026-04-28 night. | system | P1 | chuck / unassigned | 2026-04-29 | 2026-05-23 | 24 days | Late |
| #40 | chuck-daily-ops-report 429 rate-limit — investigate separately from messageCreate dedup fix Fix: P-00017/P-00025/P-00026 are duplicate auto-captured failures of chuck-daily-ops-report at the agent stage with 429 rate_limit_error. Today's messageCreate double-spawn fix (added to bot.js 2026-04-27 21:13 — message-ID dedup Map) addresses Discord interactive double-fires but NOT scheduled-task path. Cron path runScheduledTask already has 5s recentTaskFires dedup at line 1232. Real causes to investigate: (1) is ops-report's prompt exceeding 30K ITPM on a single Opus call (different from doubled Sonnet payload — Opus has its own limits), (2) is the cron-side dedup actually working — verify by tailing logs after next 6:38 PM fire, (3) is prompt growth (full SKILL.md + boot files + journals) hitting payload limits regardless of duplication. Diagnostic plan: tail bot.log + IT/discord-gateway-bot/scheduled-task-logs/chuck-daily-ops-report/ after next fire, check exact token counts in 429 response body, decide between prompt trim / model swap / payload split. Filed 2026-04-27 night to prevent the messageCreate fix being misread as 'all 429s solved.' | system | P1 | chuck / unassigned | 2026-04-28 | 2026-05-23 | 25 days | Late |
| #49 | [chuck-daily-ops-report] agent stage failed: 429 {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization's rate lim Fix: LLM subprocess failed (often timeout or auth). Chuck checks Anthropic console for API credit + 401s. If auth: rotate key + restart bot with --update-env (see P-00002 + rotate-anthropic-key.ps1). If timeout: check CLI_TIMEOUT or rewrite as handler-typed (see P-00009). | system | P1 | chuck / chuck | 2026-04-28 | 2026-05-23 | 24 days | Late |
| #48 | [chuck-daily-ops-report] agent stage failed: 429 {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization's rate lim Fix: LLM subprocess failed (often timeout or auth). Chuck checks Anthropic console for API credit + 401s. If auth: rotate key + restart bot with --update-env (see P-00002 + rotate-anthropic-key.ps1). If timeout: check CLI_TIMEOUT or rewrite as handler-typed (see P-00009). | system | P1 | chuck / chuck | 2026-04-28 | 2026-05-23 | 24 days | Late |
| #50 | [chuck-heartbeat-watchdog] 3/3 silence(s) UNFIXED by auto-remediation: chuck-daily-ops-report, chuck-health-beacon, chuck-drift-guard Fix: Chuck investigates manually. For each unfixed task: (1) read scheduled-task-logs/chuck-daily-ops-report/ for the expected fire time, (2) check bot-error.log for crashes, (3) if pm2 status engel-ops-bot shows anomaly, restart with --update-env. Auto-remediation log at IT/status/auto-remediation-log.json has full history of what was tried. | system | P1 | chuck / chuck | 2026-04-29 | 2026-05-23 | 24 days | Late |
| #39 | [chuck-heartbeat-watchdog] 3/3 silence(s) UNFIXED by auto-remediation: chuck-daily-ops-report, chuck-health-beacon, chuck-drift-guard Fix: Chuck investigates manually. For each unfixed task: (1) read scheduled-task-logs/chuck-daily-ops-report/ for the expected fire time, (2) check bot-error.log for crashes, (3) if pm2 status engel-ops-bot shows anomaly, restart with --update-env. Auto-remediation log at IT/status/auto-remediation-log.json has full history of what was tried. | system | P1 | chuck / chuck | 2026-04-28 | 2026-05-23 | 25 days | Late |
| #28 | [engelsplace-gmail-minutes-ingest] Handler completed with 1 error(s): Token refresh failed: {"error":"invalid_grant","error_description":"Token has been expired or revoked."} Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-gmail-minutes-ingest/2026-04-27.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action. | system | P2 | auto / chuck | 2026-04-27 | 2026-05-23 | 26 days | Late |
| #27 | [engelsplace-gmail-minutes-ingest] Handler completed with 1 error(s): Token refresh failed: {"error":"invalid_grant","error_description":"Token has been expired or revoked."} Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-gmail-minutes-ingest/2026-04-27.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action. | system | P2 | auto / chuck | 2026-04-27 | 2026-05-23 | 26 days | Late |
| #46 | [chuck-health-beacon] 1/4 probes failed: version-watch Fix: WebSearch the exact version change for known regressions; roll back if problematic. | system | P1 | chuck / chuck | 2026-04-28 | 2026-05-23 | 24 days | Late |
| #47 | [chuck-drift-guard] 1/5 sections drifted Fix: Chuck reconciles SYSTEM_STATE.md to match live state (edit the doc, not the live system unless live is wrong). 10-20 min depending on scope. | system | P2 | chuck / chuck | 2026-04-28 | 2026-05-23 | 24 days | Late |
| #37 | [chuck-drift-guard] 1/5 sections drifted Fix: Chuck reconciles SYSTEM_STATE.md to match live state (edit the doc, not the live system unless live is wrong). 10-20 min depending on scope. | system | P2 | chuck / chuck | 2026-04-27 | 2026-05-23 | 25 days | Late |
| #24 | [chuck-drift-guard] 1/5 sections drifted Fix: Chuck reconciles SYSTEM_STATE.md to match live state (edit the doc, not the live system unless live is wrong). 10-20 min depending on scope. | system | P2 | chuck / chuck | 2026-04-26 | 2026-05-23 | 26 days | Late |
| #23 | [chuck-drift-guard] 1/5 sections drifted Fix: Chuck reconciles SYSTEM_STATE.md to match live state (edit the doc, not the live system unless live is wrong). 10-20 min depending on scope. | system | P2 | chuck / chuck | 2026-04-26 | 2026-05-23 | 26 days | Late |
| #15 | [chuck-drift-guard] 1/5 sections drifted Fix: Chuck reconciles SYSTEM_STATE.md to match live state (edit the doc, not the live system unless live is wrong). 10-20 min depending on scope. | system | P2 | chuck / chuck | 2026-04-25 | 2026-05-23 | 27 days | Late |
| #13 | [chuck-health-beacon] 1/4 probes failed: version-watch Fix: WebSearch the exact version change for known regressions; roll back if problematic. | system | P1 | chuck / chuck | 2026-04-25 | 2026-05-23 | 27 days | Late |
| #90 | Cloudflare tunnel nas.engelsplace.com + plex.engelsplace.com HTTP 530 — cloudflared on 192.168.1.5 crashed Fix: Phil RDP/console into 192.168.1.5 (Plex box, host of cloudflared agent) → Services.msc → cloudflared → Restart. Verify with `curl -I https://nas.engelsplace.com/` returning 401/200/302. Same fate-share signature as P-00055 (46h outage 4/29→5/1, fixed by cloudflared restart on the same box). Targeted service restart on 192.168.1.5 is canonical fix — generic reboots will not suffice. | network | P1 | kara / phil | 2026-05-20 | 2026-05-23 | 2 days | On-Time |
| #89 | "[engelsplace-fmx-ingest-afternoon] Handler completed with 1 error(s): rebase: error: cannot pull with rebase: You have unstaged changes. Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-fmx-ingest-afternoon/2026-05-19.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action. | system | P2 | auto / chuck | 2026-05-19 | 2026-05-21 | 1 day | On-Time |
| #93 | "[engelsplace-gmail-minutes-ingest] Handler completed with 1 error(s): rebase: error: cannot pull with rebase: You have unstaged changes. Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-gmail-minutes-ingest/2026-05-21.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action. | system | P2 | auto / chuck | 2026-05-21 | 2026-05-21 | 0 days | On-Time |
| #94 | "[engelsplace-fmx-ingest-morning] Handler completed with 1 error(s): rebase: error: cannot pull with rebase: You have unstaged changes. Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-fmx-ingest-morning/2026-05-21.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action. | system | P2 | auto / chuck | 2026-05-21 | 2026-05-21 | 0 days | On-Time |
| #95 | "[engelsplace-fmx-pm-ingest-morning] Handler completed with 1 error(s): rebase: error: cannot pull with rebase: You have unstaged changes. Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-fmx-pm-ingest-morning/2026-05-21.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action. | system | P2 | auto / chuck | 2026-05-21 | 2026-05-21 | 0 days | On-Time |
| #97 | "[engelsplace-youtube-ingest] Handler completed with 1 error(s): rebase: error: cannot pull with rebase: You have unstaged changes. Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-youtube-ingest/2026-05-21.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action. | system | P2 | auto / chuck | 2026-05-21 | 2026-05-21 | 0 days | On-Time |
| #98 | "[engelsplace-fmx-ingest-morning] Handler completed with 1 error(s): rebase: error: cannot pull with rebase: You have unstaged changes. Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-fmx-ingest-morning/2026-05-21.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action. | system | P2 | auto / chuck | 2026-05-21 | 2026-05-21 | 0 days | On-Time |
| #99 | "[engelsplace-fmx-pm-ingest-morning] Handler completed with 1 error(s): rebase: error: cannot pull with rebase: You have unstaged changes. Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-fmx-pm-ingest-morning/2026-05-21.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action. | system | P2 | auto / chuck | 2026-05-21 | 2026-05-21 | 0 days | On-Time |
| #87 | "[engelsplace-fmx-ingest-morning] Handler completed with 1 error(s): rebase: error: cannot pull with rebase: You have unstaged changes. Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-fmx-ingest-morning/2026-05-18.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action. | system | P2 | auto / chuck | 2026-05-18 | 2026-05-21 | 3 days | On-Time |
| #26 | [chuck-daily-ops-report] agent stage failed: 429 {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization's rate lim Fix: LLM subprocess failed (often timeout or auth). Chuck checks Anthropic console for API credit + 401s. If auth: rotate key + restart bot with --update-env (see P-00002 + rotate-anthropic-key.ps1). If timeout: check CLI_TIMEOUT or rewrite as handler-typed (see P-00009). | system | P1 | chuck / chuck | 2026-04-26 | 2026-04-28 | 1 day | On-Time |
| #38 | [chuck-daily-ops-report] agent stage failed: 429 {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization's rate lim Fix: LLM subprocess failed (often timeout or auth). Chuck checks Anthropic console for API credit + 401s. If auth: rotate key + restart bot with --update-env (see P-00002 + rotate-anthropic-key.ps1). If timeout: check CLI_TIMEOUT or rewrite as handler-typed (see P-00009). | system | P1 | chuck / chuck | 2026-04-27 | 2026-04-28 | 0 days | On-Time |
| #25 | [chuck-daily-ops-report] agent stage failed: 429 {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization's rate lim Fix: LLM subprocess failed (often timeout or auth). Chuck checks Anthropic console for API credit + 401s. If auth: rotate key + restart bot with --update-env (see P-00002 + rotate-anthropic-key.ps1). If timeout: check CLI_TIMEOUT or rewrite as handler-typed (see P-00009). | system | P1 | chuck / chuck | 2026-04-26 | 2026-04-28 | 1 day | On-Time |
| #17 | [chuck-daily-ops-report] agent stage failed: 429 {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization's rate lim Fix: LLM subprocess failed (often timeout or auth). Chuck checks Anthropic console for API credit + 401s. If auth: rotate key + restart bot with --update-env (see P-00002 + rotate-anthropic-key.ps1). If timeout: check CLI_TIMEOUT or rewrite as handler-typed (see P-00009). | system | P1 | chuck / chuck | 2026-04-25 | 2026-04-28 | 2 days | On-Time |
| #42 | Gmail OAuth refresh token revoked — engelsplace-gmail-minutes-ingest failing Fix: Phil re-authorizes Gmail OAuth for [email protected]. Run original consent flow against client_id in IT/credentials/gmail-oauth-client.json, scope https://www.googleapis.com/auth/gmail.readonly, capture new refresh_token + access_token, overwrite IT/credentials/gmail-oauth-tokens.json. No bot restart needed — getGmailClient reads from disk per invocation. Verify via the live-verify oneliner in credentials-ledger. Surfaced by Tess 2026-04-27 17:42 CDT after seeing FATAL invalid_grant in scheduled-task-logs/engelsplace-gmail-minutes-ingest/2026-04-27.log. Token revoked at or before 2026-04-27T10:00:00Z. Affects: live engelsplace.com is missing 2026-04-27 weekly meeting minutes (and any future minutes until OAuth re-auth). | website | P1 | auto / unassigned | 2026-04-28 | 2026-04-28 | 0 days | On-Time |
| #36 | Phase B v2 — auto-LLM-generation of skill candidates from session markers (Hermes method) Fix: SEQUENCING (Phil 2026-04-27 night): blocked until Alex restoration is verified end-to-end (alex.zip installed in Cowork, /alex appears in slash menu, first interactive session boots cleanly, daily-financial-report task re-authored under Alex's voice). Once Alex is verified working, proceed: (1) RESEARCH PHASE — read Hermes Agent's actual implementation of autonomous skill creation. github.com/NousResearch/hermes-agent is MIT, source-readable. Specifically look at: how they detect 'task-complete-worthy' sessions, what their LLM prompt template looks like for drafting a skill .md, how they handle skill-name dedup against existing skills, how they decide a draft is 'good enough' vs 'noise/discard.' Also read recent Hermes releases (current v0.11.0) for any skill-creation refinements since launch. (2) IMPLEMENTATION PHASE — build on top of existing Phase B v1 infrastructure (memory/skills-pending/ + skills.js + stop-hook-skill-marker.sh). Add IT/scripts/auto-draft-skill-candidate.js that reads the latest unreviewed marker in _session-markers.jsonl, fetches the session transcript, calls Anthropic API (Opus per Phil's standing order — never Sonnet/Haiku in scheduled work) with a skill-drafting prompt, writes the draft to memory/skills-pending/skill-candidate-<slug>.md with status=pending, dedup-checks against existing skill names + descriptions in ~/.claude/skills/. Conservative threshold: only fires when marker.tool_calls >= 15 + file_edits >= 5 (higher than v1's marker threshold). Capped at 1 candidate per day to prevent runaway cost. Wire as a bot.js handler-typed task firing daily 5 AM CDT (after the conversation indexer at 4:30 AM, before the daily house-in-order at 6 AM-ish). (3) MANUAL GATE PRESERVED — auto-drafts still land in pending state. Phil promotes/archives via skills.js. v2 is about removing the 'Chuck handwrites the candidate' step, NOT about auto-promoting to live skills. Estimated 3-5 hours focused work after Alex is verified. | system | P2 | chuck / unassigned | 2026-04-27 | 2026-04-27 | 0 days | On-Time |
| #35 | Restore Alex (CFO) from 2026-04-11 archive — full agent files + plugin Fix: Phil's directive 2026-04-27 night: bring Alex back for Finance, Chuck concentrates on the system. Pattern matches Peter restoration 2026-04-25: (1) copy agents/alex/soul.md from _ARCHIVE/agents-retired-2026-04-11/alex/ UNCHANGED per Phil's standing order 'never fuck with soul.md', (2) Chuck builds agents/alex/{IDENTITY.md (canonical persona card), role.md, agents.md, TOOLS.md, WORKING_MEMORY.md} matching post-2026-04-26-OpenClaw structure, (3) build IT/plugins/alex/ plugin matching peter/john pattern (plugin.json + skills/alex/SKILL.md), (4) daily-financial-report Cowork task prompt re-routed to Alex agent, (5) auto-sync ripple updates to CLAUDE.md / USER.md / glossary / decisions-log / SYSTEM_STATE / AGENT_BOARD / INFRASTRUCTURE-DESIGN / OPENCLAW-BIBLE per memory/AGENTS.md auto-sync rule. Estimated 2-3 hours. Requires Phil interactive for at least: confirming any scope decisions for Alex's identity.md/role.md beyond what's clearly Finance-domain. Tonight (2026-04-27) Chuck did the immediate Chuck-scope changes only: role.md scope updated, agents.md Finance refusal added, P-00012 reassigned to Alex's lane. Full restoration is this problem entry's work. | system | P2 | chuck / unassigned | 2026-04-27 | 2026-04-27 | 0 days | On-Time |
| #22 | [chuck-health-beacon] 1/4 probes failed: version-watch Fix: WebSearch the exact version change for known regressions; roll back if problematic. | system | P1 | chuck / chuck | 2026-04-26 | 2026-04-26 | 0 days | On-Time |
| #14 | [chuck-health-beacon] 1/4 probes failed: version-watch Fix: WebSearch the exact version change for known regressions; roll back if problematic. | system | P1 | chuck / chuck | 2026-04-25 | 2026-04-26 | 0 days | On-Time |
| #16 | [chuck-drift-guard] 1/5 sections drifted Fix: Chuck reconciles SYSTEM_STATE.md to match live state (edit the doc, not the live system unless live is wrong). 10-20 min depending on scope. | system | P2 | chuck / chuck | 2026-04-25 | 2026-04-26 | 0 days | On-Time |
| #18 | [chuck-daily-ops-report] agent stage failed: 429 {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization's rate lim Fix: LLM subprocess failed (often timeout or auth). Chuck checks Anthropic console for API credit + 401s. If auth: rotate key + restart bot with --update-env (see P-00002 + rotate-anthropic-key.ps1). If timeout: check CLI_TIMEOUT or rewrite as handler-typed (see P-00009). | system | P1 | chuck / chuck | 2026-04-25 | 2026-04-26 | 0 days | On-Time |
| #19 | Bot needs guildMemberAdd handler + grant-role admin commands Fix: Add guildMemberAdd event handler to IT/discord-gateway-bot/bot.js that posts to #it-ops when any new member joins (alert: 🆕 New member joined as @everyone-only: <username>). Also add !grant-family @user and !grant-trusted @user admin commands gated to Phil's user ID for assigning roles without leaving Discord. Estimated 30 min. Closes the gap discovered 2026-04-25 22:33 when guest tylerbailey0517 was invited and Phil expected channels to be locked but had no per-join visibility. | system | P2 | chuck / unassigned | 2026-04-26 | 2026-04-26 | 0 days | On-Time |
| #2 | Anthropic API key rotation Fix: Chuck writes a single PowerShell script (IT/scripts/rotate-anthropic-key.ps1) that: (1) reads current key from .env and shows last-4 chars, (2) prompts Phil to paste the new key from console.anthropic.com, (3) updates .env in place, (4) runs pm2 restart engel-ops-bot --update-env, (5) test-pings chuck-chuck.cmd -p to verify new key works, (6) deletes itself after success. Phil action: 5 minutes — open the console, generate new key, paste into the prompt. Chuck can have the script ready in 15 min on Phil's go. | system | P1 | chuck / phil | 2026-04-24 | 2026-04-25 | 1 day | On-Time |
| #10 | [chuck-drift-guard] 1/5 sections drifted Fix: Chuck reconciles SYSTEM_STATE.md to match live state (edit the doc, not the live system unless live is wrong). 10-20 min depending on scope. | system | P2 | chuck / chuck | 2026-04-24 | 2026-04-25 | 0 days | On-Time |
| #3 | NAS Plex box stale credentials (192.168.1.5) Fix: Script already exists: IT/scripts/fix-plex-box-nas-creds.ps1. Phil AnyDesks into 192.168.1.5 (Plex box), opens PowerShell, runs the script. It swaps stale 'engelp' creds for 'engel-agent' in Windows Credential Manager for both 192.168.1.80 and \\philsserver. Watches for SUCCESS, waits 5 min, confirms QuLog quieted on the NAS. 60 seconds of Phil's time. Chuck cannot do this remotely — Windows Credential Manager is per-user-session scoped on the Plex box's console. | system | P2 | chuck / phil | 2026-04-24 | 2026-04-25 | 0 days | On-Time |
| #6 | chuck-daily-ops-report CLI_TIMEOUT — Layer 1/2/both decision pending Fix: Chuck's vote: SHIP BOTH. Layer 1 (10 min): raise CLI_TIMEOUT 300s → 450s in bot.js, add double-fire guard (refuse second spawn within 30s of prior start). Stops the immediate bleeding. Layer 2 (1-2 hrs): rewrite chuck-daily-ops-report as a handler-typed task (like chuck-drift-guard and chuck-health-beacon — no LLM subprocess, deterministic, fast, can't timeout). Layer 2 kills the failure class permanently. Chuck can ship Layer 1 tonight on go; Layer 2 slots for one focused session this week. Phil decides: both, Layer 1 only, Layer 2 only, or defer. | system | P2 | chuck / phil | 2026-04-24 | 2026-04-24 | 0 days | On-Time |
| #4 | chuck-local-task-trial leftover — delete or keep Fix: Chuck's vote: DELETE. The trial was for the 2026-04-19 Routines-vs-Claude-Code-scheduled-tasks evaluation. Evaluation is done (scheduled-tasks won). The trial task has no operational purpose and just adds to the scheduled-task roster. Proposed action: Chuck archives SKILL.md to _ARCHIVE/scheduled-tasks-retired/2026-04-24-chuck-local-task-trial/, then Phil disables + deletes the task via Claude Code UI (1 click). Backup preserved; recovery is copy-back if ever needed. Phil says go or redirect. | system | P2 | chuck / phil | 2026-04-24 | 2026-04-24 | 0 days | On-Time |
| #5 | Scheduled tasks default-boot on Sonnet (Claude Desktop 1.3883 regression suspected) Fix: Primary fix already applied: added 'model: claude-opus-4-7' to frontmatter of both chuck-openclaw-on-track-check and chuck-local-task-trial SKILL.md files. Verifier: 8 AM CDT 2026-04-24 natural fire. If the 8 AM archive shows Opus was used → close this ticket. If still Sonnet → Chuck files GitHub issue against Claude Desktop 1.3883 referencing the regression, rolls Claude Desktop back to 1.3561 (last known-good per health-beacon logs from 2026-04-21). Chuck owns the close or the escalation depending on the 8 AM result. | system | P1 | chuck / chuck | 2026-04-24 | 2026-04-24 | 0 days | On-Time |
| #7 | [smoke-test] smoke test: red path with ledger integration Fix: This is a smoke test. Close this problem after verifying the flow works end-to-end. Run: node IT/scripts/problem.js close <id> --fix="smoke test" --resolver=chuck | system | P2 | auto / chuck | 2026-04-24 | 2026-04-24 | 0 days | On-Time |