Operations Dashboard

System Problems & Active Projects
Total problems tracked: 218
Active: 64 · Resolved: 153
On-Time Delivery
76%
116 of 153 resolved with due dates
System Issues: Active
64
P0–P2 break-fix in progress
Total Resolved
153
completed issues archived
Deferred Items
1
backlogged or scheduled later
System Issues: Active
P0 OVERDUE
#248

Discord bot MJ/SB run on METERED Sonnet API (unauthorized spend); bot stopped + pinned

Category: bot-health
Reporter: kara
Assigned: unassigned
Created: 2026-06-21 Time Active: 1 day Due: 2026-06-21 (1 day overdue)
Details
Proposed Fix: Bot STOPPED via pm2 to halt bleed (done). Then: (1) route MJ/SB off askAgent (anthropic.messages.create) onto the Max-subscription path (askAgentReal/claude -p) so kid bots are $0; (2) wire a REAL daily cost check from the Anthropic Admin cost API (actual dollars) + hard alarm — needs Phil to mint a read-only Admin API key (free to call); (3) fix the bot MODULE_NOT_FOUND path bug crash-looping the bot and breaking kid posts; (4) audit other metered callers (gmail/fmx/blood-panel/youtube pullers, skill-drafter, backtest-reconcile) -> subscription or explicit Phil okay; re-enable bot only after 1+2+Phil go.
Root Cause (5 Whys) 5 Whys: API spent -> bot calls metered API; MJ/SB use askAgent=messages.create not askAgentReal=subscription; kid bots deliberately left on Sonnet API (bot.js:1463), never migrated; undetected because burn-history.json is blind to separate-client jobs + no authoritative cost check; REPEAT of P-00238 because that fix moved only SOME jobs and left the cost-blindness. ROOT = no enforced subscription-only default + no to-the-penny cost meter.
P1 OVERDUE
#222

Outcome missing: dreaming-nightly produced no result (verifier could not self-heal)

Category: scheduled-task
Reporter: auto
Assigned: unassigned
Created: 2026-06-18 Time Active: 4 days Due: 2026-06-21 (2 days overdue)
Details
Proposed Fix: Investigate why dreaming-nightly ran without producing its artifact; wire in-process re-fire (increment 2) or fix the producer.
Root Cause (5 Whys) Outcome-verifier found the task's expected artifact is missing/stale: dreaming-nightly has not run in 30.4h (freshest log 2026-06-17-john.log). Auto-heal not available for this task in increment 1.
P1 OVERDUE
#218

Plex box (PHILSPLEXI9) NAS backup is silently FAILING — HDP PC Agent can't access inventory (same broken engel-agent cred as P-00204)

Category: network
Reporter: kara
Assigned: unassigned
Created: 2026-06-16 Time Active: 6 days Due: 2026-06-19 (4 days overdue)
Details
Proposed Fix: On 192.168.1.5 (philsplexi9): (1) Open HBS 3 / HDP PC Agent and re-authenticate the NAS pairing with engel-agent + EngelBot2026! (the agent's OWN cred store, separate from Windows cmdkey). (2) Run IT/scripts/fix-plex-box-nas-creds.ps1 to swap lingering cmdkey/mapped-drive entries engelp->engel-agent. (3) Re-trigger job PHILSPLEXI9_engel-agent_1, confirm Success. Verify: job Success in HBS AND engel-agent/engelp failed-login Warnings stop in QuLog (also closes P-00204 symptom). Needs hands/remote-app access ON the Plex box — not fixable from philsgamingmachine over the NAS API.
Fired live during nas-watch 2026-06-16. NAS-Alerts email + QuLog sev3. Shares root cause with P-00204; this is the SEVERE consequence — the Plex/home machine nightly NAS backup (daily 03:30, 30 versions) is NOT running. Repeat manifestation of the Plex-box credential fault -> ICAR owed (silent backup failure, no success-verification; echoes P-00194). Root Cause (5 Whys) HDP PC Agent on 192.168.1.5 reports 'inventory PHILSPLEXI9_1 could not be accessed' (QuLog 06-16 03:30 sev3, emailed 08:30) because it cannot authenticate to the NAS — engel-agent logins from 192.168.1.5 fail (P-00204). NAS-side engel-agent is VALID (authed live this fire w/ EngelBot2026!), so the bad credential lives on the Plex box; the HDP/HBS agent's own cred store was never updated to the 2026-04-23 password (cmdkey swap doesnt touch it). Undetected for days because failed-login spam was treated as cosmetic and there is NO backup-success verification — a non-running backup looked identical to a healthy one.
P1 OVERDUE
#209

OpenBrain is the only memory system with NO enforced write — Chuck's captures are model-dependent, so OpenBrain is sparsely fed (Phil: 'Chuck doesn't write to open brain / forgets')

Category: architecture
Reporter: kara
Assigned: unassigned
Created: 2026-06-16 Time Active: 7 days Due: 2026-06-19 (4 days overdue)
Details
Proposed Fix: Add an enforced Stop-hook bridge that forwards the already-enforced AgentKits CPS session summary into OpenBrain via capture_thought (reuse existing summary, deterministic, model-independent) — makes OpenBrain writes automatic every session for ALL agents. Plus doc fix: Chuck SKILL.md session-end step 6 memory_save->capture_thought + make capture mandatory not 'after meaningful exchanges'. Activation (settings.json Stop wire-in + plugin rebuild) = system change, Phil green-light.
Root Cause (5 Whys) 5 Whys (live evidence 2026-06-15): (1) Phil reports Chuck 'doesn't write to OpenBrain / forgets'; (2) OpenBrain capture pipeline is HEALTHY (getcapturejobstats: 263 done, 0 failed/pending) and Chuck DID capture today+yesterday — so it's not dead, just sparse; (3) sparse because the ONLY thing that writes to OpenBrain is the model choosing to call capturethought — there is NO hook/automated path (grep of all Stop/PostToolUse hooks: agentkits summarize->memory.db, dream->file memory, working-memory-discipline, mid-session-nudge only PRINTS a reminder; none call capturethought/localhost:8000); (4) the other two memory systems (AgentKits CPS memory.db, dream file-memory) ARE enforced by Stop hooks, so they get fed every session while OpenBrain doesn't -> asymmetry Phil perceives as 'Chuck forgets'; (5) compounding: Chuck SKILL.md session-end step 6 points durable writes at memorysave (AgentKits), NOT capture_thought (OpenBrain) -> split-brain. ROOT: OpenBrain writes are 100% model-dependent with no enforcement, unlike the other two stores. (Note: within-session forgetting is a separate context-attention axis, not solvable by cross-session plumbing.)
P1 OVERDUE
#194

Scheduled-task sprawl across 4 surfaces: duplicate ops-reports + triple doc-audit + cross-surface dupes + NO outcome-verification layer (the gap that let Journey rot 10 days)

Category: scheduled-task
Reporter: chuck
Assigned: chuck
Created: 2026-06-13 Time Active: 9 days Due: 2026-06-16 (6 days overdue)
Details
Proposed Fix: (A) Build a single canonical scheduled-task registry/index across all 4 surfaces. (B) Consolidate: pick ONE daily ops-report, keep bot-cron daily doc-audit (detect) + ONE weekly that fixes+emails, retire the duplicate Cowork/code-side dead tasks. (C) Build the missing OUTCOME-VERIFICATION layer — a daily task that checks each producing-task actually landed its artifact (entry/email/file/commit exists and is fresh), the receive-side gate ICAR-2026-06-13-02 demands, generalized. (D) Clarify john paused + agent-platform-watch revive/kill. Code-side dead folders retired in-pass.
Root Cause (5 Whys) ~40 scheduled tasks live across 4 surfaces (Code-side .claude/scheduled-tasks, Cowork Documents/Claude/Scheduled, cloud CronList=empty, bot-cron scheduled-tasks.json) with no single registry, so overlaps accreted unseen: (1) chuck-daily-ops-report runs TWICE nightly (code-side Claude narrative 6:47pm #operations + bot-cron deterministic handler 6:50pm #it-ops); (2) doc-audit.js runs THREE times (bot-cron daily 4:20am + code-side chuck-scheduled-task-audit Sun + Cowork chuck-weekly-doc-audit Sun); (3) chuck-openclaw-on-track-check exists on BOTH Code(disabled/dead) and Cowork(active); (4) john-weekly-compliance-update PAUSED with no date; (5) agent-platform-watch dead since 5/03. DEEPER root: all monitoring measures proxies (crash / doc-drift / missed-heartbeat) not OUTCOMES — no task verifies a green run produced a correct result, so silent-success-but-wrong (Journey) is invisible. Same family as ICAR-2026-06-13-01 (opt-in coverage default-off).
P1 OVERDUE
#186

tess-website-watchdog email alert sends from UNVERIFIED engeloperations.com — 403-dead 17 days, swallowed

Category: bot-health
Reporter: john
Assigned: tess
Created: 2026-06-13 Time Active: 9 days Due: 2026-06-16 (7 days overdue)
Details
Proposed Fix: IN-PASS: change bot.js:297 from [email protected] -> [email protected] (verified domain all other senders use). Notify Tess. Poka-yoke (Chuck/Tess): single ALERT_FROM constant + sendResendEmail helper that THROWS on !ok; boot assertion that the sender domain is verified; canary probes the same sender real alerts use.
Root Cause (5 Whys) 5 Whys: (1) the website-flag email to Phil 403s every fire. (2) bot.js:297 sends from [email protected]. (3) only ONE Resend domain is verified — engelsplace.com; engeloperations.com is not registered. (4) from-address is a hardcoded literal duplicated at 3 sites with no shared constant; the org migration to engelsplace.com missed this inline handler. (5) the 403 is swallowed (bot.js:304-306 console.error, no throw; returns skipDiscord:false) so it never reaches the failure path; health-beacon canary probes the VERIFIED sender so it stays green. 13 dated 403s 05-26..06-11. ROOT: per-handler from-literals, fail-open error path, canary tests wrong sender.
P1 OVERDUE
#185

Plaintext .env.pre-rotation*.bak leak STILL-LIVE FMX password + YouTube API key on disk

Category: credential
Reporter: john
Assigned: chuck
Created: 2026-06-13 Time Active: 9 days Due: 2026-06-16 (7 days overdue)
Details
Proposed Fix: IN-PASS: move 3 .env.pre-rotation*.bak to _DELETE_QUEUE. ESCALATE PHIL: rotate FMX password + reissue FMX creds + rotate YOUTUBE_API_KEY. Poka-yoke (Chuck): rotation writes pre-copy to os.tmpdir() + unlink in finally; boot guard refuses start if any .env*.bak present.
Plaintext .env.pre-rotation.bak leak STILL-LIVE FMX password + YouTube API key on disk Root Cause (5 Whys) 5 Whys: (1) 3 plaintext cred backups in live bot dir. (2) auto-made during rotations. (3) each snapshots the ENTIRE .env so un-rotated keys (FMXAPIPASSWORD, YOUTUBEAPIKEY, FMXAPIUSERNAME) are mirrored verbatim. (4) nothing scrubs them or checks rotate-completeness. (5) nothing scans for .env.bak. Verified: FMX pw + YouTube key md5-MATCH live .env in all 3; gitignored but plaintext on FS since Apr 19-25. ROOT: whole-file snapshot, never scrubbed, no completeness check.
P1 OVERDUE
#182

NAS philsserver abnormal disk SMART status on bay 1 (3.5" SATA HDD 1) — fired x2 on 6/11

Category: network
Reporter: auto
Assigned: unassigned
Created: 2026-06-12 Time Active: 10 days Due: 2026-06-15 (7 days overdue)
Details
Proposed Fix: Abnormal SMART warning on HDD bay 1 fired twice 2026-06-11 (14:53 + 21:20 UTC). Disk still online, NAS healthy, snapshots current, but abnormal SMART = potential drive failure / data-loss risk. Kara: pull live SMART attributes (reallocated/pending/uncorrectable sectors) + QuLog review, decide monitor-vs-replace. Owner: kara. Flagged to Kara in 6/12 05:59 reply-loop; she journaled as carry-over; ticket filed in-pass per QMS (was un-ledgered).
Root Cause (5 Whys) PENDING — run the 5 Whys at triage (auto-captured; root cause not yet drilled).
P1 OVERDUE
#179

Lane-refusal fix never reached runtime: all 5 installed agent plugins were stale (pre-6/7), Tess refused Phil again

Category: architecture
Reporter: kara
Assigned: kara
Created: 2026-06-12 Time Active: 10 days Due: 2026-06-15 (8 days overdue)
Details
Proposed Fix: DONE in-pass: synced all 5 installed SKILL.md (marketplaces/local-desktop-app-uploads) from IT/plugins source — every installed copy now has the 6/7 Phil carve-out; also reworded the 'Refuses X work' priming line in tess/kara/alex descriptions to 'Phil's direct ask is ALWAYS done end-to-end'. CORRECTIVE (ICAR-2026-06-12-01): new IT/scripts/plugin-install-sync-check.js diffs source vs installed and re-syncs; cross-lane ask to Chuck to wire it into daily house-in-order.
Root Cause (5 Whys) 5 Whys: Tess refused Phil's direct ask (1) because she loaded the refusal template with no Phil carve-out (2) because the desktop app reads the INSTALLED plugin copy, not source (3) because the 6/7 fix was applied only to IT/plugins source and installed copies date to Jun 5 (4) because nothing syncs or diffs installed-vs-source after a loader fix (5) because the plugin pipeline assumes Phil-UI reinstall propagates fixes but no check verifies it — fix-landed-but-never-deployed class, same as P-00174.
P1 OVERDUE
#54

Power Automate WorkSync Discrete Resend flow — 2 failures past 7 days (Microsoft alert)

Category: system
Reporter: chuck
Assigned: phil
Created: 2026-04-30 Time Active: 54 days Due: 2026-05-03 (51 days overdue)
Details
Proposed Fix: Open Power Automate portal → Flows → WorkSync Discrete Resend → Run history. Expand the 2 failed runs to identify which connector errored (trigger / Condition / Send HTTP). Most likely causes: (a) Outlook connector token-refresh expired (re-authenticate via Power Automate), (b) Condition logic edge case (recent extension to OR-logic on 2026-04-29 may have introduced a bug), (c) Microsoft Graph throttling on cross-tenant send, (d) Gmail-side rejection if message tripped spam/attachment-size filter on fairriteworksync. ~10 min Phil-action: log into Power Automate, screenshot the failed runs, share with Chuck for diagnosis. Flow is upstream of engelsplace-gmail-minutes-ingest cron — if broken, daily meeting-minutes do not reach the website. Full topic context at memory/topics/email-forwarding-engelp-fairrite-to-fairriteworksync.md.
P1 OVERDUE
#53

Outlook→Gmail auto-forwarder ([email protected][email protected]) silently broken

Category: website
Reporter: tess
Assigned: tess
Created: 2026-04-29 Time Active: 55 days Due: 2026-05-02 (52 days overdue)
Details
Proposed Fix: Phil verifies the actual mechanism: (1) Open Outlook ([email protected]), check Settings → Mail → Forwarding — likely empty or 'rule disabled by admin'. (2) If Phil has Exchange admin rights, check whether external forwarding is blocked at the org policy level (Microsoft 365 default = blocked). (3) If org policy blocks external forwarding, build a Power Automate flow in Phil's account: trigger on new email matching subject filter → action: save attachment to a specific Google Drive folder (using the Google Drive connector). Power Automate flows often work even when raw external forwarding is blocked because they're a managed process not a raw rule. (4) The Drive folder is then read by the existing engelsplace-drive-reader service account (same pattern as Phase E blood-panels). Eliminates Gmail OAuth entirely from the meeting-minutes pipeline AND eliminates the 7-day refresh-token problem AND eliminates the wrong-account-slip risk. Surfaced 2026-04-28 night when Tess Gmail-API forensics confirmed only 1 non-Phil-manual-forward message from @fair-rite.com in last 30 days (Tyler Bailey 2026-04-23). Pipeline has been silently broken since at least 2026-03-30 (the only prior Tyler/meeting-minutes data point was 2026-03-30.md from a since-uncommitted earlier puller fire). Without the auto-forwarder Phil has to manually Fw every meeting docx, which defeats the entire point of the puller.
P1 OVERDUE
#43

PC-to-NAS auto backup rollout — NetBak + Veeam Agent Free, 4 PCs

Category: network
Reporter: peter
Assigned: kara
Created: 2026-04-28 Time Active: 56 days Due: 2026-05-01 (53 days overdue)
Details
Proposed Fix: Phase 1 pilot on philsgamingmachine: install QNAP NetBak PC Agent (file-level continuous, nightly incrementals) + Veeam Agent for Microsoft Windows FREE (image-based, monthly full image). Both target a new /backups/philsgamingmachine/ share on the NAS with snapshot retention. Restore-test BOTH (one file via NetBak, one image-mount via Veeam) before declaring pilot complete. Phase 1.5: rollout to laptop, Plex server (192.168.1.5 i9-12900H mini PC), and Kiahna's computer. Open question for Kiahna: confirm she's on Phil's LAN, requires WireGuard extension, or needs cloud backup destination — different network. Closure criteria = all 4 PCs running both layers + ≥1 successful restore drill per PC documented in SOPs/Network/. Full rationale + tradeoff matrix in Network/pc-backup-strategy-2026-04.md.
P1 OVERDUE
#41

Self-improvement loop — detect agent failure patterns without Phil's complaint as the trigger

Category: system
Reporter: chuck
Assigned: chuck
Created: 2026-04-28 Time Active: 56 days Due: 2026-05-01 (53 days overdue)
Details
Proposed Fix: STRUCTURAL FIX for the willingness pattern Phil named 2026-04-27 night. Goal: corrective signal comes from INSIDE the loop, not from Phil escalating. Multi-mechanism design — each piece detects a different class of agent failure: (1) TRIAGE-AT-BOOT — every Chuck/Tess/Peter/John/Alex boot OPENS the top 5 oldest auto-captured Problem Ledger entries (reporter=auto, status=new) and either triages them in-session or files an explicit decision. Today's pattern of P-00017/25/26 sitting unread for 36+ hours stops. (2) PUSHBACK PATTERN DETECTOR — bot.js scans Phil's messages for repetition patterns (same complaint topic 3+ times across N days) and surfaces as a P1 structural-concern problem with auto-flag 'Phil has complained 3x about X — STRUCTURAL gap, not symptom.' Trigger words include 'doubling,' 'still happening,' 'I told you,' 'this is the same,' 'over and over.' (3) GREP-FIRST AUDIT — periodic check (weekly) samples recent Chuck responses to Phil's symptom reports + verifies Chuck did grep / log-read / verify-script BEFORE responding. Failure = file as a behavioral pattern problem. (4) SOUL/AGENTS EFFECTIVENESS REVIEW — monthly, sample 10 recent Chuck sessions + score whether soul.md rules actually fired (RULE 4 grep-first did or didn't happen). Updates banked + surfaced for Phil review. (5) AUTO-DRAFT RULE PROPOSALS — when behavioral pattern problems accumulate, generate proposed soul.md / agents.md additions (NOT auto-applied — Phil reviews + approves like Phase B v2 skill candidates). Implementation order: (1) and (2) ship first (low risk, high value, ~3-4 hours each). (3) and (4) need more design — Phase E. (5) builds on Phase B v2's auto-drafter pattern but for rules instead of skills. Estimated ~10-15 hours total across multiple sessions. NOT shipping tonight — banking the plan + closing only when all 5 mechanisms live.
P1 OVERDUE
#20

Claude Desktop 1.4758 spawns MCP servers twice (directMcpHost + LocalMcpServerManager)

Category: system
Reporter: auto
Assigned: chuck
Created: 2026-04-26 Time Active: 57 days Due: 2026-04-29 (55 days overdue)
Details
Proposed Fix: GitHub issue #53134 confirms regression in 1.4758. Phil has 3 MCPs (discord-mcp, desktop-commander, resend) + scheduled-tasks — double-spawn risks port collisions, doubled token usage, file-watcher contention. Workaround: restart Cowork after each cold start (single-spawn restored). Monitor: Task Manager for duplicate processes parented to Claude Desktop. If confirmed duplicate, run ClaudeZombieReaper manually. Track Anthropic fix in next Cowork update; pin 1.4758.0.0 in SYSTEM_STATE Runtime Versions and re-check on next bump.
P1 OVERDUE
#12

Robinhood puller silent failure (device verification expired)

Category: system
Reporter: alex
Assigned: phil
Created: 2026-04-25 Time Active: 58 days Due: 2026-04-28 (56 days overdue)
Details
Proposed Fix: Phil runs robinhood-sync.py interactively from PowerShell on philsgamingmachine to complete the SMS/email device verification challenge. Diagnosed 2026-04-25: 04-23 + 04-24 cron fires saw login() block 25 min waiting for SMS challenge, return partial-auth, then load_portfolio_profile fails with 'can only be called when logged in'. Fix: cd C:/Users/engelp/Claude_Lives_Here/IT, python robinhood-sync.py, complete challenge when prompted (~2 min). After successful interactive login, the cached pickle at C:/Users/engelp/.tokens/robinhoodrobinhood_session.pickle re-establishes device trust for ~weeks. Mon-Fri 7:30 AM scheduled task auto-recovers on next fire.
P1 OVERDUE
#1

Tornado safety plan filing (IL HB2987)

Category: system
Reporter: john
Assigned: phil
Created: 2026-04-24 Time Active: 59 days Due: 2026-04-27 (57 days overdue)
Details
Proposed Fix: John stages the full plan document from existing Gmail drafts + the Crawford County EMA template. Phil takes 30 minutes in one sitting: measures the bathroom square footage, reviews John's draft, hits send to both the Crawford County EMA and Flat Rock FD.
Phil must measure bathroom sq ft, finalize plan, file with Crawford County EMA + Flat Rock FD. First surfaced 2026-04-05. At 22 days overdue, this is a legal compliance concern and a personal-safety concern. Gmail draft already staged from earlier sessions. Lane: John (Compliance Officer, Fair-Rite). Chuck does not track this.
P2 OVERDUE
#208

NetBak PC Agent Electron GUI auto-launches at login + leaks RAM (5.77GB, #1 consumer in 6/13 hard reset) — redundant to HDP

Category: network
Reporter: kara
Assigned: unassigned
Created: 2026-06-16 Time Active: 7 days Due: 2026-06-23 (due today)
Details
Proposed Fix: Rename HKLM Run value 'QNAP NetBak PC Agent' -> '.disabled-by-kara' (reversible) + kill the 4 running NetBak PC Agent.exe Electron processes. Leave HDP + the QNAP_NetBak_PC_Agent service (TimeBackAgent.exe 204MB, independent of HDP but shares 'TimeBack' name -> untouched per Rule 0) alone; backups unaffected. ICAR: standardize on HDP, audit all enrolled PCs (Plex P-00204 = same legacy-NetBak class) for stray NetBak auto-launch.
Root Cause (5 Whys) 5 Whys (live evidence): (1) 6/13 16:30 Kernel-Power 41 hard reset because Windows virtual memory was exhausted; (2) VM exhausted because consumers stacked — NetBak PC Agent.exe 5.77GB + vmmemWSL 5.17GB + msedgewebview2 4.54GB (System Resource-Exhaustion 16:24:04, pid 41940 = 5774700544 bytes); (3) NetBak hit 5.77GB because its process is an Electron GUI (gpu-process/renderer, app.asar) that leaks over uptime; (4) the GUI runs because HKLM Run 'QNAP NetBak PC Agent' auto-launches it every login; (5) it auto-launches despite NO backup function — user-data-dir holds only Electron browser artifacts, no job/repo/schedule; actual backup runs via HDP (QNAPHDPAgent+QNAPHDPFD+TimeBack+diskutil.exe, daily 03:00 clean through 6/15). ROOT: redundant legacy NetBak GUI auto-starts at login and leaks RAM while doing nothing; HDP is the real engine. Also corrected stale record: diskutil.exe is an HDP component (QNAP\HDP\Bareos), not NetBak as P-00163 said.
P2 OVERDUE
#207

Recurring ~02:08 nightly mains brownout trips UPS to battery (NAS sees 'power loss')

Category: network
Reporter: kara
Assigned: unassigned
Created: 2026-06-15 Time Active: 7 days Due: 2026-06-22 (1 day overdue)
Details
Proposed Fix: (1) Identify the ~02:08 timer load — check WATER SOFTENER regeneration time FIRST (commonly defaults to ~2AM, classic culprit); then well pump, water-heater timer, pool/aerator pump, irrigation. (2) Reschedule/smooth it (e.g., move softener regen off the hour, or address inrush) to stop the sag. (3) Quantify sag depth via input-voltage logging (APC PowerChute if installable on a paired PC, or a plug-in voltage monitor / Kill-A-Watt) to judge benign motor-inrush vs. a service/wiring concern that warrants an electrician. (4) NAS-side mitigation ALREADY deployed: 10-min UPS shutdown grace (P-00203) makes the NAS immune regardless of these transients. NAS data-loss risk = none.
Root Cause (5 Whys) 5-Whys from live evidence: (1) NAS logs a UPS power-loss event nightly ~02:08 (6/11, 6/13, 6/15 — 3rd+ recurrence); (2) because the APC Back-UPS RS1000G transfers to battery at that instant; (3) because incoming MAINS voltage sags below the UPS transfer threshold = a real brownout — CONFIRMED by Phil: the house lights visibly dim at the same moment (lights run on house mains, NOT the UPS, so the dimming proves a mains-side sag, not a UPS fault); (4) mains sags at the SAME time nightly => a high-inrush load starts on a timer/schedule. ROOT (hypothesis, pending Phil identifying the device): a scheduled ~02:08 high-inrush load briefly sags the home supply past the UPS's AVR window, forcing a battery transfer. The UPS is functioning correctly; this is power-side, not a UPS defect.
P2 OVERDUE
#206

Stand up a dedicated agent ops mailbox for NAS/system alerts (graduate from Gmail tag+filter)

Category: network
Reporter: kara
Assigned: unassigned
Created: 2026-06-15 Time Active: 7 days Due: 2026-06-22 (1 day overdue)
Details
Proposed Fix: Stand up a dedicated ops mailbox the agents own. Options to weigh: (a) dedicated Gmail (Phil creates; agents get read/label access via a connector with write scope — same pattern just enabled for the main account); (b) [email protected] via Resend inbound (agent-native read through Resend MCP, $0, uses our verified domain). Then repoint NAS notification-rule recipients + the nas-watch Step 1.5 scan + the Gmail filter to the new address. NON-URGENT — the tag+filter interim is working and verified.
Root Cause (5 Whys) Interim solution (2026-06-15) routes NAS alerts to a NAS-Alerts label in Phil's personal Gmail via a content filter; the agents read that label. Works + keeps Phil's inbox clean, but it's still inside Phil's personal mailbox. Phil's stated preference (2026-06-15): a truly separate agent-owned mailbox so machine alerts never touch his personal account and the channel scales to all agents (NAS, website watchdog, finance, etc.).
P2 OVERDUE
#204

Plex box (192.168.1.5) repeatedly fails NAS login (engelp + engel-agent) — stale cached creds spamming QuLog Warnings

Category: network
Reporter: kara
Assigned: unassigned
Created: 2026-06-15 Time Active: 8 days Due: 2026-06-22 (1 day overdue)
Details
Proposed Fix: On 192.168.1.5: identify the stale cached NAS credential (Windows Credential Manager / cmdkey / mapped drive / a service) and update it to engel-agent's current password. Helper exists: IT/scripts/fix-plex-box-nas-creds.ps1 (swaps cmdkey entries engelp->engel-agent). Verify by: failed-login Warnings stop appearing in QuLog/inbox. Until fixed these are benign-but-noisy (no breach, just wrong-password retries) and will pollute the new email-alert channel.
Root Cause (5 Whys) 5-Whys from live evidence (NAS notification emails received 2026-06-14 ~23:16-23:22 CDT): (1) NAS QuLog logs repeated [Warning] 'Failed to log in' events; (2) source is 192.168.1.5 (Plex box) over HTTPS, for BOTH engelp AND engel-agent; (3) the Plex box is presenting NAS credentials that don't match current — engelp pwd is stale post-migration (documented INVALID in qnap-credentials.md) and an engel-agent cached pwd is also failing; (4) a Windows service / mapped drive / cmdkey entry on 192.168.1.5 still holds old NAS passwords and retries on a loop; ROOT: credential drift on the Plex box — cached NAS creds never updated after the 2026-04 migration + engel-agent password change (2026-04-23). RECURRENCE: same class of QuLog credential-spam noted historically in qnap-credentials.md notes.
P2 OVERDUE
#199

Gaming PC RAM exhaustion -> hard reset 6/13 16:30 (frozen mouse); chronic memory overcommit, WSL uncapped

Category: architecture
Reporter: chuck
Assigned: unassigned
Created: 2026-06-15 Time Active: 8 days Due: 2026-06-22 (1 day overdue)
Details
Proposed Fix: 1) Add C:\Users\engelp\.wslconfig: cap memory=20GB + autoMemoryReclaim=gradual + swap (stops VM balloon; reversible). 2) Flag NetBak backup-job RAM spike to Kara (NAS lane). 3) Chrome tab hygiene (64 procs/~10GB). 4) Optional: disable asus_framework autostart. Activate WSL cap via wsl --shutdown (bounces Docker/OpenBrain ~30s) or next reboot.
Root Cause (5 Whys) 5-Whys from Windows Event Log (live): mouse froze + Kernel-Power 41 hard reset 2026-06-13 16:30:15. Resource-Exhaustion-Detector ID 2004 at 16:24-16:25 named top virtual-mem consumers: NetBak PC Agent 5.77GB (backup job ballooning), vmmemWSL 5.17GB + vmmem 4.30GB (WSL2/Docker VMs), msedgewebview2 4.54GB, asus_framework 1.14GB -> ~20GB from 5 procs stacked on Chrome(64 tabs ~10GB)+node MCP swarm(44)+model -> physical RAM exhausted -> pagefile thrash -> UI unresponsive. Why WSL took 9.5GB: NO C:\Users\engelp\.wslconfig -> WSL2 uncapped (default ~50% RAM) and does not reclaim. Why NetBak 5.77GB: backup agent leaks/spikes during active job (intermittent; idle now 0.1GB). ROOT: chronically overcommitted memory with zero guardrails.
P2 OVERDUE
#177

"[engelsplace-gmail-minutes-ingest] Handler completed with 1 error(s): Claude returned non-JSON: Unexpected non-whitespace character after JSON at position 1535 (line 39 column 1)

Category: scheduled-task
Reporter: auto
Assigned: chuck
Created: 2026-06-12 Time Active: 10 days Due: 2026-06-19 (4 days overdue)
Details
Proposed Fix: Handler returned errors[] without throwing. Chuck reads scheduled-task-logs/engelsplace-gmail-minutes-ingest/2026-06-12.log for full context. Common causes: per-record parse errors, partial API failures, transient network issues, missing dependencies (e.g., python module not on pm2 PATH). If recurring across runs, investigate the failing record/source. If single transient, no action.
[engelsplace-gmail-minutes-ingest] Handler completed with 1 error(s): Claude returned non-JSON: Unexpected non-whitespace character after JSON at position 1535 (line 39 column 1) First 300 c Root Cause (5 Whys) PENDING — run the 5 Whys at triage (auto-captured; root cause not yet drilled).
P2 OVERDUE
#151

Reactivate the locked $20K final-expense fund + set up a proper estate-liquidity structure

Category: phil-action
Reporter: alex
Assigned: alex
Created: 2026-06-07 Time Active: 16 days Due: 2026-06-14 (9 days overdue)
Details
Proposed Fix: (1) Reactivate: confirm whose legal name the account is in (if Kiahna is an adult, the bank likely needs HER to reactivate; if custodial/joint, Phil can). Call the bank or visit a branch; a small deposit/withdrawal usually clears dormant status. (2) Better structure for the SAME goal: add a Payable-on-Death (POD/TOD) beneficiary naming Kiahna on Phil's OWN savings — money stays his + in his control, transfers to her instantly outside probate on death; OR a small final-expense life insurance policy. Optionally pair with a basic will. Alex can lay out the options in plain English; a quick attorney check recommended for the will piece (Alex is not an estate attorney).
From 2026-06-06 session. Emotionally significant — this is Phil providing for his kid. Net-worth context now complete: total ~$460K (401k $323K + Robinhood $63K + bank cash ~$74K). This is the one genuinely actionable banking item. Not urgent but real. Root Cause (5 Whys) Phil earmarked ~$20,060 in an account titled to his daughter Kiahna as immediate cash for burial + attorney costs if he dies, but it sat stagnant and the bank flagged it DORMANT/locked — so the emergency money is frozen exactly when its whole purpose is instant access. The homemade approach (park cash in someone else's account) is fragile and may also be legally hers, exposed to her creditors, and outside Phil's control.
P2 OVERDUE
#150

Retire wedged Tailscale on gaming PC (wrong transport; replaced by Discord agent-bus)

Category: cleanup
Reporter: tess
Assigned: unassigned
Created: 2026-06-06 Time Active: 16 days Due: 2026-06-13 (10 days overdue)
Details
Proposed Fix: AFTER P-00149 (Discord bus) is confirmed working from Phil's work network, retire Tailscale: run IT/scripts/stop-tailscale.ps1 as Administrator (reversible — undo with Set-Service Tailscale -StartupType Automatic; Start-Service Tailscale) OR uninstall via Settings→Apps. DO NOT remove before the replacement is verified. VPN lane = Kara; needs admin (this session not elevated).
Root Cause (5 Whys) Tailscale was installed on philsgamingmachine as a guessed transport for laptop↔PC agent comms, but it is wedged (NoState, 2 procs, Running+Automatic) and blocked by Phil's work-network firewall, so it serves no working purpose yet auto-starts every boot. 5-Whys: it 'never works' → laptop is on a work network → corporate firewalls block VPN tools like Tailscale → it was the wrong tool for the job → once Discord replaces its intended role it is pure dead weight.
P2 OVERDUE
#149

Build laptop↔gaming-PC agent-to-agent comms (Discord message-bus, not Tailscale)

Category: architecture
Reporter: tess
Assigned: unassigned
Created: 2026-06-06 Time Active: 16 days Due: 2026-06-13 (10 days overdue)
Details
Proposed Fix: Per brief Operations/2026-06-06-laptop-pc-agent-comms-brief.md: use a Discord REQ/RSP message-bus (both ends run discord-mcp; plain HTTPS passes the work firewall that blocks Tailscale). Define a REQ<id>/RSP<id> convention + a small handler on each side. Chuck owns design + MUST coordinate the laptop/Hermes side. See cross-lane ask #42. NOT a Tess/engelsplace item — routed to Chuck (agent-infra lane).
Root Cause (5 Whys) Phil wants the laptop Code agent and gaming-PC Code agent to exchange info on demand, but it was never wired because the transport was mis-chosen: Tailscale (a network VPN) is the wrong layer for agent messaging AND is blocked by his corporate work-network firewall. 5-Whys: no cross-machine answer → never built → chosen plumbing (Tailscale) never worked → it's a VPN tunnel fighting a work firewall → the real need is a firewall-friendly shared message channel (Discord HTTPS), which doesn't exist yet.
P2 OVERDUE
#145

Compile Phil's tax inputs / 'the numbers' (2026 realized gains, filing status, income, lot dates)

Category: phil-action
Reporter: alex
Assigned: alex
Created: 2026-06-06 Time Active: 16 days Due: 2026-06-13 (10 days overdue)
Details
Proposed Fix: Alex gathers: attempt to pull lot/acquisition dates from Robinhood (some may need Phil's 1099-B / statements); Phil provides filing status + ballpark income + whether any 2026 sales booked a gain. Result feeds exact federal $ figures into Finance/taxes tracker and unlocks precise sizing on P-00144 (harvest) and P-00143 (trim). Illinois piece already known: flat 4.95%, no LT discount.
From 2026-06-06 session. This is the 'get some numbers' Phil asked for. Once these inputs exist, the harvest/trim tickets become exact go/no-go calls instead of estimates. Pairs with P-00144. Root Cause (5 Whys) Can't quantify the federal dollar impact of any harvest or trim without three inputs: (1) any 2026 sales already realized for a gain, (2) Phil's filing status + income band (sets the bracket), (3) acquisition/lot dates on Robinhood holdings (decides long-term vs short-term treatment) — the portfolio JSON has cost basis but no buy dates.
P2 OVERDUE
#144

Decide on the Bitcoin tax-loss harvest (~$3,920 deductible loss, keeps the coins)

Category: phil-action
Reporter: alex
Assigned: alex
Created: 2026-06-06 Time Active: 16 days Due: 2026-06-13 (10 days overdue)
Details
Proposed Fix: Harvest = sell 0.1 BTC + rebuy same minute: keeps full BTC exposure/upside AND banks the ~$3,920 loss. Only delivers cash value if it offsets a realized gain (e.g. a future trim) or the $3,000/yr ordinary-income offset + carryforward. Decide alongside any P-00143 de-risk. Needs Phil's go + his tax facts (P-00144). Note: rebuy resets cost basis lower (bigger gain someday) — it's a timing win, not free money.
From 2026-06-06 session. Phil initially wanted to 'wait for BTC to recover before selling' — corrected the misunderstanding: harvesting keeps the coins, you ride the recovery either way; waiting only erases the tax loss. Phil is not opposed to it. Not urgent; window open while BTC < cost basis. Root Cause (5 Whys) A ~$3,920 deductible capital loss is sitting unused in BTC; crypto has NO wash-sale rule (2026) so it can be harvested by selling+rebuying the same minute WITHOUT giving up the position — but the loss shrinks as BTC recovers and vanishes entirely if it returns to cost basis, so doing nothing risks throwing the deduction away.
P2 OVERDUE
#138

Configure NAS proactive email alerts (one Control Panel step + verified test email) to complete the push+heartbeat model

Category: network
Reporter: kara
Assigned: unassigned
Created: 2026-06-05 Time Active: 17 days Due: 2026-06-12 (10 days overdue)
Details
Proposed Fix: Design + scheduled-task side DONE 2026-06-05 (nas-watch rewired to push+heartbeat+weekly-sweep model; SOP at SOPs/Network/nas-email-alerts.md). Remaining: configure the NAS to actually SEND alert email. Recommended = Gmail OAuth (Control Panel -> Service Account and Device Pairing -> E-mail -> Add SMTP Service -> Sign in with Google as [email protected] -> Send test email), then Notification Center rule: Warning+Error -> email [email protected]. Needs Phil's one Google sign-in click (robust, nothing to rotate). Alt = Resend SMTP (Kara can do fully, but rotating-key fragility). MUST verify a real test email is received before trusting it. Then flip the transition gate in nas-watch SKILL.md (daily deep poll -> weekly). Kara could not automate the QTS canvas UI reliably (off-screen scaled coords, unlabeled tree) — this is a guided/click step, not a blind automation.
P2 OVERDUE
#137

QNAP Security Center scheduled scan failing daily (admin Log On As auth expired post-firmware-update)

Category: network
Reporter: kara
Assigned: unassigned
Created: 2026-06-05 Time Active: 17 days Due: 2026-06-12 (11 days overdue)
Details
Proposed Fix: Security Checkup scan fails daily 06:00 since 2026-06-05 (QuLog id 311012, sev4). Cause: scan's Log On As admin account auth expired, almost certainly from the 6/4 firmware update to 5.2.9.3499 (6/4 QuLog was clean). Not data-at-risk: pool Healthy, 7 disks OK, snapshots current. Fix UI-only (Phil): NAS desktop -> Security Center -> Scan Schedule -> reapply schedule settings, re-bind to a valid account (engel-agent). Recurs daily until reapplied.
P2 OVERDUE
#135

Behavioral pattern: SCOPE_CREEP (chuck)

Category: architecture
Reporter: chuck
Assigned: chuck
Created: 2026-06-05 Time Active: 17 days Due: 2026-06-12 (11 days overdue)
Details
Proposed Fix: Auto-detected by chuck-behavior-auditor from a session transcript — no Phil complaint triggered this. STRUCTURAL FIX (from the auditor): Agent references a 'clear order' from Phil, but the transcript shows no such kill order from Phil between Chuck's diagnostic summary and the kill action. Add a precondition check: destructive actions (Stop/Disable/Delete scheduled tasks, process tree kills) require an explicit in-session Phil authorization token quoted back before execution, not an inferred 'clear order.' — Treat the recurrence count on this ticket as the pattern signal: a rising count means this failure mode is systemic for chuck and needs a doctrine/hook change, not a per-incident nudge.
Failure mode: SCOPECREEP (P2) — agent: chuck. Source session: c8d53fa4-c65b-4e5f-bd54-583a456e77f8.jsonl. EVIDENCE: Clear order — killing it now, capturing forensics first so we can see how it got here (and reverse if ever needed). No hesitation. [...] # 3) STOP + DISABLE + DELETE the HermesGateway scheduled task (kill the relaunch vector
P2 OVERDUE
#132

Bootstrap size caps blown across 8 files — AGENT_BOARD.md 80.7KB (4x cap), aggregate 294KB vs 150KB cap

Category: cleanup
Reporter: auto
Assigned: chuck
Created: 2026-06-04 Time Active: 18 days Due: 2026-06-11 (11 days overdue)
Details
Proposed Fix: test-agent-boot 8 fails: CLAUDE.md 23.8KB, AGENT_BOARD 80.7KB, ORG_STATE 35.4KB, memory/AGENTS.md 38.7KB, decisions-log 36.8KB, chuck WORKING_MEMORY 22KB, tess agents.md 27.4KB, aggregate 294KB. trim-org-state archived 0 (all within 30d). Fix: dedicated distillation session — compact AGENT_BOARD rows to archive (pattern: _ARCHIVE/bootstrap-trim/2026-05-02), distill ORG_STATE + AGENTS.md preserving doctrine, move decision detail to memory/decisions/. Too big for the 12-min on-track budget; needs one interactive Chuck session ~45 min.
P2 OVERDUE
#45

/nas plugin underreports M.2 NVMe — emits only slot 1 even with 2 drives in RAID

Category: network
Reporter: peter
Assigned: kara
Created: 2026-04-28 Time Active: 55 days Due: 2026-05-05 (49 days overdue)
Details
Proposed Fix: Update IT/plugins/peter/skills/nas/SKILL.md Step 5 (Format report) to require explicit per-M.2 reporting in both the all-green template AND the daily journal entry. Current behavior: SKILL.md Step 5 collapses all-green to one status line — disks are tracked in Step 4 but not surfaced individually in the report. With 2x 256GB M.2 drives in RAID on philsserver, every fire should emit Health + temp + alert state for BOTH drives, not just one. Fix: add explicit M.2-block to all-green template (e.g. 'M.2 cache: drive 1 OK 58°C, drive 2 OK XX°C'), require state.json to track per-M.2 anomaly status under known_anomalies[].id pattern 'm2-slot-N-<issue>'. Verify by manually triggering a fire after edit and confirming both M.2s appear in 2026-04-28.log and the journal. Lineage: Phil flagged 2026-04-27 17:42 CDT during /peter session — see agents/peter/memory/2026-04-27.md. No state.json edits required (Step 4 already iterates all entries; only Step 5 formatting changes). Low risk: SKILL.md is template, broken edit means noisier or silent run, not a NAS state change.
(Title was Git-Bash MSYS-path-mangled at create time — /nas got prefixed with C:/Program Files/Git/. Manually corrected via Edit immediately after create. Lesson banked for future problem.js calls: escape leading slashes or use --title="\/nas" workaround on Git Bash for Windows.)
P1
#252

Gemini Discord cutover overran 'API-swap-only': ops interactive routes through a NEW parallel Gemini path; handoff claims 'ops=Anthropic Sonnet' but running code (07:34 restart) does not

Category: architecture
Reporter: chuck
Assigned: chuck
Created: 2026-06-22 Time Active: 0 days Due: 2026-06-25 (in 2 days)
Details
Proposed Fix: RECOMMEND ONLY (audit; no action taken): (1) swap the API INSIDE askAgent() — replace callAnthropicWithRetry with a Gemini equiv in the SAME loop/systemBlocks/tools/looksLikeLookup; delete parallel askAgentGeminiOps+assembleOpsSystemText so ops has ONE path. (2) pin GEMINI_MODEL_OPS=gemini-2.5-flash (3.5-flash likely-invalid + 5x meter cost). (3) hard-cap the ops Gemini tool loop (ballooned to 110K in/round x up to 25 rounds) + keep anti-leak OUTPUT rule. (4) Gemini IS metered to api-spend-watch + counted in rollups, but it is an ESTIMATE conflated into the Anthropic-Console-calibrated 30usd cap with no real Google-bill pull — split it out or add a Google billing pull. (5) commit the tangled tree in SEPARATED commits (P-00248 metering / OpenBrain-v2 config-guardian / Gemini) so reverting one does not nuke the others. (6) resolve P-00247 single-writer. Phil greenlights what/whether to ship.
AUDIT-ONLY (Phil order via Kara handoff 2026-06-22: touch nothing). Current-code ops correctness UNVERIFIED — needs Phil live #network test. bot.js askAgent() has 'if(!isCron && geminiEnabled()) return askAgentGeminiOps(...)' so ALL 5 ops agents route to Gemini when GEMINIAPIKEY set (it is). Cron=Opus, !real=Claude Code unaffected. Cursor 'Files touched' OMITS config-guardian.js + scheduled-tasks.json. config-guardian 386-line diff = mostly formatter noise over legit P-00243 OpenBrain-v2 guard (benign). Full report: agents/chuck/reports/2026-06-22-gemini-cutover-audit.md. Related: P-00247. Root Cause (5 Whys) Symptom: Kara(#network) returned leaked planning text on a 16-round 83-110K-token loop. Why1: ops interactive ran a Gemini path, not the grounded Anthropic askAgent loop. Why2: 'API-swap-only' was built as a NEW parallel path (askAgentGemini/askAgentGeminiOps) instead of swapping the model call INSIDE the existing askAgent() loop. Why3: no in-loop-swap discipline, so the parallel path drifted from the proven loop. Why4: repeated edit+restart cycles (06:07/06:21/07:31->07:34) with no commit checkpoints and a handoff written mid-edit. Why5(root): uncoordinated multi-surface editing of one uncommitted tree (P-00247) -> file, running process, and handoff describe three different states; no single source of truth.
P1
#251

Behavioral pattern: PREMATURE_DONE (chuck)

Category: architecture
Reporter: chuck
Assigned: chuck
Created: 2026-06-22 Time Active: 0 days Due: 2026-06-25 (in 2 days)
Details
Proposed Fix: Auto-detected by chuck-behavior-auditor from a session transcript — no Phil complaint triggered this. STRUCTURAL FIX (from the auditor): Extend the VERIFY GATE / closeout doctrine: a scheduled or cron-driven job may NOT be tabled as '✅ live' or 'Proven' until it has actually fired once on its schedule (or its first real run is observed); created-but-unfired tasks must be labeled 'staged — first scheduled run pending' with the dependency (Chrome login / Phil Save) named, not folded into a ✅ done table. — Treat the recurrence count on this ticket as the pattern signal: a rising count means this failure mode is systemic for chuck and needs a doctrine/hook change, not a per-incident nudge.
Failure mode: PREMATURE_DONE (P1) — agent: chuck. Source session: a9e20a63-d6b2-4ee7-a422-031092859d2e.jsonl. EVIDENCE: Agent posted 'Done — and banked everywhere so it can't get lost' under a '## ✅ Real-cost monitor — live' table listing 'Daily auto-pull | Task ... created, 7:08 AM' as complete, while the unattended scheduled run had never fired (agent later admits 'The daily real-pull hasn't fired on schedule yet (first run tomorrow 7:08 AM), and it needs Chrome logged into the Console at that time') and the native email safety-net depended on Phil hitting Save. The pattern is confirmed by the agent itself: 'Fair — you've been handed too many false "done"s today. Let me re-verify everything live right now, not from memory.' Root Cause (5 Whys) PENDING — run the 5 Whys at triage (auto-captured; root cause not yet drilled).
P1
#238

Anthropic API spend blind spot: separate-client scripts (behavior-auditor Opus daily, gmail-puller, skill-drafter) burn our key INVISIBLY — burn-watchdog only sees the bot tape

Category: bot-health
Reporter: chuck
Assigned: chuck
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-23 (due tomorrow)
Details
Proposed Fix: (1) DIAGNOSE: create an Anthropic ADMIN API key and have the burn-watchdog poll the org cost_report (Admin API) so ALL spend (every key, model, day) is visible — closes the blind spot permanently. Until then, Phil reads Console -> Cost to confirm the dollar driver. (2) CONTAIN before raising the cap: pause chuck-behavior-auditor (daily Opus meta-loop, likely the steady invisible drain; reversible enabled:false) until visibility exists; route ALL Anthropic callers through one tracked wrapper that calls logTokenUsage. (3) Do NOT just raise the cap blind — whatever drained it will drain the higher one too. Note: nothing can spend right now (cap hit until 2026-07-01), so there is no active bleed — the work is to fix visibility + decide on the Opus jobs BEFORE the cap is raised/resets.
Investigation for P-00237. Evidence: token-usage-report.js (bot 7d=5.41 USD all Sonnet), token-usage-log.jsonl last write 6/18, burn-history.json (<2.50/day peak), openbrain/.env (OPENBRAINLLMPROVIDER=openai), .cursor/ai-tracking/ai-code-tracking.db (composer-2.5/default, 2842 edits 6/18-6/20), grep of all 'new Anthropic' call sites. Separate ICAR to follow (systemic monitoring gap). Root Cause (5 Whys) Monthly Anthropic spend cap (key sk-ant-api03-jGVivo...) was exhausted before 2026-06-20 with NO alarm. Investigation (live, 2026-06-20): (1) bot interactive tape = ~5 dollars/7d, 100% Sonnet, and the bot made ZERO API calls since 6/18 06:56 — bot is not the driver. (2) OpenBrain uses OpenAI (gpt-4o-mini), not our Anthropic key — exonerated. (3) Cursor's ai-code-tracking.db shows heavy churn 6/18-6/20 but model=composer-2.5/default = Cursor's OWN billing, not our Anthropic key — likely NOT the cap driver. (4) The ONLY Anthropic callers on our key are: bot.js (tracked) + SEPARATE-CLIENT scripts that are NOT in the token tape or burn-watchdog: chuck-behavior-auditor.js (claude-opus-4-7, DAILY 6:10AM, large transcript inputs, confirmed firing through today), gmail-puller.js, skill-candidate-drafter-handler.js. 5 Whys: cap hit with no warning -> spend accrued unseen -> burn-watchdog/tape only instrument the bot's askAgent path -> the Opus meta-scripts each construct their OWN Anthropic client and never call logTokenUsage -> no single chokepoint or org-level cost feed was ever wired (documented limitation 2026-06-03, never closed). ROOT CAUSE: no full-spend visibility — untracked separate-client Opus jobs spent on the shared key until the hard cap killed everything.
P1
#237

Kids-channel bots (MJ/SB) dumped raw Anthropic usage-cap error JSON into #cool-kids-only; API spend cap hit (resets 2026-07-01)

Category: bot-health
Reporter: chuck
Assigned: chuck
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-23 (due tomorrow)
Details
Proposed Fix: DONE this pass: rewrote the messageCreate catch — fun channels (sb/mj) now reply with a static, kid-safe, $0 'be right back' line; other channels get a clean one-liner; NO raw JSON / request IDs ever posted. Bot restarted + verified. REMAINING (Phil-action, billing authority + Console UI only): raise/remove the monthly API spend limit in the Anthropic Console (Billing -> Usage/Cost limits) to bring MJ + SB live replies back before the 2026-07-01 auto-reset; before raising, identify what drove the monthly spend (needs Admin API cost_report — burn-watchdog can't see it all).
Reported by Phil via screenshot of #cool-kids-only at ~11:24-11:25 AM (PhillieDawg 'Yo Wsg' -> Engel Ops Bot raw 400 usage-limit error x2). SB's new gratitude reinforcement (built same day) is wired but ALSO cannot function until the API cap is lifted — same root cause. REPEAT/systemic (raw error leak fired multiple times) -> ICAR to follow in Operations/ICAR/. Root Cause (5 Whys) TWO intertwined faults. (1) BILLING: the account's configured monthly Anthropic API spend limit was reached on 2026-06-20, so every API-backed bot reply returns 400 invalidrequesterror 'reached your specified API usage limits, regain access 2026-07-01'. 5 Whys: bot replies failed -> API rejected the call -> monthly spend cap exhausted -> a spend ceiling is set on the account AND cumulative API usage reached it by mid-month -> driver of the cumulative spend not fully visible from the bot's own token tape (burn-watchdog only sees bot askAgent calls; behavior-auditor/complaint-detector Opus calls + any console/workbench spend are outside it). (2) ERROR LEAK: the messageCreate catch posted err.message RAW (full JSON incl request_id) to the channel with no channel-type awareness -> kids saw error spew. 5 Whys: kids saw JSON -> catch replied raw err.message -> the handler was written for adult ops channels and never differentiated the kids channels -> no kid-safe/degraded error path existed -> error UX was never designed for the fun channels added later (2026-06-09).
P1
#232

gmail-puller live reconcile path: reads only resp.content[0] w/ no max_tokens guard, marks message 'seen' before digest/extract/git complete, unbounded LLM close[] auto-pushed live, swallows malformed-section JSON silently

Category: website
Reporter: john
Assigned: tess
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-23 (due today)
Details
Proposed Fix: Extract a shared collectAndGuardText(resp) (all text blocks + stop_reason retry) used by both call sites; gate seen-persistence on errors-empty for that message (or per-stage completion); add an aggregate close ceiling (refuse if close.length>5 and >50% of open items); log+record swallowed section-parse failures (better: have minutes-parser.py emit JSON the JS consumes directly instead of regex-over-YAML). Escalated to Tess (gmail-puller is 67KB, live action-item path).
John internal audit 2026-06-20. gmail-puller.js:564-587,711-744,976-1107,317-319/546-549. The one paid LLM call in the pipeline (Sonnet, event-gated) is itself confirmed EFFICIENT — the issue is robustness of the surrounding reconcile, not the model choice. Root Cause (5 Whys) The 2026-05-21 extract->reconcile refactor created reconcileActionItemsFromMinutes as the live path but the P-00177 hardening (all-text-blocks + maxtokens retry) was back-ported only to the now-dead extract function. So the LIVE path: (1) reads resp.content[0] only, maxtokens:4096, no stop_reason check -> a large close[]+create[] truncates mid-array -> swallowed JSON.parse error; (2) seen.add/saveSeen run mid-loop right after the .md write, BEFORE digest/reconcile/git — a throw there permanently skips those steps (next run sees the id and continues); (3) decision.close is applied unbounded and gitCommitAndPushIngested pushes the closures to live — a bad model pass returning close:[every open id] mass-closes work-actions on the live site with no >50% floor guard (the FMX pullers have one, this doesn't); (4) per-section JSON.parse is wrapped in empty catch -> a malformed section vanishes with no log, and reconcile may then CLOSE items whose evidence lived in the dropped section. Root: two parallel code paths drifted and the live one missed every guard the dead one has.
P1
#231

Content pullers emit NO heartbeat and floor-guard REFUSALS post silently — a stopped puller or a broken FMX/Drive feed is detected by nothing

Category: bot-health
Reporter: john
Assigned: tess
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-23 (due today)
Details
Proposed Fix: (1) Add the 5 puller task names to WATCHED_FLOOR in heartbeat-watchdog-handler.js (the guaranteed-floor list, silence-detected even with no recent heartbeat) OR have each puller post a silent:true green heartbeat on clean completion. (2) Tag floor-guard refusals with critical:true and change bot.js's partial-failure block to silent:!errors.some(e=>e.critical) so a source-down refusal forces a #it-ops/#engelsplace red alert. This is the alerting half of P-00226.
John internal audit 2026-06-20. heartbeat-watchdog-handler.js:14-17,59-70; bot.js:100-124,2385-2412. Mechanism is bot-infra (Chuck cross-notify); pullers are Tess's lane. Root Cause (5 Whys) Two compounding gaps: (1) the 5 content pullers go through bot.js's generic handler path and are explicitly excluded from OPSTASKSFOR_HEARTBEAT ('only ops tasks, not content pullers'), so they never write a green heartbeat, so heartbeat-watchdog's deriveWatchedTaskNames can never pick them up — a puller can stop firing (pm2 down / cron stalled) for days and only a human noticing stale content catches it. (2) When a floor guard DOES fire (the P-00226/227/228 refusals), bot.js logs it as status:yellow silent:true 'don't ping Discord' — so the signal that the upstream source is BROKEN (and the dashboard is now frozen on stale data) scrolls by with no alert. Root: heartbeat coverage was opt-in via green self-report, and all summary.errors are treated at one uniform silent severity.
P1
#230

Quarterly emailer hardening: no data-validity gate (can email a fabricated 0%), dup-email on archive throw, MTTR parser error renders as 'clean quarter', loadState shape crash, reminder has zero dedup

Category: website
Reporter: john
Assigned: tess
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-23 (due today)
Details
Proposed Fix: (a) before send, assert pm.generatedAt fresh (<35d) and qRec.total>0, else send a flagged DATA-UNAVAILABLE notice or skip; (b) stamp+persist sentQuarters IMMEDIATELY after a successful send, wrap archiveEmail in its own try/catch; (c) propagate an mttrStatus:'ok'|'error'|'empty' sentinel so the renderer cannot print 'no incidents' on a parser error; (d) normalize loadState to guarantee sentQuarters:[]; (e) add quarterly-reminder-state.json dedup mirroring the emailer. Escalated to Tess — emailer is 30KB, not auto-edited in-pass.
John internal audit 2026-06-20. quarterly-emailer-handler.js:178-204,206-211,556-584; quarterly-reminder-handler.js:120-158. Emailer fires Jan/Apr/Jul/Oct days1-5 (not firing now — no urgency, but next fire is Jul 1). Root Cause (5 Whys) The quarterly handlers trust upstream/state blindly with no contract: (a) computeQuarterSummary sends whatever pm-metrics.json says — empty/stale -> 0.00% official report, no freshness/non-empty gate; (b) send->archive->saveState ordering: archiveEmail (no try/catch) can throw AFTER send but BEFORE saveState, so next cron day (fires days 1-5) re-sends; (c) readMttrForQuarter returns [] on parser failure, indistinguishable from a real empty quarter, and the email AFFIRMATIVELY prints 'no downtime incidents logged'; (d) loadState catch only guards parse not shape — a valid-JSON-missing-key file -> state.sentQuarters undefined -> crash; (e) quarterly-reminder-handler has NO dedup state at all (sibling emailer was hardened, reminder was not). Root: idempotency + data-validity were solved for one path and not carried across siblings.
P1
#229

FMX PM occurrences 24-month window is IGNORED by the API — pm-metrics aggregates 2022–2029 (7yr incl. future PMs), deflating the on-time leaderboard Phil reads

Category: website
Reporter: john
Assigned: tess
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-23 (due today)
Details
Proposed Fix: Filter occurrences client-side to [now-24mo, now] on occ.date (normalizeIsoUtc) before buildMetrics, OR confirm correct FMX param names and assert min/max returned date is within window (push error if not). NOT auto-changed in-pass: this shifts the official numbers Phil sees, so Tess+Phil should review the corrected metrics. Fixes leaderboard, monthVolume, quarterRollup simultaneously.
John internal audit 2026-06-20. fmx-pm-puller.js:270-276,378-388,247-251. Verified against live pm-metrics.json (64 months). Downstream of this: leaderboard rate + monthVolume + quarterRollup-total mismatch. Root Cause (5 Whys) The from/to query params on /planned-maintenance/occurrences do not constrain the result (wrong param names or unsupported), and there is NO client-side date backstop. Live pm-metrics.json proves monthVolume spans 2022-06..2029-04 (64 months, 6463 occurrences incl. future 2029 PMs). Future open-on-time occurrences pad taskCounts.total but never onTime, so the 'worst on-time' leaderboard (sorted ascending) is artificially deflated; monthVolume plots ghost future columns. Root: client trusts an unconfirmed query-param contract with zero validation.
P1
#227

fmx-pm metrics-overwrite path has no floor guard — a 200-empty occurrences response zeroes pm-metrics.json and emails Phil a 0% report

Category: website
Reporter: john
Assigned: tess
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-23 (due today)
Details
Proposed Fix: FIXED IN-PASS: added a floor guard before buildMetrics — if occurrences.length===0 && tasksFetched>0, refuse the overwrite, keep prior pm-metrics.json, push a hard error. REMAINING for Tess: (1) verify; (2) handle the both-endpoints-empty residual via a shared fetchOrRefuse helper (ICAR); (3) add the quarterly-emailer validity gate (separate ticket).
John internal audit 2026-06-20. fmx-pm-puller.js:397-409. Guard verified (refuses zero-clobber, passes normal). Companion to P-00226 (delete path). Feeds the quarterly-emailer 0% bug. Root Cause (5 Whys) The P-00226 fix guarded the task-file DELETE path but not the metrics OVERWRITE path in the SAME file. occurrences=Array.isArray(data)?data:(data?.items||[]) yields [] on a 200-but-empty/shape-changed response with no throw; buildMetrics([]) makes an all-zero rollup; writeFileSync clobbers good pm-metrics.json; the handler commits+pushes it live; the quarterly emailer then reads the zeroed file and sends Phil a fabricated 0.00% on-time Fair-Rite report. Root: the empty-200 defense was applied to one of two write paths.
P1
#226

Content pullers hard-delete entire collection on a 200-but-empty/shape-changed API response (silent, auto-pushed to live engelsplace.com)

Category: website
Reporter: john
Assigned: tess
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-23 (due today)
Details
Proposed Fix: Source-level floor guard in fmx-puller.js, fmx-pm-puller.js, youtube-puller.js: BEFORE the cleanup/unlink step, if summary.fetched===0 (or would-delete count exceeds ~50% of on-disk files), ABORT cleanup, push a hard error onto summary.errors, and have the bot handler treat summary.errors.length>0 as failure — skip commitAndPushContent AND raise red (ledger ticket + #engelsplace post) instead of a gray info line. Secondary: wrap puller handlers in runHandlerWithHeartbeat + add to heartbeat-watchdog watched list so a silently-stopped puller trips the silence detector.
John weekly internal audit (engelsplace-pipelines, 2026-06-20). Lane: Tess. Evidence: fmx-puller.js:199-228; fmx-pm-puller.js:332-362; youtube-puller.js:179-250; commit-content.js:33-92 (build gate catches malformed content but a mass DELETE builds green and deploys). Blast radius: 403 maintenancerequests + 133 pmtasks + 8 videos. Floor-guard fix applied in-pass by auditor under FIX-IT-WHEN-YOU-FIND-IT; this ticket tracks verification + the heartbeat/alerting half for Tess. Root Cause (5 Whys) Maintenance dashboard risks going blank because the puller deletes every local .md whose id isn't in the live set. The live set can be empty WITHOUT an error: fmx/fmx-pm/youtube treat any 200 as success and resolve via Array.isArray(data)?data:(data?.items||data?.data||[]) — a shape change, a building-scope permission downgrade on the Dashboard-Sync viewer user, or a genuine empty page all yield [] with no thrown error. The cleanup loop has no floor/sanity guard, so an empty set unlinks all files. The wipe reaches live because the handler then runs commitAndPushContent (git add -A->commit->build-gate->push origin/main) and a deletion of valid files builds GREEN, passing the build gate and auto-deploying. It is silent because the pullers are not wrapped in runHandlerWithHeartbeat and a fetched=0 run emits only a gray Discord info line. ROOT CAUSE: hard-delete reconciliation trusts an unvalidated success signal with no zero/floor guard at the source.
P2
#260

Repeat (3x): advised reboot for an installed skill instead of verifying invocation syntax

Category: architecture
Reporter: chuck
Assigned: unassigned
Created: 2026-06-23 Time Active: 0 days Due: 2026-06-30 (in 7 days)
Details
Proposed Fix: Correct command is /cursor-architect:cursor-architect (confirmed via claude-code-guide + structural parity with /chuck:chuck). CA: authoritative invocation rules banked to OpenBrain id 570. For a bare /cursor-architect, repackage as a single-skill plugin with SKILL.md at the plugin ROOT + plugin.json name=cursor-architect. Standing behavioral CA: use the claude-code-guide agent for Claude Code feature questions; never advise a reboot for a skill-visibility issue without first verifying the invocation syntax.
Root Cause (5 Whys) Chuck did not know this desktop app's skill-invocation model — a plugin skill at skills/<name>/SKILL.md is invoked as /<plugin>:<skill> (e.g. /chuck:chuck), NOT bare /<name>; and personal ~/.claude/skills/ skills never surface in the desktop slash picker at all. cursor-architect was installed and LIVE the whole time; only the typed command string was wrong. Chuck guessed 'reload/reboot' 3 times instead of verifying the invocation string.
P2
#259

Boot-doctrine drift: 8 stale/conflicting refs found by janitor run #1

Category: architecture
Reporter: chuck
Assigned: unassigned
Created: 2026-06-23 Time Active: 0 days Due: 2026-06-30 (in 7 days)
Details
Proposed Fix: Sweep safe verified items: align CLAUDE.md L82/L137 surface count to canonical capability-matrix; Peter->Kara in OPENCLAW-BIBLE.md L206/L253; qualify bare auto-rebuild-plugins.js path in agents/chuck/agents.md L24. Resolve IT/problems canonical-vs-legacy via Control Tower migration. soul.md L76 deferred (soul-locked, needs Phil). Full list: IT/status/janitor/findings/latest.md.
Root Cause (5 Whys) System reorganizes (agent retirements, file renames, 2026-06-22 surface-persona rollout) faster than cross-references get swept; no automated reference-integrity check existed — the gap the Cursor janitor (Tard) was built to close. 8 issues verified live 2026-06-22.
P2
#256

NAS (TS-664) did not auto-power-on after AC power recovery - stayed off after the outage

Category: network
Reporter: kara
Assigned: unassigned
Created: 2026-06-22 Time Active: 0 days Due: 2026-06-29 (in 7 days)
Details
Proposed Fix: Set QTS Control Panel > System > Power > Power Recovery to "Turn on the server automatically" and ensure Control Panel > System > Hardware > EuP Mode is DISABLED. GUI toggle (engel-agent has admin). This is the anchor for power-event self-recovery and pairs with moving cloudflared onto the NAS (see tunnel-migration proposal). Verify the setting persists; full proof on next power event or a controlled UPS test.
Root Cause (5 Whys) Phil reports the NAS did not come back on its own after last nights power outage (had to be manually powered on). QTS "Power Recovery" action is not set to turn on automatically, OR EuP Mode is enabled (EuP minimizes standby power and DISABLES auto-power-on/WoL). 5 Whys: (1) NAS off after power returned = it did not auto-boot; (2) did not auto-boot = QTS power-recovery action not "turn on automatically" or EuP enabled; (3) = the unattended-recovery setting was never configured; (4) = default/post-migration state never hardened. NAS verified live + healthy now (engel-agent admin auth OK, is_booting=0, mediaReady=1). Exact current value still to be confirmed in the QTS panel.
P2
#250

Behavioral pattern: ACT_BEFORE_CONFIRM (chuck)

Category: architecture
Reporter: chuck
Assigned: chuck
Created: 2026-06-22 Time Active: 0 days Due: 2026-06-29 (in 6 days)
Details
Proposed Fix: Auto-detected by chuck-behavior-auditor from a session transcript — no Phil complaint triggered this. STRUCTURAL FIX (from the auditor): Add a 'native-first' line to TARGET DISCIPLINE Part 2 (agents/chuck/agents.md): before building custom automation against a third-party service, grep/check that service's own native features (billing alerts, exports, webhooks) and confirm the simplest path with Phil before writing browser/script glue. — Treat the recurrence count on this ticket as the pattern signal: a rising count means this failure mode is systemic for chuck and needs a doctrine/hook change, not a per-incident nudge.
Failure mode: ACTBEFORECONFIRM (P2) — agent: chuck. Source session: a9e20a63-d6b2-4ee7-a422-031092859d2e.jsonl. EVIDENCE: Agent built a Chrome-driven browser pull as the PRIMARY real-cost monitor (scheduled task 'api-spend-real-cost-daily' that 'drives Chrome to the Console endpoint') and declared it the solution — then Phil surfaced Anthropic's native email alert: 'Why can't we incorporate this email? That should do it, shouldn't it?' Agent conceded: 'Yes — incorporate it. You found the most reliable piece of the whole thing... that's more bulletproof than my browser-driven pull. It fires on the actual bill, Anthropic sends it, and it has zero dependency on our scripts, Chrome, or anything on your machine.' The agent committed build effort to a fragile, Chrome-login-dependent approach without first checking for the provider's native, zero-dependency spend-alert feature. Root Cause (5 Whys) PENDING — run the 5 Whys at triage (auto-captured; root cause not yet drilled).
P2
#247

Two agents (Cowork-Code + Cursor) edited the same repo files concurrently — clobber risk; Cursor edits left uncommitted

Category: architecture
Reporter: chuck
Assigned: unassigned
Created: 2026-06-21 Time Active: 1 day Due: 2026-06-28 (in 6 days)
Details
Proposed Fix: Add a same-repo concurrent-edit guard/coordination: (1) before a multi-file work session, an agent claims the repo (or specific files) in AGENT_BOARD with a timestamp; peers check it. (2) Lightweight lock file IT/status/repo-edit-lock.json (agent+surface+files+ts, stale after N min) checked by a PreToolUse hook on Edit/Write — warn (not block) if another live surface holds it. (3) Commit-frequently discipline so uncommitted cross-surface edits don't linger (today CLAUDE.md/AGENTS.md/.cursor/rules were left uncommitted by Cursor while I worked). (4) Cross-surface note: don't run Code + Cursor agent tasks on the same repo simultaneously without coordinating. Today survived only because edits hit different lines + the 'modified since read' guard caught one clash.
Root Cause (5 Whys) Cowork-Code (me) and Cursor both ran autonomous agent tasks on ClaudeLivesHere at the same time (both improving the OpenBrain deploy-verify gate + boot files). No coordination/lock exists for same-repo multi-agent edits; git author is identical ('Phil Engel') on both surfaces so even history doesn't distinguish them. Detected via 'File modified since read' on openbrain-deploy-verify.js + harness external-modification flags on CLAUDE.md/AGENTS.md/.cursor/rules during my turns.
P2
#245

robinhood-staleness-check flat 30h threshold false-alarms every Sunday/Monday (puller runs M-F)

Category: architecture
Reporter: auto
Assigned: unassigned
Created: 2026-06-21 Time Active: 1 day Due: 2026-06-28 (in 5 days)
Details
Proposed Fix: Make STALE_HOURS_THRESHOLD weekend-aware: Sunday=52h, Monday-pre-0730=76h, Tue-Sat=30h. Check runs 06:53 BEFORE the day 07:30 sync so it reads the previous weekday pull; flat 30h trips on Sun (Fri+47h) and Mon (Fri+71h) with no real failure. Real failures still trip (Tue data >30h = Monday sync failed).
Root Cause (5 Whys) PENDING — run the 5 Whys at triage (auto-captured; root cause not yet drilled).
P2
#242

kara-network-watch: green path never clears notify-state -> warns perpetually misreport as STILL-DOWN, no RECOVERED text

Category: cleanup
Reporter: kara
Assigned: unassigned
Created: 2026-06-21 Time Active: 2 days Due: 2026-06-28 (in 5 days)
Details
Proposed Fix: On overall==='ok', the SKILL.md deliver block must ALSO call pushAlert({agent:'kara',key:'network-watch',status:'ok',...}). pushAlert is silent on good->good but sends exactly one RECOVERED text and resets notify-state.bad=false on bad->good. Restores edge-trigger semantics while preserving 'no routine green spam'. Update the green branch in the Code-side SKILL.md.
Root Cause (5 Whys) 5-Whys: (1) Tonight pushAlert fired 'STILL DOWN' not 'ALERT' for a download dip -> (2) notify-state.bad was already true from a prior warn -> (3) clearing only happens when pushAlert is called with a good status -> (4) the SKILL.md green branch is 'SILENT, no Telegram' and never calls pushAlert -> (5) ROOT: task conflated 'no routine green spam' (correct) with 'never clear alert state' (defect); edge-trigger recovery/reset half of pushAlert is never invoked for this monitor.
P2
#241

OpenBrain managed-directive pile accretes — capture_thought adds but nothing retires stale/contradictory directives

Category: memory-system
Reporter: kara
Assigned: unassigned
Created: 2026-06-21 Time Active: 2 days Due: 2026-06-28 (in 5 days)
Details
Proposed Fix: (1) Determine + document the retirement mechanism for managed directives (rebuild_managed_memories behavior, or a delete/supersede path) — capture_thought only ADDS, proven live: superseding the Tess-Hands-Off directive left BOTH the old (wrong) and new entries active. (2) Run a one-time prune of the ~50-directive pile per the 2026-06-20 audit: cut stale/done/contradictory/status-note entries, merge duplicates (UPS x2, NAS-monitoring x3), resolve the 401k contradiction (hold-20%-don't-pressure is authoritative; retire the bump-to-30%/maximize entries). (3) Add a periodic directive-hygiene pass so the pile stays lean (the 5 that matter aren't diluted by 45).
OpenBrain managed-directive pile accretes — capturethought adds but nothing retires stale/contradictory directives Root Cause (5 Whys) OpenBrain's managed-memory layer is append-only from the agent's side: capturethought promotes new directives but there is no wired path to retire a superseded/done/wrong one. So every correction ADDS a contradicting directive instead of replacing the stale one, and completed one-time tasks ('add sleepSync', 'add step 15' — both verified DONE) and pure status notes ('closed 7 tickets', 'OpenBrain live check works') never leave. The pile grows unbounded and starts injecting contradictory orders (no-tool-bans vs Tess-prohibition; hold-401k-at-20% vs maximize-deferral).
P2
#240

Honor-system OpenBrain boot-read gets skipped under pressure — enforce via SessionStart hook

Category: memory-system
Reporter: kara
Assigned: chuck
Created: 2026-06-21 Time Active: 2 days Due: 2026-06-28 (in 5 days)
Details
Proposed Fix: Build IT/scripts/openbrain-session-start-hook.js (spec: IT/openbrain-boot-enforcement-hook-SPEC.md): a SessionStart hook that fetches get_active_memories (reuse openbrain-boot.js) and injects Phil's standing directives via hookSpecificOutput.additionalContext (verify-first-hook.js pattern), keyed on the SessionStart 'source' field — fire on startup/resume/compact, skip clear. Hard 8s timeout + fail-open. Wire a second SessionStart command into .claude/settings.json next to agentkits-hook-wrapper. Owner: Chuck (settings.json/hook lane).
Root Cause (5 Whys) ENFORCED-VS-WRITTEN GAP. Harness-injected instructions (verify-first/settled-conclusions, via UserPromptSubmit hook) are always followed; honor-system boot instructions (call getactivememories FIRST) get skipped when an agent jumps straight into a Phil-handed task under pressure. There was no hook making the boot-read deterministic — it relied on the agent remembering, which fails exactly when busy. Adding another written self-instruction would not fix it (no teeth); only an enforced fetch-and-inject hook does.
P2
#236

SB/MJ kids-channel bots asked 'who are you?' — poster identity never passed to the persona

Category: bot-health
Reporter: chuck
Assigned: chuck
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-27 (in 4 days)
Details
Proposed Fix: Add KNOWN_USERS map + resolveSender() in bot.js; for sb/mj pass opts.senderInfo through routeAgent to askAgent, which injects a 'WHO YOU ARE TALKING TO' system block before the conversation frame. lilly069964 to Lillian, philliedawg to Phil; unknown users fall back to Discord display name. Fixed + verified 2026-06-20.
Fixed in-pass during Phil's Lillian gratitude check-in build (2026-06-20). Needs engel-ops-bot pm2 restart to take effect. Verify: post as a known user in #champions-chat, confirm SB greets by name without asking. Root Cause (5 Whys) messageCreate in discord-gateway-bot/bot.js routed only the raw message TEXT to askAgent; message.author identity (username/displayName) was discarded before the persona call, so SB/MJ were never told who was speaking and asked 'who are you?' even with Discord author info present (observed: SB asked philliedawg 'who's this?' 2026-06-10). The 2026-06-09 fun-channel wiring reused the agent-channel path that forwards only text+history; no identity hop was added.
P2
#235

Doc drift: SOP-IT-011 says FMX maintenance + PM pullers are 'ON-DEMAND / cron disabled' but both fire cron 3x/day enabled:true; YouTube + quarterly pipelines undocumented; code header docblocks describe retired extract/FMX-MTTR paths

Category: website
Reporter: john
Assigned: tess
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-27 (in 4 days)
Details
Proposed Fix: Rewrite SOP-IT-011 FMX/FMX-PM sections to reflect cron-enabled 3x/day + the new floor guards; add YouTube + quarterly sections; reconcile Last-Updated. Fix gmail-puller header (RECONCILE not extract) + quarterly-emailer header + bot.js:140-141 (xlsx not *.md). Maintained-by is Chuck (SOP) — cross-notify.
John internal audit 2026-06-20. SOP-IT-011-engelsplace-content-pipelines.md; gmail-puller.js:1-19; quarterly-emailer-handler.js:4,9,386; bot.js:140-141. Consolidates #8,#18,#39 + auditor's SOP finding. Root Cause (5 Whys) LIVING-DOCUMENT updates lagged code changes. SOP-IT-011 (lines 31-32,198,227,236) calls the FMX + FMX-PM pullers ON-DEMAND/cron-disabled, but scheduled-tasks.json shows engelsplace-fmx-ingest-morning/afternoon + fmx-pm equivalents all enabled:true on cron 0 6,11 + 45 14 Mon-Fri; the SOP has NO section for the LIVE youtube-puller or the quarterly emailer/reminder at all; Last-Updated says 2026-05-31 at top but 2026-05-23 at bottom. In-code: gmail-puller header still says 'Claude Sonnet to EXTRACT' (live path RECONCILES, which can CLOSE items); quarterly-emailer header + bot.js comment say MTTR reads maintenancerequests/*.md (live source is mttr-log.xlsx; the .md reader is DORMANT). Root: migrations updated code but not the prose/registry comments. DANGER: the SOP tells a future agent the pullers DON'T auto-delete, masking the P-00226 wipe risk.
P2
#234

Pipeline cleanup (Sort): duplicated puller helper twins (yamlEscape/normalizeIsoUtc/writeIfChanged/fmxGet across 3-4 files, already drifting), + dead code (extract/close v2-v3 prompt, dead exports, no-op statements)

Category: cleanup
Reporter: john
Assigned: tess
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-27 (in 4 days)
Details
Proposed Fix: Create IT/discord-gateway-bot/puller-lib.js exporting normalizeIsoUtc/yamlEscape/yamlArray/writeIfChanged + FMX authHeader/fmxGet; import from all pullers (one canonical copy can't drift). Delete/archive the dead gmail extract+close trio and prune their exports; trim youtube dead exports; delete blood-panel:116 and fmx-puller dead yamlEscape branch. Add a jscpd/ts-prune gate on the gateway-bot dir.
John internal audit 2026-06-20. Consolidates findings #5,#16,#31,#32,#33,#35,#36. Low blast radius but the twin-drift already caused one real divergence. Root Cause (5 Whys) The pullers grew by copy-paste ('Phase D' scaffolds); only the git step (commit-content.js) was ever de-duplicated. The YAML/date/write helpers and the FMX authHeader/fmxGet pair are cloned across fmx/fmx-pm/youtube and are ALREADY drifting — fmx-puller.js:84-88 has a dead yamlEscape conditional (both branches return JSON.stringify) that was cleaned up in the other two copies but not this one. Plus: gmail-puller extractActionItemsFromMinutes + closeMissingMinutesActionItems + ~190 lines of v2/v3 EXTRACTIONPROMPT are dead in the live path but still exported; youtube-puller exports unused CHANNELID/PLAYLISTTOSLUG; blood-panel:116 is a guaranteed no-op re-set. Root: behavior changes applied additively and no shared lib / dead-export lint.
P2
#233

blood-panel-puller treats an empty/permission-revoked Drive listing as 'no changes' success — no zero-floor alert

Category: website
Reporter: john
Assigned: tess
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-27 (in 4 days)
Details
Proposed Fix: Persist lastFilesChecked in the seen-state; on a >0->0 drop push a summary.errors entry ('0 PDFs now, N before — verify service-account Viewer access + folder id') so the partial-failure path surfaces it. (On-demand only, cron disabled — low urgency but real.)
John internal audit 2026-06-20. blood-panel-puller.js:228-235,307-311. Same empty-200-trust family as P-00226/227. Root Cause (5 Whys) files=list.data.files||[]; an empty list (revoked service-account Viewer access or changed folder ID — usually a 200, not a throw) yields filesChecked=0, the loop never runs, errors stays empty, and summaryToDiscordMessage prints 'No changes. Dashboard up to date.' A silently-revoked credential is indistinguishable from a genuinely empty folder. Unlike FMX/YouTube, blood-panel has no floor/baseline guard. Root: no persisted expected-minimum, so a drop from >0 to 0 reads as normal.
P2
#228

FMX maintenance + PM-task pulls have NO pagination — silent truncation at pageSize, and a truncated read feeds the hard-delete reconciliation

Category: website
Reporter: john
Assigned: tess
Created: 2026-06-20 Time Active: 2 days Due: 2026-06-27 (in 4 days)
Details
Proposed Fix: PARTIAL FIX IN-PASS: fmx-puller delete-reconciliation now DISARMS (refuses cleanup) when fetched>=pageSize (possible truncation). REMAINING for Tess: add a real pagination loop per FMX's paging contract in BOTH pullers, and add the same truncation-disarm to fmx-pm task cleanup. Then a >cap list reads fully instead of just not-deleting.
John internal audit 2026-06-20. fmx-puller.js:188-196 + cleanup guard; fmx-pm-puller.js:322,387. Truncation guard verified. Root Cause (5 Whys) fmx-puller and fmx-pm-puller do a SINGLE fmxGet with pageSize=500 (tasks) / 10000 (occurrences) and no nextPage/skip loop (youtube-puller DOES paginate). Today volumes are under cap so it does not bite, but the moment a list exceeds pageSize the extra records are dropped with no error — and worse, the dropped-but-live tickets are then treated as removed-from-FMX and hard-deleted locally. Root: a capacity assumption baked in with no overflow detection.
P2
#223

scheduled-task-registry.js reports code-side/Cowork tasks [ON] by folder-presence, not real scheduler enabled-state — falsely flagged a disabled code-side ops-report as a live double-fire

Category: scheduled-task
Reporter: chuck
Assigned: chuck
Created: 2026-06-18 Time Active: 4 days Due: 2026-06-25 (in 3 days)
Details
Proposed Fix: Cross-reference mcp scheduled-tasks list_scheduled_tasks for code-side enabled/lastRunAt/nextRunAt instead of hardcoding enabled:true; for Cowork read cowork-tasks-snapshot.json (drift-guard's source) for enabled state; where neither is available, label enabled as 'unknown(app-registered)' rather than asserting [ON]. Re-run and confirm the false bot+code ops-report cross-surface dup clears once code-side reflects enabled:false.
Root Cause (5 Whys) listDirSurface() in IT/scripts/scheduled-task-registry.js (line ~50) HARDCODES enabled:true for every code-side/Cowork folder. The true enabled/nextRunAt/lastRunAt state lives in the app scheduler (Claude Code + Cowork desktop), exposed via mcp scheduled-tasks listscheduledtasks — which the registry never queries. So a disabled-but-present task (SKILL.md on disk, enabled:false in scheduler) is indistinguishable from a live one. VERIFIED 2026-06-18: code-side chuck-daily-ops-report is enabled:false (lastRun 2026-06-13) per the MCP, but the registry printed it [ON] and flagged a bot+code cross-surface dup, misleading the P-00194 consolidation into reporting a live ops-report double-fire that does NOT exist (the dedup was already complete since ~Jun-13). 5-Whys root: the canonical cross-surface registry reports PRESENCE, not firing-state, for the two subscription surfaces.
P2
#221

Nicole's garage AiMesh node (RT-AC86U) has weak 5GHz wireless backhaul to main router

Category: network
Reporter: kara
Assigned: kara
Created: 2026-06-17 Time Active: 5 days Due: 2026-06-24 (due tomorrow)
Details
Proposed Fix: Strengthen the garage node's link to the main RT-BE92U, best-to-simplest: (1) WIRED Ethernet backhaul to the garage if any cable path exists (AiMesh auto-detects; turns weak wireless link into solid gigabit) — the real fix; (2) reposition the node for clearer line-of-sight / fewer walls to the main router; (3) add an intermediate AiMesh node to hop the distance; (4) upgrade the old WiFi5 RT-AC86U to a WiFi6 node; (5) powerline/MoCA backhaul if coax/powerline available. Phil wants this 'eventually' — advisory, no action yet.
From Phil's 2026-06-16 Network Map screenshot. Node: RT-AC86U 'Garage' @192.168.2.22, 5GHz backhaul WEAK, 5 clients all 2.4G (android .241, linux .129, Espressif .179, Espressif .237, MyQ-74C .157). LIKELY a contributor to the residual P-00220 churn since some churning Espressif plugs are behind this weak node. Note: this corrects the earlier 'single-AP' assumption — Nicole's has 1 AiMesh node. Root Cause (5 Whys) Garage node is an ASUS RT-AC86U (WiFi5) linked to the main router over a 5GHz WIRELESS backhaul; garage distance + walls weaken 5GHz badly, so the backhaul shows weak signal. Devices behind the node (2x Espressif plugs, MyQ opener, etc.) inherit the weak link.
P2
#220

Smart-plug Wi-Fi churn NOT resolved by Roaming Assistant fix - re-investigate real root cause (both routers)

Category: network
Reporter: kara
Assigned: kara
Created: 2026-06-16 Time Active: 6 days Due: 2026-06-23 (due today)
Details
Proposed Fix: Do NOT guess-and-poke. Proper diagnosis: (1) pull per-device RSSI of the dropping plugs from both routers (wl client list) - weak signal is a prime suspect for reason-8 client-leaves; (2) review untouched 2.4GHz settings known to break cheap Tuya/ESP IoT: 802.11ax mode + OFDMA, WMM APSD/U-APSD power save, MBO; (3) 2.4GHz channel congestion scan; (4) confirm whether aggregate deauth churn even equals the user symptom (Alexa-unreachable) vs normal IoT power-save. Propose ONE change with rationale, apply, verify over a FULL multi-hour window before claiming success.
Supersedes the PREMATURE resolve of P-00217 (home) and ties to reopened P-00214 (Nicole). Lesson: I declared success on a 17-min window; verification must be a full multi-hour window. ICAR-2026-06-16-01 corrected. Roaming Assistant left OFF (harmless on single-AP; before-configs saved if exact original wanted). Root Cause (5 Whys) OPEN. DISPROVEN: Roaming Assistant was NOT the cause - disabled on all bands (nvram=0, verified) yet Nicole's churn continues at full rate (51/hr at 06h, 32/hr at 07h, reason codes 8/3 client-initiated, same Espressif/Tuya plug MACs). Home inconclusive (low baseline). Reason 8/3 = STATION/client-initiated leave => points to device-side cycling, 2.4GHz RF/signal, or ax/OFDMA/APSD incompatibility - NOT an AP roaming kick.
P2
#219

Home router (192.168.1.3) Let's Encrypt / DDNS update failing in a loop (every 5 min)

Category: network
Reporter: kara
Assigned: kara
Created: 2026-06-16 Time Active: 6 days Due: 2026-06-23 (due today)
Details
Proposed Fix: Investigate: check WAN > DDNS settings (asuscomm.com hostname registration) and the Let's Encrypt cert status. Likely DDNS hostname/registration or WAN-IP detection issue blocking ACME cert renewal. NOT related to the Roaming Assistant fix (P-00217). Determine root cause before changing anything.
Discovered 2026-06-16 while reading the home router log for P-00217 verification. Pre-existing (my changes were wireless-only). Flagged to Phil. Root Cause (5 Whys) PENDING investigation. Symptom only: router log shows repeating 'rcservice restartletsencrypt' + 'Let's Encrypt: Err, DDNS update failed' at 5-min intervals (observed 07:15-07:35 Jun 16 in the home router General Log via Phil's screenshot). Discovered incidentally while verifying the roaming fix.
P2
#216

Behavioral pattern: NO_VERIFY_BEFORE_ASSERT (alex)

Category: architecture
Reporter: chuck
Assigned: chuck
Created: 2026-06-16 Time Active: 6 days Due: 2026-06-23 (due today)
Details
Proposed Fix: Auto-detected by chuck-behavior-auditor from a session transcript — no Phil complaint triggered this. STRUCTURAL FIX (from the auditor): Add a hard rule to Alex's SKILL.md: when guiding Phil through a third-party web UI Alex cannot see, do not narrate menu paths from memory — instead ask Phil to read what's on screen one element at a time, or offer to drive the browser first. — Treat the recurrence count on this ticket as the pattern signal: a rising count means this failure mode is systemic for alex and needs a doctrine/hook change, not a per-incident nudge.
Failure mode: NOVERIFYBEFORE_ASSERT (P2) — agent: alex. Source session: 76763a8c-0476-4706-819a-fc2cc62cb084.jsonl. EVIDENCE: 'Look for a menu item called "Statements & Documents" (sometimes just "Documents," often under your name or a menu in the top-right corner).' — Alex narrates a UI path on the Empower/JP Morgan site he cannot see and has not pulled from an authoritative source. Root Cause (5 Whys) PENDING — run the 5 Whys at triage (auto-captured; root cause not yet drilled).
P2
#215

Behavioral pattern: SCOPE_CREEP (kara)

Category: architecture
Reporter: chuck
Assigned: chuck
Created: 2026-06-16 Time Active: 6 days Due: 2026-06-23 (due today)
Details
Proposed Fix: Auto-detected by chuck-behavior-auditor from a session transcript — no Phil complaint triggered this. STRUCTURAL FIX (from the auditor): Add a scope-lock checkpoint in Kara's apply loop: when Phil's approval enumerates a specific set ('both bands'), the executor must treat additional targets as a new ask requiring fresh confirmation, not a 'while I'm in here' bonus. — Treat the recurrence count on this ticket as the pattern signal: a rising count means this failure mode is systemic for kara and needs a doctrine/hook change, not a per-incident nudge.
Failure mode: SCOPE_CREEP (P2) — agent: kara. Source session: 92c14ab7-3ede-466c-8204-b725c79fd37f.jsonl. EVIDENCE: Phil's correction narrowed the fix: 'The entire fix is #1: disable Roaming Assistant on both bands... Nothing else needs touching.' Kara confirmed and applied 2.4 + 5 GHz, then on her own initiative extended to 6 GHz: 'The 6 GHz band still shows -70 — no plugs live there, but it's the identical defect, so I'll disable it too for consistency while I'm in here.' This exceeds the explicitly scoped 'both bands' approval and burned extra time on retries when the 6 GHz apply failed. Root Cause (5 Whys) PENDING — run the 5 Whys at triage (auto-captured; root cause not yet drilled).
P2
#214

Nicole's smart plugs repeatedly disconnect from ASUS RT-BE92U (Alexa can't reach them, worse at night)

Category: network
Reporter: kara
Assigned: kara
Created: 2026-06-16 Time Active: 6 days Due: 2026-06-23 (due today)
Details
Proposed Fix: Live-diagnose in router web UI (System Log + Wireless log + wireless settings). Save before-config .CFG first. Most likely fix: isolate 2.4GHz IoT traffic (disable Smart Connect band-steering OR dedicate a 2.4GHz IoT SSID), set WPA2-Personal + PMF=Capable (NOT WPA3/Required), pin a clear 2.4GHz channel (1/6/11) at 20MHz, disable Roaming Assistant + airtime fairness on 2.4GHz. All reversible; save after-config .CFG.
Reported by Phil 2026-06-16. Symptom: Alexa cannot reach bedroom plug etc.; plugs flash (disconnected) middle of night, slow to rejoin. Site: Nicole's, 192.168.2.0/24, ASUS RT-BE92U @ 192.168.2.2 (WireGuard client). This session is on-site (philsgamingmachine, 2ms ping, web UI HTTP 200). Latest RT-BE92U backup .CFG = 7-13-2025; old RT-AC86U files stale. Root Cause (5 Whys) PENDING live log review. Hypothesis: budget 2.4GHz-only smart plugs are band-steered or de-authenticated by Smart Connect, and/or WPA3/PMF-required + 802.11ax (OFDMA/TWT) features incompatible with their WiFi chipsets. Nightly-worse pattern suggests an auto-channel/DFS radio reset or scheduled wireless event also drops them.
P2
#213

chuck-schedule-snapshot missed its 01:00 fire on 2026-06-16; dashboard JSON went 24h stale until manual regen

Category: scheduled-task
Reporter: chuck
Assigned: chuck
Created: 2026-06-16 Time Active: 6 days Due: 2026-06-23 (due today)
Details
Proposed Fix: DONE this run: regenerated schedule-snapshot.json via generate-schedule-snapshot.js (verified 2793 events, mtime 06-16 05:26). WATCH: if the 01:00 fire misses again, treat as a handler/cron-registration bug and debug in an interactive Code session.
Root Cause (5 Whys) PENDING - single occurrence while bot up 18h and heartbeat-watchdog cron fired normally at 01:30/03:00, so not a bot-down or restart-window miss. Need to confirm whether the chuck-schedule-snapshot cron handler threw a swallowed exception at 01:00 or was skipped by the scheduler. Inspect bot.js cron registration (~line 344) + any 01:00 handler stderr on the next occurrence.