The Codebase That Fixes Itself at 3 AM
Five Python daemons, headless Claude Code, and zero on-call pages
Last Tuesday at 2:47 AM, a user in Sydney logged “medium fries from Macca’s.” The system matched it to a US-sourced McDonald’s record. Wrong country, wrong portion, wrong calories. Nobody was awake.
By morning, there was a markdown file in our repo explaining exactly what happened: the branded fast path matched a US-centric database record because the system has no region awareness. The file included the trace ID, the database query that confirmed the mismatch, and a recommendation to add region-aware branded food matching to the backlog. It decided not to auto-fix because the user had already corrected the values manually.
The system that wrote that report has no pulse. It’s a Python daemon running on a Macbook Pro that polls our production database every three minutes, and when it finds something wrong, it spawns an autonomous AI agent to investigate and fix it.
We call them watchers.
What a Watcher Actually Is
A watcher is a Python script that runs as a background daemon. It polls a data source on a fixed interval, checks for new problems, and when it finds one, spawns a headless Claude Code process to investigate.
Not an API call. A full coding agent with access to the filesystem, the production database, observability traces, and the source code. It can read files, write fixes, run queries, commit to git, and update production data. It does this without a human in the loop.
We run five of them.
Bug Watcher polls every three minutes. It watches two tables: user-submitted bug reports and food feedback dislikes. When someone taps the thumbs-down on a food log and writes “this is way off,” the bug watcher picks it up, pulls the user’s chat logs, queries the food item and its change history, searches our observability platform for the processing trace, reads the relevant source code, and writes an investigation report. If the fix is straightforward, a bad reference ratio or a corrupted nutrition value, it fixes it in production and commits the change.
Accuracy Watcher runs every three hours. It fetches all food item corrections from the last window: every time a user edited their calories, changed a quantity, or deleted something we got wrong. It groups the corrections by food classification, data source, and type of change. When it spots a pattern, like a cluster of 30%+ calorie corrections on estimated foods from the same source, it traces the issue through our processing pipeline, identifies the upstream bug, and fixes it in the code.
Error Watcher runs every thirty minutes. It looks for food items stuck in an error state, meaning the system tried to process them and failed. It groups them by classification, investigates the failure via traces and code, and either recovers the items or fixes the code path that broke.
Fixer Watcher polls every three minutes. We have an automated fixer agent that attempts to correct food items when it detects issues. Sometimes the fixer itself fails. The fixer watcher investigates those failures, classifying whether the root cause was a code bug, a data issue, a model hallucination, a search gap, or a timeout. This is meta-debugging: an agent investigating why another agent failed.
GCloud Watcher runs every three hours. It pulls error logs from our three Cloud Run services (API, workers, scheduler) via gcloud logging read, classifies the errors with regex patterns, filters out known benign failures (expected auth rejections, user-deleted items, network timeouts), and spawns an investigation for anything new or frequent.
What Self-Healing Actually Looks Like
The phrase “self-healing” gets thrown around loosely. Here’s what it means in practice, with two real examples from the last 48 hours.
The scoop problem. The accuracy watcher’s 6 AM run analyzed 223 corrections from 38 users. It found that protein powder logged by weight was coming back at absurd values. A user logged 30 grams of protein powder and got 30 calories instead of 140.
The investigation traced it to our fast path formula for chef recipe ingredients. When a food is stored per-scoop (140 cal per 1 scoop, with a reference ratio indicating 30g per scoop) and a user logs it in grams, the formula was squaring the weight conversion. The physics filter caught the impossible values and fell back to the LLM, which underestimated by a different amount.
The watcher identified the exact function (try_chef_ingredient_fast_path in fast_paths.py), wrote a cross-unit detection guard, committed the fix, and noted that 253 foods in the database had this broken pattern. Total time from detection to committed fix: about twelve minutes. No human involved.
The zucchini corruption. A user reported that 196g of zucchini was showing 98 calories, roughly 3x the correct value. The bug watcher investigated and found something worse than a simple estimation error.
Seven days earlier, a different user had over-corrected their zucchini entry twice in two minutes. Our user correction propagator, designed to learn from corrections and update shared food records, accepted both corrections. The master zucchini record went from 15 cal/100g (correct, per USDA data) to 50 cal/100g. For a full week, every user who logged zucchini got inflated values. About 301 food items across 165 users were affected.
The watcher traced the entire corruption timeline through five database events and three observability traces. It identified that the propagator’s safety guards only blocked large downward corrections (>80% reductions) but had no symmetric guard for upward corrections. It documented the affected user count, recommended a backfill script, and flagged the propagator design gap for human review.
This is the distinction that matters. The watcher didn’t blindly auto-fix everything. It fixed the data that was safe to fix (the master food record had already been corrected by a subsequent user). It documented what needed human judgment (the backfill affecting 165 users). And it identified the systemic weakness (asymmetric propagation guards) that caused the problem in the first place.
The Constraint Design
Giving an AI agent production database access and git commit rights sounds reckless. The interesting engineering is in how you make it not reckless.
Each watcher spawns Claude Code in headless mode with a whitelist of allowed tools. The agent can read, edit, and write files. It can run specific shell commands: uv run, gcloud, git add, git diff, git commit. It can query our production database via Supabase and search observability traces via Logfire. It cannot delete tables, cannot force-push, cannot modify CI/CD pipelines, cannot access external APIs.
The prompts are 200+ lines of markdown each. They aren’t vague instructions like “investigate bugs.” They’re structured investigation playbooks: which tables to query first, what SQL patterns to use, how to navigate the codebase, what qualifies as a safe auto-fix versus what requires human review. The prompt template is the real product. The daemon is just the delivery mechanism.
Timeouts prevent runaway investigations. Bug and fixer watchers get 15 minutes. Accuracy, error, and GCloud watchers get 30. If Claude hasn’t finished by then, the process is killed and the item is logged for the next cycle.
State tracking prevents re-investigation. Each watcher maintains a SQLite database of processed items. If the bug watcher already investigated report #4521, it won’t investigate it again on the next poll. Reset is simple: delete the SQLite file and everything from today gets re-examined.
Every investigation produces a markdown report saved to the repo. These aren’t just logs. They’re structured documents with severity classification, root cause analysis, affected code paths, evidence (trace IDs, SQL queries, before/after values), and a clear record of what was fixed and what was deferred. When I review them in the morning, I’m reading a finished investigation, not raw data.
The Prompt Is the Product
The difference between a watcher that’s useful and one that generates noise is entirely in the prompt template.
Our accuracy watcher prompt doesn’t just say “analyze these corrections.” It tells Claude to group changes by food classification and data source first. It explains our processing pipeline: completion agent, chef agent, reference agent, fast paths. It states our core principle: “fix upstream, not at the gateway.” It provides the SQL to pull Logfire traces by food item ID. It describes what a “large correction” means in context (30%+ calorie change) and how to distinguish between user preference changes and genuine data quality issues.
We iterated on these prompts for weeks. Early versions produced investigations that were technically correct but useless. Claude would identify the wrong food record but not trace it back to the code that selected it. Or it would find the code bug but not check how many other foods were affected. Each failure mode became a new paragraph in the prompt.
The prompts encode institutional knowledge. They’re how we teach a new engineer to investigate a bug, except the engineer runs at 3 AM and doesn’t get tired.
The Economics
This runs on a Claude Max subscription. Every headless invocation is included. The bug watcher might spawn 15 investigations in a day. The accuracy watcher runs 8 times. The GCloud watcher, 8 times. All unlimited, fixed cost.
We explicitly strip the Anthropic API key from the environment before spawning each agent. This forces Claude Code to use the Max subscription rather than per-token billing. At the volume these watchers produce, per-token costs would be significant. Fixed-cost unlimited access makes the whole architecture viable.
The alternative isn’t “don’t do this.” The alternative is doing it manually. Reading error logs over coffee. Querying the database by hand. Tracing through code to find why a food item came back at 3x the correct calories. The watchers don’t do anything a human can’t do. They do it at 2:47 AM, in twelve minutes, every time.
What Goes Wrong
The watchers aren’t magic. They have failure modes.
False confidence is the main risk. An agent that can write to production and commit code will occasionally be wrong with conviction. Our mitigation is the constrained tool whitelist and the investigation reports. Every action is documented. Every fix includes before/after values. When a watcher makes a bad call, the report makes it easy to find and revert.
Prompt drift is real. As the codebase evolves, the prompts need to reference current file paths and function names. A prompt that says “check completion_agent.py line 340” becomes misleading after a refactor. We treat prompts like code: they get updated alongside the source they reference.
Scope creep is tempting. When you have an agent that can fix production data, it’s easy to let it fix things that should require human judgment. The zucchini case is a good example. The watcher could have run the backfill across 165 users. It correctly chose not to, because that decision has blast radius. The prompt explicitly defines what “safe to auto-fix” means: single-record data quality issues with high-confidence evidence. Anything broader gets flagged.
What This Means for Production AI
Most of the conversation about AI agents focuses on capability: what they can do, how smart they are, which benchmarks they pass. The watchers taught us that the interesting problems are about constraint, trust, and feedback loops.
Constraint design matters more than capability. Claude Code can do a lot. The value of the watchers comes from limiting what it does. A tightly scoped agent that fixes serving ratios reliably is more valuable than an unconstrained agent that might fix anything.
Markdown reports are an underrated interface. The watchers produce human-readable investigation reports, not log files. This means the morning review takes five minutes instead of an hour. When something goes wrong, the reasoning is right there. Structured output for human review is as important as the automation itself.
Meta-debugging is a real pattern. The fixer watcher, an agent that investigates why another agent failed, sounds recursive. In practice, it’s one of the most useful watchers. Automated systems fail in automated ways. Having a second system that classifies those failures by root cause (code bug, data issue, model hallucination, timeout) turns noise into signal.
Start with your noisiest signal. We didn’t build five watchers on day one. We started with the bug watcher because user-reported issues were the most painful feedback loop. Once that worked, we noticed patterns in the data that led to the accuracy watcher. The error and GCloud watchers came from operational pain points that the first two watchers surfaced.
Headless CLI agents are production infrastructure. We treat claude -p the same way we treat a cron job or a background worker. It has a defined input, constrained permissions, a timeout, state tracking, and structured output. The fact that the “worker” is a language model instead of a Python function is an implementation detail.
The watchers run every day. Most of the time, they find nothing interesting. But when a serving formula silently breaks at 6 AM and gets fixed by 6:12 AM, before any user notices, that’s not a demo. That’s infrastructure.

