Shipping a Multi-Agent System to Production Without Breaking Everything
How we rebuilt our core food logging system and rolled it out to 100% of users in two months
At Alma, we’re on a mission to build the best AI nutrition coach. But a coach that doesn’t know what you’re eating is just a chatbot with opinions. Tracking is the foundation, and last year, we hit a wall.
Users loved the natural language experience. “Two eggs and a banana for breakfast” beats tapping through databases. But as we scaled, the cracks showed. The same user might say “my usual smoothie” on Monday and “the green one I had yesterday” on Tuesday. They’d track “chicken stir fry” and expect Alma to remember the exact recipe they described three weeks ago. They wanted a system that felt like a personal nutritionist who’d been working with them for years.
Our original system couldn’t do this. It was good at parsing text into food items. It wasn’t good at understanding context, history, or the messy reality of how people actually talk about food.
So we rebuilt it from scratch. This post isn’t about the architecture of the new system. It’s about how we shipped it without breaking everything.
The Problem with Shipping AI
We were going to be making a change to the core flow in our app, food logging. We couldn’t just flip a switch. We couldn’t just flip a switch.
The challenge: how do you know if a fuzzy, subjective system is actually better? Users don’t file bug reports when their chicken breast shows 165 calories instead of 185. They just quietly lose trust.
Our north star was correction-free logs: the percentage of tracked foods where users made no changes. If someone logs “chicken breast” and doesn’t edit the result, we assume we got it right.
Obviously 100% is impossible. People change their minds. They realize they had 6 ounces, not 4. But the rate tells us something, and the types of corrections tell us more.
Phase 1: The Test Kitchen
Before any real users saw the new system, we built a “Test Kitchen” feature flag. Internal team members and a handful of early supporters could opt into the new system while everyone else stayed on the old one.
We used it obsessively. Every meal, every day. When something broke (and things broke horribly in the early days) we caught it before it touched real users.
The bugs we found in Test Kitchen would have been disasters in production:
Cloning the wrong meal when users had similar-named items
Creating recipes with absurd ingredient quantities (2000g of salt)
Unit conversion failures that turned a cup of rice into a kilogram
Each bug became a test case. By the time we started the real rollout, we had hundreds of edge cases covered.
Phase 2: The Gradual Rollout
Once Test Kitchen felt stable, we started routing production traffic:
Week 1: 10% of requests to the new system Week 2: 25% Week 3: 50% Week 4: 75% Week 5: 100%
At each stage, we monitored obsessively. We built a feedback system that could dial up or down. After any food log, users might see a simple thumbs up/down toast. We kept the frequency low enough to avoid annoyance but high enough for statistical significance.
The key was having kill switches at every level. If something went wrong, we could dial back to 0% in seconds.
Phase 3: The Automated Auditor
This is where it gets interesting.
We built a cron job that runs every three hours. It uses Anthropic’s Claude Opus 4.5 to audit everything that happened since the last run:
New food records: Identifies impossible nutrition values (>900 cal/100g for non-fats), all-zero entries, macro-to-calorie mismatches, and data entry typos
Serving data: Catches decimal placement errors where 118g gets entered as 1.18g (a 100x error that would destroy calorie accuracy)
User corrections: Analyzes large edits to understand if users are fixing legitimate data issues that should propagate back to source data
A/B test comparison: Tracks performance differences between the old and new systems during the rollout
The auditor doesn’t just report. It acts. High-confidence issues (>90% confidence) get auto-fixed. Questionable ones get flagged for human review. Every Monday, I get a summary email with the week’s findings.
Here’s what the decision model looks like:
keep: Food data is correct
soft_delete: Food data is clearly wrong and should be removed
fix: Food data can be corrected automatically
review: Uncertain, needs human verification
The conservative thresholds matter. Auto-delete only happens at 95% confidence. We’d rather flag something for review than silently corrupt user data.
What the Auditor Caught
During the rollout, the auditor surfaced issues we never would have found manually:
Dry vs. cooked values: The system was sometimes returning nutrition for dry rice instead of cooked rice. 350 calories vs 130 calories for the same portion. The auditor detected the pattern and we fixed the underlying logic.
Decimal drift in servings: A batch of food servings had reference ratios off by 10x due to a unit conversion bug. The auditor caught it within 3 hours and auto-fixed 47 records before users noticed.
Systematic overcounting: One edge case in recipe handling was consistently adding 15-20% extra calories. The auditor’s calorie distribution analysis flagged the anomaly.
The automation wasn’t about removing humans from the loop. It was about making sure humans saw the right things at the right time. I didn’t need to manually review 10,000 food logs. I needed to see the 12 that looked suspicious.
The Feedback Loop
Thumbs-down feedback with text comments went straight to my inbox. Not to a queue. Not to a dashboard I’d check weekly. My inbox.
This created urgency. When someone wrote “this is way off, I had a small salad not 800 calories,” I saw it within minutes. Often I could trace the issue, fix it, and deploy before the user’s next meal.
The tight loop changed how we thought about bugs. They weren’t tickets to be prioritized. They were problems affecting real people right now.
The Numbers
The whole process, from first commit to 100% rollout, took about two months. In my previous life at larger companies, a change to a core flow like this would have been a multi-quarter-long effort.
What made it fast:
Dogfooding: Using it ourselves daily meant we felt problems before we measured them
Automated auditing: The three-hour cycle meant issues surfaced in hours, not days
Gradual rollout with kill switches: We could dial back instantly if something went wrong
Tight feedback loops: Thumbs down to my inbox to fix deployed, often within hours
The new system now handles 100% of food logging. Correction-free rates are up. More importantly, the types of queries that used to fail (temporal references, complex recipes, modifications to previous meals) now work.
Lessons for Shipping AI
If you’re building AI systems that touch user data:
Instrument everything from day one. We couldn’t have done the gradual rollout without feature flags that let us route traffic precisely. Build this infrastructure before you need it.
Automate the audit, not the judgment. The LLM auditor surfaces issues. Humans decide what to do about edge cases. The 90% confidence threshold for auto-fix exists because we’d rather be slow and right than fast and wrong.
Make feedback painful. Routing complaints to my inbox instead of a dashboard created accountability. When you feel the friction of user problems, you fix them faster.
Correction-free rate is a lagging indicator. By the time you see it drop, users have already had bad experiences. Watch leading indicators: error rates, unusual values, processing times.
Ship incrementally, monitor obsessively. 10% traffic for a week tells you more than a month of internal testing. Real users do things you never imagined.
The auditor cron job is still running. Every three hours, it checks our work. Most runs find nothing. But when they find something, we know within hours instead of weeks.
That’s the difference between shipping AI and shipping AI responsibly.

