Simulating Months of Coaching in Minutes
Testing AI behavior before it reaches real users
I built something fun this week: a simulation that stress-tests our AI nutrition coach across different user types over weeks of interactions. Compressed into minutes.
At Alma, we’ve learned there’s no universal approach to coaching. Some users want data and optimization. Others need encouragement and low-pressure nudges. The promise of AI is adapting to each one. But how do you know if it actually does?
The Setup
Six synthetic users, each representing a pattern from our real data. The Optimizer who logs religiously and asks technical questions. The Struggler who engages sporadically and needs celebration over metrics. The Weekend Warrior whose tracking falls apart every Saturday.
Each persona interacts with our coach daily. Logs meals. Receives weekly check-ins. Responds based on their personality. Four simulated weeks, about 20 minutes of runtime.
What Broke
The coach was excellent at pattern recognition. It spotted that one user was eating 500 calories under target. Mentioned it week one. And week two. And week three. Same observation, no escalation.
It could see the problem. It never pushed to solve it.
Another user had a weight loss goal but muscle gain in their profile. The coach noticed the tension. Mentioned it repeatedly. Never resolved it.
Classic middle-manager behavior. Flag issues, don’t own solutions.
The Fix
The changes weren’t smarter models or better prompts. They were operational:
If the same issue appears two weeks in a row, change your approach. Don’t observe—propose action.
If goal and profile contradict for more than two weeks, force the conversation.
Escalation logic. The kind of thing you’d put in a manager’s expectations, not a software spec.
Why This Matters
When building products involving psychology and behavior, you used to ship and wait. Run A/B tests. Iterate slowly.
Now you can build synthetic users from real patterns and watch months unfold before anything goes live. The simulation won’t catch everything. But it catches obvious issues that would take weeks of production data to surface.
If you’re building something similar, I’d be curious to compare notes.
Technical Details
For those interested in building something similar, here’s how the system works under the hood.
The Stack
Coach Agent: Built with PydanticAI, a framework for building production AI agents with type safety and dependency injection. The agent has ~40 tools for reading user data, adjusting goals, scheduling check-ins, and generating personalized content.
Coach Model: Claude Opus 4.5 via AWS Bedrock. We chose Opus for the weekly reviews because they require nuanced pattern recognition across days of data and personalized communication that matches user preferences. The system prompt is around 1,800 lines covering tone, escalation rules, tool usage, and examples.
User Simulator: Claude Sonnet 4.5 via the Anthropic API. Generates realistic food logs and check-in responses based on persona configurations. Sonnet is fast enough for the volume of interactions and good enough at roleplaying consistent personalities.
Database: PostgreSQL via Supabase. The simulation writes real records (users, meals, food items, goals, streaks) so the coach agent queries actual data, not mocks.
Persona Architecture
Each persona is a dataclass with ~20 parameters:
@dataclass
class PersonaConfig:
name: str
nutrition_experience: NutritionExperience # just_starting, some_experience, knowledgeable
feedback_style: FeedbackStyle # celebrate_wins, straight_to_point, deep_dives
engagement_level: float # 0.0-1.0, probability of logging any given day
weekend_drop: float # how much engagement drops on weekends
target_adherence: float # how close they stick to calorie goals
response_rate: float # probability of responding to coach check-ins
question_style: str # technical, emotional, practical
# ... etc
The personas were derived from analyzing coaching preferences and engagement patterns in our production database. We queried users by feedback style, looked at their logging consistency, and built archetypes from the clusters.
Simulation Flow
Create a test user in the database with the persona’s demographics and preferences
For each simulated week:
Generate 7 days of food logs based on persona’s engagement and eating patterns
Insert meals and food items into the database
Run the coach’s run_weekly_review() with a reference_date parameter (this was key—without it, the coach uses date.today() and all the date math breaks)
Fetch the coach’s outputs: review notes, scheduled check-ins, focus messages
Use Sonnet to generate user responses to each check-in based on persona
Generate a markdown report with all interactions
The Date Problem
The trickiest part was time simulation. Our coach agent calls date.today() in about 15 places—for calculating week boundaries, scheduling check-ins, querying recent meals.
We added a reference_date parameter that propagates through the system:
async def run_weekly_review(
self,
user_id: UUID,
reference_date: Optional[date] = None, # For simulations
) -> CoachResult:
today = reference_date or date.today()
# ... all date calculations use `today`
This lets you run a “weekly review” for any arbitrary date, and the coach sees the world as if that date were today.
Runtime
4 weeks × 1 persona ≈ 4-5 minutes (mostly waiting on Opus)
4 weeks × 6 personas ≈ 25 minutes
Cost: roughly $3-5 per full simulation run
Output
Each simulation produces:
A JSON file with raw data (every meal, every check-in, every response)
A markdown report summarizing each week’s interactions
Aggregate metrics: engagement rate, streak length, check-in response rate
The reports made it easy to spot patterns across personas. The Struggler got gentler messaging (good). The Optimizer got the same undereating observation four weeks in a row (bad).
What’s Next
The current system is manual, you run it when you want to test something. The interesting evolution is continuous simulation: a background process that periodically generates synthetic interactions based on recent user patterns and flags drift in coach behavior before real users experience it.
If you’re building agent simulations and want to compare approaches, reach out.

