Simulating Months of Coaching in Minutes

Testing AI behavior before it reaches real users

Jan 15, 2026

Contemporain Syrian art Safwan Dahoul -Blog Graphiste, Sculptures, Photos, Ver et Vie — Safwan Dahoul - Syrian painter - Acrylic on canvas

I built something fun this week: a simulation that stress-tests our AI nutrition coach across different user types over weeks of interactions. Compressed into minutes.

At Alma, we’ve learned there’s no universal approach to coaching. Some users want data and optimization. Others need encouragement and low-pressure nudges. The promise of AI is adapting to each one. But how do you know if it actually does?

The Setup

Six synthetic users, each representing a pattern from our real data. The Optimizer who logs religiously and asks technical questions. The Struggler who engages sporadically and needs celebration over metrics. The Weekend Warrior whose tracking falls apart every Saturday.

Each persona interacts with our coach daily. Logs meals. Receives weekly check-ins. Responds based on their personality. Four simulated weeks, about 20 minutes of runtime.

What Broke

The coach was excellent at pattern recognition. It spotted that one user was eating 500 calories under target. Mentioned it week one. And week two. And week three. Same observation, no escalation.

It could see the problem. It never pushed to solve it.

Another user had a weight loss goal but muscle gain in their profile. The coach noticed the tension. Mentioned it repeatedly. Never resolved it.

Classic middle-manager behavior. Flag issues, don’t own solutions.

The Fix

The changes weren’t smarter models or better prompts. They were operational:

If the same issue appears two weeks in a row, change your approach. Don’t observe—propose action.

If goal and profile contradict for more than two weeks, force the conversation.

Escalation logic. The kind of thing you’d put in a manager’s expectations, not a software spec.

Why This Matters

When building products involving psychology and behavior, you used to ship and wait. Run A/B tests. Iterate slowly.

Now you can build synthetic users from real patterns and watch months unfold before anything goes live. The simulation won’t catch everything. But it catches obvious issues that would take weeks of production data to surface.

If you’re building something similar, I’d be curious to compare notes.

Technical Details

For those interested in building something similar, here’s how the system works under the hood.

The Stack

Coach Agent: Built with PydanticAI, a framework for building production AI agents with type safety and dependency injection. The agent has ~40 tools for reading user data, adjusting goals, scheduling check-ins, and generating personalized content.

Coach Model: Claude Opus 4.5 via AWS Bedrock. We chose Opus for the weekly reviews because they require nuanced pattern recognition across days of data and personalized communication that matches user preferences. The system prompt is around 1,800 lines covering tone, escalation rules, tool usage, and examples.

User Simulator: Claude Sonnet 4.5 via the Anthropic API. Generates realistic food logs and check-in responses based on persona configurations. Sonnet is fast enough for the volume of interactions and good enough at roleplaying consistent personalities.

Database: PostgreSQL via Supabase. The simulation writes real records (users, meals, food items, goals, streaks) so the coach agent queries actual data, not mocks.

Persona Architecture

Each persona is a dataclass with ~20 parameters:

@dataclass

class PersonaConfig:

name: str

nutrition_experience: NutritionExperience # just_starting, some_experience, knowledgeable

feedback_style: FeedbackStyle # celebrate_wins, straight_to_point, deep_dives

engagement_level: float # 0.0-1.0, probability of logging any given day

weekend_drop: float # how much engagement drops on weekends

target_adherence: float # how close they stick to calorie goals

response_rate: float # probability of responding to coach check-ins

question_style: str # technical, emotional, practical

# ... etc

The personas were derived from analyzing coaching preferences and engagement patterns in our production database. We queried users by feedback style, looked at their logging consistency, and built archetypes from the clusters.

Simulation Flow

Create a test user in the database with the persona’s demographics and preferences

For each simulated week:

Generate 7 days of food logs based on persona’s engagement and eating patterns

Insert meals and food items into the database

Run the coach’s run_weekly_review() with a reference_date parameter (this was key—without it, the coach uses date.today() and all the date math breaks)

Fetch the coach’s outputs: review notes, scheduled check-ins, focus messages

Use Sonnet to generate user responses to each check-in based on persona

Generate a markdown report with all interactions

The Date Problem

The trickiest part was time simulation. Our coach agent calls date.today() in about 15 places—for calculating week boundaries, scheduling check-ins, querying recent meals.

We added a reference_date parameter that propagates through the system:

async def run_weekly_review(

self,

user_id: UUID,

reference_date: Optional[date] = None, # For simulations

) -> CoachResult:

today = reference_date or date.today()

# ... all date calculations use `today`

This lets you run a “weekly review” for any arbitrary date, and the coach sees the world as if that date were today.

Runtime

4 weeks × 1 persona ≈ 4-5 minutes (mostly waiting on Opus)

4 weeks × 6 personas ≈ 25 minutes

Cost: roughly $3-5 per full simulation run

Output

Each simulation produces:

A JSON file with raw data (every meal, every check-in, every response)

A markdown report summarizing each week’s interactions

Aggregate metrics: engagement rate, streak length, check-in response rate

The reports made it easy to spot patterns across personas. The Struggler got gentler messaging (good). The Optimizer got the same undereating observation four weeks in a row (bad).

What’s Next

The current system is manual, you run it when you want to test something. The interesting evolution is continuous simulation: a background process that periodically generates synthetic interactions based on recent user patterns and flags drift in coach behavior before real users experience it.

If you’re building agent simulations and want to compare approaches, reach out.

Action Potential

Discussion about this post

Ready for more?