Anne-bot
Scaling Expert Mentorship Through AI—
Without Replacing Human Judgment
A 2-week proof of concept exploring whether AI could reduce PRD review cycles by codifying director-level coaching patterns—and discovering that the bottleneck wasn't tooling, it was foundational alignment on what a strong problem statement actually is.
Overview
Anne-bot is a conversational AI prototype designed to scale director-level PRD coaching, helping Product Managers strengthen problem statements and arrive at reviews better prepared—without replacing human judgment, strategic oversight, or organizational context.
Instead of auto-generating answers, Anne-bot prompts PMs with Socratic questions that clarify assumptions, reveal gaps, and improve reasoning before formal review sessions—mirroring the coaching style used by Digital Cabin leadership.
The Stakes
PRD reviews were thoughtful but slow. Every document required director approval, creating a quality-vs-velocity tension: leadership wanted to move faster, but not at the expense of product rigor or strategic clarity.
The opportunity wasn't automation—it was scaling mentorship to prepare PMs before reviews, raising the floor so director time could focus on strategic alignment rather than foundational fixes.
QUICK FACTS
Role: Product Manager & AI Practitioner (Solo)
Duration: 2 weeks (Proof of Concept)
Validation: Tested with 6 PMs across 10+ PRDs
Platform: Internal Ford LLM + Gemini
Key Finding: Teaching > Automating for sustainable impact
Status: POC validated feasibility; surfaced need for foundational scaffolding before production investment
My Role
Product Manager & AI Practitioner | Solo Project
Led discovery, strategy, prototyping, evaluation, and iteration across a 2-week exploratory proof of concept.
What I Did
Discovery & Strategy
Analyzed transcripts, meeting notes, Slack discussions, and PRDs to extract Anne's coaching patterns.
Conducted 6 PM interviews and observed live review sessions.
Framed opportunity as scaling mentorship (not automating reviews) and defined riskiest assumptions to test.
AI Prototyping & Evaluation
Built iterative GPT prototypes mirroring Anne's reasoning, tone, and Socratic questioning style.
Tested against real PRDs, measuring usefulness, clarity, tone, and trust.
Developed evaluation methodology comparing AI feedback quality to human coaching.
Strategic Pivot
Identified structural bottleneck: PRD quality issues stemmed from weak problem statement skills, not review velocity.
Recommended investing in teaching problem statement framework before scaling AI—proving when not to build is as strategic as shipping.
Why This Project Mattered to Me
I wanted to explore whether AI could scale expert thinking without replacing human judgment. This was my chance to practice applied prompt engineering grounded in real organizational dynamics, develop rigorous evaluation methodology, and learn when foundational work (teaching frameworks) creates more value than technological solutions.
The Challenge
PRD reviews were slow and centralized. Every document required director-level approval before teams could move forward, creating downstream delivery delays. Anne's reviews were thoughtful, Socratic, and rigorous—but because she led all PRD review sessions, PMs often waited days for feedback, even if they just needed a quick gut-check.
Leadership recognized the slowdown—but didn't want to compromise quality, standards, or product thinking.
The Tension
Business Need: Move faster without sacrificing quality
PM Need: Get feedback without long wait times
Director Need: Maintain rigor and strategic alignment
Traditional solutions had clear limitations:
Hire more directors → Expensive, slow to scale, dilutes expertise
Lower review standards → Unacceptable quality risk
Skip reviews → Defeats the purpose of having them
The Opportunity
What if AI could codify Anne's coaching patterns to prepare PMs before formal reviews—raising the floor so director time focused on strategic alignment rather than foundational fixes?
Product Hypothesis
Riskiest Assumption
AI can play a productive role in the PRD review process without replacing human judgment, nuance, or organizational context.
If True, Then:
PMs could access early coaching before formal reviews
At least one review cycle could be eliminated
Directors could spend less time on foundational fixes and more time on strategic alignment
Feedback quality would remain high—or improve
PM confidence and preparedness would increase
This hypothesis shaped how Anne-bot was designed, evaluated, and iterated—and ultimately, what it revealed about the real bottleneck.
Discovery
To understand why reviews were slow, I used mixed methods:
RESEARCH METHODS
Observation: Attended live PRD review sessions
6 PM Interviews: Surfaced perceptions, frustration points, expectations around reviews
Transcript Analysis: Captured Anne's real language, questioning techniques, reasoning patterns
Artifact Comparison: Reviewed PRDs before + after meetings to identify what changed
Prototype Feedback: PM evaluation of Anne-bot responses for relevance and usefulness
Mixed methods revealed both behavioral patterns and structural bottlenecks—essential for designing the right intervention.
What I Found
A clear pattern emerged across all data sources:
PRDs weren't slow because PMs were unprepared—they were slow because alignment on the problem statement took multiple rounds.
Key Patterns:
Most review time centered on clarifying the true user problem
PMs and leadership defined "problem statement" differently
Once the problem section was strong, the rest of the PRD moved quickly
Common issues: jumping to solutions, describing features instead of user needs, conflating business goals with user problems
The Core Insight
Everything downstream—scope, metrics, prioritization—depends on a strong problem statement.
If the problem foundation wasn't clear, reviewing anything else was performative.
Prototyping Strategy
I used Anne-bot itself as a research tool—building to learn, not to ship.
Iterative Learning Framework
Each iteration tested a different hypothesis:
Iteration 1: Can AI replicate coaching voice and logic?
Iteration 2: Can human-in-the-loop reduce irrelevant feedback?
Iteration 3: Should AI focus on entire PRD or specific sections?
Testing drove continuous refinement based on PM feedback.
Building the Solution
Iteration 1 — Capturing the Coaching Voice
Goal: Replicate Anne's tone, logic, and conversational structure
Approach:
Mapped Anne's Socratic questioning patterns from transcripts
Prototype reviewed entire PRDs at once
Focused on rhythm, clarity, and thoughtful guidance
Result:
✅ Tone alignment worked—PMs recognized Anne's voice
❌ Context gaps surfaced immediately: "Anne would never say this—she already knows this."
Learning: Accurate tone isn't enough if feedback feels uninformed.
Iteration 2 — Human-in-the-Loop Clarification
Problem: Some feedback felt irrelevant because the AI lacked organizational context PMs assumed Anne would have.
Solution: Redesigned interaction flow to make context gaps explicit:
Review one PRD section at a time
Pause to confirm relevance and ask clarifying questions
PMs add missing context dynamically
Feedback updates based on new information
Result:
✅ PMs felt heard and understood
✅ Feedback quality improved dramatically
✅ Shifted prototype from evaluator → collaborator
Learning: Transparency about limitations builds trust more than false confidence.
Iteration 3 — Prioritizing the Problem Statement
Observation: Testing made one truth unavoidable:
If the problem statement was unclear, everything else stalled.
Decision: Anne-bot began focusing primarily on diagnosing and coaching the problem section before addressing the rest of the document.
Approach:
Start with problem statement review
Use Socratic questions to strengthen reasoning
Only move to other sections once problem foundation is solid
Result:
✅ Reviews became more focused and productive
✅ PMs described it as "thinking partner, not grader"
✅ Turned feedback into structured thinking practice
Learning: The highest-leverage intervention point is the foundation, not the finish line.
Strategic Design Principles
Three validated principles guided the prototype design:
1. Conversational Guidance Over Prescription
Socratic questioning strengthened PM reasoning more than directive edits.
Example:
❌ "Your problem statement is too vague."
✅ "Who specifically experiences this problem? What happens when they encounter it? What evidence do we have that this matters to them?"
2. Human-in-the-Loop Transparency
Admitting uncertainty ("I may be missing context—can you clarify?") built trust and improved relevance.
Why It Worked:
Acknowledged AI limitations openly
Gave PMs agency to correct misunderstandings
Prevented irrelevant feedback from undermining credibility
3. Flexible Interaction Model
PMs needed different modes depending on whether they were exploring, refining, or validating—not a one-size-fits-all review.
Result: Anne-bot became a thinking partner, not a reviewer.
The Critical Insight
Problem Definition Drives Everything
Review cycles weren't slow because of workflow inefficiency—they were slow because PMs and leadership weren't aligned on what a problem statement is.
If the problem foundation wasn't strong, reviewing anything else was performative.
Product Implications
This insight reframed the entire opportunity:
Wrong Solution: Build AI to review PRDs faster
Right Solution: Teach what a strong problem statement is, then use AI to practice applying that framework
Before productionizing Anne-bot, the org needed to:
Define and document problem statement criteria (What makes a problem statement strong?)
Create teaching resources (templates, examples, guidelines)
Build shared understanding across PMs and leadership
Core Principle: AI cannot compensate for foundational ambiguity—it can only scale what's already clear.
Recommendation: Invest in scaffolding before scaling automation.
Key Outcomes
Prototype Validation
Tested with: 6 PMs across 10+ PRDs
Results:
Designed to eliminate at least one review cycle through better preparation
PMs arrived better prepared and more confident at formal reviews
Shifted review discussions toward strategic alignment rather than foundational fixes
Revealed problem-statement clarity as the root bottleneck—not workflow efficiency
Demonstrated that AI can scale mentorship, not just efficiency
The prototype didn't replace review sessions—it raised the floor, ensuring time together was more strategic, focused, and high-value.
If I Were to Productionize This
Phased Approach
Phase 1: Foundation (Month 1-2)
Formalize problem statement framework with leadership
Create reference templates and examples
Pilot teaching workshops with 2-3 teams
Document shared criteria for "strong problem statement"
Phase 2: Pilot (Month 3-4)
Deploy Anne-bot to pilot teams who've completed training
Measure: review cycle reduction, PM confidence, feedback quality
Iterate based on usage patterns and PM feedback
Refine prompts based on real-world performance
Phase 3: Scale (Month 5-6)
Roll out org-wide once scaffolding is proven
Build feedback loop: Anne-bot surfaces common gaps → informs training updates
Establish continuous improvement cycle
Success Metrics
Primary Metrics:
Reduce review cycles from 3 → 1-2 on average
Maintain or improve feedback quality (measured via PM satisfaction surveys)
Secondary Metrics:
Increase PM confidence entering reviews (pre/post surveys)
Reduce director time spent on foundational fixes (time tracking)
Improve problem statement quality (evaluated against documented criteria)
Long-term Goal:
Shift director time toward strategic alignment, not foundational coaching.
Why This Phased Approach
This ensures foundation before scale—avoiding the trap of automating ambiguity and instead codifying clarity that AI can amplify.
Key Learnings
1. Problem Definition Is the Leverage Point
Alignment upstream prevents chaos downstream. The highest-value intervention isn't speeding up reviews—it's strengthening the foundation those reviews depend on.
2. Teaching Scales Better Than Automating
The best AI products cultivate capability rather than dependency. Sustainable impact comes from strengthening thinking, not delivering answers.
3. Transparency Builds Trust
Human-in-the-loop acknowledgment of uncertainty—not false certainty—increased adoption and credibility. Admitting "I may be missing context" worked better than pretending to know everything.
4. Evaluation Is as Critical as Prompting
Testing output quality systematically is essential for scaling AI experiences. Without rigorous evaluation, you risk scaling noise instead of value.
5. Strategic Restraint Matters
Sometimes the right product decision is to not build—or to build foundations first. Knowing when to pause and strengthen the system is as important as knowing when to ship.
What This Project Taught Me About AI Product Work
1. Start with the System, Not the Tool
AI interventions fail if the underlying process has structural ambiguity. Before building AI to scale something, ensure there's clarity worth scaling.
2. Context Matters More Than Capability
AI performance in isolation ≠ AI usefulness in real workflows. The best prompts fail if they don't account for organizational context, user mental models, and existing processes.
3. Human-in-the-Loop Design Enables Trust
Giving users agency to correct AI misunderstandings prevents frustration and builds confidence in the system. Transparency > perfection.
4. Focus on Leverage Points, Not Surface Problems
The real bottleneck often isn't what users complain about. Deep discovery reveals where intervention creates the most value.
5. Know When Not to Build
The strategic move isn't always feature expansion—sometimes it's tightening foundations. Anne-bot's greatest value was revealing what needed to happen before scaling AI.
Final Reflection
Project Status
Proof-of-concept validated feasibility and surfaced the need for strong problem-statement scaffolding before production investment.
The Bigger Picture
Anne-bot reinforced a belief that now shapes my approach to AI product development:
The most valuable AI products don't deliver answers—they strengthen thinking.
The goal isn't to replace expert judgment with AI—it's to scale the conditions under which people develop better judgment themselves.
This project taught me that the best AI products are often the ones that help users become better versions of themselves, not the ones that do the work for them.
What I'd Do Differently
If I were starting this project again with the insights I have now:
1. Start with Foundational Alignment First
Before building the prototype, I'd facilitate a working session with leadership and PMs to:
Document what "strong problem statement" means
Create shared examples of good vs. weak problem statements
Align on evaluation criteria
Why: This would have revealed the root bottleneck faster and clarified whether AI was even the right intervention.
2. Build Decision Gates Into Testing
Define clear thresholds upfront:
What results would make me double down?
What findings would make me pivot?
What evidence would suggest stopping entirely?
Why: Explicit decision criteria prevent attachment to solutions and enable faster, more objective pivots.
3. Test with More Diverse PRD Types
The prototype focused on one PRD format. Testing across different project types (0→1 products, feature enhancements, technical improvements) would reveal whether coaching patterns generalize or need customization.
4. Measure Baseline Review Quality
Track problem statement quality before Anne-bot to quantify improvement afterward. Without baseline data, "better" is subjective.
5. Involve Directors in Prototype Testing Earlier
Getting Anne's feedback on prototype responses would have:
Validated coaching pattern accuracy faster
Surfaced organizational context gaps sooner
Built stakeholder buy-in earlier in the process
Why: The people whose expertise you're codifying should validate the codification.