Anne-bot
Exploratory AI prototype for scaling internal PRD coaching
A 2-week proof of concept revealed that PRD review delays stemmed from misaligned problem-statement criteria—not review capacity. Production was deliberately paused to establish shared foundations before automation.
Problem
PRD Reviews Looked Rigorous—Alignment Quietly Lagged
PRD reviews were thoughtful, Socratic, and widely respected. Standards were high, and documents were reviewed carefully by experienced leaders.
Despite this rigor, review cycles frequently stretched across multiple rounds—not because PMs were unprepared, but because foundational alignment was unresolved when reviews began. Much of the discussion centered on clarifying the problem itself. Definitions drifted. Assumptions surfaced late.
Because reviews ran on a fixed cadence, even a single additional revision round translated into real delay. The process appeared slow, but the underlying issue was that alignment entered the room too late.
Insight
Review Speed Wasn’t the Constraint—Shared Criteria Were
The bottleneck was not review capacity, tooling, or director availability. It was the absence of shared, explicit criteria for what constituted a strong problem statement—causing PMs and leadership to arrive at reviews solving different epistemic tasks.
Until the problem definition stabilized, reviewing anything else was effectively performative. Scope, metrics, and prioritization could not be meaningfully evaluated without first aligning on what problem was actually being addressed.
Structural Solution
Coaching Was Constrained to Problem Statements Before Broader Review
Anne-bot was designed as a constrained mentorship system focused narrowly on strengthening problem statements through Socratic questioning.
Rather than issuing evaluations or edits, it surfaced assumptions, gaps, and reasoning errors that typically emerged late in reviews.
Just as important were the deliberate non-solutions:
- No auto-generated PRDs
- No scoring or approval signals
- No replacement of director judgment
Human-in-the-loop transparency made context gaps explicit instead of masking them. Where alignment was missing, automation was intentionally deferred.
WHY THIS MATTERED BEFORE SCALE: AI coaching tools fail when they erode trust faster than they save time. If Anne-bot had confidently delivered feedback without acknowledging what it didn't know, PMs would have dismissed it after the first irrelevant suggestion—and the tool would have been abandoned before surfacing anything useful.
The real adoption gate wasn't AI capability. It was organizational readiness: whether shared criteria existed for the AI to codify. Without that foundation, automation would scale confusion, not clarity.
By treating the prototype as a diagnostic instrument rather than a production investment, the work avoided the most common failure mode in AI tooling: building confidently on ambiguous ground.
Synthesis: The decision to pause—not ship—was the output that mattered most.
Key Strategic Decisions
The focus was on doing things in the right order, and using AI to sharpen thinking rather than replace judgment.
- Observed: Early outputs referenced context the AI lacked—PMs reported feedback felt "off" or "uninformed."
- Decision: Added checkpoints where the AI surfaced gaps and invited PMs to clarify before generating feedback.
- Tradeoff: Increased interaction friction; feedback required back-and-forth rather than one-shot delivery.
- Trust Implication: Shifted AI from evaluator to collaborator—PMs corrected misunderstandings rather than dismissing the tool.
- Observed: Full-PRD feedback produced noise; revision cycles consistently originated in the problem statement.
- Decision: Constrained prototype to coaching problem statements before reviewing other sections.
- Tradeoff: Could not demonstrate value on complete document review; reduced perceived comprehensiveness.
- Adoption Implication: Feedback became actionable at the highest-leverage point; PMs used the tool earlier in their process.
- Observed: PMs valued reviews as learning forums—peer ideas and Socratic coaching were reasons to attend.
- Decision: Positioned AI as pre-review preparation, helping PMs anticipate questions rather than bypassing reviews.
- Tradeoff: Kept reviews intact; couldn't claim time savings from eliminating them.
- Adoption Implication: Preserved learning value while reducing rework; directors focused on strategy, not foundational fixes.
- Observed: Testing revealed "strong problem statement" meant different things to PMs and leadership—no shared criteria existed.
- Decision: Recommended establishing shared criteria and examples before investing in production AI tooling.
- Tradeoff: Delayed efficiency gains; required organizational alignment work before technical investment.
- Adoption Implication: Prevented scaling ambiguity; future AI would codify shared understanding, not inconsistent expectations.
Impact At a Glance
The POC revealed that context, not model capability, was the constraint on AI-assisted coaching. Two weeks of discovery and prototyping across 10+ PRDs with 6 PMs validated an architectural approach: when organizational knowledge is unavailable, human-in-the-loop interaction becomes the interface for context-gathering, not a limitation to overcome.
Quantitative Signals
during discovery phase
through prototype iterations
full-PRD feedback → problem-statement focus
(1 round when strong, 2-3 when weak)
These signals indicate the bottleneck was structural misalignment, not review capacity or tooling speed. The shift to problem-statement focus with explicit context-gathering checkpoints proved feasible and relevant—validating the architectural approach.
What Did Not Happen:
- No approval cycles accelerated solely due to AI speed (the constraint was alignment, not velocity)
- Comprehensive PRD review feedback did not improve adoption (confirmed scope-reduction was necessary)
- Directorial judgment was not replaced (preparation was sufficient)
Why This Impact Was Durable
The validation was structural, not circumstantial. Three independent methods—observation, artifact analysis, PM interviews—converged on the same findings. The human-in-the-loop approach is replicable because it treats context-gathering as a design requirement, not a workaround.
How This Sharpened My Judgment
Lessons from Real-World Use
- Context is the primary constraint on AI relevance in judgment-heavy domains. I designed around missing organizational knowledge by making context-gathering interactive—human-in-the-loop became the architecture.
- Scope reduction can be more impactful than comprehensive solutions. Full PRD feedback created noise; problem-statement focus created signal with fewer context dependencies.
- Acknowledging uncertainty produces more trusted guidance. Iteration 2 succeeded because the assistant admitted what it didn't know instead of guessing.
- The leverage point is often upstream of where you're building. The real constraint was shared criteria, not feedback speed—a foundational problem, not a tooling problem.
- Testing can reveal what not to build, not just what to build. The prototype's failures taught me that comprehensive coverage was strategically wrong.
A Transferable Pattern
Design for the context you actually have.
In judgment-sensitive workflows, context—not model capability—is the primary constraint on AI relevance. When organizational knowledge isn't available in the system, make context-gathering part of the interface: explicit checkpoints where AI surfaces gaps and humans provide missing information. Simultaneously, scope your intervention to the constraint with the fewest context dependencies—and verify whether foundational alignment would solve the problem more efficiently than automation.
Problem
PRD Reviews Looked Rigorous—Alignment Quietly Lagged
PRD reviews were thoughtful, Socratic, and widely respected. Standards were high, and documents were reviewed carefully by experienced leaders.
Despite this rigor, review cycles frequently stretched across multiple rounds—not because PMs were unprepared, but because foundational alignment was unresolved when reviews began. Much of the discussion centered on clarifying the problem itself. Definitions drifted. Assumptions surfaced late.
Because reviews ran on a fixed cadence, even a single additional revision round translated into real delay. The process appeared slow, but the underlying issue was that alignment entered the room too late.
Insight
Review Speed Wasn’t the Constraint—Shared Criteria Were
The bottleneck was not review capacity, tooling, or director availability. It was the absence of shared, explicit criteria for what constituted a strong problem statement—causing PMs and leadership to arrive at reviews solving different epistemic tasks.
Until the problem definition stabilized, reviewing anything else was effectively performative. Scope, metrics, and prioritization could not be meaningfully evaluated without first aligning on what problem was actually being addressed.
Structural Solution
Coaching Was Constrained to Problem Statements Before Broader Review
Anne-bot was designed as a constrained mentorship system focused narrowly on strengthening problem statements through Socratic questioning.
Rather than issuing evaluations or edits, it surfaced assumptions, gaps, and reasoning errors that typically emerged late in reviews.
Just as important were the deliberate non-solutions:
- No auto-generated PRDs
- No scoring or approval signals
- No replacement of director judgment
Human-in-the-loop transparency made context gaps explicit instead of masking them. Where alignment was missing, automation was intentionally deferred.
WHY THIS MATTERED BEFORE SCALE: AI coaching tools fail when they erode trust faster than they save time. If Anne-bot had confidently delivered feedback without acknowledging what it didn't know, PMs would have dismissed it after the first irrelevant suggestion—and the tool would have been abandoned before surfacing anything useful.
The real adoption gate wasn't AI capability. It was organizational readiness: whether shared criteria existed for the AI to codify. Without that foundation, automation would scale confusion, not clarity.
By treating the prototype as a diagnostic instrument rather than a production investment, the work avoided the most common failure mode in AI tooling: building confidently on ambiguous ground.
Synthesis: The decision to pause—not ship—was the output that mattered most.