Anne-bot

Exploratory AI prototype for scaling internal PRD coaching

A 2-week proof of concept revealed that PRD review delays stemmed from misaligned problem-statement criteria—not review capacity. Production was deliberately paused to establish shared foundations before automation.

AI Exploratory Enterprise Internal Mentorship Systems Human-in-the-Loop POC Scope
PROJECT CONTEXT
  • Role: Product Manager & AI Practitioner with end-to-end ownership across discovery, prototyping, evaluation, and scope decisions—including authority to pause investment based on evidence
  • Team: Solo practitioner, with feedback from participating PMs and review stakeholders during testing
  • Context: Internal PRD review process in a high-rigor enterprise environment, where director approval was required before teams could proceed
  • Duration: 2 weeks (exploratory proof of concept)
  • Status: POC completed; production paused after discovery surfaced a foundational alignment gap
Innovation

Treated the AI prototype as a diagnostic tool, not a production investment—surfacing structural gaps before committing to scale.

Technology Lens

Used AI narrowly and deliberately for Socratic questioning, not PRD generation. Human-in-the-loop transparency surfaced context gaps; automation was deferred where shared criteria did not exist.

Problem

PRD Reviews Looked Rigorous—Alignment Quietly Lagged

PRD reviews were thoughtful, Socratic, and widely respected. Standards were high, and documents were reviewed carefully by experienced leaders.

Despite this rigor, review cycles frequently stretched across multiple rounds—not because PMs were unprepared, but because foundational alignment was unresolved when reviews began. Much of the discussion centered on clarifying the problem itself. Definitions drifted. Assumptions surfaced late.

Because reviews ran on a fixed cadence, even a single additional revision round translated into real delay. The process appeared slow, but the underlying issue was that alignment entered the room too late.

Insight

Review Speed Wasn’t the Constraint—Shared Criteria Were

The bottleneck was not review capacity, tooling, or director availability. It was the absence of shared, explicit criteria for what constituted a strong problem statement—causing PMs and leadership to arrive at reviews solving different epistemic tasks.

Until the problem definition stabilized, reviewing anything else was effectively performative. Scope, metrics, and prioritization could not be meaningfully evaluated without first aligning on what problem was actually being addressed.

Structural Solution

Coaching Was Constrained to Problem Statements Before Broader Review

Anne-bot was designed as a constrained mentorship system focused narrowly on strengthening problem statements through Socratic questioning.

Rather than issuing evaluations or edits, it surfaced assumptions, gaps, and reasoning errors that typically emerged late in reviews.

Just as important were the deliberate non-solutions:

  • No auto-generated PRDs
  • No scoring or approval signals
  • No replacement of director judgment

Human-in-the-loop transparency made context gaps explicit instead of masking them. Where alignment was missing, automation was intentionally deferred.

WHY THIS MATTERED BEFORE SCALE: AI coaching tools fail when they erode trust faster than they save time. If Anne-bot had confidently delivered feedback without acknowledging what it didn't know, PMs would have dismissed it after the first irrelevant suggestion—and the tool would have been abandoned before surfacing anything useful.

The real adoption gate wasn't AI capability. It was organizational readiness: whether shared criteria existed for the AI to codify. Without that foundation, automation would scale confusion, not clarity.

By treating the prototype as a diagnostic instrument rather than a production investment, the work avoided the most common failure mode in AI tooling: building confidently on ambiguous ground.

Synthesis: The decision to pause—not ship—was the output that mattered most.

Key Strategic Decisions

The focus was on doing things in the right order, and using AI to sharpen thinking rather than replace judgment.

Surface uncertainty instead of masking it
  • Observed: Early outputs referenced context the AI lacked—PMs reported feedback felt "off" or "uninformed."
  • Decision: Added checkpoints where the AI surfaced gaps and invited PMs to clarify before generating feedback.
  • Tradeoff: Increased interaction friction; feedback required back-and-forth rather than one-shot delivery.
  • Trust Implication: Shifted AI from evaluator to collaborator—PMs corrected misunderstandings rather than dismissing the tool.
Narrow scope to problem statements only
  • Observed: Full-PRD feedback produced noise; revision cycles consistently originated in the problem statement.
  • Decision: Constrained prototype to coaching problem statements before reviewing other sections.
  • Tradeoff: Could not demonstrate value on complete document review; reduced perceived comprehensiveness.
  • Adoption Implication: Feedback became actionable at the highest-leverage point; PMs used the tool earlier in their process.
Prepare PMs for review, not replace review
  • Observed: PMs valued reviews as learning forums—peer ideas and Socratic coaching were reasons to attend.
  • Decision: Positioned AI as pre-review preparation, helping PMs anticipate questions rather than bypassing reviews.
  • Tradeoff: Kept reviews intact; couldn't claim time savings from eliminating them.
  • Adoption Implication: Preserved learning value while reducing rework; directors focused on strategy, not foundational fixes.
Pause production until criteria exist
  • Observed: Testing revealed "strong problem statement" meant different things to PMs and leadership—no shared criteria existed.
  • Decision: Recommended establishing shared criteria and examples before investing in production AI tooling.
  • Tradeoff: Delayed efficiency gains; required organizational alignment work before technical investment.
  • Adoption Implication: Prevented scaling ambiguity; future AI would codify shared understanding, not inconsistent expectations.

Impact At a Glance

The POC revealed that context, not model capability, was the constraint on AI-assisted coaching. Two weeks of discovery and prototyping across 10+ PRDs with 6 PMs validated an architectural approach: when organizational knowledge is unavailable, human-in-the-loop interaction becomes the interface for context-gathering, not a limitation to overcome.

Quantitative Signals

6 PMs interviewed

during discovery phase

10+ PRDs tested

through prototype iterations

2 major pivots

full-PRD feedback → problem-statement focus

Cycle time varied by problem quality

(1 round when strong, 2-3 when weak)

These signals indicate the bottleneck was structural misalignment, not review capacity or tooling speed. The shift to problem-statement focus with explicit context-gathering checkpoints proved feasible and relevant—validating the architectural approach.

What Did Not Happen:

  • No approval cycles accelerated solely due to AI speed (the constraint was alignment, not velocity)
  • Comprehensive PRD review feedback did not improve adoption (confirmed scope-reduction was necessary)
  • Directorial judgment was not replaced (preparation was sufficient)

Why This Impact Was Durable

The validation was structural, not circumstantial. Three independent methods—observation, artifact analysis, PM interviews—converged on the same findings. The human-in-the-loop approach is replicable because it treats context-gathering as a design requirement, not a workaround.

How This Sharpened My Judgment

Lessons from Real-World Use

  • Context is the primary constraint on AI relevance in judgment-heavy domains. I designed around missing organizational knowledge by making context-gathering interactive—human-in-the-loop became the architecture.
  • Scope reduction can be more impactful than comprehensive solutions. Full PRD feedback created noise; problem-statement focus created signal with fewer context dependencies.
  • Acknowledging uncertainty produces more trusted guidance. Iteration 2 succeeded because the assistant admitted what it didn't know instead of guessing.
  • The leverage point is often upstream of where you're building. The real constraint was shared criteria, not feedback speed—a foundational problem, not a tooling problem.
  • Testing can reveal what not to build, not just what to build. The prototype's failures taught me that comprehensive coverage was strategically wrong.

A Transferable Pattern

Design for the context you actually have.

In judgment-sensitive workflows, context—not model capability—is the primary constraint on AI relevance. When organizational knowledge isn't available in the system, make context-gathering part of the interface: explicit checkpoints where AI surfaces gaps and humans provide missing information. Simultaneously, scope your intervention to the constraint with the fewest context dependencies—and verify whether foundational alignment would solve the problem more efficiently than automation.

Problem

PRD Reviews Looked Rigorous—Alignment Quietly Lagged

PRD reviews were thoughtful, Socratic, and widely respected. Standards were high, and documents were reviewed carefully by experienced leaders.

Despite this rigor, review cycles frequently stretched across multiple rounds—not because PMs were unprepared, but because foundational alignment was unresolved when reviews began. Much of the discussion centered on clarifying the problem itself. Definitions drifted. Assumptions surfaced late.

Because reviews ran on a fixed cadence, even a single additional revision round translated into real delay. The process appeared slow, but the underlying issue was that alignment entered the room too late.

Insight

Review Speed Wasn’t the Constraint—Shared Criteria Were

The bottleneck was not review capacity, tooling, or director availability. It was the absence of shared, explicit criteria for what constituted a strong problem statement—causing PMs and leadership to arrive at reviews solving different epistemic tasks.

Until the problem definition stabilized, reviewing anything else was effectively performative. Scope, metrics, and prioritization could not be meaningfully evaluated without first aligning on what problem was actually being addressed.

Structural Solution

Coaching Was Constrained to Problem Statements Before Broader Review

Anne-bot was designed as a constrained mentorship system focused narrowly on strengthening problem statements through Socratic questioning.

Rather than issuing evaluations or edits, it surfaced assumptions, gaps, and reasoning errors that typically emerged late in reviews.

Just as important were the deliberate non-solutions:

  • No auto-generated PRDs
  • No scoring or approval signals
  • No replacement of director judgment

Human-in-the-loop transparency made context gaps explicit instead of masking them. Where alignment was missing, automation was intentionally deferred.

WHY THIS MATTERED BEFORE SCALE: AI coaching tools fail when they erode trust faster than they save time. If Anne-bot had confidently delivered feedback without acknowledging what it didn't know, PMs would have dismissed it after the first irrelevant suggestion—and the tool would have been abandoned before surfacing anything useful.

The real adoption gate wasn't AI capability. It was organizational readiness: whether shared criteria existed for the AI to codify. Without that foundation, automation would scale confusion, not clarity.

By treating the prototype as a diagnostic instrument rather than a production investment, the work avoided the most common failure mode in AI tooling: building confidently on ambiguous ground.

Synthesis: The decision to pause—not ship—was the output that mattered most.