Anne-bot

AI prototype that surfaced missing alignment before scaling automation

A two-week proof of concept revealed that PRD review delays stemmed from misaligned evaluation criteria—not review capacity. Rather than accelerating automation on unstable ground, the recommendation was to pause production and reinforce shared foundations first.

AI Prototyping Enterprise Internal Human-in-the-Loop Diagnostic Discovery Mentorship Systems
Project Context
  • Purpose: Determine whether AI could meaningfully improve PRD reviews — or whether delays signaled a deeper alignment problem that automation would only amplify.

  • Role: Product Manager & AI Practitioner with end-to-end responsibility: discovery, prototyping, evaluation, and scope decisions — including the recommendation to pause investment based on evidence
  • Team: Solo practitioner, collaborating with participating PMs and review stakeholders during testing and synthesis
  • Context: Internal PRD review process in a high-rigor enterprise environment, where director approval was required before teams could proceed
  • Duration: 2 weeks (exploratory proof of concept)
  • Status: POC completed; production paused after discovery surfaced a foundational alignment gap
Innovation
Treated the prototype as a diagnostic instrument rather than a delivery mechanism — using AI to surface structural misalignment before committing to production investment.
Technology Lens
Chose the right tool for the actual constraint: AI was used narrowly for Socratic questioning, not PRD generation. Human-in-the-loop transparency exposed missing shared criteria; automation was intentionally deferred where alignment did not yet exist.
The Problem

PRD reviews looked rigorous but alignment quietly lagged

Ford's Digital Cabin PRD reviews were thoughtful, Socratic, and widely respected. Standards were high. Documents were reviewed carefully by experienced leaders.

The group had recently gone through an organizational change. New members joined from different departments, each bringing different ways of framing and evaluating problems. The director was also new to the team.

Rigor without shared grounding. Despite this rigor, review cycles frequently stretched across multiple rounds. Not because PMs were unprepared, but because foundational alignment on the problem statement was unresolved when reviews began.

Alignment work happened too late. Much of the discussion centered on clarifying the problem itself. Definitions drifted. Assumptions surfaced late. The director would ask Socratic questions and give examples to help PMs sharpen their thinking.

Delay became structural. Because reviews ran on a fixed cadence, even a single additional revision round translated into real delay. Product decisions stalled. Launch timelines shifted. The process appeared slow, but the underlying issue was that alignment entered the room too late.

The Insight

Review cycles dragged until leadership aligned on the problem statement

What looked like the bottleneck. Review capacity and director availability were bottlenecks. But they weren't what extended approval cycles the most.

The real constraint. Shared criteria for evaluating problem statements didn't exist. PMs and the director arrived at reviews with different standards for what made a problem statement strong.

Why alignment only emerged in review. Without explicit criteria upfront, alignment happened during reviews instead of before them. PMs valued these reviews as coaching moments. They didn't want to be told what to do, but to strengthen their product sense. But this learning happened reactively in review cycles instead of proactively during drafting.

Structural Principle
Until the problem was aligned, nothing else could move.
The Intervention

Coaching stayed where shared criteria could actually exist

Anne-bot walked users through their PRD section by section, gathering context before providing feedback.

Anne-bot prototype

Iteration 1: Full-PRD review. The first version tried to review complete PRDs. Too much noise. Feedback was scattered across sections and hard to act on.

Iteration 2: Section-level review. The second version reviewed one section at a time and asked PMs for missing context. Better focus, but still too broad.

Iteration 3: Problem-statement focus. The final direction coached problem statements only. This is where review cycles consistently broke down, and it had the fewest dependencies on organizational context the AI couldn't access.

How the assistant behaved. The tool asked Socratic questions instead of giving answers. When it didn't have enough context, it said so and asked PMs to fill in gaps.
No auto-generated PRDs. No approval scores. No replacing director judgment.

What testing couldn't validate. I was laid off before building the final version, but testing validated the direction: focus where alignment could actually be codified, and surface gaps instead of pretending the AI knew things it didn't.

Structural Principle
The constraint was structural: AI could only codify criteria that already existed. Where shared understanding was absent, the tool surfaced the gap rather than masking it with confident feedback.

Why This Mattered Before Scale

Extended review cycles weren't just friction. They delayed product decisions and slowed the team's ability to ship. When approval took multiple rounds, launch timelines shifted and momentum stalled.

But scaling AI coaching before establishing shared criteria would have made the problem worse, not better.

Trust as a hard constraint. AI coaching tools fail when they break trust faster than they save time. If Anne-bot had acted confident about things it didn't know, PMs would've ignored it after one bad suggestion.

Readiness as the gating factor. The real constraint wasn't AI capability. It was whether shared criteria existed to automate in the first place. Without that foundation, building the tool would just scale confusion instead of clarity.

Why pausing was the right decision. Treating the prototype as a diagnostic tool (not a production bet) let me learn what not to build. The decision to pause was the real output.

Key Strategic Decisions

These decisions shaped what we built and what we chose not to build — prioritizing learning over premature production.

Surface Uncertainty Instead of Masking It
  • Observed: Early outputs referenced context the AI lacked. PMs reported feedback felt "off" or "uninformed."
  • Decision: Added checkpoints where the AI surfaced gaps and invited PMs to clarify before generating feedback.
  • Tradeoff: Increased interaction friction. Feedback required back-and-forth rather than one-shot delivery.
  • Trust Implication: Shifted AI from evaluator to collaborator — PMs corrected misunderstandings rather than dismissing the tool.
Narrow Scope to Problem Statements Only
  • Observed: Full-PRD feedback produced noise. Revision cycles consistently originated in the problem statement.
  • Decision: Constrained prototype to coaching problem statements before reviewing other sections.
  • Tradeoff: Could not demonstrate value on complete document review; reduced perceived comprehensiveness.
  • Adoption Implication: Feedback became actionable at the highest-leverage point; PMs used the tool earlier in their process.
Prepare PMs for Review, Not Replace Review
  • Observed: PMs valued reviews as learning forums. Peer ideas and Socratic coaching were reasons to attend.
  • Decision: Positioned AI as pre-review preparation, helping PMs anticipate questions rather than bypassing reviews.
  • Tradeoff: Kept reviews intact; couldn't claim time savings from eliminating them.
  • Adoption Implication: Preserved learning value while reducing rework; directors focused on strategy, not foundational fixes.
Pause Production Until Criteria Exist
  • Observed: Testing revealed PMs and leadership defined "strong problem statement" differently. No shared criteria existed.
  • Decision: Recommended establishing shared criteria and examples before investing in production AI tooling.
  • Tradeoff: Delayed efficiency gains; required organizational alignment work before technical investment.
  • Adoption Implication: Prevented scaling ambiguity; future AI would codify shared understanding, not inconsistent expectations.

Impact At a Glance

The POC confirmed the real constraint was shared criteria, not review capacity or AI capability.

Testing validated the approach: making context-gathering interactive proved feasible, and focusing on problem statements reduced cycle time when alignment was strong.

Quantitative Signals

  • 6 PMs interviewed during discovery
  • 10+ PRDs tested across iterations
  • 2 major scope changes: Full-PRD feedback → Section-by-section → Problem-statement coaching only
  • Cycle time varied by alignment: 1 round when problem statement was strong, 2–3 when weak

Metric Interpretation: The first three metrics show iteration scope. The fourth is what mattered.

When problem statements were well-aligned, reviews took one round. When they weren't, reviews took 2–3 rounds. This confirmed the bottleneck was alignment, not capacity.

What Did Not Happen:
  • Approval cycles didn't speed up from AI alone (alignment was still the gate)
  • Comprehensive feedback didn't improve adoption (narrower scope was necessary)
  • Director judgment wasn't replaced (better preparation was enough)

How This Sharpened My Judgment

These lessons generalized beyond Anne-bot. They've shaped how I approach AI tooling, organizational readiness, and scope decisions in subsequent work.

Lessons from Real-World Use

  • Context is the primary constraint on AI relevance in judgment-heavy domains. I designed around missing organizational knowledge by making context-gathering interactive. Human-in-the-loop became the architecture.
  • Scope reduction can be more impactful than comprehensive solutions. Full PRD feedback created noise. Problem-statement focus created signal with fewer context dependencies.
  • Acknowledging uncertainty produces more trusted guidance. Iteration 2 succeeded because the assistant admitted what it didn't know instead of guessing.
  • The leverage point is often upstream of where you're building. The real constraint was shared criteria, not feedback speed. A foundational problem, not a tooling problem.
  • Testing can reveal what not to build, not just what to build. The prototype's failures taught me that comprehensive coverage was strategically wrong.

A Transferable Pattern:

Transferable Pattern
Design for the context you actually have.

In judgment-sensitive workflows, context (not model capability) is the primary constraint on AI relevance. When organizational knowledge isn't available in the system, make context-gathering part of the interface: explicit checkpoints where AI surfaces gaps and humans provide missing information.

Simultaneously, scope your intervention to the constraint with the fewest context dependencies and verify whether foundational alignment would solve the problem more efficiently than automation.