What My LLM Research Taught Me About Building AI Products That Actually Work

Most product managers build AI features without knowing how their models fail. Here's what a year of LLM research taught me about making better product decisions.

Katherine Xu

March 24, 2026

Most product managers build AI features without knowing how their models fail. LLMs still hallucinate, go off-topic, and give weird answers. But how do you actually prevent that from happening in the first place?

I've been a heavy user of LLMs since ChatGPT launched — idea generation, research synthesis, ethical debates, critiques. They've been embedded in how I think and work. I also wrote my undergraduate thesis on whether LLMs can reason about nested beliefs: does the model know that you know that they know? What I found changed how I make every AI product decision since.

When Models Fail, They Fail Predictably

For my undergraduate thesis, I spent a year testing whether LLMs can reason about what other people know and believe. The short version: they're inconsistent in ways that are predictable once you know what to look for.

The classic test is something like: Sally hides a marble, leaves the room, and someone moves it while she's gone. Where does Sally think the marble is? Easy. Now make it four layers deep: what does Sally think Anne thinks Bob thinks Carol knows? The models fall apart fast. GPT-3.5-turbo was hitting around 21% accuracy at that depth. It shows up at a specific point of complexity.

Newer models have gotten better at these tests. But here's what the research community has started pointing out: passing a benchmark about a made-up story isn't the same as actually reasoning about a real person in a real conversation. The gap between "performs well on paper" and "works for your users" is where most AI product failures live.

That's the thing I kept coming back to when I started building. Not "is this model smart enough" — but "where exactly does it break, and does my product depend on the thing it's bad at?"

Good Prompts Aren't Enough — Good Structure Is

At Harvard, I worked with Imagine Learning, a K-12 ed-tech company, to build AI prototypes that could generate and assess communication skills tasks for high school students. The first version didn't work well. We'd give the AI a vague instruction and got back something generic — technically correct, but practically useless.

The fix was to give it a better structure. Once we fed the model a detailed rubric — four proficiency levels, three types of observable evidence, twelve different conditions — the output quality jumped significantly. The AI had more to work with.

This connected directly back to my thesis findings. Program-of-thought prompting worked better than chain-of-thought for complex reasoning tasks because it gave the model an explicit structure to work through, rather than expecting it to hold everything in its head. When you externalize the cognitive scaffolding, the model performs better.

For PMs, the takeaway is uncomfortable: the quality of your AI feature isn't just a function of which model you pick. It's a function of how much structure you give it. And that's a product decision, not an engineering one.

Structure Isn't Just About Quality — It's About Raising the Ceiling

At the Bot Club Initiative at Harvard, I built a persona-building bot for a course on pilot-ready products. The bot guided student entrepreneurs through building realistic personas. The model could already produce decent personas without much help — coherent, structured, useful.

But decent isn't the goal when you're a student trying to recruit real pilot participants. The personas needed to be specific enough that a student could look at one and immediately know how to reach that person, what to say, and what would make them say no.

That level of specificity didn't come from the model being smarter. It came from a five-round structure, explicit guidance on what to elicit in each round, and pre-written exemplars showing what a high-quality persona actually looked like. Structure didn't fix a broken model — it raised the ceiling on what a capable one could produce.

What This Means If You're Building AI Products

Three things I'd tell any PM building an AI feature right now:

Know where your model breaks before you write a single spec. Every model has a complexity threshold where performance degrades — not randomly, but predictably. You need to find that threshold for your specific use case before you build around it, not after your users do.
Prompting strategy is a product decision. Whether you use structured rounds, exemplars, or a detailed rubric isn't an implementation detail your engineers figure out later. It directly affects what your users experience. These decisions belong in your PRD.
"The model is smart enough" is not a spec. Benchmark numbers are averages. Your failure mode lives in the specific conditions of your use case — the emotional context, the complexity level, the interaction length. A model that scores well on a standardized test can still completely miss what your user needs in the moment.

Start With the Failure

I didn't set out to become the person who stress-tests AI models before building with them. It just turned out that understanding failure made me a better builder.

You don't need a thesis to do this. But you do need to ask the question most PMs skip: not "will this model work?" but "where will it break, and does my product depend on the thing it's bad at?"

That question changes what you build, how you prompt, and what you ship. And in a space moving as fast as AI products, it might be the most valuable habit you can develop.