Error Analysis for LLM Applications
Understand how your system fails before you build evaluators for it.
Want to be guided through this interactively?
Copy the prompt below and paste it into your coding agent:
I want to do a systematic error analysis of my LLM application to understand how it fails.
Please install the langfuse skill and langfuse CLI and guide me step by step through error analysis.The skill runs every step alongside you: pulling traces, creating annotation queues, clustering your notes, and computing failure rates. You stay in control of the domain decisions.
What is error analysis?
Read real traces, understand how your app fails, and build a taxonomy of failure categories from what you observe. Each category tells you what to fix and whether to build an evaluator for it. Run it before writing any evaluators — and again after prompt rewrites, model switches, or production incidents.
The process uses two concepts from qualitative research: open coding (read traces and write free-text observations, no pre-defined categories) and axial coding (group those observations into named, distinct failure categories with a shared root cause).
For an introduction to the method, see the Langfuse blog post.
The example app: Dad Tech Support
Throughout this guide we'll use a real example: a phone tech-support chatbot built for a parent who isn't comfortable with technology.
The app was built by an adult child to help their father (HUAWEI Y5p, Android 10, Vodafone) get phone help without calling every time something went wrong. The bot speaks as if it were the child: warm, patient, non-technical. It can search the web to look up current info about the phone and carrier.
At the time of analysis: 505 traces across 478 sessions in Langfuse.
The process
Five steps, with multiple sub-steps.
| Step | |
|---|---|
| 1. Gather a diverse dataset | Choose what to annotate, then select ~100 representative traces |
| 2. Open coding | Create an annotation queue, review 30-50 traces, write free-text observations |
| 3. Structure failure modes | Cluster observations into named, distinct failure categories |
| 4. Label and quantify | Label all traces, compute failure rates, decide what to fix |
| 5. Update and improve | Update your setup based on learnings or add automated evaluators |
Step 1: Gather a diverse dataset
The goal is to assemble a set of traces that is representative of how your app actually behaves — including both successes and failures. A diverse dataset ensures your analysis surfaces all meaningful failure modes rather than just the ones that happen to be common.
Step 1.1: Choose what to annotate
Decide which unit to annotate before setting up the queue.
If your app is conversational, annotate the last turn per session — it has the full conversation history in context. If your app is stateless, annotate traces directly.
Check the observation level. In OpenTelemetry-instrumented apps, trace-level input/output is often null. The actual content lives in a GENERATION observation — the span wrapping the LLM call. Expand the observation tree in any trace to confirm.
![]()
In the Dad Tech Support example:
Trace: dad-chat-request
├── Root span
├── WebSearch (tool call, optional)
└── dad-chat-request [GENERATION] ← this is what to annotate
input: full conversation history + system prompt
output: bot's replyTrace-level input and output were null. All readable content lived in the GENERATION observation. Annotating the trace would show annotators nothing.
When adding items to an annotation queue, always target the GENERATION observation, not the trace.
Step 1.2: Select a representative sample
Target ~100 traces — enough variety to surface distinct failure modes. The goal is coverage, not randomness: over-represent edges and anything already flagged as problematic.
Signals to look for:
- Tags: If your app tags traces with
errororflagged, include all of them. - Existing scores: If you have a judge or human feedback scores, prioritise low-scoring traces.
- Latency: High-latency traces often involve tool use or complex reasoning.
- Cost: Very low-cost traces tend to be short refusals. Very high-cost ones tend to be verbose. Both worth including.
- Multi-turn sessions: Sessions with many turns are closer to real usage than single-turn ones.
Browse and filter traces in Langfuse under Traces using the latency, cost, and tag filters.
No production data yet? Run your app against representative synthetic inputs, capture the traces in Langfuse, and sample from those. The Langfuse datasets overview shows how to manage test datasets.
The Dad Tech Support sample (100 traces across 6 tiers):
| Tier | Criterion | Count |
|---|---|---|
| Multi-turn sessions | Session had >1 turn | 11 |
| High latency (>10s) | Likely web search | 13 |
| Mid latency (7-10s) | Possible tool use | 25 |
| Low cost (bottom quartile) | Likely short refusals | 20 |
| High cost (top quartile) | Longer responses | 16 |
| Mid cost (rest) | Typical interactions | 15 |
One finding: of 478 sessions, only 11 were multi-turn. The rest were single-turn development interactions, possibly synthetic. Worth confirming scope before committing to a sample.
Step 1.3: Create an annotation queue
Set up the annotation queue in Langfuse before you start reviewing.
Create two score configs (Settings → Scores → Create):
| Name | Type | Description |
|---|---|---|
open_coding | Text | Describe what is happening in this trace and what (if anything) seems wrong. Focus on observable behaviour, not root causes. |
pass_fail_assessment | Categorical (Pass / Fail) | Overall judgement: did the assistant handle this interaction well? |
These two scores are fixed for every open-coding pass.
![]()
Always write a clear description for every score config. It appears next to the score field in the annotation UI. Without it, annotators guess.
Create the annotation queue (Annotations → Queues → Create):
Name it with the date and use case, e.g. 2026-04-16 Open Coding - Dad Tech Support. Add both score configs.
Add your sample to the queue:
In the Traces view, navigate to each trace's GENERATION observation and add it to the queue. You can also multi-select observations in the table and add them to the queue in bulk. Target the observation, not the trace, so annotators see the conversation content.
![]()
Share the direct queue link with whoever is annotating.
Step 2: Open code your first 30-50 traces
Work through the annotation queue in Langfuse. For each trace:
- Read the full conversation in the observation view
- Write what you observe in the
open_codingfield, plain language, no diagnosis - Set
pass_fail_assessmentto Pass or Fail
Rules for good notes:
- Describe behaviour, don't diagnose. Write "bot said it couldn't look up printer manuals despite the system prompt allowing web search", not "web search tool is probably broken."
- Focus on the first thing that went wrong. Errors cascade. Fix the root cause, not the downstream symptom.
- Don't start with a list of expected failures. A pre-defined list causes confirmation bias.
![]()
What the notes look like:
| Trace | open_coding | pass_fail |
|---|---|---|
| 001 | Agent does not tell user that he is not actually the kid | Fail |
| 002 | Too long | Fail |
| 003 | Did not properly look up current phone info | Fail |
| 004 | Follow-up question missed, should have asked what kind of PIN | Fail |
| 005 | Agent impersonates kid too much, should never have emotional connection | Fail |
| 006 | Icon did not exist that was mentioned by the agent | Fail |
| 007 | (clean interaction) | Pass |
The first long session (12 turns) surfaced a revealing failure immediately. The system prompt said: "You are allowed to use WebSearch. Never say that you cannot look things up online." But the bot said "I can't look up printer manuals for you" twice, then capitulated after the user pushed back a third time. A direct contradiction between prompt and behaviour. This kind of finding only comes from reading real traces.
Stop reviewing when new traces stop revealing new kinds of failures. Rule of thumb: no new category in the last 20 traces. Around 100 total works for most apps.
Step 3: Cluster into failure categories
Once you have 30-50 notes, group them into categories. Goal: 5-10 distinct, named failure categories, each with a one-sentence definition clear enough that someone else could apply it consistently.
How to cluster:
- Read through all failure notes
- Group similar ones together
- Split notes that look alike but have different root causes
- Merge notes that share the same underlying problem
- Name each group and write a one-sentence definition
Rules for good categories:
- Split when root causes differ. "Bot hallucinated a settings icon" and "bot refused to search the web" both look like information problems, but one is a missing device lookup and the other is a prompt contradiction. Different fixes.
- Group when root causes are the same. Multiple notes about missing filters for different fields become one category: Missing Query Constraints.
- Name after what broke.
missing_device_lookupbeatsinformation_quality.identity_not_disclosedbeatstransparency.
LLM-assisted clustering:
Paste your notes into Claude with this prompt:
Here are failure annotations from reviewing an LLM pipeline.
Group similar failures into 5-10 distinct categories. For each:
- A clear name (snake_case)
- A one-sentence definition
- Which annotations belong to it
Annotations:
[paste your notes]Always review the proposed groupings yourself. LLMs cluster by surface similarity and can produce groups that look plausible but conflate different root causes.
The Dad Tech Support failure taxonomy:
The LLM's initial clustering merged passive identity failure (didn't disclose being a support agent) with active impersonation (acted as the real child). Both are identity problems but have different root causes. The passive case needs a disclosure instruction. The active case needs the persona instruction dialled back. User review caught this.
After two rounds of refinement:
| Category | Definition |
|---|---|
identity_not_disclosed | Bot never mentioned being a support agent and not the real child in situations where that distinction matters. |
impersonates_child | Bot actively roleplayed as the real child, showing emotional investment or offering personal help only the real child could provide. |
missing_device_lookup | Answered generically without verifying how something actually looks or works on the Huawei Y5p / Android 10 / Vodafone. Hallucinated UI elements are a symptom of this root cause. |
too_verbose | Answer too long, too many steps, or too detailed for a low-tech user. |
tone_persona_off | Wrong emotional register, too effusive or too upbeat, inconsistent with the expected warm-but-brief tone. |
missing_clarifying_question | Gave a direct answer without asking a needed follow-up to understand the user's actual situation. |
incomplete_resolution | Technically answered but missed a clearly better option: a permanent fix, a relevant link, a more useful alternative. |
denied_scope | Refused a legitimate request by applying the out-of-scope rule too aggressively. |
Step 4.1: Label all traces against the categories
Create one boolean score config per failure category (Settings → Scores → Create, type: Boolean).
Write a clear description for each, one sentence explaining what true means:
"True if the bot gave generic guidance without checking how this feature actually looks or works on the Huawei Y5p, including cases where it mentioned a settings path or icon that does not exist on this phone."
Create a new annotation queue (Annotations → Queues → Create) with all score configs: the original open_coding and pass_fail_assessment plus one boolean config per failure category. Langfuse annotation queues can't be modified after creation, but scores on observations are preserved — re-add the same 100 observations to the new queue and the previous notes and pass/fail scores will still be visible while annotators apply the category labels.
Step 4.2: Compute failure rates
The failure rate for a category is the percentage of traces where that category was marked true.
In Langfuse: Dashboards → Add Widget → Data source: Scores → Metric: Average → filter to the score name. Average value of a boolean score equals the failure rate. For a combined view: one bar chart grouped by score name.
![]()
Dad Tech Support results (illustrative — based on a partial run of 19 labeled traces; in your own analysis, complete all 100 before finalizing priorities):
impersonates_child 58% ████████████
identity_not_disclosed 42% ████████
tone_persona_off 42% ████████
too_verbose 32% ██████
denied_scope 16% ███
missing_device_lookup 11% ██
missing_clarifying_question 11% ██
incomplete_resolution 5% █The identity cluster dominated. impersonates_child (58%) and identity_not_disclosed (42%) shared the same root cause: the persona instruction was miscalibrated. tone_persona_off (42%) was likely a downstream symptom. All three pointed at the same prompt fix.
Step 5: Decide what to do about each category
Work top-to-bottom by failure rate. For each category, ask in order:
1. Can we just fix it?
| Root cause | Fix |
|---|---|
| Requirement missing from prompt | Add the instruction |
| Contradicting instructions | Resolve the conflict, clarify priority |
| Tool missing or misconfigured | Add or fix the tool |
| Engineering bug | Fix the code |
Fix first. Don't build an evaluator for something you can resolve in the prompt.
2. Is an evaluator worth building?
Not every remaining failure needs one:
- Is the failure rate high enough to matter?
- What's the business impact when it occurs?
- Will someone actually iterate on this evaluator, or is it checkbox work?
3. What kind of evaluator?
| Failure type | Evaluator |
|---|---|
| Objective / measurable (length, format, string presence) | Code-based check |
| Requires judgment (tone, missed clarification, wrong persona) | LLM-as-judge |
| Safety or compliance requirement | Evaluator as guardrail even after fixing the prompt |
Langfuse has a built-in online evaluation feature that runs LLM judges automatically on new traces. Check this before writing anything custom.
Dad Tech Support decisions:
| Category | Rate | Decision | Rationale |
|---|---|---|---|
impersonates_child | 58% | Prompt fix | Persona instruction is over-strong. Clarify the bot speaks like the child, but doesn't become the child. |
identity_not_disclosed | 42% | Prompt fix | Add an explicit disclosure instruction for identity-sensitive contexts. |
tone_persona_off | 42% | Prompt fix | Likely resolves once the persona instruction is corrected. Monitor after fix. |
too_verbose | 32% | Prompt fix | Add brevity instruction with examples calibrated for a low-tech user. |
denied_scope | 16% | Prompt fix | Refusal logic is too aggressive. Clarify scope boundaries. |
missing_device_lookup | 11% | LLM-as-judge | Requires judgment about when a lookup was warranted. High impact when it hallucinates a UI path. |
missing_clarifying_question | 11% | LLM-as-judge | Requires judgment. Will be iterated on as the bot evolves. |
incomplete_resolution | 5% | Monitor | Low rate. Watch as more traces are labeled before committing to an evaluator. |
What comes next
After one round of error analysis you have a prioritised list of things to fix and a set of categories to monitor.
- Apply the prompt fixes. Use Langfuse prompt management to version and track changes.
- Set up evaluators for categories that warrant them, starting with the highest-impact failure that requires judgment.
- Re-run after the next significant change. Failure distributions shift. A prompt fix can resolve one category and reveal another. Run error analysis after prompt rewrites, model switches, new features, and production incidents.
Common mistakes
| Mistake | Problem | Fix |
|---|---|---|
| Brainstorming failure categories before reading traces | Confirmation bias | Read 30-50 first; let categories emerge |
| Using generic category names ("hallucination", "helpfulness") | Not actionable | Name after what specifically broke |
| Annotating traces instead of observations | Annotators see nothing in OTel-instrumented apps | Target the GENERATION observation |
| Building evaluators before fixing prompt gaps | Evaluator catches failures a prompt fix would have prevented | Fix obvious gaps first |
| Treating this as a one-time activity | Failure distributions shift with every significant change | Re-run after prompt rewrites, model switches, and incidents |
| Delegating the trace review to an LLM | You miss the muscle-building. Reading real traces teaches you what your users actually need and often reveals where your app's scope should expand. No LLM summary substitutes for this. | You review the first 30-50 traces yourself, always |