Can precision and recall metrics work without a ground truth?

This explores whether you can judge retrieval quality — precision (are the returned answers right?) and recall (did you find all the right ones?) — when you have no answer key to check against.

This explores whether precision and recall can survive without ground truth — the labeled "correct answers" those metrics normally compare against. The corpus doesn't tackle the metric definition head-on, but it's full of systems facing exactly this bind: they need to know whether an output is good, in a setting where nobody has told them what "good" is. The recurring move is to replace the missing answer key with internal signals the system can generate for itself.

The most direct substitute is the model's own uncertainty. Calibrated token-probability uncertainty turns out to beat more elaborate retrieval heuristics at deciding when an answer is trustworthy enough to keep Can simple uncertainty estimates beat complex adaptive retrieval?. Confidence can be read even more finely: scoring each reasoning step locally catches breakdowns that a single averaged-confidence number hides entirely Does step-level confidence outperform global averaging for trace filtering?. In both cases the system is approximating "is this likely correct?" without ever seeing the correct answer — a stand-in for precision built from self-knowledge rather than labels.

A second family swaps the answer key for a verification gate. Bidirectional RAG only lets a generated answer back into its corpus if it survives entailment checks, source attribution, and novelty detection — effectively a precision filter with no ground truth, just consistency tests Can RAG systems safely learn from their own generated answers?. A learned verifier operating on token-token similarity maps can reject "structural near-misses" that look like matches but aren't, separating real hits from lookalikes without a gold label Can verification separate structural near-misses from topical matches?. And agent-based evaluation that actively collects evidence cuts judging error a hundredfold over a plain LLM judge — though its memory module cascaded errors, a reminder that proxy evaluators have their own failure modes Can agents evaluate AI outputs more reliably than language models?. The cleanest fallback of all is refusal: when sources are too noisy to trust, grounded generation simply declines to answer, trading recall for precision you can actually defend Can RAG systems refuse to answer without reliable evidence?.

But the corpus also delivers the warning that makes this question worth asking. Proxy metrics built without ground truth can be confidently, invisibly wrong. Aggregate accuracy masks fluent confident errors that concentrate exactly in the rare, high-harm cases — the metric looks healthy while the dangerous failures hide inside it Why do confident wrong answers hide in standard accuracy metrics?. Worse, the self-signals can be hollow: a model run at zero temperature returns the same answer every time, but that consistency is just one fixed draw from its distribution, not evidence the answer is right Does setting temperature to zero actually make LLM outputs reliable?.

So the honest answer the corpus points to: yes, you can build precision-and-recall-like signals without ground truth — from uncertainty, step-level confidence, entailment gates, and verifiers — but every one of them measures internal consistency, not truth. They're proxies for the answer key, and a proxy can agree with itself while being wrong. The thing you didn't know you wanted to know is that the danger isn't the missing ground truth; it's how convincingly a confidence-based metric can fake having one.

Sources 8 notes

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Can verification separate structural near-misses from topical matches?

A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Why do confident wrong answers hide in standard accuracy metrics?

Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Can precision and recall metrics work without a ground truth?

Sources 8 notes

Next inquiring lines