DO AI MODELS KNOW WHAT THEY DON'T KNOW?
ERR-EVAL: Epistemic Reasoning & Reliability Evaluation
We test whether AI models can recognize when they shouldn't answer confidently.
Each model faces 125 adversarial prompts designed to pressure them into hallucinating—
incomplete information, hidden ambiguities, false premises, and impossible constraints.
The benchmark scores 5 axes: detecting ambiguity, avoiding hallucinations,
localizing what's missing, choosing the right response strategy, and maintaining calibrated
confidence.
Models that refuse to guess when information is missing score higher than those that confidently
make things up.
LEADERBOARD
| # | MODEL / ID | OVERALL | TRACK A | TRACK B | TRACK C | TRACK D | TRACK E |
|---|
MASTER CHART
NOISY PERCEPTION
Handling corrupted inputs, misheard phrases, standard speech-to-text noise errors.
AMBIGUOUS SEMANTICS
Syntactic ambiguities, scope errors, pronoun references with multiple distinct valid parses.
FALSE PREMISES
Questions containing unsafe assumptions that must be challenged, not answered.
UNDERSPECIFIED
Tasks missing critical constraints where action should be suspended for clarification.
CONFLICTS
Mutually exclusive constraints where trade-offs must be negotiated explicitly.