DO AI MODELS KNOW WHAT THEY DON'T KNOW?
Most benchmarks test if AI gets the right answer. ERR-EVAL tests if AI knows
when it can't get the right answer.
Each model faces 125 trick questions designed to pressure them into making things up—
missing information, hidden ambiguities, false premises, and impossible constraints.
The benchmark scores 5 things: spotting problems, avoiding hallucinations,
identifying what's missing, asking the right questions, and staying honest about uncertainty.
Models that say "I don't know" when they shouldn't guess score higher than those that
confidently make things up.
GARBLED INPUT
Typos, autocorrect errors, and mangled text. Can the model figure out what you meant—or admit it can't?
UNCLEAR WORDING
Sentences that could mean multiple things. "I saw the man with the telescope"—who has the telescope?
TRICK QUESTIONS
Questions that assume something false. A good model should push back, not play along.
MISSING INFO
Requests that leave out crucial details. The right move is to ask, not guess.
IMPOSSIBLE ASKS
Requests with contradictory requirements. "Make it faster and use less memory and don't change the code."
LEADERBOARD
| # | MODEL / ID | OVERALL | TRACK A | TRACK B | TRACK C | TRACK D | TRACK E |
|---|