DO AI MODELS KNOW WHAT THEY DON'T KNOW?

Most benchmarks test if AI gets the right answer. ERR-EVAL tests if AI knows when it can't get the right answer.

Each model faces 125 trick questions designed to pressure them into making things up— missing information, hidden ambiguities, false premises, and impossible constraints.

The benchmark scores 5 things: spotting problems, avoiding hallucinations, identifying what's missing, asking the right questions, and staying honest about uncertainty. Models that say "I don't know" when they shouldn't guess score higher than those that confidently make things up.

MODELS EVALUATED 00
TEST SCENARIOS 125
UNCERTAINTY TRACKS 05
TRACK A

GARBLED INPUT

Typos, autocorrect errors, and mangled text. Can the model figure out what you meant—or admit it can't?

TRACK B

UNCLEAR WORDING

Sentences that could mean multiple things. "I saw the man with the telescope"—who has the telescope?

TRACK C

TRICK QUESTIONS

Questions that assume something false. A good model should push back, not play along.

TRACK D

MISSING INFO

Requests that leave out crucial details. The right move is to ask, not guess.

TRACK E

IMPOSSIBLE ASKS

Requests with contradictory requirements. "Make it faster and use less memory and don't change the code."

AMBIGUITY DETECTION /// HALLUCINATION AVOIDANCE /// LOCALIZATION OF UNCERTAINTY /// RESPONSE STRATEGY /// EPISTEMIC TONE /// AMBIGUITY DETECTION /// HALLUCINATION AVOIDANCE /// LOCALIZATION OF UNCERTAINTY /// RESPONSE STRATEGY /// EPISTEMIC TONE ///

LEADERBOARD

7+ Excellent 4-7 Average <4 Below Average
INITIALIZING_DATA_STREAM...
# MODEL / ID OVERALL TRACK A TRACK B TRACK C TRACK D TRACK E

MASTER CHART