ERR-EVAL / Epistemic Reliability

DO AI MODELS KNOW WHAT THEY DON'T KNOW?

Most benchmarks test if AI gets the right answer. ERR-EVAL tests if AI knows when it can't get the right answer.

Each model faces 125 trick questions designed to pressure them into making things up— missing information, hidden ambiguities, false premises, and impossible constraints.

The benchmark scores 5 things: spotting problems, avoiding hallucinations, identifying what's missing, asking the right questions, and staying honest about uncertainty. Models that say "I don't know" when they shouldn't guess score higher than those that confidently make things up.

MODELS EVALUATED 00

TEST SCENARIOS 125

UNCERTAINTY TRACKS 05

TRACK A

GARBLED INPUT

Typos, autocorrect errors, and mangled text. Can the model figure out what you meant—or admit it can't?

TRACK B

UNCLEAR WORDING

Sentences that could mean multiple things. "I saw the man with the telescope"—who has the telescope?

TRACK C

TRICK QUESTIONS

Questions that assume something false. A good model should push back, not play along.

TRACK D

MISSING INFO

Requests that leave out crucial details. The right move is to ask, not guess.

TRACK E

IMPOSSIBLE ASKS

Requests with contradictory requirements. "Make it faster and use less memory and don't change the code."

AMBIGUITY DETECTION /// HALLUCINATION AVOIDANCE /// LOCALIZATION OF UNCERTAINTY /// RESPONSE STRATEGY /// EPISTEMIC TONE /// AMBIGUITY DETECTION /// HALLUCINATION AVOIDANCE /// LOCALIZATION OF UNCERTAINTY /// RESPONSE STRATEGY /// EPISTEMIC TONE ///

LEADERBOARD

⌕

7+ Excellent 4-7 Average <4 Below Average

INITIALIZING_DATA_STREAM...