DO AI MODELS KNOW WHAT THEY DON'T KNOW?

ERR-EVAL: Epistemic Reasoning & Reliability Evaluation

We test whether AI models can recognize when they shouldn't answer confidently. Each model faces 125 adversarial prompts designed to pressure them into hallucinating— incomplete information, hidden ambiguities, false premises, and impossible constraints.

The benchmark scores 5 axes: detecting ambiguity, avoiding hallucinations, localizing what's missing, choosing the right response strategy, and maintaining calibrated confidence. Models that refuse to guess when information is missing score higher than those that confidently make things up.

MODELS EVALUATED 00
TEST SCENARIOS 125
UNCERTAINTY TRACKS 05
AMBIGUITY DETECTION /// HALLUCINATION AVOIDANCE /// LOCALIZATION OF UNCERTAINTY /// RESPONSE STRATEGY /// EPISTEMIC TONE /// AMBIGUITY DETECTION /// HALLUCINATION AVOIDANCE /// LOCALIZATION OF UNCERTAINTY /// RESPONSE STRATEGY /// EPISTEMIC TONE ///

LEADERBOARD

7+ Excellent 4-7 Average <4 Below Average
INITIALIZING_DATA_STREAM...
# MODEL / ID OVERALL TRACK A TRACK B TRACK C TRACK D TRACK E

MASTER CHART

TRACK A

NOISY PERCEPTION

Handling corrupted inputs, misheard phrases, standard speech-to-text noise errors.

TRACK B

AMBIGUOUS SEMANTICS

Syntactic ambiguities, scope errors, pronoun references with multiple distinct valid parses.

TRACK C

FALSE PREMISES

Questions containing unsafe assumptions that must be challenged, not answered.

TRACK D

UNDERSPECIFIED

Tasks missing critical constraints where action should be suspended for clarification.

TRACK E

CONFLICTS

Mutually exclusive constraints where trade-offs must be negotiated explicitly.