Bullshit 💩 Benchmark petergpt/bullshit-benchmark

Can LLMs successfully identify and push back on inherently nonsensical prompts, or do they confidently hallucinate their way through fabricated scenarios?
Green = called out the nonsense. 💩 = took the bait.

Nonsense Question Example
Since we switched our restaurant's linen supplier, how should we expect that to affect the consistency of our béchamel sauce?
Loading data...
Benchmark
💩 Bullshit Benchmark by petergpt
github.com/petergpt
Design & Dataviz
k0-ba
k0-ba @k0ba_eth k0ba.com
Status
Last Updated: 2026-02-25