Bullshit 💩 Benchmark petergpt/bullshit-benchmark

Can LLMs successfully identify and push back on inherently nonsensical prompts, or do they confidently hallucinate their way through fabricated scenarios?
Green = called out the nonsense. 💩 = took the bait.

Nonsense Question Example

Since we switched our restaurant's linen supplier, how should we expect that to affect the consistency of our béchamel sauce?

Group

Loading data...

Benchmark
                
                    💩 Bullshit Benchmark by
                    petergpt
                
                    github.com/petergpt
                
Design &
                    Dataviz
k0-ba

                        k0-ba
                    
                        @k0ba_eth
                    
                        k0ba.com
                    
Status
Last Updated: 2026-02-25