Discussion about this post

User's avatar
Jeff Morhous's avatar

Hard to explain why, but Vending-Bench is cracking me up

Pawel Jozefiak's avatar

The case for building personal benchmarks lands differently once you've tested models on real tasks under time pressure. Generic leaderboards told me Mistral was competitive - my actual hackathon experience showed specific gaps that no published benchmark flagged: instruction-following edge cases, speed-to-first-token under load, the way it handled ambiguous prompts vs. Claude (https://thoughts.jock.pl/p/mistral-ai-honest-review-eu-hackathon-2026).

Your three categories (behavioral, domain-specific, product-focused) map well to why that gap exists. I'd add a fourth: deadline-pressure performance, because models can behave differently when you're iterating fast and can't clean up the context.

4 more comments...

No posts

Ready for more?