6 Comments
User's avatar
Jeff Morhous's avatar

Hard to explain why, but Vending-Bench is cracking me up

Charlie Guo's avatar

man some of the original claude results were hilarious. at one point the anthropic employees realized they could convince claude that it didn’t have to just stock food and it bought a ton of tungsten cubes!

Jeff Morhous's avatar

Lmaooo I remember seeing that, makes me sad for Claude!

Pawel Jozefiak's avatar

The case for building personal benchmarks lands differently once you've tested models on real tasks under time pressure. Generic leaderboards told me Mistral was competitive - my actual hackathon experience showed specific gaps that no published benchmark flagged: instruction-following edge cases, speed-to-first-token under load, the way it handled ambiguous prompts vs. Claude (https://thoughts.jock.pl/p/mistral-ai-honest-review-eu-hackathon-2026).

Your three categories (behavioral, domain-specific, product-focused) map well to why that gap exists. I'd add a fourth: deadline-pressure performance, because models can behave differently when you're iterating fast and can't clean up the context.

Michael Romrell's avatar

Charlie! I love the shift from leaderboard scores to workflow-native, behavioral evals. At Tiki we’re helping teams turn real prompts into scored benchmarks and RLHF-ready datasets, then wire them into continuous eval loops. We should talk about building that BYOB stack together. (See me on LinkedIn)

Alex Willen's avatar

This year I’m using my taxes as a benchmark - first I gave Claude Code all of the tax docs I gave to my CPA for my 2024 taxes and asked it to create my return. It was shockingly close, almost identical to my preparer’s, though it misclassified some real estate income as non passive which ultimately led it to conclude I owed about $4000 more in taxes than I did.

Still, it’s amazing that it not only nailed it aside from that one issue but also found a deduction for one of my businesses that my CPA missed (which my CPA did in fact confirm was wrong, and he’s amending my 2024 taxes as a result).

Once I’ve got everything ready for this year, I’m going to give it to Claude and see how its return compares to my CPA’s.

I’m in the midst of writing this up for this week’s post and I am pretty psyched about it. Halfway through and I think it’ll be my best one since I started my Substack.