AI benchmarks are saturating, getting harder to verify, and increasingly irrelevant to how most people use models. The replacements are weirder - and more useful.
man some of the original claude results were hilarious. at one point the anthropic employees realized they could convince claude that it didn’t have to just stock food and it bought a ton of tungsten cubes!
The case for building personal benchmarks lands differently once you've tested models on real tasks under time pressure. Generic leaderboards told me Mistral was competitive - my actual hackathon experience showed specific gaps that no published benchmark flagged: instruction-following edge cases, speed-to-first-token under load, the way it handled ambiguous prompts vs. Claude (https://thoughts.jock.pl/p/mistral-ai-honest-review-eu-hackathon-2026).
Your three categories (behavioral, domain-specific, product-focused) map well to why that gap exists. I'd add a fourth: deadline-pressure performance, because models can behave differently when you're iterating fast and can't clean up the context.
Charlie! I love the shift from leaderboard scores to workflow-native, behavioral evals. At Tiki we’re helping teams turn real prompts into scored benchmarks and RLHF-ready datasets, then wire them into continuous eval loops. We should talk about building that BYOB stack together. (See me on LinkedIn)
This year I’m using my taxes as a benchmark - first I gave Claude Code all of the tax docs I gave to my CPA for my 2024 taxes and asked it to create my return. It was shockingly close, almost identical to my preparer’s, though it misclassified some real estate income as non passive which ultimately led it to conclude I owed about $4000 more in taxes than I did.
Still, it’s amazing that it not only nailed it aside from that one issue but also found a deduction for one of my businesses that my CPA missed (which my CPA did in fact confirm was wrong, and he’s amending my 2024 taxes as a result).
Once I’ve got everything ready for this year, I’m going to give it to Claude and see how its return compares to my CPA’s.
I’m in the midst of writing this up for this week’s post and I am pretty psyched about it. Halfway through and I think it’ll be my best one since I started my Substack.
Hard to explain why, but Vending-Bench is cracking me up
man some of the original claude results were hilarious. at one point the anthropic employees realized they could convince claude that it didn’t have to just stock food and it bought a ton of tungsten cubes!
Lmaooo I remember seeing that, makes me sad for Claude!
The case for building personal benchmarks lands differently once you've tested models on real tasks under time pressure. Generic leaderboards told me Mistral was competitive - my actual hackathon experience showed specific gaps that no published benchmark flagged: instruction-following edge cases, speed-to-first-token under load, the way it handled ambiguous prompts vs. Claude (https://thoughts.jock.pl/p/mistral-ai-honest-review-eu-hackathon-2026).
Your three categories (behavioral, domain-specific, product-focused) map well to why that gap exists. I'd add a fourth: deadline-pressure performance, because models can behave differently when you're iterating fast and can't clean up the context.
Charlie! I love the shift from leaderboard scores to workflow-native, behavioral evals. At Tiki we’re helping teams turn real prompts into scored benchmarks and RLHF-ready datasets, then wire them into continuous eval loops. We should talk about building that BYOB stack together. (See me on LinkedIn)
This year I’m using my taxes as a benchmark - first I gave Claude Code all of the tax docs I gave to my CPA for my 2024 taxes and asked it to create my return. It was shockingly close, almost identical to my preparer’s, though it misclassified some real estate income as non passive which ultimately led it to conclude I owed about $4000 more in taxes than I did.
Still, it’s amazing that it not only nailed it aside from that one issue but also found a deduction for one of my businesses that my CPA missed (which my CPA did in fact confirm was wrong, and he’s amending my 2024 taxes as a result).
Once I’ve got everything ready for this year, I’m going to give it to Claude and see how its return compares to my CPA’s.
I’m in the midst of writing this up for this week’s post and I am pretty psyched about it. Halfway through and I think it’ll be my best one since I started my Substack.