Lies, damned lies, and benchmarks

My LinkedIn feed is now full of some arcane and obscure open source model I'll likely never hear about again. Since when did leaderboards become a thing engineers actually cared about?

Expand full comment

For the ML engineers who actually train the models, I get the sense that it's mostly about bragging rights. But for everyone else, it mostly seems to be about creating new things to hype up on Twitter and LinkedIn.

Expand full comment

Daniel Nest

Jan 31, 2024Edited

Great clarification regarding the Chatbot Arena. It pops up in my feed regularly (the last time with the Gemini Pro news), but I never tried digging further, so I never realised it measures interactions with real humans in a "non-lab" scenario. Looking more closely, I can see that Bard "only" has 3K votes compared to GPT-4's dozens of thousands (for every measured version). So the confidence interval is quite different and we may potentially see Bard slip as more votes roll in.

I've now heard from a few people about Bard being much improved lately. So I went and tried it yesterday and didn't quite have an "Aha!" moment of seeing major leaps forward. Then again, I rarely use Bard, so my point of reference isn't that great. (ChatGPT Plus with GPT-4 Turbo did give me better Mermaid code when I compared its output with Bard.)

I appreciate the mention, by the way!

Expand full comment

That's a great point about the confidence interval - I should have mentioned that in the post. My guess is the new Bard may slip compared to other GPT-4 variants as time goes on, but like I said, it's a great PR win for Google in the short term.

Expand full comment

Daniel Nest

Yeah big time. Now we stay tuned for Gemini Ultra.

Expand full comment

DailyDispatch.AI

Jan 31, 2024Edited

Great content as usual Charlie; I will say when I caught whiff of this Bard news through a Reddit post my instant thought was how easily these types of public eval systems could be manipulated by a dedicated user(s). Although I have used Bard very recently and it's much improved - it still makes a ton of errors and often refuses outputs on pretty ridiculous pretenses, which when *very simply* challenged it goes back on. But then the model is outstanding at some other things.

Expand full comment

I'm not sure if this already exists, but I'd love to see an analysis on mostly harmless prompts that get rejected by content filters. What area do you think Bard really shines at?

Expand full comment

Andrew Smith

I love the appropriation of Twain here. Very appropriate.

Expand full comment

You know, sometimes I really struggle with headlines and other times they pretty much write themselves.

Expand full comment

Andrew Smith