10 Comments

My LinkedIn feed is now full of some arcane and obscure open source model I'll likely never hear about again. Since when did leaderboards become a thing engineers actually cared about?

Expand full comment
Jan 31·edited Jan 31Liked by Charlie Guo

Great clarification regarding the Chatbot Arena. It pops up in my feed regularly (the last time with the Gemini Pro news), but I never tried digging further, so I never realised it measures interactions with real humans in a "non-lab" scenario. Looking more closely, I can see that Bard "only" has 3K votes compared to GPT-4's dozens of thousands (for every measured version). So the confidence interval is quite different and we may potentially see Bard slip as more votes roll in.

I've now heard from a few people about Bard being much improved lately. So I went and tried it yesterday and didn't quite have an "Aha!" moment of seeing major leaps forward. Then again, I rarely use Bard, so my point of reference isn't that great. (ChatGPT Plus with GPT-4 Turbo did give me better Mermaid code when I compared its output with Bard.)

I appreciate the mention, by the way!

Expand full comment
Jan 31·edited Jan 31Liked by Charlie Guo

Great content as usual Charlie; I will say when I caught whiff of this Bard news through a Reddit post my instant thought was how easily these types of public eval systems could be manipulated by a dedicated user(s). Although I have used Bard very recently and it's much improved - it still makes a ton of errors and often refuses outputs on pretty ridiculous pretenses, which when *very simply* challenged it goes back on. But then the model is outstanding at some other things.

Expand full comment

I love the appropriation of Twain here. Very appropriate.

Expand full comment