While benchmarks (and leaderboards) are useful tools, they are but a small facet when it comes to evaluating large language models. Often, they're not the best indicators of real-world utility - and I want to dig into why (and what other approaches exist).
My LinkedIn feed is now full of some arcane and obscure open source model I'll likely never hear about again. Since when did leaderboards become a thing engineers actually cared about?
Great clarification regarding the Chatbot Arena. It pops up in my feed regularly (the last time with the Gemini Pro news), but I never tried digging further, so I never realised it measures interactions with real humans in a "non-lab" scenario. Looking more closely, I can see that Bard "only" has 3K votes compared to GPT-4's dozens of thousands (for every measured version). So the confidence interval is quite different and we may potentially see Bard slip as more votes roll in.
I've now heard from a few people about Bard being much improved lately. So I went and tried it yesterday and didn't quite have an "Aha!" moment of seeing major leaps forward. Then again, I rarely use Bard, so my point of reference isn't that great. (ChatGPT Plus with GPT-4 Turbo did give me better Mermaid code when I compared its output with Bard.)
Great content as usual Charlie; I will say when I caught whiff of this Bard news through a Reddit post my instant thought was how easily these types of public eval systems could be manipulated by a dedicated user(s). Although I have used Bard very recently and it's much improved - it still makes a ton of errors and often refuses outputs on pretty ridiculous pretenses, which when *very simply* challenged it goes back on. But then the model is outstanding at some other things.
My LinkedIn feed is now full of some arcane and obscure open source model I'll likely never hear about again. Since when did leaderboards become a thing engineers actually cared about?
Great clarification regarding the Chatbot Arena. It pops up in my feed regularly (the last time with the Gemini Pro news), but I never tried digging further, so I never realised it measures interactions with real humans in a "non-lab" scenario. Looking more closely, I can see that Bard "only" has 3K votes compared to GPT-4's dozens of thousands (for every measured version). So the confidence interval is quite different and we may potentially see Bard slip as more votes roll in.
I've now heard from a few people about Bard being much improved lately. So I went and tried it yesterday and didn't quite have an "Aha!" moment of seeing major leaps forward. Then again, I rarely use Bard, so my point of reference isn't that great. (ChatGPT Plus with GPT-4 Turbo did give me better Mermaid code when I compared its output with Bard.)
I appreciate the mention, by the way!
Great content as usual Charlie; I will say when I caught whiff of this Bard news through a Reddit post my instant thought was how easily these types of public eval systems could be manipulated by a dedicated user(s). Although I have used Bard very recently and it's much improved - it still makes a ton of errors and often refuses outputs on pretty ridiculous pretenses, which when *very simply* challenged it goes back on. But then the model is outstanding at some other things.
I love the appropriation of Twain here. Very appropriate.