While benchmarks (and leaderboards) are useful tools, they are but a small facet when it comes to evaluating large language models. Often, they're not the best indicators of real-world utility - and I want to dig into why (and what other approaches exist).
My LinkedIn feed is now full of some arcane and obscure open source model I'll likely never hear about again. Since when did leaderboards become a thing engineers actually cared about?
For the ML engineers who actually train the models, I get the sense that it's mostly about bragging rights. But for everyone else, it mostly seems to be about creating new things to hype up on Twitter and LinkedIn.
Great clarification regarding the Chatbot Arena. It pops up in my feed regularly (the last time with the Gemini Pro news), but I never tried digging further, so I never realised it measures interactions with real humans in a "non-lab" scenario. Looking more closely, I can see that Bard "only" has 3K votes compared to GPT-4's dozens of thousands (for every measured version). So the confidence interval is quite different and we may potentially see Bard slip as more votes roll in.
I've now heard from a few people about Bard being much improved lately. So I went and tried it yesterday and didn't quite have an "Aha!" moment of seeing major leaps forward. Then again, I rarely use Bard, so my point of reference isn't that great. (ChatGPT Plus with GPT-4 Turbo did give me better Mermaid code when I compared its output with Bard.)
That's a great point about the confidence interval - I should have mentioned that in the post. My guess is the new Bard may slip compared to other GPT-4 variants as time goes on, but like I said, it's a great PR win for Google in the short term.
Great content as usual Charlie; I will say when I caught whiff of this Bard news through a Reddit post my instant thought was how easily these types of public eval systems could be manipulated by a dedicated user(s). Although I have used Bard very recently and it's much improved - it still makes a ton of errors and often refuses outputs on pretty ridiculous pretenses, which when *very simply* challenged it goes back on. But then the model is outstanding at some other things.
I'm not sure if this already exists, but I'd love to see an analysis on mostly harmless prompts that get rejected by content filters. What area do you think Bard really shines at?
Same! I'm also super glad to see you and Daniel Nest passing the ball back and forth a bit. You two are good at this kind of summary, and you're both usually how I keep up (beyond the inundation in my news feed). Just wanted to encourage you a little here, beyond the good headline!
My LinkedIn feed is now full of some arcane and obscure open source model I'll likely never hear about again. Since when did leaderboards become a thing engineers actually cared about?
For the ML engineers who actually train the models, I get the sense that it's mostly about bragging rights. But for everyone else, it mostly seems to be about creating new things to hype up on Twitter and LinkedIn.
Great clarification regarding the Chatbot Arena. It pops up in my feed regularly (the last time with the Gemini Pro news), but I never tried digging further, so I never realised it measures interactions with real humans in a "non-lab" scenario. Looking more closely, I can see that Bard "only" has 3K votes compared to GPT-4's dozens of thousands (for every measured version). So the confidence interval is quite different and we may potentially see Bard slip as more votes roll in.
I've now heard from a few people about Bard being much improved lately. So I went and tried it yesterday and didn't quite have an "Aha!" moment of seeing major leaps forward. Then again, I rarely use Bard, so my point of reference isn't that great. (ChatGPT Plus with GPT-4 Turbo did give me better Mermaid code when I compared its output with Bard.)
I appreciate the mention, by the way!
That's a great point about the confidence interval - I should have mentioned that in the post. My guess is the new Bard may slip compared to other GPT-4 variants as time goes on, but like I said, it's a great PR win for Google in the short term.
Yeah big time. Now we stay tuned for Gemini Ultra.
Great content as usual Charlie; I will say when I caught whiff of this Bard news through a Reddit post my instant thought was how easily these types of public eval systems could be manipulated by a dedicated user(s). Although I have used Bard very recently and it's much improved - it still makes a ton of errors and often refuses outputs on pretty ridiculous pretenses, which when *very simply* challenged it goes back on. But then the model is outstanding at some other things.
I'm not sure if this already exists, but I'd love to see an analysis on mostly harmless prompts that get rejected by content filters. What area do you think Bard really shines at?
I love the appropriation of Twain here. Very appropriate.
You know, sometimes I really struggle with headlines and other times they pretty much write themselves.
Same! I'm also super glad to see you and Daniel Nest passing the ball back and forth a bit. You two are good at this kind of summary, and you're both usually how I keep up (beyond the inundation in my news feed). Just wanted to encourage you a little here, beyond the good headline!