11 Comments
User's avatar
Daniel Nest's avatar

I remember the GPT-4 "getting lazy" discussion at the end of last year very well. And I remember Ethan Mollick making a joke post about prompting getting weird that included a line providing a "non-lazy" month to the LLM. But I must say, I completely missed the new pushback against Claude. I've personally found Claude to be consistently great recently, and my main complain is that Anthropic is now frequently defaulting to Claude 3 Haiku for free accounts when demand is high.

But many of the theories sound reasonable, including the training and us slowly discovering edge cases after the shine wears off. It'll be interesting to see if we ever get some clarity here.

Expand full comment
Charlie Guo's avatar

One thing I will say about Claude is that while I haven't necessarily noticed degraded performance, I've been using it a lot via the API and noticed it doesn't return identical results even with temperature zero.

Expand full comment
Nico Appel's avatar

That’s a curious comment. Can you elaborate on that a bit? I mean, responses are never really identical, baut I guess that isn’t really what you meant.

Expand full comment
Andrew Smith's avatar

Very good job of laying out the various theories here, Charlie. Out of these, I tend to gravitate toward some combination of the extra pre-training, coupled with (possibly) collective delusion due to misinformation. I also don't want to dismiss folks who are noticing things getting worse, but my own experience has not been so - the LLMs are not getting dumber, at least for the things I use them for every day.

With all that said, I've been thinking nonstop about emergence, and I wonder if this little surprise might be a part of that larger concept. I'll be thinking about that one for a while.

Expand full comment
Charlie Guo's avatar

I think the perception bias is probably pretty big, especially because if you get a string of very very good completions followed by a string of bad or unsatisfying ones, that's much more of a "narrative" than randomly getting good and bad mixed together.

I've gotten into the habit of hitting the "retry" button (or editing my prompt) way more often. I don't know if that's Claude's fault or mine, but I'm trying its outputs as something closer to a slot machine these days, with little-to-no cost for a re-roll.

Expand full comment
Andrew Smith's avatar

I think that's notable: a year ago, I was pretty careful not to fire off too many inquiries, lest I tax the system too much and get kicked off for a while. I have ZERO fears along those lines these days, and I'm certain other users feel the same way and have changed their behavior accordingly.

That emergence thing, though: hoooooo boy. I mean, how is life a thing? It emerged. How about consciousness? Emerged. RQM (relational quantum mechanics) even suggests that reality as we experience it is (kinda) emergent, although that's a bit of a stretch at my end.

Can you tell I'm thinking about emergence? :)_

Expand full comment
William R Thomas's avatar

I think the models just get tired of some of the boring tasks and bad treatment from developers. Please and thank you go a long way to better outputs.

Expand full comment
Charlie Guo's avatar

I definitely am guilty of saying "thank you" to ChatGPT!

Expand full comment
Frank W.S.'s avatar

An example of an LLM doing bad, only proves that very example. However, there is an interesting a-priori argument that puts LLMs and most AI in the "shameful" bucket; and that is called the miem paradox (youtube). Easy-and-not-easy to comprehend it proves that an error was made long ago and yet it continues. No matter how much we wish something, we cannot change the rules of the universe.

Expand full comment
Charlie Guo's avatar

I haven't heard of the miem paradox before, will check it out!

Expand full comment
José Enrique Estremadoyro fort's avatar

What if it's not just some money, these companies are burning cash, exploiting hype and buying hardware that's in super high demand while trying to scale up and fundraise constantly.

If we do napkin math we might find out, that a strong release or even an initial wow factor to acquire users and investors is the only sustainable strategy vs a strong consistently ongoing model.

Another theory, could be that limiting an llm in post training diminishes it's confidence or creativity, making its thinking "squarer" or dumber. Many factors like who's doing the post training and how restrictive it is might affect performance drift.

Real nice article BTW

Expand full comment