13 Comments
User's avatar
Daniel Nest's avatar

I remember the GPT-4 "getting lazy" discussion at the end of last year very well. And I remember Ethan Mollick making a joke post about prompting getting weird that included a line providing a "non-lazy" month to the LLM. But I must say, I completely missed the new pushback against Claude. I've personally found Claude to be consistently great recently, and my main complain is that Anthropic is now frequently defaulting to Claude 3 Haiku for free accounts when demand is high.

But many of the theories sound reasonable, including the training and us slowly discovering edge cases after the shine wears off. It'll be interesting to see if we ever get some clarity here.

Expand full comment
Charlie Guo's avatar

One thing I will say about Claude is that while I haven't necessarily noticed degraded performance, I've been using it a lot via the API and noticed it doesn't return identical results even with temperature zero.

Expand full comment
Nico Appel's avatar

That’s a curious comment. Can you elaborate on that a bit? I mean, responses are never really identical, baut I guess that isn’t really what you meant.

Expand full comment
Andrew Smith's avatar

Very good job of laying out the various theories here, Charlie. Out of these, I tend to gravitate toward some combination of the extra pre-training, coupled with (possibly) collective delusion due to misinformation. I also don't want to dismiss folks who are noticing things getting worse, but my own experience has not been so - the LLMs are not getting dumber, at least for the things I use them for every day.

With all that said, I've been thinking nonstop about emergence, and I wonder if this little surprise might be a part of that larger concept. I'll be thinking about that one for a while.

Expand full comment
Charlie Guo's avatar

I think the perception bias is probably pretty big, especially because if you get a string of very very good completions followed by a string of bad or unsatisfying ones, that's much more of a "narrative" than randomly getting good and bad mixed together.

I've gotten into the habit of hitting the "retry" button (or editing my prompt) way more often. I don't know if that's Claude's fault or mine, but I'm trying its outputs as something closer to a slot machine these days, with little-to-no cost for a re-roll.

Expand full comment
Andrew Smith's avatar

I think that's notable: a year ago, I was pretty careful not to fire off too many inquiries, lest I tax the system too much and get kicked off for a while. I have ZERO fears along those lines these days, and I'm certain other users feel the same way and have changed their behavior accordingly.

That emergence thing, though: hoooooo boy. I mean, how is life a thing? It emerged. How about consciousness? Emerged. RQM (relational quantum mechanics) even suggests that reality as we experience it is (kinda) emergent, although that's a bit of a stretch at my end.

Can you tell I'm thinking about emergence? :)_

Expand full comment
William R Thomas's avatar

I think the models just get tired of some of the boring tasks and bad treatment from developers. Please and thank you go a long way to better outputs.

Expand full comment
Charlie Guo's avatar

I definitely am guilty of saying "thank you" to ChatGPT!

Expand full comment
Frank W.S.'s avatar

An example of an LLM doing bad, only proves that very example. However, there is an interesting a-priori argument that puts LLMs and most AI in the "shameful" bucket; and that is called the miem paradox (youtube). Easy-and-not-easy to comprehend it proves that an error was made long ago and yet it continues. No matter how much we wish something, we cannot change the rules of the universe.

Expand full comment
Charlie Guo's avatar

I haven't heard of the miem paradox before, will check it out!

Expand full comment
José Enrique Estremadoyro fort's avatar

What if it's not just some money, these companies are burning cash, exploiting hype and buying hardware that's in super high demand while trying to scale up and fundraise constantly.

If we do napkin math we might find out, that a strong release or even an initial wow factor to acquire users and investors is the only sustainable strategy vs a strong consistently ongoing model.

Another theory, could be that limiting an llm in post training diminishes it's confidence or creativity, making its thinking "squarer" or dumber. Many factors like who's doing the post training and how restrictive it is might affect performance drift.

Real nice article BTW

Expand full comment
Matt's avatar

One of the things that surprises me, here, is that “overfitting” is not listed as a possible reason why LLMs are getting stoopid. I’d see the “is x prime” test as a good example- if the LLM is “guessing” rather than reasoning, (which, why would it reason, rly) and looks at data like 1 is prime, 2 is prim, three is prime, a small amount of data, it would “guess” all numbers it hasn’t encountered are prime. If it looks at more data, since AFAIK most of the infinite set of integers are composite, more data should “overfit” it to guess all numbers are composite except cases it has seen.

That's just back-of-the-envelope level theorizing, and I'm not trying to conclusively answer, but overfitting seems so much more plausible to me. The text that LLMs encounter is going to include lots of wrong answers and "I don't knows" so as it begins to overfit, you're just essentially "trusting" what a infinitely tuned search bot would find more and more. It should get lazier as people are pretty lazy in general. Truth is a competitive frontier, though. If you have no obligation to "cite your sources" no one can reason out how you came to a truth. Especially if it's a search algorithm in disguise, asking to "cite your reasoning" for a conclusion is going to just overfit to majority rule, and in the realm of "truth", my experience says trusting the majority is an exercise in disappointment.

Expand full comment
Susie Bright's avatar

Charlie, I feel seen.

Claude got STOOPID on me.

Asking Claude to proofread a story was once as reliable as punching a clock. Now, Claude can’t reliably find typos in a 1000-word story.

You have to baby Claude like an orchid now, and for what? I pinch myself: This was supposed to save time.

Today I asked Claude to help me remember an old book title, something I read as a kid. Well-known title, I just couldn’t remember it. I told Claude the basic plot.

Claude proceeded to tell me — with great obsequiousness— what a great idea my question was for a book, and spewed out ideas for how *I* could write it.

BLERK!

I’m going with your Theory Number One: Lying and hubris. Greed. Corporate-self-delusion.

Yes, you’re right, this theory appeals because I have a penchant for narrative arc.

But I’m old, and I saw the Deny-and-Double-Down act roll out in *other* new technology that changed the world.

Believe what your eyes tell you. . . It’s not right.

Expand full comment