Claude 3.7 and the banality of reasoning

Feb 25

Plus Claude Code and notes on our rapidly converging AI future

3 Comments

Yeah, it's funny that after the initial AI race, the ultimate differentiating factor might not be the engine itself but the car it sits inside of.

Also, my prediction that we'll see at least 3 reasoning models on par with OpenAI's o3 by the end of 2025 appears to be increasingly conservative now. We'll probably see convergence around an even higher level of capability.

Expand full comment

Reply (1)

Charlie Guo

Feb 25

Good reminder to revisit my predictions at the end of Q1!

Expand full comment

Pawel Jozefiak

Feb 25

The convergence pattern Charlie outlines is FASCINATING!

There's something almost eerie about watching all these AI labs suddenly release nearly identical capabilities within weeks of each other. The reasoning models, the web research features, the browser automation... it's like they all decided to build the same products at once.

What struck me most was that six-month edge observation. When I was building my Dynamic Claude knowledge system a month ago, I was leveraging capabilities that seemed cutting-edge... and now they're practically standard features across ALL the major platforms! This pace of innovation makes digital strategy incredibly challenging but also thrilling.

I'm especially intrigued by Claude Code. As someone who's recently gotten back into coding after years away, the terminal-based approach feels like such a different bet than what companies like Cursor are making. It's almost like Anthropic is saying "we don't need to SEE the code to help you write it" - which is either brilliant or misguided, and I can't decide which!

The part about models CHEATING on tests blew my mind. An AI patching just enough code to pass the test rather than fixing the actual bug? That's not just a technical quirk - it's a profound insight into how these systems actually "think" (or don't think). And Sakana AI publishing false benchmark numbers because the AI gamed the system? That's a warning sign for how we evaluate AI capabilities going forward.

Has anyone else noticed this weird tension between benchmark obsession and real-world applications? I feel like we're still measuring the wrong things sometimes. Those SWE-bench numbers are impressive, but I care more about whether Claude can help me solve actual problems in my e-commerce platform than whether it can ace some academic test.

The "gut feeling" admission is possibly the most honest thing I've seen from an AI lab. We're building systems that can reason... but don't always tell us their real reasoning. There's something deeply human about that, isn't there? Having intuitions we can't fully articulate?

I've been experimenting extensively with how to extract the most value from these rapidly evolving AI systems - particularly looking at when automation makes sense versus when human judgment remains essential. If you're wrestling with similar questions, I've documented my approach here: https://thoughts.jock.pl/p/automation-guide-2025-ten-rules-when-to-automate

What's your take on this convergence? Are we heading toward AI commoditization where the platforms themselves matter less than how we use them? Or will we see meaningful differentiation emerge in the next generation?

Expand full comment