4 Comments
User's avatar
Jeff Morhous's avatar

5.2 benchmarks are nuts. Very much enjoying comparing it to other models

Expand full comment
Charlie Guo's avatar

It's on my to-do list to take it for a spin on longer-running agentic tasks. At Pulley we dreamed about setting up a model harness that could do cap table document analysis and that's one of their demo examples!

Expand full comment
Jeff Morhous's avatar

That sounds like a really great use case! As a software engineer I mostly watch the SWE-bench numbers, but I know there's a **ton** of emphasis on this model's performance in spreadsheets (especially BIG spreadsheets)

Expand full comment
Gary Mersham's avatar

Thank you Charlie for your great posts this year!

Expand full comment