It's on my to-do list to take it for a spin on longer-running agentic tasks. At Pulley we dreamed about setting up a model harness that could do cap table document analysis and that's one of their demo examples!
That sounds like a really great use case! As a software engineer I mostly watch the SWE-bench numbers, but I know there's a **ton** of emphasis on this model's performance in spreadsheets (especially BIG spreadsheets)
5.2 benchmarks are nuts. Very much enjoying comparing it to other models
It's on my to-do list to take it for a spin on longer-running agentic tasks. At Pulley we dreamed about setting up a model harness that could do cap table document analysis and that's one of their demo examples!
That sounds like a really great use case! As a software engineer I mostly watch the SWE-bench numbers, but I know there's a **ton** of emphasis on this model's performance in spreadsheets (especially BIG spreadsheets)
Thank you Charlie for your great posts this year!