Comparing GPT-4.1 against other models in "did code change cause this incident"

by lawrjoneon 4/15/2025, 8:49 PMwith 0 comments

We've been testing GPT-4.1 in our investigation system, which is used to triage and debug production incidents.

I thought it would be useful to share, as we have evaluation metrics and scorecards for investigations, so you can see how real-world performance compares between models.

I've written the post on LinkedIn so I could share a picture of the scorecards and how they compare:

https://www.linkedin.com/posts/lawrence2jones_like-many-others-we-were-excited-about-openai-activity-7317907307634323457-FdL7

Our takeaways were:

- 4.1 is much fussier than Sonnet 3.7 at claiming a code change caused an incident, leading to a drop (38%) in recall

- When 4.1 does suggest a PR caused an incident, it's right 33% more than Sonnet 3.7

- 4.1 blows 4o out the water, with 4o finding just 3/31 of the code changes in our dataset, showing how much of an upgrade 4.1 is on this task

In short, 4.1 is a totally different beast to 4o when it comes to software tasks, and at a much lower price-point than Sonnet 3.7 we'll be considering it carefully across our agents.

We are also yet to find a metric where 4.1 is worse than 4o, so at minimum this release means >20% cost savings for us.

Hopefully useful to people!

0