Claude 3.7 Sonnet: the first model I trust to write code unsupervised

Anthropic released Claude 3.7 Sonnet on 24 February, and I’ve spent the last few days testing it against the type of work we do. The benchmarks are good, which is fine. What matters more is the practical experience of using it to write code that ends up in production.

The test was straightforward. We had a WordPress plugin that needed three new features. I wrote the requirements into Claude. I ran what it gave me. On the first iteration, the code had a subtle bug in the error handling that would have crashed in production. That’s what code review is for. Second iteration, fixed. Third iteration, the code was good enough that I would ship it.

On earlier versions, the gap between “first pass” and “production ready” required significant rework. The code was overengineered or underengineered, it was missing edge cases, it had security holes. You ended up rewriting half of it. With 3.7, the code is closer to the shape it needs to be. It’s not perfect, but it’s the kind of “not perfect” you’d get from a junior developer, not the kind you’d get from a chatbot that doesn’t understand context.

Where it still breaks down is when the problem requires you to hold multiple constraints in mind at once. A task that involves “use this library, but not that function, and make sure it also integrates with this other system, and the output needs to match that format, and watch out for this edge case because it’s a known issue in the codebase” is still hard for the model. It doesn’t fail catastrophically, but it doesn’t produce the most elegant solution either.

Code review still happens, but it’s not “did this work” anymore, it’s “does this fit the architecture”. Security review still happens for anything that touches user data or authentication. Testing happens exactly the same way. And anything that’s going to run on client sites still goes through quality gates.

This doesn’t mean we’re going to lay off developers, which people keep asking. It means developers will spend less time writing boilerplate and more time on design and reasoning. That’s better work, not less work. And the codebases I’m seeing come out of this workflow are probably more consistent because the model is less idiosyncratic than different people are.

The interesting this is whether this stays stable when we scale it. Right now I’m working on small problems with Claude. Bigger codebases, longer context windows, more complex requirements, that’s where it might break down.

But for the first time, I can say that I trust this model to write code that ends up in production without someone checking every line.