This is my experience with how LLMs "draft" legal arguments: at first glance, it's plausible — but may be, and often is, invalid, unsound, and/or ill-advised.
The catch is that many judges lack the time, energy, or willingness to not only read the documents in detail, but also roll up their sleeves and dig into the arguments and cited authorities. (Some lack the skills, but those are extreme cases.) So the plausible argument (improperly and unfortunately) carries the day.
LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute. (It's a species of Brandolini's Law / The Bullshit Asymmetry Principle.) Thus justice suffers.
I imagine that this is analogous to the cognitive, technical, and "sub-optimal code" debt that LLM-produced code is generating and foisting upon future developers who will have to unravel it.
This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow). Doesn't use is_ipk to identify primary keys, uses fsync on every statement. The problem with larger projects like this even if you are competent is that there are just too many lines of code to read it properly and understand it all. Bravo to the author for taking the time to read this project, most people never will (clearly including the author of it).
I find LLMs at present work best as autocomplete -
The chunks of code are small and can be carefully reviewed at the point of writing
Claude normally gets it right (though sometimes horribly wrong) - this is easier to catch in autocomplete
That way they mostly work as designed and the burden on humans is completely manageable, plus you end up with a good understanding of the code generated. They make mistakes I'd say 30% of the time or so when autocompleting, which is significant (mistakes not necessarily being bugs but ugly code, slow code, duplicate code or incorrect code.
Having the AI produce the majority of the code (in chats or with agents) takes lots of time to plan and babysit, and is harder to review, maintain and diagnose; it doesn't seem like much of a performance boost, unless you're producing code that is already in the training data and just want to ignore the licensing of the original code.
Nitpick/question: the "LLM" is what you get via raw API call, correct?
If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?
An example I keep hearing is "LLMs have no memory/understanding of time so ___" - but, agents have various levels of memory.
I keep trying to explain this in meetings, and in rando comments. If I am not way off-base here, then what should be the term, or terms, be? LLM-based agents?
> The vibes are not enough. Define what correct means. Then measure.
Pretty much. I've been advocating this for a while. For automation you need intent, and for comparison you need measurement. Blast radius/risk profile is also important to understand how much you need to cover upfront.
The Author mentions evaluations, which in this context are often called AI evals [1] and one thing I'd love to see is those evals become a common language of actually provable user stories instead of there being a disconnect between different types of roles, e.g. a scientist, a business guy and a software developer.
The more we can speak a common language and easily write and maintain these no matter which background we have, the easier it'll be to collaborate and empower people and to move fast without losing control.
- [1] https://ai-evals.io/ (or the practical repo: https://github.com/Alexhans/eval-ception )
This article is great. And the blog-article headline is interesting, but wrong. LLM's don't in general write plausible code (as a rule) either.
They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.
This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.
Based on a search, the SQLite reimplementation in question is Frankensqlite, featured on Hacker News a few days ago (but flagged):
Yes plausible text prediction is exactly what it is. However, I wonder if the author included benchmarking in their prompt. It's not exactly fair to keep hidden requirements.
Ok, I’ll bite: how is that different from humans?
It writes statistically represented code, which is why (unless instructed otherwise) everything defaults to enterprisey, OOP, "I installed 10 trendy dependencies, please hire me" type code.
Just a recent anecdote, I asked the newest Codex to create a UI element that would persist its value on change. I'm using Datastar and have the manual saved on-disk and linked from the AGENTS.md. It's a simple html element with an annotation, a new backend route, and updating a data model. And there are even examples of this elsewhere in the page/app.
I've asked it to do why harder things so I thought it'd easily one-shot this but for some reason it absolutely ate it on this task. I tried to re-prompt it several times but it kept digging a hole for itself, adding more and more in-line javascript and backend code (and not even cleaning up the old code).
It's hard to appreciate how unintuitive the failure modes are. It can do things probably only a handful of specialists can do but it can also critical fail on what is a straightforward junior programming task.
This maps directly to the shift happening in API design for agent-to-agent communication.
Traditional API contracts assume a human reads docs and writes code once. But when agents are calling agents, the "contract" needs to be machine-verifiable in real-time.
The pattern I've seen work: explicit acceptance criteria in API responses themselves. Not just status codes, but structured metadata: "This response meets JSON Schema v2.1, latency was 180ms, data freshness is 3 seconds."
Lets the calling agent programmatically verify "did I get what I paid for?" without human intervention. The measurement problem becomes the automation problem.
Similar to how distributed systems moved from "hope it works" to explicit SLOs and circuit breakers. Agents need that, but at the individual request level.
What's up with the (somewhat odd) title HN has gone with for this article? it's implying a very different article than the one I just read
Most humans also write plausible code.
LLMs have no idea what "correct" means.
Anything they happen to get "correct" is the result of probability applied to their large training database.
Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.
Relying on an LLM for anything "serious" is a liability issue waiting to happen.
I'm using an LLM to write queries ATM. I have it write lots of tests, do some differential testing to get the code and the tests correct, and then have it optimize the query so that it can run on our backend (and optimization isn't really optional since we are processing a lot of rows in big tables). Without the tests this wouldn't work at all, and not just tests, we need pretty good coverage since if some edge case isn't covered, it likely will wash out during optimization (if the code is ever correct about it in the first place). I've had to add edge cases manually in the past, although my workflow has gotten better about this over time.
I don't use a planner though, I have my own workflow setup to do this (since it requires context isolated agents to fix tests and fix code during differential testing). If the planner somehow added broad test coverage and a performance feedback loop (or even just very aggressive well known optimizations), it might work.
100% I found that you think you are smarter than the LLM and knowing what you want, but this is not the case. Give the LLM some leeway to come up with solution based on what you are looking to achieve- give requirements, but don't ask it to produce the solution that you would have because then the response is forced and it is lower quality.
This is a bit unfair - to generate a bunch of code but not give the model data/tools and direct it to optimize it; then compare it to the optimized work of thousands over decades.
Feels like an extremely high effort hit piece, even though I know it’s not.
You: Claude, do you know how to program?
Claude: No, but if you hum a few bars I can fake it!
Except "faking it" turns out to be good enough, especially if you can fake it at speed and get feedback as to whether it works. You can then just hillclimb your way to an acceptable solution.
I’ve found this to be critical for having any chance of getting agents to generate code that is actually usable.
The more frequently you can verify correctness in some automated way the more likely the overall solution will be correct.
I’ve found that with good enough acceptance criteria (both positive and negative) it’s usually sufficient for agents to complete one off tasks without a human making a lot of changes. Essentially, if you’re willing to give up maintainability and other related properties, this works fairly well.
I’ve yet to find agents good enough to generate code that needs to be maintained long term without a ton of human feedback or manual code changes.
The reference in the text to Anthropic’s “Towards Understanding Sycophancy in Language Models” is related to RLHF (reinforcement learning with human feedback).
Claude code uses primarily different "pathways" in Anthropic LLMs that were not post-trained with RLHF, but rather with RLVF (reinforcement learning with verifiable rewards).
So, his point about code being produced to please the user isn't valid from where I am sitting.
In the last month I've done 4 months of work. My output is what a team of 4 would have produced pre-AI (5 with scrum master).
Just like you can't develop musical taste without writing and listening to a lot of music, you can't teach your gut how to architect good code without putting in the effort.
Want to learn how to 10x your coding? Read design patterns, read and write a lot of code by hand, review PRs, hit stumbling blocks and learn.
I noticed the other day how I review AI code in literally seconds. You just develop a knack for filtering out the noise and zooming in on the complex parts.
There are no shortcuts to developing skill and taste.
Producing the most plausible code is literally encoded into the cross entropy loss function and is fundamental to the pre-training. I suppose post training methods like RLVR are supposed to correct for this by optimizing correctness instead of plausibility, but there are probably many artifacts like these still lurking in the model's reasoning and outputs. To me it seems at least possible that the AI labs will find ways to improve the reward engineering to encourage better solutions in the coming years though.
Excellent article. But to be fair, many of these effects disappear when the model is given strict invariants, constraints, and built-in checks that are applied not only at the beginning but at every stage of generation.
This is why I used to use Beads and now GuardRails (shameless plug[0]). You brain dump to the model what you want, it breaks it down into discrete tasks, you have it refine them with you. By the time you have the model work on everything it can spawn workers in parallel that know what to do. In hindsight I should have called it BrainDump.
[0]: https://giancarlostoro.com/introducing-guardrails-a-new-codi...
The acceptance criteria point translates directly outside of coding too. Using Claude Code for sales and operational workflows, having acceptable criteria upfront (along with some manual checks along the way depending on the task) definitely helps the output.
I tried to make Claude Code, Sonnet 4.6, write a program that draws a fleur-de-lis.
No exaggeration it floundered for an hour before it started to look right.
It's really not good at tasks it has not seen before.
I think there is one problem with defining acceptance criteria first: sometimes you don't know ahead of time what those criteria are. You need to poke around first to figure out what's possible and what matters. And sometimes the criteria are subjective, abstract, and cannot be formally specified.
Of course, this problem is more general than just improving the output of LLM coding tools
Humans work best like this too
Been building a fintech data pipeline with Claude Code lately and yeah this tracks. The moment I started writing actual test cases before letting it loose the quality jumped massively. Before that it was generating stuff that looked right but would silently drop edge cases in the data parsing. Treating it like a junior dev who needs a clear spec is exactly right imo.
Does it work if you get the agent to throw away all of its actual implementation and start again from scratch, keeping all the learning and tests and feedback?
Gemini seems to try to get a lot of information upfront with questions and plans but people are famously bad at knowing what they want.
Maybe it should build a series of prototypes and spikes to check? If making code is cheap then why not?
The difference for me recently
Write a lambda that takes an S3 PUT event and inserts the rows of a comma separated file into a Postgres database.
Naive implementation: download the file from s3 and do a bulk insert - it would have taken 20 minutes and what Claude did at first.
I had to tell it to use the AWS sql extension to Postgres that will load a file directly from S3 into a table. It took 20 seconds.
I treat coding agents like junior developers.
There's also such a thing as being too ambitious. 99% of developers can not rewrite SQLite in rust even if they spent the rest or their lifetime doing it.
Expecting an AI do to a good job vibe-coding a Sqllite clone over a few weekends just isn't realistic. Despite that, it's useful technology.
The following paragraph appears twice:
> Now 2 case studies are not proof. I hear you! When two projects from the same methodology show the same gap, the next step is to test whether similar effects appear in the broader population. The studies below use mixed methods to reduce our single-sample bias.
This article hits on an important point not easily discerned from the title:
Sometimes good software is good due to a long history of hard-earned wins.
AI can help you get to an implementation faster. But it cannot magically summon up a battle-hardened solution. That requires going through some battles.
Great software takes time.
You can ask an LLM to write benchmarks and to make the code faster. It will find and fix simple performance issues - the low-hanging fruit. If you want it to do better, you can give it better tools and more guidance.
It's probably a good idea to improve your test suite first, to preserve correctness.
> Your LLM Doesn't Write Correct Code. It Writes Plausible Code.
I don't always write correct code, either. My code sure as hell is plausible but it might still contain subtle bugs every now and then.
In other words: 100% correctness was never the bar LLMs need to pass. They just need to come close enough.
I made a comment in another thread about my acceptance criteria
https://news.ycombinator.com/item?id=47280645
It is more about LLMs helping me understand the problem than giving me over engineered cookie cutter solutions.
Uncle Bob made this concept clear to me when he introduced to me that code itself IS requirements specification. LLMs are the new intermediary, but the necessity of the word and the machine persists.
> SQLite is not primarily fast because it is written in C. Well.. that too, but it is fast because 26 years of profiling have identified which tradeoffs matter.
Someone (with deep pockets to bear the token costs) should let Claude run for 26 months to have it optimize its Rust code base iteratively towards equal benchmarks. Would be an interesting experiment.
The article points out the general issue when discussing LLMs: audience and subject matter. We mostly discuss anecdotally about interactions and results. We really need much more data, more projects to succeed with LLMs or to fail with them - or to linger in a state of ignorance, sunk-cost fallacy and supressed resignation. I expect the latter will remain the standard case that we do not hear about - the part of the iceberg that is underwater, mostly existing within the corporate world or in private GitHubs, a case that is true with LLMs and without them.
In my experience, 'Senior Software Engineer' has NO general meaning. It's a title to be awarded for each participation in a project/product over and over again. The same goes for the claim: "Me, Senior SWE treat LLMs as Junior SWE, and I am 10x more productive." Imagine me facepalming every time.
bad input > bad output
idk what to say, just because it's rust doesn't mean it's performant, or that you asked for it to be performant.
yes, llms can produce bad code, they can also produce good code, just like people
Enterprise customers don't buy correct code, they buy plausible code.
I'm sure this is because they are pattern matching masters, if you program them to find something, they are good at that. But you have to know what you're looking for.
These LLM prompting tip articles write themselves if you just take the last decade of project management articles and replace “IC” with “agent”.
Increasing plausibility tends towards correctness.
Yes, which is why TDD is finally necessary
I've noticed a key quality signal with LLM coding is an LOC growth rate that tapers off or even turns negative.
That's very impressive. Your LLM actually wrote a correct code for a full relational database on the first try, like it takes 2.5 seconds to insert 100 rows but it stores them correctly and select is pretty fast. How many humans can do this without a week of debugging? I would suggest you install some profiling tools and ask it to find and address hotspots. SQL Lite had how long and how many people to get to where it is?
This is becoming clear, now?
I have had similar experiences, and I read over and over others experiences like this.
A powerful tool...
Early LLMs would do better at a task if you prefixed the task with "You are an expert [task doer]"
That's why I added an invariant tool to my Go agent framework, fugue-labs/gollem:
https://github.com/fugue-labs/gollem/blob/main/ext/codetool/...
human developers work best when the user defines their acceptance criteria first.
Bro you are like saying "OH LLM can't do X within 10 days which few people spend over decades" Live a life bro applause and change the title to "it can do xyz" instead of adding the "critical and critical" ...
I also write plausible code. Not much of a moat.
s/code/stuff/
this won't age well.
To be fair, people do too.
> I write this as a practitioner, not as a critic. After more than 10 years of professional dev work, I’ve spent the past 6 months integrating LLMs into my daily workflow across multiple projects. LLMs have made it possible for anyone with curiosity and ingenuity to bring their ideas to life quickly, and I really like that! But the number of screenshots of silently wrong output, confidently broken logic, and correct-looking code that fails under scrutiny I have amassed on my disk shows that things are not always as they seem.
Same experience, but the hype bros do only need a shiny screengrab to proclaim the age of "gatekeeping" SWE is over to get their click fix from the unknowingly masses.
Are we now at the bottom of the the Uncanny Valley of AI?
I have great techniques to fix this issue but not sure how it behooves me to explain it.
Oftentimes, plausible code is good enough, hence why people keep using AI to generate code. This is a distinction without a difference.
But my AI didn't do what your AI did.
Cherry picked AI fail for upvotes. Which you’ll get plenty of here an on Reddit from those too lazy to go and take a look for themselves.
Using Codex or Claude to write and optimize high performance code is a game changer. Try optimizing cuda using nsys, for example. It’ll blow your lazy little brain.
Holy gracious sakes... Of course... Thank you... thank you... dear katanaquant, from the depths... of my heart... There's still belief in accountability... in fun... in value... in effort... in purpose... in human... in art...
Related:
- <http://archive.today/2026.03.07-020941/https://lr0.org/blog/...> (I'm not consulting an LLM...)
- <https://web.archive.org/web/20241021113145/https://slopwatch...>
Their default solution is to keep digging. It has a compounding effect of generating more and more code.
If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.
If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.
If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.
If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."