We Can Just Measure Things

by toshon 6/17/2025, 11:15 AMwith 60 comments

by yujzgzcon 6/19/2025, 8:33 PM

Another, related benefit of LLMs in this situation is that we can observe their hallucinations and use them for design. I've come up with a couple situations where I saw Copilot hallucinate a method, and I agreed that that method should've been there. It helps confirm whether the naming of things makes sense too.

What's ironic about this is that the very things that TFA points out are needed for success (test coverage, debuggability, a way to run locally etc) are exactly the things that typical LLMs themselves lack.

by layer8on 6/19/2025, 5:49 PM

We can just measure things, but then there’s Goodhart's law.

With the proposed way of measuring code quality, it’s also unclear how comparable the resulting numbers would be between different projects. If one project has more essential complexity than another project, it’s bound to yield a worse score, even if the code quality is on par.

by GardenLetter27on 6/19/2025, 8:11 PM

I'm really skeptical of using current LLMs for judging codebases like this. Just today I got Gemini to solve a tricky bug, but it only worked after providing it more debug output after solving some of it myself.

The first time I tried without the deeper output, it "solved" it by writing a load of code that failed in loads of other ways, and ended up not even being related to the actual issue.

Like you can be certain it'll give you some nice looking metrics and measurements - but how do you know if they're accurate?

by stephc_int13on 6/20/2025, 2:03 AM

I think we very rarely can measure things as soon as they have more than one dimension and unit. Measurements aggregate are weighted and thus arbitrary and or incomplete.

This is a common and irritating intellectual trap. We want to measure things as this gives us a handle to apply algorithms or logical processes on them.

But we can only measure very simple and well defined dimensions such as mass, length, speed etc.

Being measurable is the exception, not the rule.

by ToucanLoucanon 6/19/2025, 5:05 PM

Still RTFA but this made me rage:

> In fact, we as engineers are quite willing to subject each others to completely inadequate tooling, bad or missing documentation and ridiculous API footguns all the time. “User error” is what we used to call this, nowadays it's a “skill issue”. It puts the blame on the user and absolves the creator, at least momentarily. For APIs it can be random crashes if you use a function wrong

I recently implemented Microsoft's MSAL authentication on iOS which includes as you might expect a function that retrieves the authenticated accounts. Oh sorry, I said function, but there's two actually: one that retrieves one account, and one that retrieves multiple accounts, which is odd but harmless enough right?

Wrong, because whoever designed this had an absolutely galaxy brained moment and decided if you try and retrieve one account when multiple accounts are signed in, instead of, oh I dunno, just returning an error message, or perhaps returning the most recently used account, no no no, what we should do in that case is throw an exception and crash the fucking app.

I just. Why. Why would you design anything this way!? I can't fathom any situation you would use the one-account function in when the multi-account one does the exact same fucking thing, notably WITHOUT the potential to cause a CRASH, and just returns a set of one, and further, why then if you were REALLY INTENT ON making available one that only returned one, it wouldn't itself just call the other function and return Accounts.first.

</ rant>

by timhiginson 6/20/2025, 1:56 AM

The title of this post really doesn’t match the core message/thesis, which is a disappointing trend in many recent articles.

by lostdogon 6/19/2025, 5:08 PM

A lot of the "science" we do is experimenting on bunches of humans, giving them surveys, and treating the result as objective. How many places can we do much better by surveying a specific AI?

It may not be objective, but at least it's consistent, and it reflects something about the default human position.

For example, there are no good ways of measuring the amount of technical debt in a codebase. It's such a fuzzy question that only subjective measures work. But what if we show the AI one file at a time, ask "Rate, 1-10, the comprehensibility, complexity, and malleability of this code," and then average across the codebase. Then we get measure of tech debt, which we can compare over time to measure if it's rising or falling. The AI makes subjective measurements consistent.

This essay gives such a cool new idea, while only scratching the surface.

by elktownon 6/19/2025, 8:02 PM

I think this is advertisement for an upcoming product. Sure, join the AI gold rush, but at least be transparent about it.