Hacker News

by ahamilton454on 1/30/2025, 4:29 AMwith 1 comments

So there are so many benchmarks out there to evaluate models. ARC-AGI, frontier math, MMLU, Berkeley Function calling and many many more. And I guess the all together, general idea behind all these is to “approximate” all possible types of problems that can be tokenized and solved by an LLM.

That said, I can’t seem to do better than just “vibes”. Basically, oh this model gave me a good response to this question, it must be better.

Now I have tried keeping track of a couple benchmarks like the ones I mentioned above. But I generally can’t translate these benchmarks into utility outside of the small scope the benchmark test for. Also there are so many benchmarks to keep track of and each takes some learning to understand.

So perhaps my scope isn’t well enough defined. But as a programmer, everything >GPT4o feels pretty damn similar.

Would love to hear how others evaluate LLMs beyond “just vibes” generally for programming use, but also when trying to use create new ai projects.

by harshalachavan7on 1/30/2025, 5:08 AM

Sharing this as a non-technical user:

Yesterday I saw a video of someone using 'lovable' platform to prompt it into providing an app. For a no-coder like me, this was crazy stuff. The latest 'Deepseek' progress hints that things will get more cheaper. We live in the best time to build as the costs further decrease.

For an end user like me who doesn't understand the model evaluation methods you said - I do observe that my life is improving because it is getting easier to automate manual work. With AI agents, I think it will only get even easier. This 'easy' feeling is getting better with every update.

For example, earlier I used to write AI tool descriptions for my AI tools directory (https://appliedai.tools) using Perplexity. I had to edit a lot of stuff as it messed up with facts.

Now I hardly make any changes. It can scout user reviews, YouTube reviews, etc on its own and write very accurate descriptions.

I only shorten sentences - that too using WordPress AI.

As an end user, if my output is good enough, I do not care much about how good the upcoming models are. I care about costs though. Maybe now, people should focus more on 'cost' reduction and environment impact when it comes to looking for 'improvement'.

Ask HN: Are LLMs getting better, how can you tell?