Can I run AI locally?

by ricardbejaranoon 3/13/2026, 12:46 PMwith 338 comments

by mark_l_watsonon 3/13/2026, 6:20 PM

I have spent a HUGE amount of time the last two years experimenting with local models.

A few lessons learned:

1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.

2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...

Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.

I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.

by meatmanekon 3/13/2026, 5:25 PM

This seems to be estimating based on memory bandwidth / size of model, which is a really good estimate for dense models, but MoE models like GPT-OSS-20b don't involve the entire model for every token, so they can produce more tokens/second on the same hardware. GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.

(In terms of intelligence, they tend to score similarly to a dense model that's as big as the geometric mean of the full model size and the active parameters, i.e. for GPT-OSS-20B, it's roughly as smart as a sqrt(20b*3.6b) ≈ 8.5b dense model, but produces tokens 2x faster.)

by mopierottion 3/13/2026, 6:44 PM

This (+ llmfit) are great attempts, but I've been generally frustrated by how it feels so hard to find any sort of guidance about what I would expect to be the most straightforward/common question:

"What is the highest-quality model that I can run on my hardware, with tok/s greater than <x>, and context limit greater than <y>"

(My personal approach has just devolved into guess-and-check, which is time consuming.) When using TFA/llmfit, I am immediately skeptical because I already know that Qwen 3.5 27B Q6 @ 100k context works great on my machine, but it's buried behind relatively obsolete suggestions like the Qwen 2.5 series.

I'm assuming this is because the tok/s is much higher, but I don't really get much marginal utility out of tok/s speeds beyond ~50 t/s, and there's no way to sort results by quality.

by twampsson 3/13/2026, 4:24 PM

Is this just llmfit but a web version of it?

https://github.com/AlexsJones/llmfit

by LeifCarrotsonon 3/13/2026, 4:47 PM

This lacks a whole lot of mobile GPUs. It also does not understand that you can share CPU memory with the GPU, or perform various KV cache offloading strategies to work around memory limits.

It says I have an Arc 750 with 2 GB of shared RAM, because that's the GPU that renders my browser...but I actually have an RTX1000 Ada with 6 GB of GDDR6. It's kind of like an RTX 4050 (not listed in the dropdowns) with lower thermal limits. I also have 64 GB of LPDDR5 main memory.

It works - Qwen3 Coder Next, Devstral Small, Qwen3.5 4B, and others can run locally on my laptop in near real-time. They're not quite as good as the latest models, and I've tried some bigger ones (up to 24GB, it produces tokens about half as fast as I can type...which is disappointingly slow) that are slower but smarter.

But I don't run out of tokens.

by rahimnathwanion 3/14/2026, 2:02 AM

This site presents models in an incomplete and misleading way.

When I visit the site with an Apple M1 Max with 32GB RAM, the first model that's listed is Llama 3.1 8B, which is listed as needing 4.1GB RAM.

But the weights for Llama 3.1 8B are over 16GB. You can see that here in the official HF repo: https://huggingface.co/meta-llama/Llama-3.1-8B/tree/main

The model this site calls 'Llama 3.1 8B' is actually a 4-bit quantized version ( Q4_K_M) available on ollama.com/library: https://ollama.com/library/llama3.1:8b

If you're going to recommend a model to someone based on their hardware, you have to recommend not only a specific model, but a specific version of that model (either the original, or some specific quantized version).

This matters because different quantized versions of the model will have different RAM requirements and different performance characteristics.

Another thing I don't like is that the model names are sometimes misleading. For example, there's a model with the name 'DeepSeek R1 1.5B'. There's only one architecture for DeepSeek R1, and it has 671B parameters. The model they call 'DeepSeek R1 1.5B' does not use that architecture. It's a qwen2 1.5B model that's been finetuned on DeepSeek R1's outputs. (And it's a Q4_K_M quantized version.)

by sxateson 3/13/2026, 4:05 PM

Cool thing!

A couple suggestions:

1. I have an M3 Ultra with 256GB of memory, but the options list only goes up to 192GB. The M3 Ultra supports up to 512GB. 2. It'd be great if I could flip this around and choose a model, and then see the performance for all the different processors. Would help making buying decisions!

by dxxvion 3/14/2026, 12:55 AM

Not sure if there's anybody like me. I use AI for only 2 purposes: to replace Google Search to learn something and to generate images. I wonder where there are not many models that do only 1 thing and do it well. For example, there's this one https://huggingface.co/Fortytwo-Network/Strand-Rust-Coder-14... for Rust coding. I haven't used it yet, so don't know how it's compared to the free models that Kilo Code provides.

by torginuson 3/13/2026, 9:00 PM

Huh, I never knew my browser just volunteers my exact hardware specs to any website without so much as even notifying me about it.

by mmaunderon 3/13/2026, 7:17 PM

OP can you please make it not as dark and slightly larger. Super useful otherwise. Qwen 3.5 9B is going to get a lot of love out of this.

by StefanoCon 3/14/2026, 2:40 PM

Can anybody share their setup using 64GB macs? I have an M2 Ultra studio and I'm trying Qwen 3.5 MLX models hosting them from the CLI, but I'm a bit stuck picking bigger models, more context, 4/8 bits, Opus-Reasoning-Distilled, coder... There are a bit too many permutations between mlx CLI flags, env variables, and models.

At the moment I'm exploring:

- nightmedia/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-qx64-hi-mlx

- BeastCode/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit

- mlx-community/Qwen3-Coder-Next-4bit

by olivercoleaion 3/14/2026, 2:02 PM

Useful tool, but the "can I run it" question obscures the more important question: "should I run it locally for my use case?"

For interactive chat and simple Q&A, local models are great — latency is predictable, privacy is absolute, and the quality gap with frontier models is narrowing for straightforward tasks. A quantized Llama running on an M-series Mac is genuinely useful.

But for agentic workflows — where the model needs to plan multi-step tasks, use tools, recover from errors, and maintain coherence across long interactions — the gap between local and frontier models is still enormous. I have seen local models confidently execute plans that make no sense, fail to recover from tool errors, and lose track of what they are doing after a few steps. Frontier models do this too sometimes, but at a much lower rate.

The practical middle ground I see working well: local models for fast, cheap tasks like commit message generation, code completion, and simple classification. Frontier API models for anything requiring planning, reasoning over large contexts, or reliability. The economics favor this split — running a local model costs electricity and GPU memory, while API calls cost per token. For high-volume low-complexity tasks, local wins. For low-volume high-complexity tasks, APIs win.

by andy_pppon 3/13/2026, 6:20 PM

Is it correct that there's zero improvement in performance between M4 (+Pro/Max) and M5 (+Pro/Max) the data looks identical. Also the memory does not seem to improve performance on larger models when I thought it would have?

Love the idea though!

EDIT: Okay the whole thing is nonsense and just some rough guesswork or asking an LLM to estimate the values. You should have real data (I'm sure people here can help) and put ESTIMATE next to any of the combinations you are guessing.

by carraon 3/13/2026, 5:08 PM

Having the rating of how well the model will run for you is cool. I miss to also have some rating of the model capabilities (even if this is tricky). There are way too many to choose. And just looking at the parameter number or the used memory is not always a good indication of actual performance.

by phelmon 3/13/2026, 4:13 PM

This is awesome, it would be great to cross reference some intelligence benchmarks so that I can understand the trade off between RAM consumption, token rate and how good the model is

by rando1234on 3/14/2026, 1:46 PM

On a related question, I'm in the market to buy a new laptop for development and want to get something with good support for local models. What is a good recommendation in terms of GPU support etc? I currently have a Dell XPS 13. Should I just get a MacBook? Or are there good non-Mac options?

by azmenakon 3/13/2026, 7:21 PM

From my personal testing, running various agentic tasks with a bunch of tool calls on an M4 Max 128GB, I've found that running quantized versions of larger models to produce the best results which this site completely ignores.

Currently, Nemotron 3 Super using Unsloth's UD Q4_K_XL quant is running nearly everything I do locally (replacing Qwen3.5 122b)

by cafed00don 3/13/2026, 6:09 PM

Open with multiple browsers (safari vs chrome) to get more "accurate + glanceable" rankings.

Its using WebGPU as a proxy to estimate system resource. Chrome tends to leverage as much resources (Compute + Memory) as the OS makes available. Safari tends to be more efficient.

Maybe this was obvious to everyone else. But its worth re-iterating for those of us skimmers of HN :)

by amdiviaon 3/13/2026, 8:45 PM

I found this to be inaccurate, I can run OSS GPT 120B (4 bit quant) on my 5090 and 64 ram system with around 40 t/s. Yet here the site claims it won't work

by freediddyon 3/13/2026, 5:24 PM

i think the perplexity is more important than tokens per second. tokens per second is relatively useless in my opinion. there is nothing worse than getting bad results returned to you very quickly and confidently.

ive been working with quite a few open weight models for the last year and especially for things like images, models from 6 months would return garbage data quickly, but these days qwen 3.5 is incredible, even the 9b model.

by zahirbmirzaon 3/13/2026, 10:56 PM

This was depressing. But, also, I can't figure why AI companies are valued so high. The models will reach a limit (ie for what most people want to use a model for), and compute will increase over time.

by orthoxeroxon 3/13/2026, 5:19 PM

For some reason it doesn't react to changing the RAM amount in the combo box at the top. If I open this on my Ryzen AI Max 395+ with 32 GB of unified memory, it thinks nothing will fit because I've set it up to reserve 512MB of RAM for the GPU.

by John23832on 3/13/2026, 3:41 PM

RTX Pro 6000 is a glaring omission.

by modernerdon 3/14/2026, 12:51 PM

Would love it more if it could help me to answer:

- Which models in the list are the best for my selected task? (If you don't track these things regularly, the list is a little overwhelming.) Sorting by various benchmark scores might be useful?

- How much more system resources do I need to run the models currently listed at F, D or C at B, A, or S-tier levels? (Perhaps if you hover over the score, it could tell you?)

by am17anon 3/13/2026, 5:44 PM

You can still run larger MoE models using expert weight off-loading to the CPU for token generation. They are by and large useable, I get ~50 toks/second on a kimi linear 48B (3B active) model on a potato PC + a 3090

by GrayShadeon 3/13/2026, 4:08 PM

This feels a bit pessimistic. Qwen 3.5 35B-A3B runs at 38 t/s tg with llama.cpp (mmap enabled) on my Radeon 6800 XT.

by 0xbadcafebeeon 3/13/2026, 9:53 PM

Couple thoughts:

- The t/s estimation per machine is off. Some of these models run generation at twice the speed listed (I just checked on a couple macs & an AMD laptop). I guess there's no way around that, but some sort of sliding scale might be better.

- Ollama vs Llama.cpp vs others produce different results. I can run gpt-oss 20b with Ollama on a 16GB Mac, but it fails with "out of memory" with the latest llama.cpp (regardless of param tuning, using their mxfp4). Otoh, when llama.cpp does work, you can usually tweak it to be faster, if you learn the secret arts (like offloading only specific MoE tensors). So the t/s rating is even more subjective than just the hardware.

- It's great that they list speed and size per-quant, but that needs to be a filter for the main list. It might be "16 t/s" at Q4, but if it's a small model you need higher quant (Q5/6/8) to not lose quality, so the advertised t/s should be one of those

- Why is there an initial section which is all "performs poorly", and then "all models" below it shows a ton of models that perform well?

by adamhsnon 3/13/2026, 10:47 PM

Cool project!!

It would be useful to filter which model to use based on the objective or usage (i.e., for data extraction vs. coding).

Also, just looking at VRAM kind of misses that a lot of CPU memory can be shared with the GPU via layer offloading. I think there is ultimately a need for a native client, like a CPU/GPU benchmark, to figure out how the model will actually perform more precisely.

by RagnarDon 3/14/2026, 12:08 AM

I have an RTX 6000 Pro Max-Q, which has 96GB VRAM. It identified the hardware correctly but incorrectly thought it had 4GB, at least if I interpret the RAM dropdown correctly.

Then it shows the full resolution models, which are completely unnecessary to run quality inference. Quantized models are routine for local inference and it should realize that.

Needs work.

by suheilaaitaon 3/14/2026, 9:25 AM

The simplest way to really start, use anything like claude code, vs code, cursor, antigratvity, (or any other IDE) ask them to install ollama and pull the latest solid local model that was released that you can run based on your computer specs.

Wait 5-10 minutes, and should be done.

It genuinely is that simple.

You can even use local models using claude code or codex infrastrucutre (MASSIVE UNLOCK), but you need solid GPU(s) to run decent models. So that's the downside.

by gopalvon 3/13/2026, 8:10 PM

Chrome runs Gemini Nano if you flip a few feature flags on [1].

The model is not great, but it was the "least amount of setup" LLM I could run on someone else's machine.

Including structured output, but has a tiny context window I could use.

[1] - https://notmysock.org/code/voice-gemini-prompt.html

by GTPon 3/14/2026, 7:21 PM

If I'm sorting models by score (the default), which kind of score is it?

by kpw94on 3/13/2026, 8:04 PM

People complaining about how hard to get simple answer is don't appreciate the complexity in figuring out optimal models...

There's so many knobs to tweak, it's a non trivial problem

- Average/median length of your Prompts

- prompt eval speed (tok/s)

- token generation speed (tok/s)

- Image/media encoding speed for vision tasks

- Total amount of RAM

- Max bandwidth of ram (ddr4, ddr5, etc.?)

- Total amount of VRAM

- "-ngl" (amount of layers offloaded to GPU)

- Context size needed (you may need sub 16k for OCR tasks for instance)

- Size of billion parameters

- Size of active billion parameters for MoE

- Acceptable level of Perplexity for your use case(s)

- How aggressive Quantization you're willing to accept (to maintain low enough perplexity)

- even finer grain knobs: temperature, penalties etc.

Also, Tok/s as a metric isn't enough then because there's:

- thinking vs non-thinking: which mode do you need?

- models that are much more "chatty" than others in the same area (i remember testing few models that max out my modest desktop specs, qwen 2.5 non-thinking was so much faster than equivalent ministral non-thinking even though they had equivalent tok/s... Qwen would respond to the point quickly)

At the end, final questions are: are you satisfied with how long getting an answer took? and was the answer good enough?

The same exercise with paid APIs exists too, obviously less knobs but depending on your use case, there's still differences between providers and models. You can abstract away a lot of the knobs , just add "are you satisfied with how much it cost" on top of the other 2 questions

by sdingion 3/13/2026, 5:41 PM

When running models on my phone - either through the web browser or via an app - is there any chance it uses the phone's NPU, or will these be GPU only?

I don't really understand how the interface to the NPU chip looks from the perspective of a non-system caller, if it exists at all. This is a Samsung device but I am wondering about the general principle.

by dale_glasson 3/14/2026, 11:54 AM

It's missing Ryzen AI MAX+, which is sort of the Apple Silicon equivalent.

by Stronzon 3/14/2026, 9:31 AM

One thing I noticed with local models is that conversation behavior tends to drift over time as well.

Even when running locally, the model often starts structured but gradually becomes more verbose or explanatory in longer threads.

Curious if others have seen similar behavior when using local setups.

by scorpioxyon 3/14/2026, 1:52 AM

Besides trying to run on your own hardware, anybody have recommendations for running some decent models on one of the many "AI clouds" providers? This is for sporadic use and so maybe one of the "serverless" providers that bill by the hour or minute or similar as opposed to monthly renting GPUs.

There are quite a few of them but their marketing is just confusing and full of buzz words. I've been tinkering with OpenRouter that acts as a middleman.

by paxyson 3/13/2026, 8:07 PM

I wish creators of local model inference tools (LM Studio, Ollama etc.) would release these numbers publicly, because you can be sure they are sitting on a large dataset of real-world performance.

by pants2on 3/13/2026, 8:43 PM

This really highlights the impracticality of local models:

My $3k Macbook can run `GPT-OSS 20B` at ~16 tok/s according to this guide.

Or I can run `GPT-OSS 120B` (a 6X larger model) at 360 tok/s (30X faster) on Groq at $0.60/Mtok output tokens.

To generate $3k worth of output tokens on my local Mac at that pricing it would have to run 10 years continuously without stopping.

There's virtually no economic break-even to running local models, and no advantage in intelligence or speed. The only thing you really get is privacy and offline access.

by rcarmoon 3/13/2026, 6:33 PM

This is kind of bogus since some of the S and A tier models are pretty useless for reasoning or tool calls and can’t run with any sizable system prompt… it seems to be solely based on tokens per second?

by johnethon 3/13/2026, 10:17 PM

Re: the design of the site. Please use higher contrast colours, especially the barely visible grey text on black background. It's annoying to try to read.

by staredon 3/14/2026, 9:36 AM

I would love to see on this list some (any) benchmark.

"I can run a model" is mildly interesting. I can run OSS-20B on my M1 Pro. It works, I tried it, just I don't find any application.

by AstroBenon 3/13/2026, 5:10 PM

This doesn't look accurate to me. I have an RX9070 and I've been messing around with Qwen 3.5 35B-A3B. According to this site I can't even run it, yet I'm getting 32tok/s ^.-

by lrpeon 3/14/2026, 6:46 PM

This site desperately needs a light mode.

by mkageniuson 3/13/2026, 6:26 PM

Literally made the same app, 2 weeks back - https://news.ycombinator.com/item?id=47171499

by Decabyteson 3/14/2026, 3:19 AM

Does anyone use the super tiny models for anything ? Like in the 2billion or lower parameter level?

by dzinkon 3/13/2026, 8:46 PM

This would be wonderful if it is accurate - instead of guesstimating, let people report their actual findings. I can confirm GLM 4.7 is possible on M1 Max and it can do nice comprehensive answers (albeit at 12 min an answer) locally. You can also easily do Mistral7B and OSS 20B and others. Structure it as a way to report accruals, similarly to Levels.xyz for salaries, instead of guestimating.

by TheCapnon 3/13/2026, 9:53 PM

@OP are you the creator? Could you add my GPU to the list?

Radeon VII

https://www.amd.com/en/support/downloads/drivers.html/graphi...

by ameliuson 3/13/2026, 5:43 PM

It would be great if something like this was built into ollama, so you could easily list available models based on your current hardware setup, from the CLI.

by starkeeperon 3/13/2026, 8:41 PM

This is awesome!!!

Could you please add title="explanation" over each selected item at the top. For example, when I choose my video card the ram changes... I'm not sure if the RAM selection is GPU RAM? The GRAM was already listed with the graphics card. SO I choose 96GB which is my Main memory? And the GB/s I am assuming it's GPU -> CPU speed?

by manlymuppeton 3/14/2026, 4:13 AM

Would be useful if comparable scores for performance are added, perhaps from arena.ai or ARC. I know scores can be imperfect, but it would be nice to be able to easily see what the best model your machine can handle is.

by eichinon 3/14/2026, 4:14 AM

I'm surprised that this shows anything running usefully on my 2021-era thinkpad (with "Iris Xe"'TigerLake graphics) which inspires me to ask - are external GPUs useful for this sort of thing?

by sidchillingon 3/13/2026, 8:34 PM

I have been trying to run Qwen Coder models (8B at 4bit) on my M3 Pro 18GB behind Ollama and connecting codex CLI to it. The tool usage seems practically zero, like it returns the tool call in text JSON and codex CLI doesn’t run the tool (just displays the tool call in text). Has anyone succeeded in doing something like this? What am I missing?

by sshagenton 3/13/2026, 4:58 PM

I don't see my beloved 5060ti. looks great though

by dirk94018on 3/13/2026, 9:41 PM

We wrote the linuxtoaster inference engine, toasted, and are getting 400 prefill, 100 gen on a M4 Max w 128GB RAM on Qwen3-next-coder 6bit, 8bit runs too. KV caching means it feels snappy in chat mode. Local can work. For pro work, programming, I'd still prefer SOTA models, or GLM 4.7 via Cerebras.

by winterismuteon 3/13/2026, 11:08 PM

Oddly, the website lists "M4 Ultra" which however does not exist... Also, it does not account for Apple Silicon chips to have up to 512GB of memory in some cases, but that might be only a limitation of the gathered data.

by hotsaladon 3/13/2026, 11:04 PM

This says I can't run anything, because it's missing some of the smallest models. I know that I can run Qwen3.5 up to 4B, Ministral 3B, Qwen3VL up to 4B, and I know there are some Gemmas and Llamas in my size range.

by SXXon 3/13/2026, 6:56 PM

Sorry if already been answered, but will there be a metric for latency aka time to first token?

Since I considered buying M3 Ultra and feel like it the most often discussed regarding using Apple hardware for runninh local LLMs. Where speed might be okay, but prompt processing can take ages.

by arjieon 3/13/2026, 4:44 PM

Cool website. The one that I'd really like to see there is the RTX 6000 Pro Blackwell 96 GB, though.

by vova_hn2on 3/13/2026, 4:31 PM

It says "RAM - unknown", but doesn't give me an option to specify how much RAM I have. Why?

by comrade1234on 3/13/2026, 9:03 PM

I can't tell at a glance what this page is showing, but I am curious about the licenses on the various models that let me run it locally and make money off it. Awhile ago only deepseek let you do that - not sure now.

by mrdependableon 3/13/2026, 4:25 PM

This is great, I've been trying to figure this stuff out recently.

One thing I do wonder is what sort of solutions there are for running your own model, but using it from a different machine. I don't necessarily want to run the model on the machine I'm also working from.

by ge96on 3/13/2026, 4:46 PM

Raspberry pi? Say 4B with 4GB of ram.

I also want to run vision like Yocto and basic LLM with TTS/STT

by kuonon 3/13/2026, 7:38 PM

I have amd 9700 and it is not listed while it is great llm hardware because it has 32Gb for a reasonable price. I tried doing "custom" but it didn't seem to work.

The tool is very nice though.

by starkparkeron 3/14/2026, 2:01 AM

Every time I refresh the page, I get a higher tokens/second value, presumably because of the keying off memory bandwidth.

by zitterbewegungon 3/13/2026, 6:29 PM

The M4 Ultra doesn't exist and there is more credible rumors for an M5 Ultra. I wouldn't put a projection like that without highlighting that this processor doesn't exist yet.

by adithyassekharon 3/13/2026, 4:21 PM

This just reminded me of this https://www.systemrequirementslab.com/cyri.

Not sure if it still works.

by spidrahedronon 3/14/2026, 7:15 AM

This is super useful for people not having access to GPUs and Servers

by havalocon 3/13/2026, 4:34 PM

Missing the A18 Neo! :)

by debatem1on 3/13/2026, 4:32 PM

For me the "can run" filter says "S/A/B" but lists S, A, B, and C and the "tight fit" filter says "C/D" but lists F.

Just FYI.

by urba_on 3/13/2026, 9:25 PM

Man, I wonder when there will be AI server farms made from iCloud locked jailbroken iPhone 16s with backported MacOS

by anigbrowlon 3/13/2026, 6:58 PM

Useful tool, although some of the dark grey text is dark that I had to squint to make it out against the background.

by storuson 3/13/2026, 11:16 PM

Missing latest Nvidia cards like RTX Pro 6000; M3 Ultra can have at most 192GB selected etc.

by golem14on 3/13/2026, 6:02 PM

Has anyone actually built anything with this tool?

The website says that code export is not working yet.

That’s a very strange way to advertise yourself.

by 3Sophonson 3/14/2026, 12:23 AM

a lighter-weight alternative of docker and python is the Rust+Wasm stack https://github.com/LlamaEdge/LlamaEdge

by fraywingon 3/13/2026, 7:58 PM

This is amazing. Still waiting for the "Medusa" class AMD chips to build my own AI machine.

by raiph_aion 3/14/2026, 1:55 AM

Great site, I have an M2 and M3pro and was thinking about getting and Ultra M4 and wanted to know if it was going to be worth it. Now I can see exactly what models I can run locally.

by ementallyon 3/13/2026, 10:12 PM

In mobile section it is missing Tensor chips (used by Google Pixel devices).

by bearjawson 3/13/2026, 7:31 PM

So many people have vibe coded these websites, they are posted to Reddit near daily.

by vednigon 3/13/2026, 8:29 PM

Our work at DoShare is a lot of this stuff we've been on it for 2 years

by ameliuson 3/13/2026, 5:32 PM

What is this S/A/B/C/etc. ranking? Is anyone else using it?

by jrmgon 3/13/2026, 5:09 PM

Is there a reliable guide somewhere to setting up local AI for coding (please don’t say ‘just Google it’ - that just results in a morass of AI slop/SEO pages with out of date, non-self-consistent, incorrect or impossible instructions).

I’d like to be able to use a local model (which one?) to power Copilot in vscode, and run coding agent(s) (not general purpose OpenClaw-like agents) on my M2 MacBook. I know it’ll be slow.

I suspect this is actually fairly easy to set up - if you know how.

by charcircuiton 3/13/2026, 4:32 PM

On mobile it does not show the name of the model in favor of the other stats.

by ameliuson 3/13/2026, 5:31 PM

Why isn't there some kind of benchmark score in the list?

by sand500on 3/13/2026, 9:57 PM

How does it have details for M4 ultra?

by tencentshillon 3/13/2026, 6:44 PM

Missing laptop versions of all these chips.

by tcbrahon 3/13/2026, 5:40 PM

tbh i stopped caring about "can i run X locally" a while ago. for anything where quality matters (scripting, code, complex reasoning) the local models are just not there yet compared to API. where local shines is specific narrow tasks - TTS, embeddings, whisper for STT, stuff like that. trying to run a 70b model at 3 tok/s on your gaming GPU when you could just hit an API for like $0.002/req feels like a weird flex IMO

by ryandrakeon 3/13/2026, 6:19 PM

Missing RTX A4000 20GB from the GPU list.

by d0100on 3/14/2026, 1:40 AM

Why is there no RTX 5060ti?

by nickloon 3/13/2026, 8:53 PM

the animation of the model name text when opening the detail view is so smooth and delightful

by Readeriumon 3/13/2026, 8:45 PM

Qwen 3.5 4B is the goat then

by reactordevon 3/13/2026, 7:21 PM

This shows no models work with my hardware but that’s furthest from the truth as I’m running Qwen3.5…

This isn’t nearly complete.

by g_br_lon 3/13/2026, 4:31 PM

could you add raspi to the list to see which ridiculously small models it can run?

by varispeedon 3/13/2026, 5:16 PM

Does it make any sense? I tried few models at 128GB and it's all pretty much rubbish. Yes they do give coherent answers, sometimes they are even correct, but most of the time it is just plain wrong. I find it massive waste of time.

by metalliqazon 3/13/2026, 4:33 PM

Hugging Face can already do this for you (with much more up-to-date list of available models). Also LM Studio. However they don't attempt to estimate tok/sec, so that's a cool feature. However I don't really trust those numbers that much because it is not incorporating information about the CPU, etc. True GPU offload isn't often possible on consumer PC hardware. Also there are different quants available that make a big difference.

by markdownon 3/14/2026, 6:20 AM

Protip: Website requires CMD+ a few times to increase font size to 200%.

by bheadmasteron 3/13/2026, 6:42 PM

Missing 5060 Ti 16GB

by lagrange77on 3/13/2026, 7:05 PM

Finally! I've been waiting for something like this.

by casey2on 3/14/2026, 1:29 AM

Something notable is that Qwen3.5:0.8B does better on benchmarks than GPT3.5. Runs much faster on local hardware than GPT3.5 on release. However Qwen3.5:0.8B dumber and slower than GPT3.5. It's dumber: it can do 3*3, but if asked to explain it in terms of the definition (i.e. 3+3+3=9) it fails. It's slower: It's a thinking model so your 900T/S are mainly spent "thinking" most of the time it will just repeat until it hangs.

It pretty obvious that this reasoning scaling is a mirage, parameters are all you need. Everything else is mostly just wasting time while hardware get better.

by butILoveLifeon 3/13/2026, 11:32 PM

This is borderline irresponsible. Conflating first tokens with all tokens is terrible. Apple looks far better than it actually is.

Just ask any Apple user, they don't actually use local models.

by S4phyreon 3/13/2026, 4:16 PM

Oh how cool. Always wanted to have a tool like this.

by ThrowawayTestron 3/13/2026, 8:46 PM

For image generation or even video generation, local models are totally feasible. I can generate a 5 second clip with wan 2.2 in about 30 minutes on my 3060 12G. Plus, I have full control on the loras used.

by ipunchghostson 3/13/2026, 7:41 PM

What is S? Also, NVIDIA RTX 4500 Ada is missing.

by tristoron 3/13/2026, 6:39 PM

This does not seem accurate based on my recently received M5 Max 128GB MBP. I think there's some estimates/guesswork involved, and it's also discounting that you can move the memory divider on Unified Memory devices like Apple Silicon and AMD AI Max 395+.

by brcmthrowawayon 3/13/2026, 5:50 PM

If anyone hasn't tried Qwen3.5 on Apple Silicon, I highly suggest you to! Claude level performance on local hardware. If the Qwen team didn't get fired, I would be bullish on Local LLM.

by kylehotchkisson 3/13/2026, 5:23 PM

My Mac mini rocks qwen2.5 14b at a lightning fast 11/tokens a second. Which is actually good enough for the long term data processing I make it spend all day doing. It doesn’t lock up the machine or prevent its primary purpose as webserver from being fulfilled.

by nilslindemannon 3/13/2026, 5:29 PM

1. More title attributes please ("S 16 A 7 B 7 C 0 D 4 F 34", huh?)

2. Add a 150% size bonus to your site.

Otherwise, cool site, bookmarked.

by nazbashoon 3/13/2026, 10:32 PM

its perfect

by Akuehneon 3/13/2026, 9:32 PM

Can we get some of the ancient Nvidia Teslas, like the p40 added?

by tkfosson 3/13/2026, 6:57 PM

Nice UI, but crap data, probably llm generated.

by polyterativeon 3/13/2026, 6:36 PM

awesome, needed this

by remote3bodyon 3/13/2026, 11:14 PM

The 'spent 100 hours configuring' part hits home. That fragmentation is exactly why we started building Olares (https://github.com/beclab/Olares).

It’s basically an open-source OS layer that standardizes the local AI stack—Kubernetes (K3s) for orchestration, standardized model serving, and GPU scheduling. The goal is to stop fiddling with Python environments/drivers and just treat local agents like standardized containers. It runs on Mac Minis or dedicated hardware.