Running a 180B parameter LLM on a single Apple M2 Ultra

by tbruckneron 9/7/2023, 2:36 PMwith 141 comments

by adam_arthuron 9/7/2023, 3:32 PM

Even a linear growth rate of average RAM capacity would obviate the need to run current SOTA LLMs remotely in short order.

Historically average RAM has grown far faster than linear, and there really hasn't been anything pressing manufacturers to push the envelope here in the past few years... until now.

It could be that LLM model sizes keep increasing such that we continue to require cloud consumption, but I suspect the sizes will not increase as quickly as hardware for inference.

Given how useful GPT-4 is already. Maybe one more iteration would unlock the vast majority of practical use cases.

I think people will be surprised that consumers ultimately end up benefitting far more from LLMs than the providers. There's not going to be much moat or differentiation to defend margins... more of a race to the bottom on pricing

by logicchainson 9/7/2023, 3:17 PM

Pretty amazing that in such a short span of time we went from people being amazed how powerful GPT3.5 was upon its release to people being able to run something equivalently powerful locally.

by regularfryon 9/7/2023, 3:10 PM

4-bit quantised model, to be precise.

When does this guy sleep?

by sbierwagenon 9/7/2023, 3:18 PM

The screenshot shows a working set size of 147,456 mb, so he's using the mac studio with 192 gb of ram?

by m3kw9on 9/7/2023, 4:43 PM

OpenAIs moat will soon largely be UX. Anyone can do plugins, code etc but when operating by everyday users the best UX wins after LLM becomes commodified. Just look at stand alone digital cameras vs mobile phone cams from Apple.

by homarpon 9/7/2023, 3:16 PM

https://www.reddit.com/r/LocalLLaMA/comments/16bynin/falcon_... has some more data like sample answers with various level of quantizations

and https://huggingface.co/TheBloke/Falcon-180B-Chat-GGUF if you want to try

by doctobogganon 9/7/2023, 5:17 PM

Georgi is doing so much to democratize LLM access, I am very thankful he is doing it all on apple silicon!

by pellaon 9/7/2023, 3:20 PM

Is this an M2 Ultra with 192 GB of unified memory, or the standard version with 64 GB of unified memory?

by Havocon 9/7/2023, 10:42 PM

Great progress, but I also can't help but feel a sense of apprehension on the access front.

An M2 Ultra while consumer tech is affordable to a fairly small % of the world population.

by ViktorBashon 9/7/2023, 6:07 PM

It's refreshing to see how fast open LLMs are advancing in terms of the models available. A year ago I thought that besides for the novelty of it, running LLMs locally would be nowhere close to stuff like OpenAI's closed models in terms of utility.

As more and more models become open and are able to be run locally, the precedent gets stronger (which is good for the end consumer in my opinion).

by randomopiningon 9/7/2023, 3:40 PM

Is there any actual usecases to run this stuff on a local computer? Or are most of these models actually suited to run on remote clusters?

by two_in_oneon 9/8/2023, 4:58 AM

Just wondering what are local LLMs used for today? So far they look more like a.. promising.

by tiffanyhon 9/7/2023, 4:11 PM

  system_info: n_threads = 4 / 24
Am I seeing correctly in the video that this ran on only 4 threads?

by growton 9/7/2023, 5:48 PM

So how much ram did the machine have?

by rvzon 9/7/2023, 3:25 PM

Totally makes sense for C++ or Rust based AI models for inference instead of the over-bloated networks run on Python with sub-optimal inference and fine-tuning costs.

Minimal overhead or zero cost abstractions around deep learning libraries implemented in those languages gives some hope that people like ggerganov are not afraid of the 'don't roll your own deep learning library' dogma and now we can see the results as to why DL on the edge and local AI, is the future of efficiency in deep learning.

We'll see, but Python just can't compete on speed at all, henceforth Modular's Mojo compiler is another one that solves the problem properly with the almost 1:1 familiarity of Python.