Hacker News

by physicsgraphon 12/18/2023, 2:57 AMwith 24 comments

by antirezon 12/18/2023, 8:18 AM

Use llama.cpp for quantized model inference. It is simpler (no Docker nor Python required), faster (works well on CPUs), and supports many models.

Also there are better models than the one suggested. Mistral for 7B parameters. Yi if you want to go larger and happen to have 32Gb of memory. Mixtral MoE is the best but requires too much memory right now for most users.

by upon_drumheadon 12/18/2023, 6:17 AM

I’m a tad confused

> TinyChatEngine provides an off-line open-source large language model (LLM) that has been reduced in size.

But then they download the models from huggingface. I don’t understand how these are smaller? Or do they modify them locally?

by rodnimon 12/18/2023, 10:26 AM

"Small large" ..... so, medium? :)

by aravindgpon 12/18/2023, 8:00 AM

I have used them and I can say it's pretty decent overall. I personally plan to use tinyengineon iot devices which is for even smaller iot microcontroller devices.

by collywon 12/18/2023, 12:08 PM

Where is a good place to understand the high level topics in AI. Like an offline language model compared to a presumably online model?

by dkjaudyeqooeon 12/18/2023, 11:04 AM

I tried this and installation was easy on macOS 10.14.6 (once I updated Clang correctly).

Performance on my relatively old i5-8600 CPU running 6 cores at 3.10GHz with 32GB of memory gives me about 150-250 ms per token on the default model, which is perfectly usable.

Small offline large language model – TinyChatEngine from MIT