Show HN: Llama 3.1 8B CPU Inference in a Browser via WebAssembly

by om8on 12/25/2024, 5:39 PMwith 4 comments

by om8on 12/25/2024, 6:00 PM

This is a demo of what's possible to run on edge devices using SOTA quantization. Other similar projects that try to run 8B models in browser are either using webgpu or 2 bit quantization that breaks the model. I implemented inference of AQLM quantized representation, making model that has 2 bit quantization and does not blow up.