> BitNet b1.58 can match the performance of the full precision baseline starting from a 3B size. ... This demonstrates that BitNet b1.58 is a Pareto improvement over the state-of-the-art LLM models.
> BitNet b1.58 is enabling a new scaling law with respect to model performance and inference cost. As a reference, we can have the following equivalence between different model sizes in 1.58-bit and 16-bit based on the results in Figure 2 and 3.
> ⢠13B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 3B FP16 LLM.
> ⢠30B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 7B FP16 LLM.
> ⢠70B BitNet b1.58 is more efficient, in terms of latency, memory usage and energy consumption, than 13B FP16 LLM.
This paper seems to represent a monumental breakthrough in LLM efficiency, as the efficiency gains come with zero (or negative) performance penalty.
Does it seem at all likely that existing models could be converted?
I have often mused that, in some ways, it seems like the transistor is really being wasted in AI applications. We use binary states in normal computing to reduce entropy. In AI this is less of a concern, so why not use more of the available voltage range? Basically, re-think the role of the transistor and re-design from the ground up - maybe NAND gates are not the ideal fundamental building block here?
I was reading Exposing Floating Point today (as Airfoil is on the HN front page and I was perusing the archive of the author). It's a blog explaining the inner workings of floating point representations. About zero values it says [0]:
> Yes, the floating point standard specifies both +0.0 and â0.0. This concept is actually useful because it tells us from which âdirectionâ the 0 was approached as a result of storing value too small to be represented in a float. For instance -10e-30f / 10e30f wonât fit in a float, however, it will produce the value of -0.0.
The authors of the LLM paper use the values {-1, 0, -1}. Connecting the two ideas, I'm now wondering whether having a 2-bit {-1, -0, 0, 1} representation might have any benefit over the proposed 1.58 bits. Could the additional -0 carry some pseudo-gradient information, ("the 0 leaning towards the negative side")?
Also, I've seen 2-bit quantizations being proposed in other LLM quantization papers. What values are they using?
After reading the results I skipped back to the comment section to ask if this was real because it looks a little too good to be true, but figured I should check authors and it's Microsoft research and UCAS so yeah, real. This is going to change a lot of things, obviously the edge computing applications they point out, but also this is going to bottom out the cost of providing high-performance LLMs in the cloud. I don't know what that means for the economics long term, naively way less costs maybe means new entrants without an entire cloud available can compete easier? I do wonder if something like this has already been found and implemented by either OpenAI or Google.
That's not a 'bit' ("Binary digIT"). It's closer to a 'trit' ("TeRnary-digIT"). Specifically, ternary digits spanning {-1, 0, 1} (rather than the usual {0, 1, 2} in a base-3 numbering system) are 'balanced ternary'.
A great intro to the theoretical reasons ternary might have some promise in computing is this 2001 article from 'American Scientist', "Third Base", which quotes Knuth calling balanced-ternary "perhaps the prettiest numbering system of all" and also discusses an abortive Soviet effort in the direction of ternary computing:
http://web.archive.org/web/20011205185830/http://americansci...
In an aside, the article hints that e-nary digits (base 2.718âŚ) if somehow made practical/meaningful, might actually be better than ternary (or perhaps even optimal?).
So maybe this paper's observation that ~"1.58 bits" (ln2(3) binary-digits) is a sweet-spot could be further refined into some method for representing the state of a e-nary-modeled algorithm in ln2(e) binary-digits (~"1.44 bits") per underlying e-it.
(As it may be of renewed interest, I've also put this 2001 "American Scientist" base-3 intro as a new HN submission for discussion: https://news.ycombinator.com/item?id=39541756)
Take this with a grain of salt until someone reproduces it. Improvements such as these require extraordinary evidence. Not to mention extreme quantization has been tried before.
Major breakthrough in LLM scene. Achieve performance and perplexity equivalent to full FP16 models of same parameter size.
And you can fit 120B model with a single card 24GB VRAM. This is mind blowing.
The theoretical capacity of a binary network is 69% of the capacity of a full-weight network, so it makes sense that LLM would converge to 1-bit networks in the long term.
It's nice to finally see practical networks reach the theoretical limits found in the statistical mechanics of Ising models. A good pointer to efficient 1-bit training, from the statistical mechanics point of view, is here:
These models will are compatible with llama.cpp out of the box, we (GigaML - https://gigaml.com) are planning to train a small model (3-4B, 1-bit, opensource) with the latest stack-v2 dataset released today. Let me know if anyone is interested in collaborating with us.
It's funny how discoveries in NLP & computer vision complement each other. The replacement of multiplication by additions made me think about the AdderNet paper (https://arxiv.org/abs/1912.13200), which concluded as you had to suffer almost no performance drop.
Perhaps the accumulators in current hardware cannot leverage this to its full potential, but combined with such a strict quantization, this would open LLM to the wider ML community much earlier than expected (when consumer hardware allows you to train near SOTA LLMs from scratch on your machine).
Prior art:
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
https://arxiv.org/abs/1602.02830
Ternary Neural Networks for Resource-Efficient AI Applications
Also from Microsoft in 2021: Make Every feature Binary: A 135B parameter sparse neural network for massively improved search relevance [1]
[1] https://www.microsoft.com/en-us/research/blog/make-every-fea...
Too bad there seem to be no pretrained models to download. This is not a quantization method to apply on existing models, so having the pretrained weights is needed if one wants to test it.
The mathematics of the BNNs are sound. The shannon entropy of a word is really small (I vaguely remember ~2 bits). Also all neural networks are ridiculously over provisioned.
I worked on 7 years ago trying to efficiently binarize CNNs from existing models. It the difficult was getting training running without the losses going to high. I think that vision models will be much more difficult to binarize, but you might not need to with clip if the vision encoder stays in regular math {fp16,int8}
Powers of 3 don't pack well into binary memory...
A 1 bit multiplier in silicon is a single logic gate, but a ternary decoder to decode a packed tri-state 'weight' is bigger.
I therefore suspect that this method will be extended to make all weights simple 1 or 0 (ie. Binary). Perhaps that will be done by having half the weights have 1 or 0 values, while the other half are -1 or 0.
People have been doing this 6 years ago.
https://github.com/yashkant/quantized-nets
https://github.com/TropComplique/trained-ternary-quantization
https://github.com/buaabai/Ternary-Weights-Network
I too find it very interesting.But why this sudden, renewed fuzz?
Refreshing paper in terms of machine learning papers, simple explanation, easy to replicate, no alchemy-tier interpretations. Can't wait to see this paper replicated or disproved when it comes to real-life production tasks.
How does backprop work here? I can't imagine flipping bits of everything upstream of an error is effective.
This really just sounds absurd. How can ternary possibly encode enough information?
Anyone willing to explain it like Iâm a Django developer who watched half a karpathy video?
Interesting return to ternary. Effectively, each weight says only whether it's correlated (+1), uncorrelated (0), or anti-correlated (-1) with the input, and the structure of the network is the actual computation over that information.
Is it really so surprising that something like this works given how human brain neurons work? My admittedly basic understanding is that these operate through an all-or-nothing principle for their action potentials (firing): they either fire or they don't, based on whether the input signals reach a certain threshold. So the output is already sort of binary in biological neurons. The inputs are more like continuous values, since they are the sum of many different neurons sending signals into each neuron, but in this paper the activations are 8-bit, not binary/ternary. Can any neuroscientists here comment?
Assuming this is confirmed, what's the impact on training?
Inference is definitely an issue for LLMs right now. But if training were suddenly possible for lone hackers (or maybe smaller companies), it would open up a lot of new possibilities as well.
Triggered by the use of 1-bit to describe a trit.
1-bit LLMs remind me of a random forum post I read about SACD and limitations of the 1-bit DSD audio format. https://www.audiosciencereview.com/forum/index.php?threads/d... Accumulating approximate values in one bit leads to being "constantly overloaded", with any error correction overwriting all of your real signal from the next step. I think this trinary system might leave enough room to avoid this problem.
Damn. Well, I guess I better hurry up and write and publish a paper on the Ternary Neural Network research that I've been doing (part-time) for the last several months, before it all gets scooped.
Sooo, short Nvidia?
What does it mean for future hardware if it's not using floating point matrix multiplication units?
So for the uninitiated (me), does this mean the input is not a float (i.e. is quantized on input), such that all the math can be done with int operations?
This seems almost too good to be true.
Edit: Answering my own question, yes. The details are in the original bitnet paper: https://arxiv.org/abs/2310.11453
How is it a 1 bit LLM if 2 bits are required for each weight (and one of the 4 possible states is wasted to be able to represent 0)
Well, that's 2 bits, but still...
LLMs have gone from 32-bit floating point numbers down to 16 and 8 bit values. Now 2 bits. It's a hint as to how evolution did it. The basic component is simple and has very wide tolerances. There are just a lot of them. That's something biology can evolve.
Looks like we have finally rediscovered a biological neuron.
Would there be value in distinguishing -0 and +0? If a 0 was quantized from a small negative or a small positive, it seems like retaining the sign is better than forgetting it.
The question remains whether the benefit and the simpler design are worth the loss of density.
Shouldnât that be â1-tritâ?
Low bit parameters is always talked about in terms of performance benefits but I wonder if allowing the LLM to combine parameters to represent values, means it can select the resolution of each value, that is use a kind of internal scientific notation to track the uncertainty of values. More low bit parameters combined together means more precision and resolution, less can mean more uncertainty. This might allow the LLM to better calibrate the uncertainty of it's knowledge in a Bayesian way, to prevent hallucinations from the overconfidence you get from overfitting on too many bits.
How would you use this in something like PyTorch? Thereâs no ternary data type.
Maybe a silly question but nonlinearity is important for neural nets. Wouldn't it make more sense for the three values to be e.g. (2, 0, -1) so they are not colinear?
Also, what are the prospects for FPGA implementations of this?
Balanced ternary, my beloved.
Does quantization need to be an all or nothing? with the kind of low bit models we have seen, my assumption would be that only certain weights would benefit from the extra precision. A mixture of precision with 2-bit, 3-bit, to 8-bit weights might perform well, but I am unsure if any training process could identify the weights that need the extra precision.
This is something that's been tried many times before. 1-bit to 2-bit models and binary NNs have a long history.
How does gradient descent work with these discrete ternary parameters? If you compute the partial differential for a parameter, how do you determine what to nudge the parameter when updating on back propagation? Do you only update if the "nudging amount" meets a threshold?
Strictly speaking it should say "1-trit LLM", or, as they later mention 1.58 bit.
This is exciting news, if the 8B numbers are true, we can already use model like Mixtral 8x7, even with a single GPU?
But further into the development, we need comparison to large model sizes. 70B might be too much to ask, but 13B should be there at least.
There's an interesting mental model I've been toying with. At what point do LLMs just become circuit-shaped NNs with stochastic gradient descent backing them?
E.G. are we just determining the best program by rearranging 1s and 0s?
âInteger arithmetic is all you needâ ? NVIDIA stock arrow up or down?
What's the benefit of using ternary encoding over just a binary representation? And if we have come so far is there potential for a more efficient algorithm than gradient descent?
How do you train these? Or is it only for already trained models?
The paper talks about LLMs a lot, but would this result hold for all Transformers? Are Ternary Transformers going to make things like Whisper faster/better?
Could there be some value in recognizing areas where the model needs finer grained weights and somehow using a different data type just in certain areas?
Is there any rigorous way to answer the question of how much information (be it entropy or some other measurement) is contained in a model's weights?
Ok can someone catch me up to speed on LLM hardware requirements? Last I looked I needed a 20 gb vram card to run a good one. Is that not true anymore?
So are there any details on the algorithms they used for backprop? I'm not seeing any in the paper other than "we used a lot of tokens".
Is there anything about this specific to LLMs, or could you use it for any transformer based model? It seems like they made a modified transformer.
I hope somebody gives this team access to the good data and a lot of crunch, I'd love to see what happens when you train the big fella.
If this turns out to be true. It could indeed be a game changer... Given the advanced AI chip shortage... Also, for the chip ban on China...
I predict Daniel Lemire will build the most efficient training and inferencing systems, close to theoretical performance limits.
What does âperform slightly better than Llamaâ mean exactly? A model like this needs to be trained from scratch right?
Wondering if this might have any impact on the use of quantum computers in LLM training/distillationâŚ
Do the implications at a practical level mean that the size of gguf files will become smaller?
If true then I'm guessing this would make ASICs for this far more simple too, right?
when can we expect the first ~100+ million parameter models to run on raspberry pi Pico?
If this paper (especially the results on Table 4) is true, then this is a game changer!
If all the weights are either 1, 0 or -1, isn't this what biological neurons do?
This is great, my employer just gave me a M1 laptop with only 16gb ram and I had to downgrade my 7B parameter local LLMâs to 3 bit quantizing, theyâve been surprisingly okay!
In my personal machine at 64gb ram, I usually use 8x7B at Q5 or 70B at Q4
Its Mistral all the way down! Imagining Q1.58 thatâs doing well makes me happy
Any models published as well?
A tenary is all you need.
So we almost go back full circle to human (animal) brain binary spikes?
Does this mean we can compile LLMs to run on FPGAs directly?
How much of a waste is using NVidia hardware for this?
Can someone versed in the ways of math explain how this is different from previous quantization methods?
And specifically, seeing how going from 16fp to 8bit mostly gives same perplexity while anything further seems to lose quality / dumb down the model, how is this even less precise method is able to achieve this?
I wonder how the training process works...
Okay wait, can I train my own llm yet?
There are two findings I find shocking in this work:
* In existing LLMs, we can replace all parameter floating-point values representing real numbers with ternary values representing (-1, 0, 1).
* In matrix multiplications (e.g., weights by vectors), we can replace elementwise products in each dot product (aâbâ + aâbâ ...) with elementwise additions (aâ+bâ + aâ+bâ ...), in which signs depend on each value. See the paper for exact details.
On existing hardware, the gains in compute and memory efficiency are significant, without performance degradation (as tested by the authors).
If the proposed methods are implemented in hardware, we will see even greater gains in compute and memory efficiency.
Wow.