Transformers Can Do Arithmetic with the Right Embeddings

by byt3h3adon 5/28/2024, 4:39 AMwith 211 comments

by vesseneson 5/28/2024, 11:07 AM

Wow, a lot of grumpiness in here. If it's true that adding like 20 or so tokens to encode column location / decimal spot triples math performance in out of band tasks, that's a big deal. It's a simple fix, it improves performance A LOT, and they even indicate it's not just a party trick, in that the LLM can use the information to do better on related tasks like sorting and list making.

This is basically free to add, and there's no reason it shouldn't be made part of standard tokenization.

I'm more interested in the question of how we can find other useful concepts for data -> embedding space like this; can we incept our tokenization inception so it has more inception?

by zacksirion 5/28/2024, 2:46 PM

I think the problem here is that 'understanding' is not the same as curve fitting.

If all one is doing is giving a model lots of data and fitting curves it's not really 'understanding' but brute forcing it's way (with gradient descent) and then storing the weights and finally approximate the solution when a query is passed in.

This is not the same as understanding. Human intelligence can operate deterministically as well as non-deterministically. We can listen to language, which is by it's nature non-deterministic and convert that into deterministic operations and vice a versa. IE we can operate on some logic and explain it in multiple ways to other people.

Understanding requires much less data than brute forcing your way into pattern recognition.

When you see a simple number like this 2 * 4 you are able to understand that it's equivalent to 2 + 2 + 2 + 2 and that in turn means 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 <- Count that and you've got your answer.

Because you 'understand' this basic concept and all the operations in between you are able to compute more examples. But you only need to understand it once. Once you understand multiplications and additions and all the tricks in between you are able to compute 23 * 10 without being fed 23 * 10 as prior data. Understanding is very different from fitting a curve. You can reach conclusions and understanding through pattern recognition, but it's important to differentiate 'approximation' from 'calculation'. If you understand something in it's entirety you should be able to calculate an outcome deterministically.

Right now LLMs lack 'understanding', and seems to only 'approximate' which may seem like 'understanding' but is actually not.

by msoadon 5/28/2024, 7:22 AM

It seems like a hack to be honest. Problem at hand is not to make transformers do addition of 100 digit numbers. Problem is the current systems can’t reason about things, math included.

Optimizing for a certain use case is not gonna take us where we wanna be. We want to have a system that can learn to reason.

by Havocon 5/28/2024, 10:28 AM

For things like this where we have computationally cheap, well understood, reliable tools available (aka calculator) it seems better to train the model in tool use.

I guess perhaps the techniques could be generalized though?

by teleforceon 5/28/2024, 9:38 AM

I think understanding mathematics is what LLM really need at the moment far more important than video generation that is just another form of CGI [1]. After deep learning and transformer, understanding mathematics and its proofs not just arithmetic will be the next game changer for LLM and a turning point for humanity.

[1] Why LLMs like ChatGPT and Google Bard are bad at math:

https://www.xda-developers.com/why-llms-are-bad-at-math/

by jiggawattson 5/28/2024, 8:18 AM

Something I've been thinking about is how the Minds -- the super-human AI hyper-computers that fly the ships in the Culture series of novels are described. The image built up in my head[1] is that they're hybrids blending neural networks and regular compute substrates. They can calculate, simulate, and reason in combination.

There have been crude attempts at this already, hooking in Mathematica and Python into ChatGPT. I say crude, because these add-ons are controlled via output tokens.

What I would like to see is a GPT-style AI that also has compute blocks, not just transformer blocks. I don't mean compute in the sense of "matrix multiply for weights and biases", but literally an ALU-style block of basic maths operations available for use by the neurons.

One thought that I had was that this could be via activations that have both a floating-point activation value and "baggage" such as a numerical value from the input. Like a token in a traditional parser, that can represent a constant string or an integer with its decoded value.

The newer, truly multi-modal models gave me a related idea: Just like how they can have "image" tokens and "audio" tokens, I wonder if they could be given "numeric data" tokens or "math symbol" tokens. Not in the same way that they're given mixed-language text tokens, but dedicated tokens that are fed into both the transformer blocks and also into ALU blocks.

Just an idle thought...

[1] Every reader reads into a story something unique, which may or may not align with what the author intended. This is my understanding, coloured by my own knowledge, etc, etc...

by torginuson 5/28/2024, 10:15 AM

I just wonder if numbers were written right to left, llms would be much better at arithmetic. You can 'predict' the least significant digit by reusing the already written digits in the computation, but to generate most significant ones, you generally need to do the entire computation in one go.

by pmayrgundteron 5/28/2024, 11:16 AM

I'm curious about the framing of research like this.. "The poor performance of transformers on arithmetic tasks" (relative to what?) and how that informs the adjacent conversation on progress towards AGI.

Some say AGI has already been achieved, others that it's years or decades away. When I dig into the disagreement, it often partially depends on the perspective of how competent humans are on the tasks in question, with the optimists being, I think, more realistic about variance in human intelligence and the pessimists seeming to reserve the term "general intelligence" for possessing a nearly perfect suite of capabilities that many otherwise intelligent people practically don't have.

For example with arithmetic, this study cites another [Dziri et al. 2023], that says:

"For instance, humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf ChatGPT and GPT4 achieve only 55% and 59% accuracies on this task, respectively."

But this isn't the case.. 5-6% of the population have https://en.wikipedia.org/wiki/Dyscalculia, but can be otherwise normal.

I still see value in normative statements about human capability in AI & AGI research, but I think we'll need to move towards explicit statistical framing.

DeepMind's Position paper "Levels of AGI for Operationalizing Progress on the Path to AGI" has a schema like this, where AGI capabilities are defined across 2 axes of Performance level X Generality (narrow vs general), and the Performance levels are measured by comparison with "Percentile of skilled adults" able to perform the task.. https://arxiv.org/pdf/2311.02462#page=3.40

Within that framing, this paper's title or result might be "Achieving AGI Competency in Arithmetic", or "Expertise", or "Virtuosity", i.e. on par respectively with 50th, 90th or 99th percentile of skilled adults.

by infogulchon 5/28/2024, 7:04 AM

The other day I was wondering if LLMs are bad at at maths because they don't have readily apparent access to the concept of "columns". Apparently the answer is yes.

Vertical alignment across lines is pretty important for humans to learn operations on digits, but the way we encode lines with a \n separator doesn't really help. In a recent codebullet video gpt really struggled with any kind of vertical alignment task. I wonder if it would do better on a fixed 80 column width...

by topherjayneson 5/28/2024, 6:44 PM

I went through the paper and thought immediately about how did they implement it; I missed they published their code as well. Here is the link for everyone who skimmed past it: https://github.com/mcleish7/arithmetic/tree/main

by nerdponxon 5/28/2024, 8:06 PM

I like to see more focus on the input embeddings.

It's basically the same as feature engineering in pre-deep machine learning: constructing features with high information content can significantly reduce the amount of data and computation needed to fit a useful model. And sometimes it's impossible to fit a useful model without careful feature engineering, either because the model itself is constrained in some way or because there isn't enough data or both.

It's analogous to making a choice of inductive bias within the model itself. We literally could not do LLMs without the carefully-constructed transformer architecture. Why should we expect to make further progress without paying more attention to the embeddings?

by Shrezzingon 5/28/2024, 9:44 AM

Since models are very good at writing very short computer programs, and computer programs are very good at mathematical calculations, would it not be considerably more efficient to train them to recognise a "what is x + y" type problem, and respond with the answer to "write and execute a small javascript program to calculate x + y, then share the result"?

by kjhcvkek77on 5/28/2024, 9:40 AM

Very cool that it was able to generalise from small numbers to larger ones with such high accuracy.

by skydeon 5/28/2024, 4:38 PM

Why not apply same concept every time a word is split into more than one token?

Basically if a word contain a Prefix, suffix or root word. We could have a token position relative to the start of the word in the embedding.

by michaelnnyon 5/28/2024, 7:20 AM

I think the main problem is the way we turn the raw mathematics symbols or equations into tokens, and these suboptimal tokenization may decreases the performance

by wantsanagenton 5/28/2024, 8:09 PM

I like these kinds of fixes. It's like realizing your child has vision problems and getting them glasses.

by winddudeon 5/28/2024, 8:38 PM

But a calculator wouldn't be very good if it's only correct 99% of the time for arithmetic...

by CyberDildonicson 5/29/2024, 12:24 AM

I'm pretty sure getting computers to do arithmetic is not a giant hurdle.

by YeGoblynQueenneon 5/28/2024, 7:55 AM

What is the point of this work? 99% on 100-digit arithmetic means there's a 0% chance anyone will ever use a Transformer as an ALU or anything of the kind. We already know how to hard-code a (literally) infinitely more accurate addition machine.

And not only addition: all four arithmetic operations. The technique proposed in the article -imposing a strong inductive bias for addition- kiind of works for multiplication, but not for subtraction or division (clearly; I can't even find the words in the paper). As a practical way to build a machine to do arithmetic this is out of the question.

We've known how to mechanise arithmetic since the 1850's with Blaize Pascal and his Pascaline. What is the point in demonstrating it's possible to reinvent a broken, partial, buggy version of an arithmetic machine if one tries really hard and shoehorns the necessary patterns in a neural net? We've known that for a long time, too (every proof that a neural net can simulate this or that Turing machine if you design the network diagram and set the weights by hand, ever).

So what is the point of this? Transformers are supposed to be the "sparks of AGI" and they can almost do arithmetic if we try very hard to shove it down their heads? Who cares?

by r2_piloton 5/28/2024, 11:48 AM

Meanwhile I'm over here using Claude 3 Opus to do trig and calculus problems as well as generate the LaTex representation of the equations. It's not necessary to be 100% in my case (purely for fun) but I follow its reasoning and it's pretty consistent at least enough for "orders of magnitude" and first order effects. I was gonna post some of the chats about physics but probably nobody cares.

by gmercon 5/28/2024, 9:50 AM

That's great, 99% math is absolutely good enough for real world problems /s