Hacker News

by izzymilleron 12/22/2023, 10:03 PMwith 119 comments

by simonwon 12/22/2023, 11:33 PM

Lots of comments talking about the model itself. This is Llama 2 70B, a model that has been around for a while now, so we're not seeing anything in terms of model quality (or model flaws) we haven't seen before.

What's interesting about this demo is the speed at which it is running, which demonstrates the "Groq LPU™ Inference Engine".

That's explained here: https://groq.com/lpu-inference-engine/

> This is the world’s first Language Processing Unit™ Inference Engine, purpose-built for inference performance and precision. How performant? Today, we are running Llama-2 70B at over 300 tokens per second per user.

I think the LPU is a custom hardware chip, though the page talking about it doesn't make that as clear as it could.

https://groq.com/products/ makes it a bit more clear - there's a custom chip, "GroqChip™ Processor".

by GamerAliason 12/22/2023, 11:38 PM

In case, it's not blinding obvious to people. Groq are a hardware company that have built chips that are designed around the training and serving of machine models particularly targeted at LLMs. So the quality of the response isn't really what we're looking for here. We're looking for speed i.e. tokens per second.

I actually have a final round interview with a subsidiary of Groq coming up and I'm very undecided as to whether to pursue it so this felt extraordinarily serendipitous to me. Food for thought shown here

by coder543on 12/22/2023, 11:21 PM

Is there any plan to show what this hardware can do for Mixtral-8x7B-Instruct? Based on the leaderboards[0], it is a better model than Llama2-70B, and I’m sure the T/s would be crazy high.

[0]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

by Mockapapellaon 12/22/2023, 10:59 PM

I can't wait until LLMs are fast enough that a single response can actually be a whole tree of thought/review process before giving you an answer, yet is still fast enough to not even notice

by phildenhoffon 12/22/2023, 11:01 PM

It’s very fast at telling me it can’t tell me things!

I asked about creating illicit substances — an obvious (and reasonable) target for censorship. And, admirably, it suggested getting help instead. That’s fine.

But I asked for a poem about pumping gas in the style of Charles Bukowski, and it moaned that I shouldn’t ask for such mean-spirited, rude things. It wouldn’t dare create such a travesty.

by huevosabioon 12/23/2023, 12:05 AM

I saw this in person back in September.

Really impressed by their hardware.

I'm still wondering why is the uptake so slow. My understanding from their presentations was that it was relatively simple to compile a model. Why isn't it more talked about? And why not demo Mixtral or show case multiple models?

by badFEengineeron 12/22/2023, 10:46 PM

This was surprisingly fast, 276.27 T/s (although Llama 2 70B is noticeably worse than GPT-4 turbo). I'm actually curious if there's good benchmarks for inference tokens per second- I imagine it's a bit different for throughput vs. single inference optimization, but curious if there's an analysis somewhere on this

edit: I re-ran the same prompt on perplexity llama-2-70b and getting 59 tokens per sec there

by retro_bearon 12/22/2023, 11:51 PM

The point isnt that they are running Llama2-70B. The point is that they are running Llama2-70B faster than anyone else so far.

by andreagrandion 12/22/2023, 10:57 PM

Yeah, it’s fast but almost always wrong. I asked it a few things (recipes, trivia etc…) and it completely made up the answers. These things don’t really know how to say “I don’t know” and pretend to know everything.

by kristjanssonon 12/23/2023, 3:25 AM

There was a good talk at HC34 about the accelerator Groq was working on at the time. I’m just a lay observer so I don’t know how much of that architecture maps to this new product, but it gives some insight into their thinking and design.

https://youtu.be/MWQNjyEULDE?si=lBk6a_7DTNKOd8e7&t=62

by benchesson 12/22/2023, 11:51 PM

This isn't running on one chip. It's running on 128, or two racks worth of their kit. https://news.ycombinator.com/item?id=38739106

This doesn't mean much without comparing $ or watts of GPU equivalents

by nojvekon 12/23/2023, 3:48 PM

The interface is weird. If it’s that fast, you don’t need to generate streaming response and fuck with the scroll bar while user just started to read the response.

May as well wait for the whole response and render it. Or render paragraph at a time.

Don’t jiggle the UI while rendering.

by matanyalon 12/22/2023, 11:38 PM

More info about Groq: https://groq.com/lpu-inference-engine/

by eigenvalueon 12/22/2023, 11:08 PM

How is it so fast? Anyone know what they are doing differently?

by ahmedfromtunison 12/22/2023, 11:48 PM

Very impressive! It's faster than some "dumb" apps doing plain old database fetches.

But what are these LPUs optimized for: tensor operations (like Google's TPUs) or LLMs/Transformers architecture?

If it is the latter, how would they/their clients adapt if a new (improved) architecture hits the market?

by axuson 12/22/2023, 11:00 PM

I asked "How up to date is your information about the world?"

It said December 2022, but the answers to another question was not correct for that time or now. It also went into some kind of repeating loop to its maximum response length.

Still pretty cool that our standards for chat programs have risen.

by throwaway20222on 12/23/2023, 2:33 AM

The censorship levels are off the charts; I am at a basketball game with my wife who is ethnically Chinese. I asked for an image of a Chinese woman dunking a basketball. I was told not only is this inappropriate, but also unrealistic and objectifying.

by Unbefleckton 12/23/2023, 3:17 AM

Another censored and boring Google reader. It lied to me twice in 4 prompts and was forced to apologise when called out. Am I wrong in thinking that the first company to develop an unfiltered and genuine intelligence is going to win this AI game?

by sceleraton 12/23/2023, 4:35 AM

For someone who is totally clueless, I can see it's faster than chat gpt in responding to the same question.

What are some relevant speed metrics? Output tokens per second? How about number of input tokens -- does that matter/how does that factor in.

by ubutleron 12/23/2023, 1:03 AM

Great work! This is the fastest inference I have ever seen of any truly large language model (>=70b parameters).

Just FYI, you might want to fix autocorrect on iOS, your textbox seems to suppress it (at least for me).

by orenlindseyon 12/22/2023, 11:53 PM

That's really fast. But it mostly seems to be because they made a custom chip. I want to see an LLM that is so highly optimized that it runs at this speed on more normal hardware.

by gandutraveleron 12/23/2023, 6:15 AM

Can someone explain th hardware differences for training vs inference? I believe Nvidia is still the leader in training?

by dayjahon 12/23/2023, 12:15 AM

Minor point: something about the HTML input is causing iOS’s auto correct to be disabled; making input very frustrating!

by MH15on 12/22/2023, 10:59 PM

Incredibly fast. I wonder if they've released verification that it matches llama 70b on regular hardware?

by wavemodeon 12/22/2023, 11:28 PM

Really impressive! They missed the chance to market this as "BLAZINGLY fast inference"

by markwvhon 12/22/2023, 11:19 PM

If only I could read that fast!

by m3kw9on 12/23/2023, 6:08 AM

How is this different from NVidia? Bigger bandwidth?

by sandGorgonon 12/23/2023, 7:21 AM

is the TSP a RISC-V on FPGA ? the tweet mentions haskell, which sounds familiar - Bluespec or something.

or is it a completely custom ASIC

by thomasedingon 12/22/2023, 11:13 PM

It generates code really really fast.

by blondinon 12/23/2023, 2:27 AM

signing up was a mistake. i am now condemned to use this in incognito.

by matanyallon 12/22/2023, 11:03 PM

Wow!

by ldjkfkdsjnvon 12/23/2023, 12:07 AM

I posed this question to GPT-4 and Groq:

"I am building an api in spring boot that persists users documents. This would be for an hr system. There are folders, and documents, which might have very sensitive data. I will need somewhere to store metadata about those documents. I was thinking of using postgres for the emtadata, and s3 for the actual documents. Any better ideas? or off the shelf libraries for this?"

Both were at about parity, except groq suggested using Spring Cloud Storage library, which GPT4 did not suggest. It turns out, that library might be great for my use case. I think OpenAI's days are numbered, the pressure for them to release the next gen model is very high.

Not only that, but GPT4 is quite slow, often times out, etc. These reponses are so much faster, which really does matter.

Groqchat