Everybody has a theory about why Nvidia dropped $20B on Groq - they're mostly wrong | Retrui News

This summer, AI chip startup Groq raised $750 million at a valuation of $6.9 billion. Just three months later, Nvidia celebrated the holidays by dropping nearly three times that to license its technology and squirrel away its talent.

In the days that followed, the armchair AI gurus of the web have been speculating wildly as to how Nvidia can justify spending $20 billion to get Groq’s tech and people.

Pundits believe Nvidia knows something we don't. Theories run the gamut from the deal signifying Nvidia intends to ditch HBM for SRAM, a play to secure additional foundry capacity from Samsung, or an attempt to quash a potential competitor. Some hold water better than others, and we certainly have a few of our own.

What we know so far

Nvidia paid $20 billion to non-exclusively license Groq's intellectual property, which includes its language processing units (LPUs) and accompanying software libraries.

Groq's LPUs form the foundation of its high-performance inference-as-a-service offering, which it will keep and continue to operate without interruption after the deal closes.

The arrangement is clearly engineered to avoid regulatory scrutiny. Nvidia isn't buying Groq, it's licensing its tech. Except… it's totally buying Groq.

How else to describe a deal that sees Groq’s CEO Jonathan Ross and president Sunny Madra move to Nvidia, along with most of its engineering talent?

Sure, Groq is technically sticking around as an independent company with Simon Edwards at the helm as its new CEO, but with much of its talent gone, it's hard to see how the chip startup survives long-term.

The argument that Nvidia just wiped a competitor off the board therefore works. Whether that move was worth $20 billion is another matter, given it could provoke an antitrust lawsuit.

It must be for the SRAM, right?

One prominent theory about Nvidia’s motives is that Groq’s LPUs use static random access memory (SRAM), which is orders of magnitude faster than the high-bandwidth memory (HBM) found in GPUs today.

A single HBM3e stack can achieve about 1 TB/s of memory bandwidth per module and 8 TB/s per GPU today. The SRAM in Groq's LPUs can be 10 to 80 times faster.

Since large language model (LLM) inference is predominantly bound by memory bandwidth, Groq can achieve stupendously fast token generation rates. In Llama 3.3 70B, the benchmarkers at Artificial Analysis report that Groq's chips can churn out 350 tok/s. Performance is even better when running a mixture of experts models, like gpt-oss 120B, where the chips managed 465 tok/s.

We're also in the middle of a global memory shortage and demand for HBM has never been higher. So, we understand why some might look at this deal and think Groq could help Nvidia cope with the looming memory crunch.

The simplest answer is often the right one – just not this time.

Sorry to have to tell you this, but there's nothing special about SRAM. It's in basically every modern processor, including Nvidia's chips.

SRAM also has a pretty glaring downside. It's not exactly what you'd call space efficient. We're talking, at most, a few hundred megabytes per chip compared to 36 GB for a 12-high HBM3e stack for a total of 288 GB per GPU.

Groq's LPUs have just 230 MB of SRAM each, which means you need hundreds or even thousands of them just to run a modest LLM. At 16-bit precision, you'd need 140 GB of memory to hold the model weights and an additional 40 GB for every 128,000 token sequence.

Groq needed 574 LPUs stitched together using a high-speed interconnect fabric to run Llama 70B.

You can get around this by building a bigger chip – each of Cerebras' WSE-3 wafers features more than 40 GB of SRAM on board, but these chips are the size of a dinner plate and consume 23 kilowatts. Anyway, Groq hasn't gone this route.

Suffice it to say, if Nvidia wanted to make a chip that uses SRAM instead of HBM, it didn't need to buy Groq to do it.

Going with the data flow

So, what did Nvidia throw money at Groq for?

Our best guess is that it was really for Groq's "assembly line architecture." This is essentially a programmable data flow design built with the express purpose of accelerating the linear algebra calculations computed during inference.

Most processors today use a Von Neumann architecture. Instructions are fetched from memory, decoded, executed, and then written to a register or stored in memory. Modern implementations introduce things like branch prediction, but the principles are largely the same.

Data flow works on a different principle. Rather than a bunch of load-store operations, data flow architectures essentially process data as it's streamed through the chip.

As Groq explains it, these data conveyor belts "move instructions and data between the chip's SIMD (single instruction/multiple data) function units."

"At each step of the assembly process, the function unit receives instructions via the conveyor belt. The instructions inform the function unit where it should go to get the input data (which conveyor belt), which function it should perform with that data, and where it should place the output data."

According to Groq, this architecture effectively eliminates bottlenecks that bog down GPUs, as it means the LPU is never waiting for memory or compute to catch up.

Groq can make this happen with an LPU and between them, which is good news as Groq's LPUs aren't that potent on their own. On paper, they achieve BF16 perf, roughly on par with an RTX 3090 or the INT8 perf of an L40S. But, remember that's peak FLOPS under ideal circumstances. In theory, data flow architectures should be able to achieve better real-world performance for the same amount of power.

It's worth pointing out that data flow architectures aren't restricted to SRAM-centric designs. For example, NextSilicon's data flow architecture uses HBM. Groq opted for an SRAM-only design because it kept things simple, but there's no reason Nvidia couldn't build a data flow accelerator based on Groq's IP using SRAM, HBM, or GDDR.

So, if data flow is so much better, why isn't it more common? Because it's a royal pain to get right. But, Groq has managed to make it work, at least for inference.

And, as Ai2's Tim Dettmers recently put it, chipmakers like Nvidia are quickly running out of levers they can pull to juice chip performance. Data flow gives Nvidia new techniques to apply as it seeks extra speed, and the deal with Groq means Jensen Huang’s company is in a better position to commercialize it.

An inference-optimized compute stack?

Groq also provides Nvidia with an inference-optimized compute architecture, something that it's been sorely lacking. Where it fits, though, is a bit of a mystery.

Most of Nvidia’s "inference-optimized" chips, like the H200 or B300, aren't fundamentally different from their "mainstream" siblings. In fact, the only difference between the H100 and H200 was that the latter used faster, higher capacity HBM3e which just happens to benefit inference-heavy workloads.

As a reminder, LLM inference can be broken into two stages: the compute-heavy prefill stage, during which the prompt is processed, and the memory-bandwidth-intensive decode phase during which the model generates output tokens.

That's changing with Nvidia's Rubin generation of chips in 2026. Announced back in September, the Rubin CPX is designed specifically to accelerate the compute-intensive prefill phase of the inference pipeline, freeing up its HBM-packed Vera Rubin superchips to handle decode.

This disaggregated architecture minimizes resource contention and helps to improve utilization and throughput.

Groq's LPUs are optimized for inference by design, but they don't have enough SRAM to make for a very good decode accelerator. They could, however, be interesting as a speculative decoding part.

If you're not familiar, speculative decoding is a technique which uses a small "draft" model to predict the output of a larger one. When those predictions are correct, system performance can double or triple, driving down cost per token.

These speculative draft models are generally quite small, often consuming a few billion parameters at most, making Groq's existing chip designs plausible for such a design.

Do we need a dedicated accelerator for speculative decoding? Sure, why not. Is it worth $20 billion? Depends on how you measure it. Compared with publicly traded companies whose total valuation is around $20 billion, like HP, Inc., or Figma, it may seem steep. But for Nvidia, $20 billion is a relatively affordable amount – it recorded $23 billion in cash flow from operations last quarter alone. In the end, it means more chips and accessories for Nvidia to sell.

What about foundry diversification?

Perhaps the least likely take we've seen is the suggestion that Groq somehow opens up additional foundry capacity for Nvidia.

Groq currently uses GlobalFoundries to make its chips, and plans to build its next-gen parts on Samsung's 4 nm process tech. Nvidia, by comparison, does nearly all of its fabrication at TSMC and is heavily reliant on the Taiwanese giant’s advanced packaging tech.

The problem with this theory is that it doesn't actually make any sense. It's not like Nvidia can't go to Samsung to fab its chips. In fact, Nvidia has fabbed chips at Samsung before – the Korean giant made most of Nvidia’s Ampere generation product. Nvidia needed TSMC's advanced packaging tech for some parts like the A100, but it doesn’t need the Taiwanese company to make Rubin CPX. Samsung or Intel can probably do the job.

All of this takes time, and licensing Groq's IP and hiring its team doesn't change that.

The reality is Nvidia may not do anything with Groq's current generation of LPUs. Jensen might just be playing the long game, as he's been known to do. ®

RETRUI