#4 - Mamba, the Model Everyone's Talking About

Could Small State Models (SSMs) be the "Transformer Killer"?

Sep 17, 2024

Welcome to MAJOR.MINOR.PATCH.

Each edition, I cover one update in the world of computing and what it means for engineers, entrepreneurs and investors.

This week, Small State Models (SSMs). The hot new family of AI models…

Mamba vs. Transformers (made with Midjourney)

When a new model architecture picks up steam, investors and entrepreneurs should take notice.

In Image Generation, Generative Adversarial Networks (GANs) used to be the state-of-the-art. Then Diffusion models came along which were more stable, versatile, and performant.

Diffusion models made it possible for products like Midjourney and Adobe Firefly to emerge.

In Natural Language Processing, Recurrent Neural Networks (RNNs) used to be the state-of-the-art. Then Google created “Transformers”, which can be trained much faster than RNNs and generate much better outputs for long sequences due to their ground-breaking “self-attention” mechanism.

Transformers made it possible for products like ChatGPT and Github Copilot to emerge.

Now we have Small State Models (SSMs). They have the potential to supplant Transformers, jump the field of AI forward once more, and create space for even more jaw-dropping AI products.

Their performance is comparable to Transformers while managing extensive sequence lengths, such as one million tokens. It’s also capable of impressive speed, reportedly 5x faster than Transformer models.

So I hope this is you right now:

Leaning Forward In Chair Diagram | Know Your Meme — You reading this article

Software Engineers (like me) like to grade algorithms based on how they scale — how well they handle bigger and bigger inputs.

We measure Time Complexity and Space Complexity. Time Complexity is how the time taken to execute the algorithm increases as you increase the input. Space Complexity is how the memory needed to run the algorithm increases as you increase the input.

Understanding Time and Space Complexity of Common JavaScript Built-in Methods | by Kyle Le | JavaScript in Plain English — The various Time/Space Complexities

The main problem with Transformers is that they don’t scale particularly well.

When you prompt a transformer model, it looks at every token in your input, computes the relevance of every other token to it, and stores that in memory.

Self-attention visualization (https://github.com/jessevig/bertviz)

This means Transformers have a quadratic, O(n^2) time and space complexity. It’s hard to increase the context window for ChatGPT with that kind of scaling.

We call this the “Quadratic Bottleneck”.

State Space Models, in comparison, scale way better. They have linear time complexity and constant space complexity, O(n).

They do this by using a fixed memory. As they process the input sequence, they update their memory, compressing the information.

It’s similar to how humans read from left to right, as opposed to looking at the entire page all at once and holding every word in memory.

This fixed memory means that they can essentially have an unbounded context window. The memory requirements are always the same. And because compute requirements scale linearly, it’s also much faster and (in theory) cheaper than Transformers.

Training & Inference Comparison (https://thegradient.pub/mamba-explained/)

The SSM model everyone is talking about right now is called “Mamba”, and it’s not without its drawbacks.

The original Mamba model processes sequences from left to right in a unidirectional manner. This means that when processing any given token, the model only has access to information from previous tokens, not future ones. Transformers look at everything everywhere all at once, and can therefore handle non-sequential data.

The Mamba model compresses information as it processes a sequence, whereas Transformers kind of memorize everything. So recalling exact subsequences within the full sequence may be challenging for Mamba.

As per usual, I focused almost entirely on the ‘Bull case’ for SSMs and shoehorned in a couple of Google searches for the ‘Bear case’ towards the end.

Even so, my gut tells me that Mamba’s shortcomings are not nearly enough to disqualify the SSM family of models. I think we’re going to be hearing a lot more about them in the coming months…

Sources:

Mamba Might Just Make LLMs 1000x Cheaper… (bycloud)
Mamba Explained (The Gradient)
MambaByte and the Idea of Tokenization-Free SSMs (The Sequence)
Understanding the SSM Fundamental Equation (The Sequence)
Inside Mamba, the Most Famous SSM Model (The Sequence)
A New Series About State Space Models (The Sequence)

Rayhan Memon

Discussion about this post