Let’s dance the Mamba

Why you should read this publication

Mamba announces a new, efficient and versatile architecture family that is making its mark on the artificial intelligence landscape. Bonus: a better understanding of image embeddings from DinoV2, and a new way to bypass Large Language Models.

What business processes are likely to change as a result of this research?

About MambaIf the architecture realizes its value, we can expect to see an impact in many fields, such as image and natural language processing. Regarding the evolution of DinoV2, unsupervised approaches to image analysis are gaining in quality. Finally, with regard to LLMsa new strong risk is identified, to be implemented robustly to deploy tools.

the use cases we have developed for customers that touch on the subject of this research review

Unsupervised image analysis for feature or match detection. Robustness work to frame a Large Language Model.

If you only have a minute to devote to reading now, here's the essential content in a nutshell

  1. A new architecture, Mamba, is causing a stir in the research world.
  2. This architecture is based on a selection mechanism that enables it to manage complex input data by ignoring unnecessary elements.
  3. She is an interesting candidate to replace Transformers which suffer from overly long sequences (typically in text)
  4. It also seems to be very versatile in terms of applications: natural language, audio, DNA modeling, image, video…
  5. His refusal to take part in the ICLR2024 conference caused quite a stir, and once again highlighted the limitations of the current validation system.
  6. And now something completely different MetaAI has proposed an evolution of DinoV2 that offers a better understanding of Vision Transformers and provides a superior tool.
  7. An atrocious new flaw has been discovered in Large Language Models (yet another one): the use of ASCII Art.

A new architecture for Deep Learning?

Mamba! For several months now, this term has been attracting the attention of every data scientist and researcher. Deep Learning a little up to date. Behind this term lies a new, very interesting and versatile neural network architecture, but also one of those little psychodramas characteristic of academic AI research. We (Datalchemy) don’t usually like to jump on radically new tools hot out of the oven… Indeed, the more recent the work, the greater the risk of being blindsided. Dinosaurs in the field, for example, remember Capsule Networksproposed by Lord Hinton himself, on which the community threw itself before abandoning them six months later.

Here, as we shall see, the proposed architecture presents very strong arguments that can hardly be ignored. What’s more, it has been successfully adopted in many other projects. But we also have the opportunity to observe the limits of the conference system in a very concrete case. Deep Learning today, with the rejection of this publication by ICLR 2024. This rejection has caused a certain amount of uproar in the community, and deserves to be looked at more closely, as it perfectly testifies to certain strong limits of current research, and imposes ever greater precautions on us.

Before entering specifically into the Mambaa little background is in order. This work follows in the footsteps of other publications about Structured Space Models (SSM). Visit SSMs present a new fundamental mechanism for modeling a continuous problem (in the mathematical sense of the term), integrating a discretization system for application in Deep Learning. They can be seen as an extension of algorithms such as the Kalman. Here, this concept is used as a new form of block. Deep Learning that can be directly integrated into a neural network. This approach had already been showcased by How to Train Your HIPPO: State Space Models with Generalized Basis Projections from Gu et al where this type of model made it possible to address very long-term prediction topics (long range arena), overcoming, for example, the famous Transformer and the few thousand optimizations of the attention mechanism attempted by various researchers in recent years.

This approach remained relatively obscure until the arrival of our beloved Mamba.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces, Gu et al.

Published in December 2023, this approach has caused quite a stir. Indeed, the authors address two very sensitive points in the world of Deep Learning :

  • A highly versatile architecture, which (as we shall see) can address everything from natural language and DNA modeling to very long time series management and audio generation.
  • This architecture is highly efficient by nature, making it possible to manage very long contexts. When we recall the total energy consumed by researchers trying to improve the attention mechanism of the famous Transform which suffers from quadratic complexity with length (a sentence twice as long will require four times as much calculation), the argument takes on its full meaning and becomes particularly relevant.

The Mamba is positioned in relation to previous work in Structured Space Models when faced with a sequence of information (series, text…): these works learned parameters that were invariant throughout the time in the sequence, in the manner of an old LSTM-type recurrent network where the matrices applied within the operators were fixed once learning was complete. Here, the authors propose a fundamental selection mechanism which, when the model takes a sequence as input, enables it to “decide” whether or not to update the network’s internal variables. The model will thus learn to select, as it receives the entire sequence of input information, which information it will use and which it will not. This modification is of considerable importance, as it (theoretically) enables the model to become robust to extremely long sequences. A properly trained model will be able to ignore a large amount of useless information, which was impossible for a classical architecture such as the Transform. A more theoretical point of interest, recurrent models (RNN/GRU/LSTM) can then be considered as a special case of SSMwhich become a more general approach.

This simple mechanism would have been enough to propose an interesting work, but the authors didn’t stop there. In particular, they developed a CUDA (the assembler of our all-too-precious GPUs) optimized for accelerating model training Mamba. This point is not to be underestimated, in a field where available computing power remains a constant brake on any project. Through this type of development, the authors democratize access to their approach and transform it into a directly usable base. While it is difficult to compete with all the optimizations currently available for the Transformthis simple task (as we shall see) was then used to apply Mamba to images and video.

We’re only going to touch on fundamental technique here, but below you’ll find, in order :

  • The equations modeling the fundamental mechanism. From left to right:
    • (1a) & (1b): continuous, theoretical modelling of the transformation, with a signal as input x(t)output a result y(t)an internal state h(t) (corresponding to the latent vector of any other operation in Deep Learning) and specific transition matrices A,B and C which will evolve as a function of the input data. This theoretical model is a differential equation on the derivative of h(t)
    • (2a) & (2b): a discrete approach, this time of recurrence with an update of the internal state h as a function of its previous state and x
    • (3a) & (3b): another discrete implementation, this time in the form of a convolution.

A diagram showing the updating of a “cell Mamba” where we have the old internal state of the cell h(t-1)a new input x(t) and update mechanisms to generate the new internal cell state h(t) and a result y(t). The important point here is the selection mechanism that will modify the matrices A,B and C as a function of the new input x(t).

Those familiar with the field won’t be able to help but notice a number of similarities with recurrent networks 😊

The fundamental importance of this selection mechanism

This is probably the most fundamental point in the whole approach. Mamba. While it’s always risky to draw parallels between the atomic functioning of an operator Deep Learning and the more global behavior of a model, we can observe here that the MambaWhen faced with an input sequence, it can completely ignore a large part of this sequence to concentrate on the most important information. This approach is radically different from more conventional approaches:

  • Recurrent (and convolutional) models will always take the entire sequence as input. The length of this sequence will therefore have a negative impact on the model’s results, whatever the complexity of the problem addressed.
  • The Transformers have a similar approach, but are also particularly vulnerable to this sequence length. Numerous attempts have been made to better model the attention mechanism, not least because Transformers are the canonical architecture of Large Language Models (GPT, Llamas, Mistral), and this vulnerability to sequence length has a major impact on the way these models are used. In particular, a model that is supposed to hold a conversation will soon be limited in the conversation history it can still use…

The authors have illustrated this subject with two simplified problems that highlight the functioning of the Mamba (figure below): The Selective Copying where the model must learn to retain only certain tokens input (those in color) and ignore others tokens (in white), and the induction head where the model must be able to give token which follows a token already seen as an input (below: the token black was directly followed by token blue, which must therefore be regenerated by the model).

This problem may shock some by appearing “too simple” for artificial intelligence, yet with sequences of length 4096 for the Selective Copying for example, our beloved Transformers are quite incapable of addressing the issue… Mamba manages, for the Induction Headssequences of 100,000 or even a million in length. Here we see the strong interest of a selection mechanism which, robust to useless information, can remain effective over the long term.

 

Different fields of application

This selection mechanism can intuitively be applied to many other fields, which the authors were able to verify. Several results can be highlighted:

At natural languagethe authors observed that the Mamba was competitive with the most interesting “recipes” (this term has the advantage of scientific honesty) for training Transformers on a perplexity metric (keep this in mind, we’ll talk about it shortly) for a lower training cost. Even more interesting, the Mamba seems to be able to exploit an increase in the number of parameters more effectively than conventional approaches. Considering that Deep Learningwe’ve never stopped chasing after the biggest possible models, those “scaling laws”. scaling laws are particularly interesting.

At DNA modeling, Mamba also seems much more efficient than the state-of-the-art to date. DNA modeling is interesting because it proposes much longer sequences than in natural language, up to a length of 1048576 tokens. Point of interest: where the old model degraded according to sequence length, the new ones do not. Mambasare improving slightly.

Finally, in modeling and audio generationa subject where information is continuous at the base and sequences are particularly long and complex to address, Mamba also comes up trumps, even on particularly long sequences.

Each time, in addition to this sequence length, the Mamba seems to be more effective, in that an increase in model complexity is transferred to a higher gain in quality than we had previously observed. What’s more, the fact that this approach can be applied to such a wide range of problems makes this work particularly worthy of note, despite its extreme youth. Below: left: Precision after fine tuning of a model DNA model as a function of sequence length, right: modeling quality of audio content as a function of input sound length:

 

Tragic ball at ICLR 2024: is du8a really mean?

Begging the reader’s indulgence for this inglorious title, Mamba has been the subject of considerable controversy in the world of Deep Learning. More than just reporting on an epiphenomenon, we believe that this is a good way of illustrating some of the major limitations of Deep Learning. Let’s start by stating the obvious: in this field of science, we suffer from a theoretical deficit that makes any “sensible” approach to comparing architectures very complex (impossible?). In this festival of empiricism, it’s not uncommon to observe “herd” movements in the scientific community, with temporary “fashion effects” or real blindness. This is one of the reasons why, although we follow research closely, we are often wary of work that is too recent to be properly used… Last but not least: there are too many publications in this field today, and the classic mechanisms for filtering research (notably, and we’re getting there, the peer reviewing prestigious conferences) no longer work.

With that introduction in mind, let’s move on to the drama. Mamba was proposed at the ICLR 2024 which is intended to be a cutting-edge meeting place in certain areas of research. The majority of the scientific community expected this work to be accepted, considering the other works that have pursued the Mamba (more on this later). So it came as a great surprise to learn that the work had been rejected outright…

Let’s not forget that submissions and exchanges between authors and critics for this type of conference are public on the excellent OpenReview. This is excellent news for any player wishing to see for themselves what it’s like. And here only one reviewer (out of a total of four) felt that this work did not deserve to be reviewed. reviewer of8a.

If many criticisms have been raised by this dear du8athe authors of Mamba were able to respond on the whole. Only two points remained blocked, and led to the rejection of the publication:

  • The absence of Long Range Arena according to conventional benchmarks. This point is honestly unconvincing, considering the other experiments carried out by the authors.
  • The use of perplexity as a comparison metric in text or DNA modeling. This point is much more relevant, as this metric is based on a model Deep Learning to be established, and while this metric is particularly widely used, it remains very tricky to grasp and can, potentially, mean very little.

But then, du8a Is he a brilliant individual who, faced with the wave of a new fashion, remains upright and refuses to bend? Or are we faced with a jealous person who doesn’t recognize the quality of the work done? We’d love to have a single answer, but it’s impossible here. Since laziness is (sometimes) a virtue, we’ll just have to wait and see over the next few months whether the wave of Mambas suffocates or, on the contrary, picks up speed and effectively replaces Transformers on specific topics. Nevertheless, even if this work turns out to be less revolutionary than expected, it’s a shame that it didn’t have its place at a cutting-edge conference, especially considering that a number of publications accepted at this same event seem to fall well short in terms of investment and the questions posed in the field of artificial intelligence.

Vision Mamba, Video Mamba...

Things would be relatively straightforward if this were the only publication. In the world of Deep LearningIn a world where replication of research work is a key factor in validating an innovation, it’s often best to be wary of something new, no matter how attractive. But since the emergence of this architecture, a great deal of work has been done to extend it to new applications. The list is long, and there are two works that give an idea of this movement:

The Vision Mamba from Zhu et al applies the architecture to image processing. It shows a slight performance gain in various approaches (classification, segmentation, etc.) but above all a particularly attractive performance saving (86% GPU memory saving, inference twice as fast). In this architecture, the image is divided into patches, and the selection mechanism will retain or not the information from each patch, enabling us to project ourselves towards much higher image resolutions, where the good old Vision Transform is quickly in short supply (see diagram below: comparison of the state of the art Transform DeiT and Mamba Vision).

And that’s just the tip of the iceberg. Mamba is also used in video processing, graph modeling and more. At this point, all we can do is keep an eye on the next few months to either jump on the bandwagon at the right moment, or forget about this work and place it in the pantheon of incredible architectures that almost revolutionized artificial intelligence, between the Capsule Networks and Consistency Models

Back to DinoV2 and image embeddings

Let’s get out of Mamba for two particularly interesting side-steps that caught our eye last month. First, let’s take a look back at our old friend DinoV2from the laboratories of MetaAI laboratories. As a reminder (we’ve talked about it extensively, notably in webinar), DinoV2 is a generic model whose fundamental purpose is not to address a specific problem, but to generate a high-level representation of an image, which can then be used for many different problems. Researchers began by training the model DinoV2 It could then be adapted to a wide range of tasks (segmentation, depth map generation, etc.) via an additional layer that learned to manipulate the high-level representations of the large model. During specialization, this “big model” remained stationary, enabling extremely effective training. Beyond this versatility, DinoV2 was part of this approach to Deep Learning for learning high-level representations that are efficient and complete. For us, this model quickly became part of the toolbox, as a particularly effective Swiss Army Knife for addressing various image-based problems…

It was only recently that we realized that we had missed out on some complementary work by MetaAI released in September 2023: Vision Transformers Need Registers by Darcet et al. And this work is particularly interesting, because it is based on an observation that has already been made many times: when you look at an internal representation derived from a model DinoV2 to which you have submitted an input image, you will very quickly see pixel “artifacts” appear in the embeddings (the representations generated by the model) which will impair the representativeness of this embeddingand local representation. And this problem is not specific to DinoV2but almost all of the Vision Transformerfamily of architectures that has (alas?) become the standard for image processing in Deep Learning.

The diagram below shows three photographs, each with the internal representation of three different models, including the DinoV2 in the center. On the right, the so-called With registers “ have been improved by the authors. In particular, the DinoV2 classic (without registers) these very high-intensity pixels, visibly located in places where no relevant information seems to exist in the image…

But why is this important, would object the empiricist who has learned not to question these tools too much? Firstly, because the purpose of DinoV2 is to generate representations that can be used for a variety of tasks, the existence of these artifacts will be a particular hindrance, particularly here for locating an element from the image. But beyond that, these embeddings are often a very important toolbox when we want to make progress on a subject in an unsupervised way (for example in clusteringsimilarity or anomaly detection). The spatial representativeness of embedding is a fundamental piece of information we want to be able to rely on. However Vision Transformers will generate these representations, allowing any pixel generated at the output to depend on any pixel present at the input (unlike convolutional networks, which wisely maintain a spatial correspondence between input and output). Today, we still lack experience with these tools, and this type of work will enable us to understand them better, so that we can work more efficiently in the future.

Here, the authors solve the problem by providing a way for the model to aggregate information without interfering with the generated representations. The solution is directly available, and we can confirm that in local image analysis, the results are much more interesting. But in a way, this publication is more interesting in that it opens a small window on the behavior of a huge model of the type Vision Transform during training. Artifact pixels can be distinguished in a number of ways: they have a much higher absolute value than the others, they appear when training models of considerable size, they carry little specific information at the local level and instead play an aggregation role to represent the entire image, in other words, they carry very global but not local information.

Large Language Models are too sensitive to art

Regular readers of our research journals already know that Large Language Models which have been making headlines over the past year with GPT4 are certainly very powerful tools, but particularly complex to manage in a robust way. A particularly sensitive issue is securing the model to prevent it from generating unwanted content, especially for models trained according to a certain very American vision of “political correctness”. Any user of these tools has observed the response of the LLM declining to answer a question considered too toxic or dangerous…

The publication we are interested in here has a title that speaks for itself: ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMsby Jiang et al. This work has already made it possible to take stock of protection methods for LLMs against the malicious attacks found so far. Unfortunately, these methods remain extremely limited, and fall into two broad categories. The first is the so-called Detection-based Defense “It uses an external classifier or parameters internal to the LLM (the famous perplexity) to protect the model from malicious use. The second, ” Mitigation-based Defense “will reformulate the input request (via paraphrases, or by acting on the tokens representing the sentence) to minimize the ability to attack the model. First of all, it should be pointed out that these defenses, while useful, are by no means satisfactory complete protection. But above all, this research shows a new, highly effective attack vector, based on ASCII-Art, which bypasses these defenses without any hassle. And a diagram will be much clearer than an explanation at this point:

The authors test this attack on the majority of existing models (including GPT4and Llamas), and in almost all cases succeed in extracting unwanted behavior from the model. This work isn’t revolutionary, and we’ve already noted other workarounds of this type in previous reviews of the research. Let’s bet a good bottle that this is just the beginning, considering that the immense size of these models (essential to their quality) limits, if not prevents, any complete analysis of these behaviors. This remains one of the golden rules of artificial intelligence projects: it’s much more difficult to constrain a model to a specific behavior than to generate it, and we regularly observe this sad reality in the implementation of our projects.