Dansons le Mamba, entre révolution et psychodrame de l’IA

Download the magazine version by clicking here

Five key words from these echoes

Mamba. Sequence. Efficiency. DinoV2. LLM

Why reading this publication can be useful for you concretely?

Mamba Mamba is announcing a new family of efficient and versatile architectures that are making their mark on the artificial intelligence landscape. Bonus: a better understanding of image embeddings from DinoV2, and a new way of getting round Large Language Models.

Which business processes are likely to be modified based on this research?

About Mambais concerned, if the architecture realises its value, we can expect to see an impact in many areas, such as image processing and natural language processing. As far as the evolution of DinoV2 is concerned, unsupervised approaches to image analysis are gaining in quality. Finally, with regard to LLMs, a strong new risk has been identified, which needs to be implemented robustly in order to deploy tools.

The use cases we have developed for customers that touch on the subject of this research echo?

Unsupervised image analysis to detect features or matches. Robustness work to frame a Large Language Model tool..

If you only have a minute to devote to reading now, here is the essential content in 7 points or 7 sentences

A new architecture, Mamba, is making more and more noise in the research world.
This architecture is based on a selection mechanism that enables it to manage complex input data by ignoring unnecessary elements.
It is an interesting candidate for replacing Transformers , which suffer when faced with sequences that are too long (typically in text).
It also seems to be very versatile in terms of applications: natural language, audio, DNA modelling, image, video, etc.
Its rejection at the ICLR2024 conference caused quite a stir, highlighting once again the limitations of the current validation system.
And now something completely different : MetaAI has proposed an evolution of DinoV2 that offers a better understanding of Vision Transformers and provides a superior tool.
A new atrocious flaw has been discovered in Large Language Models (yet another one) : the use of ASCII Art.

AI revolution and psychodrama, let's talk Mamba

A new architecture for Deep Learning?

Mamba! For the past few months, this term has caught the attention of every data scientist or Deep Learning researcher who is somewhat up-to-date with current affairs. Behind this term lies a new and very interesting architecture of neural networks, but also one of those little psychodramas characteristic of academic research in AI. Normally, we (Datalchemy) don't really like to jump on radically different new tools fresh out of the oven... Indeed, the more recent a work is, the stronger the risks of blindness. The dinosaurs of the field remember, for example, the Capsule Networksproposed by Lord Hinton himself, which the community embraced before abandoning them six months later.

Here, as we shall see, the proposed architecture presents very strong arguments that are difficult to ignore. What's more, it has been used successfully in many other projects. But we also have an opportunity to observe the limits of the Deep Learning conference system in a very concrete case, with the rejection of this publication by ICLR 2024. This rejection has caused quite a stir in the community, and deserves to be looked at more closely, as it perfectly illustrates some of the major limitations of current research, and requires us to be ever more cautious.

Before looking specifically at Mamba, a little background is in order. This work follows on from other publications on Structured Space Models ((SSM). SSMs present a new fundamental mechanism for modelling a continuous problem (in the mathematical sense of the term), incorporating a discretisation system for application in Deep LearningThey can be seen as an extension of algorithms such as Kalman filters.Here, this concept is used as a new form of Deep Learning block that can be directly integrated into a neural network. This approach had already been highlighted in How to Train Your HIPPO: State Space Models with Generalized Basis Projections by Gu and al where this type of model made it possible to address very long-term prediction issues(long range arena)by surpassing, for example, the famous Transformer and the few thousand optimisations of the attention mechanism attempted by various researchers in recent years.

This approach remained relatively obscure until the arrival of our beloved Mamba.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces, Gu et al.

Published in December 2023, this approach has caused quite a stir. The authors address two very sensitive points in the world of Deep Learning :

A highly versatile architecture, which (as we shall see) can address natural language, DNA modelling, the management of very long time series and audio generation.
This architecture is highly efficient by nature, making it possible to manage very long contexts. When we recall the total energy consumed by researchers trying to improve the attention mechanism of the famous Transformer , which suffers from quadratic complexity with length (a sentence twice as long will require four times as much calculation), the argument takes on its full meaning and becomes particularly relevant.

Mamba is positioned in relation to previous work in Structured Space Models when faced with a sequence of information (series, text, etc.): this work learned parameters that were invariant throughout the sequence, in the manner of an old LSTM-type recurrent network where the matrices applied within the operators were fixed once the learning was complete. Here, the authors propose a fundamental selection mechanism which, when the model takes a sequence as input, will enable it to ‘decide’ whether or not to update the network's internal variables. The model will therefore be able to learn to select, when it receives the whole sequence of information as input, the information that it will use or not. This modification is of considerable importance, because (theoretically) it allows the model to become robust to extremely long sequences. A properly trained model will be able to ignore a large amount of useless information, which was impossible for a classic architecture such as the Transformer. As a more theoretical point of interest, recurrent models (RNN/GRU/LSTM) can then be considered as a special case of SSM, which becomes a more general approach.

This simple mechanism would have been enough to propose some interesting work, but the authors haven't stopped there. In particular, they have developed a CUDA kernel (the assembler for our all-too-precious GPUs) optimised for accelerating the training of Mamba models. This is not to be underestimated, in a field where available computing power remains a constant brake on any project. Through this type of development, the authors are democratising access to their approach and transforming it into a directly usable base. While it is difficult to compete with all the optimisations carried out for Transformer today, this simple work (as we shall see) has enabled Mamba to be applied to images and video.

We will only touch on the fundamental technique here, but you will find below, in order :

The equations modelling the fundamental mechanism. From left to right :
- (1a) & (1b) : the continuous, theoretical modelling of the transformation, with a signal x(t) as input, a result y(t) as output, an internal state h(t) (corresponding to the latent vector of any other Deep Learning operation) and specific transition matrices A, B and C which will evolve as a function of the input data. This theoretical model is a differential equation on the derivative of h(t)
- (2a) & (2b): a discrete approach, this time of recurrence with an update of the internal state h as a function of its previous state and x
- (3a) & (3b) : another discrete implementation, this time in the form of a convolution.
A diagram depicting the update of a ' Mamba cell,' where we have the old internal state of the cell h(t-1), a new input x(t) , and update mechanisms to generate the new internal state of the cell h(t) and an output y(t). The important point here is the selection mechanism that will modify the matrices A,B and C based on the new input x(t).

Those familiar with the field will not be able to help detecting a large number of similarities with recurrent networks 😊...

The fundamental importance of this selection mechanism

This point is probably the most fundamental in the whole Mamba approach. Although it is always risky to draw parallels between the atomic operation of a Deep Learning operator and the more global behaviour of a model, we can observe here that Mamba, faced with an input sequence, can completely ignore a large part of this sequence in order to concentrate on the most important information. This approach is radically different from more traditional approaches:

Recurrent (and convolutional) models will always take the entire sequence as input. The length of this sequence will therefore have a negative impact on the results of the model, whatever the complexity of the problem addressed.
Transformers take a similar approach, but are also particularly vulnerable to this sequence length. Numerous attempts have been made to better model the attention mechanism, notably because Transformers are the canonical architecture of Large Language Models (GPT, Llamas, Mistral), and this vulnerability to sequence length has very strong impacts on the exploitation of these models. In particular, a model that is supposed to hold a conversation will quickly be limited in the conversation history it can still use...

The authors have illustrated this subject with two simplified problems that highlight how Mamba works (see figure below): Selective Copying where the model must learn to keep only certain tokens as input (those in colour) and ignore other tokens (in white), and induction head where the model must be able to give back the token that follows a specific token already seen as input (below: the black token was directly followed by the blue token, which must therefore be regenerated by the model).

This problem may shock some by appearing 'too simple' for artificial intelligence, yet with sequences of length 4096 for Selective Copying, for example, our beloved Transformers are quite incapable of addressing the issue, given the same complexity... For Induction Heads, Mamba manages to handle sequences of length 100,000, or even a million, without fail. This shows the great value of a selection mechanism which, while robust to useless information, can remain effective over the long term.

Different areas of application

This selection mechanism can intuitively be applied to many other fields, as the authors were able to verify. Several results can be highlighted :

In natural language, the authors observed that Mamba was competitive with the most interesting 'recipes' (this term has the advantage of scientific honesty) for training Transformers on a perplexity metric (remember this point, we'll talk about it soon) at a lower training cost. Even more interestingly, Mamba seems to be able to exploit an increase in the number of parameters more effectively than traditional approaches. Considering that in Deep Learning we have never stopped chasing the largest possible models, these 'scaling laws' are particularly interesting.

In DNA modelling,, Mamba also appears to be much more efficient than the current state of the art. DNA modelling is interesting because it offers much longer sequences than natural language, up to a length of 1048576 tokens. What's interesting is that, whereas the old model degraded as a function of sequence length, Mambasimprove slightly.

Finally, in audio modelling and generation, a subject where the information is continuous at its core and the sequences are particularly long and complex to address, Mamba also comes out well, even on particularly long sequences.

Every time, apart from the length of the sequence, the Mamba model model seems to be more effective in that an increase in the complexity of the model is transferred to a gain in quality greater than what we were able to observe. In addition, the fact that this approach can be applied to such different problems means that we consider this work to be particularly worthy of interest, despite its very early stages. Below: left: accuracy after fine tuning of a DNA model as a function of sequence length, right: modelling quality of audio content as a function of input sound length:

Tragic ball at ICLR 2024: is du8a really mean?

Begging the reader's indulgence for this inglorious title, Mamba has been the subject of considerable controversy in the world of Deep Learning research. As well as reporting on an epiphenomenon, we believe that this is a good way of illustrating some of the major limitations of Deep LearningAs well as reporting on an epiphenomenon, we believe that this is a good way of illustrating some of the major limitations of Deep Learning research. Let's start by pointing out a few obvious facts: in this scientific field we are suffering from a theoretical deficit that makes any 'sensible' approach to comparing architectures very complex (impossible?). In this festival of empiricism, it is not uncommon to observe 'herd' movements in the scientific community, with temporary 'fashion effects' or real blindness. Last but not least, there are now too many publications in this field, and the classic mechanisms for filtering research (in particular, and we're getting there, peer reviewing of prestigious conferences) no longer work.

With that introduction out of the way, let's move on to the drama. Mamba was submitted to the ICLR 2024 conference, which is intended to be a cutting-edge event in certain areas of research. The majority of the scientific community expected this work to be accepted, considering the other work that followed on from Mamba (we'll talk about this shortly). So it came as a great surprise to learn that the work had been rejected outright...

It should be remembered that submissions and exchanges between authors and critics for this type of conference are public on the excellent OpenReview platform. This is excellent news for anyone wishing to see for themselves what's going on. And here, only one reviewer (out of a total of four) felt that this work did not deserve to be reviewed, the now famous du8a reviewer.

Although many criticisms were raised by this dear du8a, the authors of Mamba were on the whole able to respond. Only two points remained blocking, and led to the rejection of the publication :

The absence of Long Range Arena tests using conventional benchmarks. This point is honestly unconvincing, given the other experiments carried out by the authors.
The use of perplexity as a comparison metric in text or DNA modelling. This point is much more relevant, as this metric is based on a third-party Deep Learning model to be established, and while this metric is particularly used, it remains very tricky to grasp and can, potentially, mean very little.

But then, is du8a a brilliant individual who, faced with the wave of a new fashion, remains upright and refuses to bend? Or are we dealing with a jealous person who doesn't recognise the quality of the work he has done? We'd love to have a single answer, but it's impossible here. As laziness is (sometimes) a virtue, we'll just have to wait and see over the next few months whether the wave of Mambas dies down or, on the contrary, picks up speed and effectively replaces Transformers on specific subjects. Nevertheless, even if this work turns out to be less revolutionary than expected, it's a shame that it didn't have its place at a leading conference, especially considering that a number of publications accepted at this same event seem to fall well short in terms of investment and the questions raised in the field of artificial intelligence.

Vision Mamba, Video Mamba…

Things would be relatively straightforward if this were the only publication. In the world of Deep Learning, reproducing research work is a fundamental part of validating an innovation, and it is often preferable to be wary of something new, even if it is attractive. But since the emergence of this architecture, a great deal of work has been done to extend it to new applications. The list is long, and there are two works that give an idea of this movement:

The Vision Mamba by Zhu and al applies the architecture to image processing. It shows a slight performance gain in various approaches (classification, segmentation, etc.) but above all a particularly attractive performance saving (86% GPU memory saving, inference twice as fast). In this architecture, the image is divided into patches, and the selection mechanism will retain or not the information from each patch, making it possible to project towards much higher image resolutions, where the good old Vision Transformer quickly suffers (diagram below: comparison of the state of the art Transformer DeiT and Mamba Vision).

And that's just the tip of the iceberg. Mamba is also used in video processing, graph modelling, etc. At this stage, all we can do is keep an eye on the next few months and either jump on the bandwagon at the right moment, or forget about this work and place it in the pantheon of incredible architectures that almost revolutionised artificial intelligence, between Capsule Networks and Consistency Models...

Back to DinoV2 and image embedding

Let's get out of Mamba for two particularly interesting side-steps that came into our sights last month. Firstly, let's take the opportunity to revisit our old friend DinoV2, from the labs of MetaAI. As a reminder (we've talked about it extensively, notably in webinars), DinoV2 is a generic model whose fundamental purpose is not to address a specific problem, but to generate a high-level representation of an image, which can then be used for many different problems. The researchers began by training the DinoV2 model globally, then adapting it to numerous tasks (segmentation, depth map generation, etc.) via an additional layer that learned to manipulate the high-level representations of the large model. During specialisation, this 'big model' was not moved, which made training extremely effective. Beyond this versatility, DinoV2 was part of the Deep Learning approach of learning high-level representations that are efficient and complete. For us, this model quickly found its way into the toolbox as a particularly effective Swiss Army Knife for tackling different image-based problems...

It was only recently that we realised that we had missed a complementary work to MetaAI published in September 2023: Vision Transformers Need Registers by Darcet and al. This work is particularly interesting, because it is based on an observation that has already been observed many times: when you look at an internal representation resulting from a DinoV2 model to which you have submitted an input image, you will very quickly see 'artefact' pixels appear in the embeddings (these representations generated by the model) which will impair the representativeness of this embedding, particularly the local representativeness. And this problem is not specific to DinoV2, but to almost all Vision Transformers, the family of architectures that has (unfortunately?) become the norm for working with images in Deep Learning.

The diagram below shows three photographs, each with the internal representation for three different models, with the DinoV2 in the centre. On the right, the 'With registers' representations are an improvement proposed by the authors. Particularly noticeable in the classic DinoV2 version (without registers) are these very high-intensity pixels, visibly located in places where no relevant information seems to exist in the image...

But why is this important, would object the empiricist who has learned not to question these tools too much? Firstly, because the aim of DinoV2 is to generate representations that can be used for a variety of tasks, so the existence of these artefacts will be a particular nuisance, especially when it comes to locating an element from the image. But beyond that, these embeddings are often a very important toolbox when we want to make progress on a subject in an unsupervised way (for example in clustering, similarity or anomaly detection). The spatial representativeness of the embedding is a fundamental piece of information that we want to be able to rely on. Vision Transformers generate these representations by allowing any pixel generated at the output to depend on any pixel present at the input (unlike convolutional networks, which wisely maintain a spatial correspondence between input and output). We still lack experience of these tools, and this type of work will enable us to understand them better so that we can work more effectively in the future.

Here, the authors solve the problem by providing a way for the model to aggregate information without interfering with the representations generated. The solution is directly available and we can confirm that in local image analysis, the results are much more interesting. But in a way, this publication is more interesting in that it opens a small window onto the behaviour of a huge Vision Transformer-type model during its training. The 'artefact' pixels stand out in a number of ways: they have a much higher absolute value than the others, they appear when training models of considerable size, they carry little specific information at the local level and play more of an aggregation role to represent the whole image, in other words, they carry very global but not local information.

Large Language Models are too sensitive to art

As regular readers of our research journals will already know, the Large Language Models that have been making the headlines over the last year with GPT4 are certainly very powerful tools, but they are particularly complex to manage in a robust way. A particularly sensitive issue is securing the model to prevent it from generating undesirable content, especially for models trained according to a certain vision of very American 'political correctness'. Any user of these tools has observed the LLM's response of declining to answer a question considered too toxic or dangerous...

The publication we are interested in here has a sufficiently eloquent title: ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs, de Jiang et al. This work already provides an overview of the methods of protecting LLMs against malicious attacks that have been found to date. Unfortunately, these methods are still extremely limited, and fall into two main categories. The first is called 'Detection-based Defense', and exploits an external classifier or internal LLM parameters (the famous perplexity) to protect the model against malicious use. The second, ‘Mitigation-based Defence’, will reformulate the input request (via paraphrases, or by acting on the tokens representing the sentence) to minimise the ability to attack the model. It should be pointed out that these defences, while useful, are by no means satisfactory complete protection. But above all, this research shows a new and very effective attack vector, based on ASCII-Art, which bypasses these defences without any problem. And a diagram will be much clearer than an explanation at this stage:

The authors test this attack on the majority of existing models (including GPT4 and Llamas), and in almost all cases succeed in extracting unwanted behaviour from the model. This work is not revolutionary, and we have already noted other bypasses of this type in previous reviews of the research. Let's bet a good bottle that this is just the beginning, considering that the immense size of these models (essential to their quality) limits, or even prevents, any complete analysis of these behaviours. This remains one of the golden rules of artificial intelligence projects: it is much more difficult to constrain a model to a specific behaviour than to generate it, and we regularly observe this sad reality in the implementation of our projects.

Blog

Dancing the Mamba, between revolution and AI psychodrama