Why you should read this publication
If you work in robotic control, you have no right to ignore the ongoing revolution in imitation. And beyond robotic control, any optimization problem modeling an agent that has to make decisions can be inspired by these approaches.
What you can say to a colleague or your boss?
Datalchemy, in partnership with Kickmaker, reviews the latest training strategies in imitation learningAnd even if it still seems a little magical, it works with unexpected robustness, and above all, finally makes these training techniques affordable, and almost simple to implement.
What business processes are likely to change as a result of this research?
Virtually the entire chain is impacted by this paradigm shift. But more specifically, the simulation part to recreate training conditions could be greatly reduced, since we are now content with recording the actions of the human operator, which will serve as a training dataset.
the use cases we have developed for customers that touch on the subject of this research review
Training of a robot arm to pick up a part and place it on a target, whatever its position in a given space.
If you only have a minute to devote to reading now, here's the essential content in a nutshell
- The very recent emergence of new robotic training methods (diffusion policies).
- Quite miraculously, we are moving from deep reinforcement learning (which is extremely time-consuming and costly, as well as sometimes producing random results) to diffusion policies and ALOHA (much simpler and more affordable, with very robust results).
- Miraculously, because the theoretical underpinnings of this small revolution have yet to be established.
- These two approaches make it possible to train the robot to perform potentially complex tasks (complex in the sense that recreating a complete simulation of these tasks is virtually impossible) with a ridiculously small dataset (a few hundred examples of tasks correctly performed by a human operator).
- The diffusion policies (part of whose architecture is therefore based on the same diffusion as generative AI) are incredibly robust to variations during training (a hand passing in front of the video sensor, the object to be manipulated being maliciously moved, etc.).
- An extremely recent development inALOHA makes the whole process portable, making it possible to train a robot at home to help the elderly or disabled with everyday tasks.
AI & robotics?
A bit of change! Datalchemy has had the honor of working with Kickmaker for almost two years on artificial intelligence topics, particularly as applied to robotics. And so we’ve had a front-row seat to observe the impressive evolution of this field. While AI applications to robotics have been around since the work of Mnih et al and the invocation of Deep Reinforcement LearningHowever, it has to be admitted that this work has long remained inaccessible, for two main reasons:
- DRL (Deep Reinforcement Learning) are naturally much more expensive in terms of GPU power than Deep Learning classics. The projects we were eyeing in 2018 (such as the training of a robotic hand by OpenAIFor example, in the case of the “C” series (see diagram below), a minimum of 8 competitive GPUs was required in order to obtain an agent. Many of our customers would have loved to test this kind of approach, but weren’t prepared to spend the money to make up their minds.
This area, the DRLwhile extremely pleasant in its formulation (we train a software agent to maximize an arbitrary reward), was very skittish about real-world applications. While in a classical simulator, a model could learn to juggle expertly quite easily, the transfer to reality was doomed to failure, as reality would create conditions too disconnected from the simulation. As for learning directly in reality, considering the cost of a robot, this was impossible.

Figure from Learning Dexterous In-Hand Manipulation, Andrychowicz et al, https://arxiv.org/abs/1808.00177
The field has thus remained the prerogative of the “big” players in the Deep Learningas OpenAI or Google. Google which, moreover, has scored a major coup in recent years with its RT-1, then RT-2 [https://robotics-transformer2.github.io/assets/rt2.pdf]. That’s right, Google has presented the world with a robust robotic system that exploits the latest advances in Large Language Models (them again…) and cross embeddings (enabling images and text to be worked in a common mathematical space) to control a robotic arm on a large number of tasks. While this training must have been particularly onerous, the promise it holds for us is nothing short of dreamlike. Indeed, controlling a robotic arm implies extremely fine control of every position in space, and to be able to grasp an object, for example, it becomes necessary to locate that object exactly. Once this (not very trivial) localization has been achieved, the robotic gesture must also be calculated exactly, respecting its mechanical constraints…
Here, the idea of having just a defining text to perform an action seems a very strong disruption of this domain, with impressive variance, as the text can serve as much to describe the gesture as the target object (below, among other examples: “pick up the bag about to fall off”)

But then, Google revolutionized the field of robotics? Have the players in the field seized upon this work to apply it to their own cases and revolutionize the industrial sector? Sadly, no. Indeed, (and Google had accustomed us to better), if the publication is free, the code needed to train or use these models is, to date, nowhere to be found. And given that these training programs involve training both the LLM and the actor, the cumulative cost of these programs makes reproduction impossible.
However, there is another potentially complex aspect to this approach: the RT2 model is an end-to-end In other words, we put a text as input, and observe actions controlling the robot as output. However, anyone who has worked a little with Large Language Models knows that controlling a model with a prompt is an obscure, illogical and anxiety-provoking art, in which errors can easily appear without warning. It’s not impossible that RT2 suffers from the same problem…
At this point, we might sadly conclude that cutting-edge AI is reserved for the very big players who, sooner or later, will commercialize their work. But in the last six months, the scientific landscape has changed radically, and has suddenly become much more accessible…
Learning through imitation
The surprise came from another stream of DRLresearch, which until now had remained more of a scientific curiosity than a real solution: theImitation Learning.
In a “classic” approach, we’re going to train an AI model to find the right actions to perform in order to obtain the greatest possible reward. The model will be confronted with a simulator reproducing (more or less well) the target reality, make an astronomical number of mistakes, and finally (with a bit of luck, let’s be honest) become “good” enough to succeed in its task.
In the ” Imitation Learning “We’re not going to use a simulator directly. A human expert will generate a number of demonstrations, each corresponding to a sequence of successful actions. Once these demonstrations have been accumulated, a model will be trained to generalize from them, so as to be able to handle the greatest possible number of cases. Considering that we are generally dealing with a few hundred demonstrations, the challenge of generalization here is particularly arduous, as the model must learn to handle cases it has never seen in a demonstration, without having access to the physical constraints of the world in which it evolves. For this reason in particular, the Imitation Learning had remained a rather peripheral topic. And yet, it held out the promise of training a model on an agnostic task, using examples that could therefore fully correspond to reality, without the need for a highly advanced simulator.
A number of works have recently revolutionized this field, works which are now applicable to industry and which require a moderate economic investment. It’s time to present these small revolutions 😊
ALOHA: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, Zhao et al.
Joint work of Meta, Stanford and Berkeley, ALOHA is a breath of practicality and efficiency that has revolutionized the field, much to our delight. Basically, the strong point of this approach is that the researchers didn’t just run a more or less exotic algorithm on a cluster GPU to extract pretty videos. On the contrary, the authors propose here an integral approach that enables the recording configuration for demonstrations to be recreated in addition to the training and inference code. And this robotic manipulation configuration is, here, “low cost”, with an estimated reproduction cost of less than $6,000 (provided you have a decent 3D printer).
ALOHA proposes the creation of interfaces enabling direct control of both robot arms by a human operator. With this approach, an expert can record a number of demonstrations on specific tasks:

This “low-cost” approach is already a small revolution in a world where even the smallest piece of equipment costs a small fortune. Here, the entire system costs the equivalent of an entry-level robotic arm. More generally, the advantages are numerous: versatility, with a very wide range of possible applications, ease of use, ease of repair and construction (at least relative to the world of robotics). It’s rare enough for researchers to question the performance and reproducibility of their work to this extent, but here, everything is available to reproduce this system quite easily. As soon as a task can be performed via the control interface, we can hope to train a model that generalizes this task in a satisfactory way.
What’s special about the tasks performed is that they go far beyond the applications commonly accepted in reinforcement. We’re talking here about extremely precise actions, such as opening or closing a bag ziplockgrab a credit card from a wallet (skynet will need pocket money, don’t argue), insert an electronic component, turn the pages of a book, even bounce a ping pong ball off a racket!

These tasks are interesting because they would typically be very hard (if not impossible) to learn in a classical approach, if only because they are very complex to reproduce in simulation (imagine a simulator reproducing the physical deformation of plastic). They are also high-precision tasks. They are possible here, precisely because we rely on demonstrations carried out by human operators, who will be able to react to the evolution of the problem to find the right action policy. We could postulate (ambitiously) that if a task can be demonstrated, it can be learned…
As for the model’s internal architecture, we’ll leave aside a few technical originalities here, to note that we find the eternal Transformer (note: eternity began in 2017 for Deep Learning). A encode type CNN will generate high-level representations of the images from each camera (in this case, four cameras), and then feed them to a encoder Transform which also takes the robot’s command history as input. As output, the decoder will be called to generate the next action sequence. Point of interest, a conditional VAE is used here to drive another Transform encoder to learn relevant compression of the input information.
ALOHA was the first shot in the arm, with results far superior to any seen before. Below, the ACT from ALOHA is compared with previous approaches on two synthetic and two real-life tasks. The results are much more convincing.

Diffusion Policies: Visuomotor Policy Learning via Action Diffusion, Chi et al.
Another work appeared in June 2023, establishing itself as a new reference in the field ofImitation Learning. Supported by MIT, Toyota and Columbia Universitythe Diffusion Policies have established themselves as a highly effective and versatile approach, and have become a quasi baseline to test when faced with a new problem.
We remain in the same paradigm as before, i.e. training a model to solve a problem using a finite number of demonstrations. This approach is distinguished by its architectural approach, which makes it extremely effective. In particular:
- Modeling that enables closed-loop operation, constantly taking into account the history of the robot’s last actions to readjust future movements produced by the model. The idea is that the model will be able to continually adapt its actions, particularly in the event of a sudden event such as the camera being blacked out or an object being moved by a third party. This robustness, which we were able to observe in our tests, is a very strong advantage of Diffusion Policies.
- Efficient visual conditioning. A sub-model is specialized to extract representations from a sequence of images (the history of the robot’s actions as seen by a camera), which are then fed into the diffusion process that generates the robot’s actions (we’ll come back to this shortly). As this extraction is carried out only once per image, independently of the broadcast, the resulting model can run in real time.
- A new, optimal ad-hoc architecture for broadcasting (I promised we’d come back to this, and we will!) with a Transformer.
And since we’ve kept our promise, let’s take this opportunity to talk about distribution. 😊
If you follow Deep Learning at least, you won’t have missed this new approach, which has been gaining ground over the past two years, particularly in the case of generative AIs such as Stable Diffusion. The idea behind diffusion is to train a model to produce the information we’re interested in, but to add a constraint to this generation: start with totally noisy information and, progressively, by denoising, obtain the result obtained. We’ve already talked about these new approaches in our research review, not least because while they’re producing impressive results, they’re too recent for us to understand why they work so well.
As usual in Deep LearningAn approach that works very well in one area will be tested in all other areas. This is not always a very good idea (how many architectures aim to “optimize” the attention of the user? Transform ended up in limbo?), but as far as broadcasting is concerned, we have to admit that this global approach seems to apply to subjects other than painless image generation.
Here, we discover that these diffusion models are an excellent tool for learning how to generate control information for a robot arm, conditioned on a history of actions and images. We won’t even attempt to justify this success. But it does seem that this approach really succeeds in generalizing new actions from demonstrations, and in a very effective way.
The diagram below shows the overall architecture. We have (top left) a history of actions and images. These observations will be used to generate, by diffusion, the sequence of actions (diagram a) Diffusion Policy General Formulation ). This sequence is generated by modelling the images using either a good old-fashioned CNN ( diagram b) ), or a Transformer (diagram c) ). The bottom left shows a diffusion model with, on the far left, a series of totally noisy positions, which by diffusion will eventually (third image) be refined into a series of valid positions to push the “T” to the right place.

There are several points of interest here:
Distribution models are multi-modal. They teach real variance in action strategies and remain very stable, unlike previous approaches, as shown in the diagram below, where theIBC learns only one direction (as does almost all the LSTM-GMM), and where BET/LSTM-GMM are very, even too noisy:

Diffusion Policy works in positions. Most approaches to imitative robotic control model robot arm movement as velocity control, not position control. Here, surprisingly, positional control works much better.

Diffusion Policy is robust to reception latency. One of the strengths of this approach is that it offers an efficient method that is robust to certain disturbances. The authors have introduced a latency of up to 10 timesteps for a negligible loss of results. What we see here is an appreciable form of generalization, enabling us to project ourselves correctly towards practical use of this tool.
Diffusion Policy is stable during training. We have personally observed this in our tests, where the choice of hyperparameters is often a desperate attempt that can totally invalidate a training session, Diffusion Policy is extremely stable. And that’s great news, because usually you have to iterate over these hyperparameters to hope to find the right combination, which wastes time and money.
Diffusion Policy is stable in the face of disturbances. This is an impressive point. When training a robot in a real environment, the authors will regularly hide the camera, or move an element that the robot needs to manipulate. The model adapts its strategy in real time without falling into error. Here again, we have a strong argument if we’re interested in concrete, effective approaches.

MobileALHOA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation, Fu et al.
[https://mobile-aloha.github.io/resources/mobile-aloha.pdf]
The latter work is much more recent, but well worth revisiting. No real mathematical innovation here Deep LearningIt’s a window on the future of robotics based on these approaches. The authors have taken the ALOHA and evolved it into a demonstration environment that is portable and can be applied anywhere, including in someone’s home. This makes it possible to collect demonstrations in an open setting, and then train a model to interact with everyday objects via a complete robot:

What’s most impressive is not the versatility of the actions performed, but the fact that the system remains affordable (around $30K) and opens the door to many interesting applications. What we have here is a system that is fast to execute (comparable to a human in normal movement), stable even when handling heavy objects, and self-sufficient in terms of energy supply (integrated battery).

Another point of interest is that the authors have reused the oldALOHA to train the model in conjunction with these new applications, and observed a real improvement in results. This implies that each problem is not totally compartmentalized, and that a model can learn to transfer from one situation to another. It becomes conceivable to have a dataset It’s an all-encompassing system that acts as a foundation, enabling new tasks to be learned very efficiently. It should also be noted that MobileALOHAin this case, makes do with fifty or so demonstrations to learn a task, a little less than the diffusion policies which require several hundred… Below are a few examples of tasks learned and reproduced by the model:

Conclusion
If we have to conclude, we can already observe that we are at a key moment in the field of AI-enhanced robotics. A field that used to be extremely costly to access is now suddenly much more accessible (training, for example). diffusion policyfor example, can be done in two or three days on a conventional gpu). The fields of application are also exploding, eliminating tons of problems we used to have, summarized in a caricatured way by “if we can control the robot and collect demonstrations, we can hope to train an agent”.
Afterwards, be careful! These works are still very recent, and as usual in Deep LearningTraining a model is one thing, but framing and controlling it to avoid bad situations is a job in its own right that should not be underestimated, and is sometimes more complex than the training itself. It is therefore necessary to monitor the actions sent to avoid shocks (when possible!), or at the very least to prevent unwanted modes of operation. Testing such a system is also a challenge, especially if we have trained a real robot directly…
Nevertheless, robotics looks set to enter a new golden age. Where programming such a system used to be unbelievably cumbersome, accumulating demonstrations is much quicker, and addresses the specifics of the target problem directly through the operator. To be continued, as usual 😊