Que faire quand les données manquent ?

Applying artificial intelligence research to industry is something everyone wants to do, but the vast majority encounter serious obstacles in this endeavor. Every player in the field has experienced the frustration of observing impressive results from the scientific world, results that degrade significantly when applied to a "real-world" problem. There are numerous reasons for this, including the theoretical youth of deep learning and the difficulties in identifying the robustness axes of a model for industrialization. However, if there is one constant in these obstacles, it is the lack of data, which eventually puts a stop to projects. We have been working on these application limitations for a long time and can now offer a sound, effective methodology for iterating on a deep learning solution, both for training and model testing. We propose to present approaches based on synthetic data, and beyond that, a methodology that allows for working confidently.

There is never enough data

Any enthusiast familiar with the subject knows that in AI, we need a tremendous amount of data. The axiom of "always more" quickly becomes tiresome. In the academic world, major players like Google and its JFT Dataset have accumulated tens of millions of different images to train their networks. Such a quantity is often out of reach for an industrial player. It's also worth noting that data must be annotated by a human actor to be usable. Even if we can accelerate this annotation significantly (another one of our offerings), these annotations must be rigorously controlled by experts because they will be used both to train (create) our deep learning models and to qualify their prediction quality. In the latter case, data annotated too quickly will lead to a model that has the illusion of good performance until the fateful day of industrialization when the model collapses. But beware! We don't just want "a lot of data." We want this data to be sufficiently varied to reproduce the distribution of the problem we want to address. The data must, therefore, reproduce enough varied scenarios, with a certain balance. However, regardless of the industrial subject, the cases where we lack data are typically those where acquisition is more complex or costly: rare cases, cases related to the operation of a costly-to-interrupt industrial process, etc.

The task quickly seems impossible and can lead to seeking other ways to improve an AI model, such as working on a new deep learning architecture (a new backbone in detection, for example), or delving into hyperparameters (for example, via a Bayes+HyperBand approach, or Tensor V Programs). These improvement axes are real, but in practice, they generally result in much less interesting improvements than richer and better-controlled data.

2 An Obvious Solution: Data Generation

An approach that has emerged in recent years in the scientific world and has been attempted with varying degrees of success in the industrial world is to generate data to augment datasets.

This approach has become a staple in research, particularly since the work of OpenAI on domain randomization, in which researchers trained a robotic agent in a fully simulated environment and then applied it to the real world (Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World, Tobin et al, https://arxiv.org/abs/1703.06907).

More recently, OpenAI and Berkeley have used these approaches to train a quadruped to learn to navigate in an open environment (Coupling Vision and Proprioception for Navigation of Legged Robots, Fu et al, https://arxiv.org/abs/2112.02094), as shown in the diagram below.

While robotics has embraced this approach, problems such as object detection and localization, or anomaly detection, can also benefit from these methodologies. One could even argue that we are approaching the paradigm of Self-Supervised Learning, which aims to train a model on more general data to learn low-level representations, and then specialize it on a specific task.

This approach would notably allow for more extensive testing of a model, firstly by aiming to generate controlled variances in the data and observing the model's results, and secondly by maximizing the amount of real data used in testing, which statistically strengthens result metrics.

So, does generating synthetic data suffice to overcome problems in industrial settings? Unfortunately, and unsurprisingly, no.

3 Naive approaches are doomed to fail

We've seen it before. For instance, when a data, such as a photorealistic image, is generated along with its annotations, an AI model trained on this synthetic data may seem to perform well, yielding good scores.

However, when applied to real data, the results dramatically collapse. Here, we encounter the demon of Distribution Drift specific to Deep Learning...

A significant solution stems from substantial work done in 2021 by Baradad et al (Learning to See by Looking at Noise, https://arxiv.org/abs/2106.05963). In this publication, the authors illustrate the intuition drawn from Representation Learning on learning hierarchical representations of information. They demonstrate that it's possible to pre-train a model on images barely related to the target problem (e.g., noise images or irrelevant generations), provided that these training images can aid in creating fundamental useful representations within the neural network. The founding idea is as follows: we don't want synthetic data that mimics the target problem. We want a hierarchy of synthetic data where we can embed the target problem, which then becomes a special case of our generation. Through a solid methodology, we can iterate on concrete improvements to the model.

4 An indispensable methodology

How do we implement such a project? Methodology is, unsurprisingly, the key to success if we aim to ensure a proper enhancement of the model for our clients.

1. Understanding the distribution

A dataset is fundamentally a sample representing a broader distribution, that of the problem we want to address. A distribution is analyzed, measured, and mapped out through the available data. An initial analysis of this distribution, based on the data, combined with domain expertise of the targeted problem, is obviously essential for any subsequent action.

2. Embedding the distribution

Let's start with a recognition of failure: we cannot create data that is completely indistinguishable from the missing data, so there's no point in spending months searching for a perfection that is doomed to fail.

Even if you don't visually perceive any difference between your synthetic data and your real data, a Deep Learning model can perfectly exploit "invisible" differences related to a sensor model, a form of optical noise, etc. It's better, therefore, to seek to embed the distribution within a properly framed hierarchical approach, following in the footsteps of the work of Baradad et al.

3. Controlling the generation

Generation that cannot be finely controlled is of more than limited interest. Sooner or later, we'll want to generate specific cases corresponding to the "gaps" present in the data.

However, we'll only discover these specific cases by testing the model against real or synthetic data and identifying the variances where the model fails. Let's set aside Generative Adversarial Networks or VQ-Variational Autoencoders, at least initially, in favor of a deterministic system that is perfectly controlled. It will always be possible, at a later stage, to use these tools sparingly, for example, for domain transfer, or for conditioned generation akin to recent diffusion model approaches.

4. Having a compass

Probably the most important aspect. No one wants to wait for months for synthetic data only to then realize that it does not improve the model at all. With our clients, we aim to establish a "compass", in other words, a simple model related to the target data on which we can observe improvement. This allows for iterative action and verification of the correct direction of synthesis work.

5. Adapting training

Synthetic data serves a dual purpose: pre-training a model and supplementing learning with specific cases. Therefore, model training must be weighted between different data sets according to the neural network architecture and the objective, in order to optimize the final quality of the model.

What to do when data is lacking ?