Project : Data architecture, monitoring and dashboards

Context

Data audits carried out upstream by Datalchemy validated the feasibility of exploiting École 42 data for analysis, reporting and, ultimately, machine learning.
The first step is to set up a data architecture that enables us to extract a clean, controlled subset of production, without impacting online systems, and to build up a usable history.
The priority need identified is the overhaul of campus dashboards (replacing existing Google reports), with an ambitious deadline set for February 2024.

Issues

Very large volumes (data collected on all campuses worldwide).
Limited history: backups overwrite previous data.
Lack of validation and control over archived data.

Project objectives

Design an automated architecture to extract data from multiple sources.
Propose a simple system for defining and launching new extractions.
Track the evolution of data over time (versioning, time stamping).
Implement control and validation mechanisms for extracted elements.
Facilitate feedback and management of errors detected during the pipeline.

Completed work

Setting up dedicated S3 buckets :
- for dumps with automatic file rotation,
- for workflow data.
Creation of DVC repositories in production (with management of alternative branches) for dataset versioning.
Deployment of a PostgreSQL database dedicated to monitoring extraction workflows.
Provisioning a Kubernetes cluster with
- Isolated namespace
- Pipeline orchestration via Argo Workflows and Cron.

Results

Operational infrastructure ready to feed new campus dashboards.
Reproducible and extensible pipeline for other use cases (reporting, ML).
Guaranteed data traceability and control thanks to DVC versioning and workflow logs.
A solid foundation for later deployment of advanced analysis and predictive models.

They trust us