Context

  • Data audits carried out upstream by Datalchemy validated the feasibility of exploiting École 42 data for analysis, reporting and, ultimately, machine learning.
  • The first step is to set up a data architecture that enables us to extract a clean, controlled subset of production, without impacting online systems, and to build up a usable history.
  • The priority need identified is the overhaul of campus dashboards (replacing existing Google reports), with an ambitious deadline set for February 2024.

Issues

  • Very large volumes (data collected on all campuses worldwide).
  • Limited history: backups overwrite previous data.
  • Lack of validation and control over archived data.

Project objectives

  • Design an automated architecture to extract data from multiple sources.
  • Propose a simple system for defining and launching new extractions.
  • Track the evolution of data over time (versioning, time stamping).
  • Implement control and validation mechanisms for extracted elements.
  • Facilitate feedback and management of errors detected during the pipeline.

Completed work

  • Setting up dedicated S3 buckets :
    • for dumps with automatic file rotation,
    • for workflow data.
  • Creation of DVC repositories in production (with management of alternative branches) for dataset versioning.
  • Deployment of a PostgreSQL database dedicated to monitoring extraction workflows.
  • Provisioning a Kubernetes cluster with
    • Isolated namespace
    • Pipeline orchestration via Argo Workflows and Cron.

Results

  • Operational infrastructure ready to feed new campus dashboards.
  • Reproducible and extensible pipeline for other use cases (reporting, ML).
  • Guaranteed data traceability and control thanks to DVC versioning and workflow logs.
  • A solid foundation for later deployment of advanced analysis and predictive models.