Background and challenges
With the democratization of digital platforms (social networks, messaging services), content moderation and toxicity detection have become crucial issues for both private players (X/Twitter, TikTok, YouTube, Instagram…) and public authorities (notably the European Commission and its Digital Services Act of August 2023 aimed at establishing a “safer digital space”). Manual moderation, defined as the review and removal of problematic user-generated content, faces three major obstacles:
- Volume and speed: millions of publications to be analyzed in real time make the task unviable from a financial and organizational point of view.
- Variety of formats: text, images, videos and emojis require multiple skills, and limit the efficiency of manual processes.
- Stress for moderators: repeated exposure to shocking content, leading to post-traumatic stress and psychological disorders.
Faced with these challenges, automated moderation and detection of toxic messages is the only solution that is both scalable and ethical. We carried out an automatic moderation project for one of our customers.
Work performed
- Prompt engineering: design of dedicated prompts for our in-house LLM, to classify messages according to their degree of toxicity.
- Test architecture and datasets: set up an architecture to automatically test thousands of annotated messages and easily evaluate the performance of different versions of prompts and templates.
- Integration into a scalable pipeline: implementation of a clustering architecture capable of handling massive flows in parallel.
- Connection to production: continuous deployment of the module within the operational environment, guaranteeing immediate support for new content.
Benefits
- Contextual understanding: fine detection of message toxicity and correct classification of messages containing problematic language in a non-toxic context (calls for help, alerts, quotation of abusive language as part of a complaint, etc.).
- Moderator protection: total elimination of human intervention to preserve user anonymity and the mental health of moderators.
- Scalability: mass processing with no linear increase in costs or human resources.
- Explainability of moderation: thanks to categorization, the explanation of moderation is accessible to moderation teams.
Results
The pipeline implemented enabled 100% of toxic messages to be blocked during a coordinated attack, whereas competing solutions had to suspend their service for lack of scalability or suitable tools.
100% of problematic messages are blocked, and false positives are very rare. Internal moderation teams have seen their workload drastically reduced.