LAION Removes CSAM Links from Re-released Dataset Used in Stable Diffusion Training

August 31, 2024

LAION, the German research organization responsible for creating the dataset used to train models like Stable Diffusion, has announced the release of a new dataset, ReLAION-5B, which it claims has been “thoroughly cleaned” of links to suspected child sexual abuse material (CSAM). This release comes in response to prior criticisms and is a re-release of the original LAION-5B dataset, now updated with “fixes” based on recommendations from organizations such as the Internet Watch Foundation, Human Rights Watch, and the Canadian Center for Child Protection.

The new dataset, ReLAION-5B, is available in two versions: ReLAION-5B Research and ReLAION-5B Research-Safe. The latter version not only removes suspected CSAM links but also filters out additional non-safe-for-work (NSFW) content. LAION emphasizes that its datasets do not contain actual images but rather indexes of links to images and associated alt text, all derived from the Common Crawl dataset of scraped web pages.

LAION’s decision to remove the original LAION-5B dataset came after a December 2023 investigation by the Stanford Internet Observatory. The investigation revealed that LAION-5B, particularly the LAION-5B 400M subset, included at least 1,679 links to illegal images, as well as various inappropriate content, including pornographic images and harmful social stereotypes.

Implications for AI Models and Future Use

The release of ReLAION-5B is intended to allow researchers to cleanse existing copies of LAION-5B by using the updated metadata to remove matching illegal content. LAION is urging all research institutions and organizations that still utilize the old LAION-5B dataset to transition to the new version as soon as possible.

This update may have significant implications for models previously trained on LAION-5B, such as those by Stability AI and Google. Notably, Runway, an AI startup that collaborated with Stability AI, recently removed its Stable Diffusion 1.5 model from the Hugging Face platform, a move that may be related to these recent developments.

A Call for Responsible Use

LAION reiterates that its datasets are designed strictly for research purposes and not for commercial use. However, as history has shown, some organizations may still leverage these datasets for broader applications. The organization’s proactive steps to remove harmful content underline the importance of responsible data curation in the rapidly evolving field of AI.

In summary, ReLAION-5B represents LAION’s commitment to ensuring that its datasets are safe and ethically sound for researchers, and it marks a critical update for those relying on these resources for AI development.

Post Views: 658