AI Image Dataset is Pulled After Child Sex Abuse Pictures Discovered

Dec 20, 2023

Laion-5b

The best-known dataset for training AI image generators, LAION-5B, has removed its service after a Stanford study found thousands of child sex abuse images in its library.

The disturbing revelations are a result of a study by the Stanford Internet Observatory which found more than 3,200 images of suspected child sex abuse.

LAION is an open-source index of online images and captions that has been used to train various AI image models but is most associated with Stable Diffusion.

LAION used Common Crawl for its scrape of the internet which includes billions of copyrighted images taken by photographers.

Its most popular dataset, LAION-5B, is named for the more than five billion image-text pairs it contains.

Reckless

Hoovering up every image on the internet is controversial. And it unfortunately means that disturbing and illegal images are also included.

“Taking an entire internet-wide scrape and making that dataset to train models is something that should have been confined to a research operation, if anything, and is not something that should have been open-sourced without a lot more rigorous attention,” David Theil, Stanford Internet Observatory’s chief technologist, tells the Associated Press adding that many generative AI projects were “effectively rushed to market.”

Theil and the team at Stanford had to be careful when conducting their study — viewing child sexual abuse material (CSAM) is illegal whether it’s for research purposes or not.

That meant they had to employ something called perceptual hashing, which extracts a unique digital signature from an image or video. Some of the results were sent to Canada to be verified by colleagues.

LAION, which stands for the nonprofit Large-scale Artificial Intelligence Open Network, released a statement saying it “has a zero-tolerance policy for illegal content, and in an abundance of caution, we have taken down the LAION datasets to ensure they are safe before republishing them.”

It also told 404 Media that it sources its data “from the freely available Common Crawl web index and offers only links to content on the public web, with no images. We developed and published our own rigorous filters to detect and remove illegal content from LAION datasets before releasing them.”

There has been growing concerns over AI image generators and nude images involing minors. One such model linked to CSAM, Civitai, was dropped by its computing provider earlier this month.

News

aiimage, dataset, trainingdata

Discussion