Data Owners Are Increasingly Blocking AI Companies From Using Their IP

Jul 22, 2024

Matt Growcoot

A crawler

Training data for generative AI models like Midjourney and ChatGPT is beginning to dry up, according to a new study.

The world of artificial intelligence moves fast. While court cases attempt to decide whether using copyrighted text, images, and video to train AI models is “fair use”, as tech companies argue, those same firms are already running out of new data to harvest.

As generative AI has proliferated and become well-known, there has been a well-documented backlash and many have taken action by denying access to their online data — including photographers.

An MIT research group led the study which looked at 14,000 web domains that are included in three major AI training data sets.

The study, published by the Data Provenance System, discovered an “emerging crisis in consent” as online publishers pull up the drawbridge by not giving permission to AI crawlers.

The researchers looked at the C4, RefineWeb, and Dolma data sets and found that five percent of all the data is now restricted. But that number jumps to 25 percent when looking at the highest-quality sources. Generative AI needs a good caliber of data to produce good models.

Robot.txt, a decades-old method for website owners to stop automated bots from crawling their pages, is increasingly being deployed to block tech companies from collecting data.

According to The New York Times, some AI executives worry about hitting the “data wall”. Essentially, data owners, such as photographers, have become distrustful toward the AI industry and are making things difficult.

The AI industry has long been accused of profiteering from the work of artists, a theme that is subject to a number of ongoing lawsuits including those brought by photographers against the likes of Google, Midjoureny, and Stable Diffusion.

However, robots.txts files are not legally binding. The Times describes them as like a “no trespassing” sign for data but there is no way of actually enforcing it.

OpenAI, which operates DALL-E and ChatGPT, says it respects robots.txt. So do major search engines and Anthropic. However, other players have been accused of ignoring them.

“Unsurprisingly, we’re seeing blowback from data creators after the text, images, and videos they’ve shared online are used to develop commercial systems that sometimes directly threaten their livelihoods,” says Yacine Jernite, a machine learning researcher at Hugging Face.

However, there is a concern that if all AI training data needs to be obtained via a licensing deal then some players like researchers and civil society will be excluded from participating in the technology.