OpenAI Claims it is Impossible to Train AI Without Using Copyrighted Content

OpenAI fires back, saying that it is impossible to train an AI model without using copyrighted content.

OpenAI, the creator of ChatGPT and DALL-E, is fighting a major legal battle against New York Times for what The Times alleges was “unlawful use” of its content during OpenAI’s extensive AI training process.

OpenAI and one of its leading investors, Microsoft, were named co-defendants in the lawsuit, and OpenAI has been firing back.

As The Guardian reports, in a submission to the House of Lords Communications and Digital Select Committee, OpenAI claims that it is impossible to train an AI model without using copyrighted content. A relevant excerpt is below:

We believe that AI tools are at their best when they incorporate and represent the full diversity and breadth of human intelligence and experience. In order to do this, today’s AI technologies require a large amount of training data and computation, as models review, analyze, and learn patterns and concepts that emerge from trillions of words and images. OpenAI’s large language models, including the models that power ChatGPT, are developed using three primary sources of training data: (1) information that is publicly available on the internet, (2) information that we license from third parties, and (3) information that our users or our human trainers provide. Because copyright today covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

OpenAI has also published a blog post in response to the New York Times lawsuit, claiming that the lawsuit is “without merit.”

“While we disagree with the claims in The New York Times lawsuit, we view it as an opportunity to clarify our business, our intent, and how we build our technology,” the AI tech company explains.

OpenAI goes on to explain that its position is summed up in four points: It collaborates with news oraganizations and creates new opportunities, training an AI model is fair use but OpenAI provides an opt-out “because it’s the right thing to do,” the “regurgitation” outlined by The Times is a “rare bug,” and that “The New York Times is not telling the full story.”

It is unusual for an ongoing legal battle between such large organizations to result in such a public spat, but alas. Drilling down on OpenAI’s second point, that training is fair use, the company explains, “Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.”

“That being said, legal right is less important to us than being good citizens. We have led the AI industry in providing a simple opt-out process for publishers (which The New York Times adopted in August 2023) to prevent our tools from accessing their sites,” OpenAI continues.

The company also cites that “other regions and countries” have laws allowing AI companies to train models using copyrighted content, which OpenAI cites as a driving force for innovation and progress in AI.

An article by The Telegraph explains OpenAI’s controversial position that preventing the use of books and news for training AI models would “doom” AI at large.

Using the vast treasure trove of copyrighted content available online to train an AI model is obviously faster, cheaper, and easier than paying for content and using material only available in the public domain. Adobe has shown that training an AI model without relying on copyrighted content is possible.

In its submission to the House of Lords Communications and Digital Select Committee, OpenAI explains, “In training our models, OpenAI complies with the requirements of all applicable laws, including copyright laws.” The company goes on to describe its relatively new opt-out process for websites. While a compelling option for websites now, the horse has already left the barn.

Although OpenAI is working on “mutually beneficial” arrangements and partnerships with major content providers now, the company has already performed considerable scrapes and done a ton of heavy lifting on training its models.

This touches on a major issue for many, including a cohort of authors comprised of John Grisham and George R.R. Martin, who, alongside 15 other authors, sued OpenAI last September, alleging “systematic theft on a mass scale.

As has often been the case in other “Wild West” scenarios of technological advancements, the AI space may end up like so many others before it, with the companies first through the door being happy enough to close it behind them, curtailing competition. Potential new players are likely to be utterly handcuffed by new regulations and legislation that, at this juncture, feel inevitable.

What will prove instrumental in the ongoing war between publishers and AI companies is the precise value of publicly accessible copyrighted content and whether training an AI model using copyrighted content actually constitutes a copyright violation. Companies like OpenAI argue that because copyrighted content is not — at least not often — recreated in its entirety, nothing illegal has occurred. New York Times argues that OpenAI and others have stolen and illegally used content worth “billions of dollars in statutory and actual damages…”

“As an actual and proximate result of the unauthorized use of The Times’s trademarks, The Times has suffered and continues to suffer harm by, among other things, damaging its reputation for accuracy, originality, and quality, which has and will continue to cause it economic loss,” the Times’ claims in its nearly-70-page lawsuit.

“We regard The New York Times’ lawsuit to be without merit. Still, we are hopeful for a constructive partnership with The New York Times and respect its long history, which includes reporting the first working neural network over 60 years ago and championing First Amendment freedoms,” OpenAI responds in its new blog post.


Image credits: Header photo licensed via Depositphotos.

Discussion