Former OpenAI Employee Condemns the Company’s Data Scraping Practices

A 3d-rendered image of the openai logo, which consists of interlocking circles and letters, mounted on the exterior of a modern glass building under a clear blue sky.

An artificial intelligence researcher who worked at OpenAI as recently as August says the company violates copyright law.

Part of Suchir Balaji’s job was to gather enormous amounts of data for OpenAI’s GPT-4 multimodal AI but at the time he treated it as a research project and didn’t think that the product he was working on would ultimately turn out to be a chatbot with an integrated AI image generator.

“With a research project, you can, generally speaking, train on any data,” Balaji tells The New York Times. “That was the mindset at the time.”

Balaji says he was drawn to AI research because he thought the technology could do some good for the world. However, he now thinks it is causing more harm to society than good. The Berkley graduate thinks that OpenAI is a threat to the very entities that it took the data from to build its products — including individuals, businesses, and internet services.

“If you believe what I believe, you have to just leave the company,” Balaji tells The Times.

OpenAI builds products like ChatGPT and DALL-E by taking data from the open web and feeding it into a machine-learning program which learns from it. Balaji says it’s not a “sustainable model for the internet ecosystem.”

In a statement to The Times, OpenAI says: “We build our AI models using publicly available data, in a manner protected by fair use and related principles, and supported by longstanding and widely accepted legal precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.”

However, the fair use argument for AI training is yet to be tested in a court and OpenAI is facing numerous lawsuits — predominantly from wordsmiths, including The New York Times.

Balaji says OpenAI’s practices don’t meet the fair use criteria and says the company is making copies of copyrighted data and amalgamating it.

“The outputs aren’t exact copies of the inputs, but they are also not fundamentally novel,” he says. Balaji has published a mathematical analysis on his personal website to prove his theory that OpenAI violates copyright law.


Image credits: Header photo licensed via Depositphotos.

Discussion