OpenAI’s CTO Won’t Discuss Training Data for AI Video Generator Sora


The companies developing generative AI technologies are, typically, skirting around the rules regarding copyright: very few actually provide the public with concrete information on how they train their models.

That trend has not changed with OpenAI’s Sora, the company’s forthcoming text-to-video generative AI that has shown the ability to make lifelike and realistic video.

In an interview with the Wall Street Journal, OpenAI’s former CEO (she was CEO for two days when Sam Altman was momentarily ousted) and current CTO Mira Murati sat down to talk about how its new technology. Murati’s goal with the interview was likely to discuss the benefits of Sora and hype the coming technology. That certainly happens, but the Wall Street Journal‘s Joanna Stern didn’t just throw softballs: she asked some difficult questions as well.

In a segment that’s about three minutes long, Stern questions Sora’s training set. Ahead of the interview, Stern provided OpenAI with some new text descriptions that would be used to generate videos for their interview.

“Every time I watch a Sora clip, I wonder what videos did this AI model learn from,” Joanna says. “Did the model see any clips of Ferdinand to know what a bull in a China shop should look like? Was it a fan of Spongebob?”

While she asks these questions, clips from the animated movie Ferdinand and the children’s television show Spongebob play side by side with Sora’s output, and it’s difficult to not see the similarity. The next line of questioning was, naturally, what data was used to train Sora.

“We used publicly available data and licensed data,” Murati responds.

“So videos on YouTube?” Stern asks. “Videos from Facebook, Instagram? What about Shutterstock? I know you guys have a deal with them.”

“I’m actually not sure about that. If they were publicly available, publicly available to use, there might be that data, but I’m not sure. I’m not confident about it,” Murati says.

“I’m just not going to go into the details of the data that was used, but it was publicly available or licensed data.”

After the interview, off camera, Murati confirmed to Stern that the licensed data does include content from Shutterstock, but her unwillingness to discuss the topic on-camera is telling.

As impressive as generative AI is and as often as PetaPixel reports on new developments, the discussion around how these companies create visual content and the likelihood that it violates artists’ copyrights is just as constant. There have been cases that show the people behind AI image generators are specifically targeting certain artists in their training data under the guise that it is “publicly available.” Even when that’s not the case, the ease by which photographers can recreate their own photos with very little effort or the fact that iconic images are just as simple to recreate with minimal effort tells the story itself.

It could be speculated with a high level of confidence that these AI systems have seen and been trained on those copyrighted images, which is why they so easily recreate their own versions of them. However, speculation is hardly necessary. Midjourney’s founder admitted that its AI used a “hundred million” images as a training set without permission, and OpenAI has also admitted that it is “impossible” to train AI without leaning on copyrighted content.

All this said, Murati likely knows that talking about the use of stolen content to train its AI is not something OpenAI wants to regularly admit, which is perhaps why she refuses to answer Stern’s question. It is, however, an easy way to argue that these companies care very little about the rights of human artists and show to what extent they will go to further their own ends despite that cost.

Image credits: Header photo licensed via Depositphotos.