Former OpenAI Researcher Reveals Copyright Law Violations in ChatGPT's Training Data

Former OpenAI Researcher Reveals Copyright Law Violations in ChatGPT's Training Data

By Marcus Delano Thompson

November 16, 2024 at 09:42 PM

A former OpenAI researcher, Suchir Balaji, has revealed that the company potentially violated copyright law while collecting training data for ChatGPT and other AI models.

OpenAI logo on laptop display

OpenAI logo on laptop display

Balaji, who worked at OpenAI from 2020 to 2024, was directly involved in data collection for the company's large language models (LLMs). He admitted that the team scraped data indiscriminately from various sources, including copyrighted materials, paywalled content, and user-generated content, without considering legal implications.

Key revelations from Balaji's statements:

  • OpenAI assumed any freely available internet data was usable, regardless of copyright status
  • The training process involves making unauthorized copies of copyrighted material
  • Current AI company claims about 'fair use' may not be legally valid
  • The technology could harm the commercial viability of original content creators

Balaji resigned in August 2024, citing concerns about the negative impact of AI on the internet ecosystem. He specifically pointed to declining traffic on platforms like Stack Overflow as evidence of AI systems potentially damaging their original data sources.

While OpenAI has secured licensing agreements with some publishers, it still faces legal challenges from authors whose works were used without consent. Balaji advocates for increased AI regulation to address these copyright concerns and protect content creators.

Man in suit looking at phone.

Man in suit looking at phone.

Female in white shirt smiling.

Female in white shirt smiling.

Get Jewels dialog box.

Get Jewels dialog box.

Related Articles

Previous Articles