Former OpenAI Researcher Reveals Copyright Law Violations in ChatGPT's Training Data
A former OpenAI researcher, Suchir Balaji, has revealed that the company potentially violated copyright law while collecting training data for ChatGPT and other AI models.
OpenAI logo on laptop display
Balaji, who worked at OpenAI from 2020 to 2024, was directly involved in data collection for the company's large language models (LLMs). He admitted that the team scraped data indiscriminately from various sources, including copyrighted materials, paywalled content, and user-generated content, without considering legal implications.
Key revelations from Balaji's statements:
- OpenAI assumed any freely available internet data was usable, regardless of copyright status
- The training process involves making unauthorized copies of copyrighted material
- Current AI company claims about 'fair use' may not be legally valid
- The technology could harm the commercial viability of original content creators
Balaji resigned in August 2024, citing concerns about the negative impact of AI on the internet ecosystem. He specifically pointed to declining traffic on platforms like Stack Overflow as evidence of AI systems potentially damaging their original data sources.
While OpenAI has secured licensing agreements with some publishers, it still faces legal challenges from authors whose works were used without consent. Balaji advocates for increased AI regulation to address these copyright concerns and protect content creators.
Man in suit looking at phone.
Female in white shirt smiling.
Get Jewels dialog box.