Former OpenAI Researcher Reveals Copyright Law Violations in ChatGPT's Training Data

•

November 16, 2024 at 09:42 PM

A former OpenAI researcher, Suchir Balaji, has revealed that the company potentially violated copyright law while collecting training data for ChatGPT and other AI models.

OpenAI logo on laptop display

Balaji, who worked at OpenAI from 2020 to 2024, was directly involved in data collection for the company's large language models (LLMs). He admitted that the team scraped data indiscriminately from various sources, including copyrighted materials, paywalled content, and user-generated content, without considering legal implications.

Key revelations from Balaji's statements:

OpenAI assumed any freely available internet data was usable, regardless of copyright status
The training process involves making unauthorized copies of copyrighted material
Current AI company claims about 'fair use' may not be legally valid
The technology could harm the commercial viability of original content creators

Balaji resigned in August 2024, citing concerns about the negative impact of AI on the internet ecosystem. He specifically pointed to declining traffic on platforms like Stack Overflow as evidence of AI systems potentially damaging their original data sources.

While OpenAI has secured licensing agreements with some publishers, it still faces legal challenges from authors whose works were used without consent. Balaji advocates for increased AI regulation to address these copyright concerns and protect content creators.

Man in suit looking at phone.

Female in white shirt smiling.

Get Jewels dialog box.

Tech Industry News •

AI Technology Ethics

Next-Gen AirPods Pro to Feature Health Monitoring, Built-in Cameras

Apple AirPods Revenue Hits $18 Billion in 2023, Surpassing Spotify's Total Revenue

Texas Court Orders Live Nation CEO to Testify in Astroworld Tragedy Case

11/16/2024

Tool Announces First Caribbean Festival with Primus, Mastodon in Dominican Republic