Ex-OpenAI Researcher Exposes Copyright Law Violations in ChatGPT Training Process
OpenAI logo on laptop display
A former OpenAI researcher, Suchir Balaji, has revealed that the company potentially violated copyright laws while training ChatGPT and other AI models. After spending four years at OpenAI, Balaji left in August 2024 due to concerns about the company's data collection practices and its impact on the internet.
During his tenure, Balaji helped collect vast amounts of data to train OpenAI's large language models (LLMs). The company scraped data from various sources, including pirate sites, paywalled news content, and social media platforms, without considering copyright implications.
Key concerns raised by Balaji:
- Training AI models involves making unauthorized copies of copyrighted data
- The practice may not qualify as "fair use" under copyright law
- AI systems are threatening the viability of original content creators
- Popular websites like Stack Overflow are experiencing significant traffic drops
- Current AI regulation is insufficient to address these issues
While OpenAI has secured licensing agreements with some newspapers, it still faces lawsuits from authors whose works were used without consent. Balaji argues that regulation is necessary to address the growing concerns about AI's impact on copyright and content creation.
"If you believe what I believe, you have to just leave the company," Balaji told The New York Times, emphasizing his conviction that OpenAI's current trajectory may cause more harm than benefit to society.