In the realm of generative artificial intelligence, the use of digital news stories has been a critical resource for teaching machines how to engage with human queries effectively. Tech companies, such as OpenAI, have historically utilized news stories to develop their AI models freely. However, as the demand for cutting-edge AI models intensifies, newspaper publishers and other data owners are asserting their rights, demanding compensation for the use of their data in AI training sets.
Since August, over 535 news organizations, including prominent names like The New York Times, Reuters, and The Washington Post, have taken steps to block the collection of their content for ChatGPT training. As a result, discussions are now centered around establishing payment arrangements with publishers. These agreements would enable the AI to incorporate links to specific news stories in its responses, benefiting newspapers by providing direct compensation and potentially increasing web traffic.
OpenAI had previously reached an agreement in July to license content from the Associated Press for training data. The ongoing discussions have also touched on this idea but are primarily focused on integrating news stories into ChatGPT responses.
Beyond newspapers, other data sources are also seeking compensation. Reddit, a popular social platform, has reportedly engaged with leading generative AI companies to discuss payment for its data. In the absence of a deal, Reddit is considering blocking search crawlers from Google and Bing, potentially limiting its site's visibility in search results. Additionally, Elon Musk initiated charges for bulk access to Twitter posts, which were previously accessible to researchers for free, citing unauthorized use by AI companies.
These developments underscore the growing sense of urgency regarding who benefits from online information. As generative AI promises to revolutionize user interactions with the internet, publishers and companies are pushing for fair payment for their data, perceiving it as an existential issue.
For example, after OpenAI released GPT-4 in March, traffic to the coding community Stack Overflow decreased by 15%. Prashanth Chandrasekar, CEO of Stack Overflow, expressed concerns that AI models were trained on their data. Stack Overflow recently laid off 28% of its staff as a consequence.
In addition to payment demands, leading AI firms are facing copyright lawsuits from authors, artists, and coders seeking damages for infringement and a share of profits. Trade groups are also advocating for the right to collectively bargain with tech companies.
OpenAI's decision to negotiate with newspapers and data owners may be an attempt to reach agreements before legal decisions clarify tech companies' obligations to license and pay for content. The company maintains that its practices have not violated copyright law and that any deals would be for future access to otherwise inaccessible content.
The rush to secure data for generative AI has seen significant venture capital investments. In the first three quarters of 2023, nearly $16 billion was invested in generative AI. As AI development remains expensive, every component, including data, is crucial.
Historically, data had been the only free part of this equation, as services like Common Crawl did not charge tech companies for using their data. However, this landscape is changing, with sites like Reddit, Stack Overflow, and Wikipedia implementing defensive measures and launching paid portals for AI companies seeking training data.
The ongoing negotiations indicate that companies are positioning themselves to secure compensation, emphasizing that publishers must unite in their demands. Danielle Coffey, president and CEO of the News/Media Alliance, believes that policymakers recognize the need for licensing deals and copyright protection for publishers.
The landscape of AI data access is evolving, reflecting the shifting dynamics between content creators, tech companies, and the AI industry.
