White Paper: How the Pervasive Copying of Expressive Works to Train and Fuel Generative Artificial Intelligence Systems Is Copyright Infringement And Not a Fair Use

The News/Media Alliance has produced a White Paper, “How the Pervasive Copying of Expressive Works to Train And Fuel Generative Artificial Intelligence Systems Is Copyright Infringement and Not a Fair Use.”

The Alliance also filed a comprehensive submission addressing copyright and artificial intelligence with the U.S. Copyright Office, to aid the Office in its study and all branches of government on these issues. The Alliance’s reply submission focused on responding to flawed arguments by developers or investors that pushed incomplete and inaccurate views of copyright law.

Download the White Paper (PDF)

Download Copyright Office Comments (PDF)

Download Copyright Office Reply Comments (PDF) (December 2023)

About the White Paper and Copyright Office Comments

On October 30, 2023, the News/Media Alliance published a White Paper, including an incorporated technical analysis, and comments submitted to the Copyright Office focusing on generative Artificial Intelligence (AI) developers’ unauthorized use of publisher content.

The Alliance recognizes the potential benefits and is broadly supportive of AI applications and technologies. While the interests of publishers and generative AI developers could align, for example, in a fair exchange of licensing revenues for access to high-quality training materials, this promise of partnership has not yet materialized except in a few narrow instances. Instead, many generative AI developers have chosen to scrape publisher content without permission and use it for model training and in real-time to create competing products. While publishers make the investments and take the risks, generative AI developers reap the rewards in terms of users, data, brand creation, and advertising dollars. The continued unlicensed use of journalistic reporting portends injury to the public interest that it serves and may hinder the progress of generative AI innovations.

Together, the White Paper and the Technical Analysis make multiple findings, including:

  • Developers have copied and used news, magazine and digital media content to train LLMs.
  • Popular curated datasets underlying LLMs significantly overweight publisher content by a factor ranging from over 5 to almost 100 as compared to the generic collection of content that the well-known entity Common Crawl has scraped from the web.
  • Other studies show that news and digital media ranks third among all categories of sources in Google’s C4 training set, which was used to develop Google’s generative AI-powered products like Bard. Half of the top ten sites represented in the data set are news outlets.
  • LLMs also copy and use publisher content in their outputs. LLMs can reproduce the content on which they were trained, demonstrating that the models retain and can memorize the expressive content of the training works.

View full White Paper 

The Alliance’s comments to the Copyright Office address further questions related to the use of publisher content in generative AI products and services, including the potential for licensed solutions, including on a voluntary, collective basis, existing legal standards to determine when textual outputs may be substantially similar to news and media articles, and methods to obtain consent from copyright owners to the use of their materials for AI training.

Based on the conclusions these findings, recommendations from the Alliance include:

  • The Copyright Office should clarify publicly that use of publishers’ expressive content for commercial generative AI training and development is likely to compete with and harm publisher businesses, which is disfavored as a fair use.
  • Substantial transparency measures should develop around the ingestion of copyrighted materials for uses in generative AI technologies.
  • Further development of relevant licensing models should be encouraged, including by acknowledging the potential feasibility of voluntary collective licensing to facilitate licensing for ingestion of news and media materials for generative AI purposes.
  • The Copyright Office should swiftly promulgate an updated registration option to enable online news publishers to register groups of news articles published online.
  • Considering the large bargaining power disparity between media publishers and very large online platforms, measures to correct this negotiating disparity, such as the Journalism Competition and Preservation Act, should be supported.
  • Measures to address the scraping of protected content from third-party pirate websites should be adopted.

View full Copyright Office Comments

Additional Resources:

Press Release: News/Media Alliance Study Finds Pervasive Unauthorized Use of Publisher Content to Power Generative AI Technologies (October 30, 2023)