Meta signs multi-year AI deal with News Corp worth up to $50 million a year

This article explains how AI companies like Meta are using copyrighted news content for training large language models, and the implications of such data licensing deals for publishers and the broader AI industry.

Introduction

Meta's recent multi-year agreement with News Corp, valued at up to $50 million annually, highlights a critical tension in the AI industry: the role of training data in machine learning systems. This deal underscores how AI companies are increasingly relying on copyrighted content—such as news articles—for training their models, raising complex legal, ethical, and economic questions. At its core, this situation involves the data economy and how value is extracted from content in the age of large language models (LLMs).

What is AI Training Data?

AI training data refers to the vast collections of information used to teach artificial intelligence systems how to perform specific tasks. In the case of large language models (LLMs), this data typically consists of text from books, websites, news articles, and other written content. The process of training involves feeding this data into neural networks, which learn patterns, relationships, and structures within the text. These patterns then allow the model to generate human-like responses when prompted with new input.

For example, when an LLM is trained on a dataset containing millions of news articles, it learns to recognize the structure of news reporting, the use of certain vocabulary, and how to summarize or rephrase information. This training data is not just raw text—it's a carefully curated resource that directly impacts a model's performance and accuracy.

How Does This Deal Work?

Meta's deal with News Corp is a licensing agreement for the use of copyrighted news content. In this arrangement, Meta pays News Corp for access to its content, which it can then use to train its AI models. This is not a one-time transaction but a multi-year contract, indicating a long-term strategy to secure a consistent supply of high-quality data.

From a technical standpoint, this data is processed and fed into Meta’s AI systems through a pipeline that includes data cleaning, tokenization (breaking text into smaller units), and embedding (representing text in numerical form). The quality and diversity of the data are crucial for model performance. For instance, a model trained on a broad range of news sources will likely perform better on tasks requiring general knowledge than one trained on a narrow set of sources.

This deal also reflects the data scarcity problem in AI development. As AI models become more sophisticated, they require more data to function effectively. However, access to high-quality, diverse, and legally obtained datasets is limited, especially in industries like journalism where content is protected by copyright.

Why Does This Matter?

This agreement has significant implications for the AI industry, journalism, and intellectual property law. For AI companies, securing access to large volumes of training data is a competitive advantage. The ability to train models on diverse, high-quality sources can lead to superior performance and, ultimately, market dominance.

However, the deal also raises ethical concerns about the use of copyrighted content without explicit consent from individual authors. It highlights the tension between AI innovation and the rights of content creators. In many cases, news articles are created by individual journalists, but the rights to use that content for AI training often rest with the publishing organizations.

From an economic perspective, this deal represents a shift in how the value of content is realized. Instead of just monetizing content through advertising or subscriptions, publishers can now license their content for AI training, creating a new revenue stream. However, this also risks commodifying journalism, where the focus shifts from public service to data extraction.

Key Takeaways

AI training data is essential for developing large language models and is often sourced from copyrighted content like news articles.
Meta’s deal with News Corp is a multi-year licensing agreement that provides access to high-quality training data for AI development.
This arrangement illustrates the growing importance of data in the AI economy and the need for legal frameworks to govern its use.
While beneficial for publishers, the deal raises ethical questions about consent, author rights, and the commodification of journalism.
The value of content is shifting from traditional monetization methods to new forms of data licensing and AI-driven revenue models.

Meta signs multi-year AI deal with News Corp worth up to $50 million a year

Introduction

What is AI Training Data?

How Does This Deal Work?

Why Does This Matter?

Key Takeaways

Related Articles

'The SaaS apocalypse is overrated': How Workday and other software provders plan to survive AI

Zuckerberg's plan to sell excess AI compute could finds its first big customer in Anthropic

The Zoom hack that says, ‘Don’t record me’