Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data"
Back to Explainers
techExplainerbeginner

Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data"

June 5, 202610 views3 min read

This article explains how AI models are trained and why the source of training data matters, especially when companies claim to use only licensed data but reportedly use unlicensed web data.

What’s the big deal about how AI models are trained? It might sound like a technical question, but it’s actually one that affects all of us. When big companies like Microsoft build AI models, they need lots of data to learn from. But where that data comes from can have a big impact on how the AI works — and even how it’s used.

What is AI model training?

Think of AI model training like teaching a child to recognize cats. You show them thousands of pictures of cats and tell them, 'This is a cat.' Over time, they start to understand what makes a cat a cat. AI models do something similar. They are fed massive amounts of text, images, or other data so they can learn patterns and make predictions.

For example, when you ask a chatbot like ChatGPT a question, it’s using patterns it learned from training data to give you a helpful answer.

How does data licensing matter?

When companies train AI models, they often use data from the internet — like news articles, blog posts, books, or websites. But here’s the catch: not all data is free to use.

Data licensing is like a set of rules that say who can use what data and under what conditions. For example, a company might buy a license to use a specific dataset — like a database of medical records — for a certain price and purpose. But if they use data without permission, that’s called unlicensed data.

Microsoft has been saying that its new AI models (called MAI models) were trained only on clean and commercially licensed data. But recent reports show that Microsoft also used unlicensed data from sources like Common Crawl — a massive collection of web pages that anyone can access.

Why does this matter?

There are a few important reasons why how AI models are trained matters:

  • Legal issues: If companies use data without permission, they might face lawsuits or be forced to pay fines.
  • Quality of AI: Data from the web can be messy, outdated, or even biased. If it’s not carefully selected, it can lead to AI models that give wrong or unfair answers.
  • Trust: When companies say they use only clean, licensed data, people trust them. If they’re not honest, it can damage their reputation and the trust in AI in general.

Microsoft’s case is especially interesting because it’s not just a small company — it’s a global tech giant. So when it says one thing but does another, it raises questions about how other companies might be acting behind the scenes.

Key takeaways

  • AI models are trained using large amounts of data, often from the internet.
  • Data licensing is important because it sets rules for how data can be used.
  • Microsoft claimed to use only clean, licensed data but reportedly also used unlicensed web data.
  • Using unlicensed data can lead to legal issues, lower-quality AI, and loss of trust.
  • It’s important for companies to be honest about where their data comes from.

In short, how AI companies gather their data is not just a technical detail — it’s a matter of fairness, trust, and the future of how AI affects our lives.

Source: The Decoder

Related Articles