What is a tokenizer and why does it matter for AI models?
Introduction
Imagine you're trying to teach a computer to understand human language. You might think that if you give it a sentence like "The cat sat on the mat," it would immediately understand what that means. But computers don't understand words the way we do. They need everything broken down into smaller pieces they can work with.
This is where tokenizers come in. A tokenizer is like a translator that takes human language and breaks it into tiny pieces called tokens. These tokens are the building blocks that AI models use to process and understand text.
What is a Tokenizer?
Think of a tokenizer like a chef who needs to chop ingredients into small, manageable pieces. Just as a chef might chop a carrot into tiny cubes, a tokenizer takes a sentence and splits it into smaller parts. These parts can be words, parts of words, punctuation, or even individual characters.
For example, the sentence "I love AI" might be broken down into these tokens:
- I
- love
- AI
But for more complex sentences, the tokenizer might split things differently. Consider "The quick brown fox jumps over the lazy dog." This might become:
- The
- quick
- brown
- fox
- jumps
- over
- the
- lazy
- dog
How Does a Tokenizer Work?
Tokenizers work using different methods, but most modern ones use a technique called byte pair encoding (BPE). This is a fancy way of saying they look for the most common patterns in the text and then group those patterns together.
For instance, if the tokenizer sees that "ing" often appears at the end of words, it might decide to treat "ing" as a single token instead of three separate characters. This helps reduce the number of tokens needed to represent common patterns.
But here's the catch: when tokenizer developers make changes to their system, they might decide to split text differently. This can dramatically change how many tokens a piece of text requires.
Why Does This Matter for AI Pricing?
When you use an AI model, you usually pay based on the number of tokens it processes. It's like paying for each ingredient you use in a recipe.
Now, let's look at what happened with Anthropic's Claude Opus model. They kept the price per token the same, but when they updated their tokenizer in version 4.7, it started breaking text into up to 47% more tokens than before. This means that even though the price per token stayed the same, each request ended up costing significantly more because more tokens were being used.
Let's use a real-world example:
- Before: A 100-word sentence was broken into 120 tokens
- After: The same sentence is now broken into 176 tokens (47% more)
If the price is $0.01 per token, that means:
- Before: $1.20 for the sentence
- After: $1.76 for the same sentence
So even though the price per token didn't change, the total cost went up because more tokens were needed.
Key Takeaways
- A tokenizer breaks text into smaller pieces called tokens that AI models can process
- Changes to tokenizer algorithms can dramatically increase the number of tokens needed for the same text
- Even if the price per token stays the same, more tokens mean higher total costs
- Understanding tokenization helps explain why AI costs can seem to increase without any price changes
This is a simple but important concept in AI. As AI becomes more common in our daily lives, understanding how these systems work can help us make better decisions about how we use them and how much we pay for them.


