Transformers v5 redesigns tokenization: clearer and more modular

11 hours ago4 minutes

Transformers v5 completely changes how we think about tokenizers: they are now explicit architectures you can inspect, instantiate, and train from scratch—just like an nn.Module in PyTorch. Can you imagine building a tokenizer with the exact same structure as LLaMA but trained only on your medical or legal corpus? That’s exactly what v5 makes easier.

What tokenization does and why it matters

Models don’t read raw text; they consume sequences of integers called token IDs. Tokenization converts text to those IDs and back. Why should you care? Because good tokenization compresses text better: fewer tokens needed for the model’s context means more effective context.

In everyday work you’ll see a token can be a word, a character, or a subtoken like play or ##ing. How the tokenizer normalizes, pre-tokenizes, and segments text determines how efficient that compression is.

Tokenization pipeline in v5

Tokenization is not a black box: it’s a pipeline with separate stages and clear responsibilities.

Normalizer: cleaning and Unicode normalization, lowercasing, etc.
Pre-tokenizer: preliminary splitting into chunks (for example, splitting on spaces)
Model: tokenization algorithm (BPE, Unigram, WordPiece)
Post-processor: inserts special tokens (BOS, EOS, padding)
Decoder: reconstructs text from tokens

Each component is interchangeable. In v5 you can inspect tokenizer.normalizer, tokenizer.pre_tokenizer, tokenizer._tokenizer.model, and more. That gives you fine-grained control to adapt tokenization to specific domains.

Dominant algorithms

BPE (Byte Pair Encoding): merges frequent character pairs, deterministic and widely used.
Unigram: a probabilistic approach that selects segmentations from a large initial vocabulary.
WordPiece: similar to BPE but with likelihood-based criteria.

These algorithms are usually implemented in the Rust tokenizers library, which is fast and model-agnostic.

The relationship between `tokenizers` (Rust) and `transformers`

The tokenizers library does the heavy lifting: speed and efficiency. transformers adds model-aware features: chat templates, special tokens, truncation, and output formats (PyTorch tensors, NumPy, etc.).

Practical example: with AutoTokenizer you get a ready-to-use interface, and if you need the fast engine it’s available at tokenizer._tokenizer.

How things were before v5 (brief)

In v4 there were two implementations per model: a slow Python one and a fast Rust one. That caused:

duplicated files per model
behavioral discrepancies between slow/fast versions
difficulty seeing the real architecture of the tokenizer
inability to instantiate an empty “template” to train from scratch easily

Sound like redundant code and a confusing user experience? Exactly.

v5’s philosophical shift: architecture separated from parameters

The big idea in v5 is to separate the architecture of the tokenizer (normalizer, pre-tokenizer, model type, post-processor, decoder) from the trained parameters (vocabulary, merges). It’s the same pattern PyTorch uses with nn.Module and weights.

Instead of loading a tokenizer as a closed box, you now instantiate the structure and then fill it with trained vocabulary. For example:

from transformers import LlamaTokenizer
# Instantiate the architecture
tokenizer = LlamaTokenizer()
# Train the learned part using your data
trained = tokenizer.train_new_from_iterator(text_iterator, vocab_size=32000)

This makes it easy to create tokenizers that behave identically to the reference ones but with vocabularies tuned to your domain.

Relevant technical changes in the library

A single file per model (no slow/fast split).
TokenizersBackend (Rust) is the preferred backend and wraps the Rust tokenizer.
PythonBackend and SentencePieceBackend still exist for particular cases.
PreTrainedTokenizerBase defines the interface and common functions: handling special tokens, encode, decode, apply_chat_template, save_pretrained, from_pretrained, etc.
AutoTokenizer remains the entry point, but now maps to classes that represent the tokenizer’s clear architecture.

Quick inspection example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")
print(tokenizer._tokenizer.normalizer)
print(tokenizer._tokenizer.pre_tokenizer)
print(tokenizer._tokenizer.model)

Training a tokenizer compatible with a model (example)

If you want a tokenizer with the same LLaMA architecture but trained on your data:

from transformers import LlamaTokenizer
from datasets import load_dataset

tokenizer = LlamaTokenizer()
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

def get_training_corpus():
    batch = 1000
    for i in range(0, len(dataset), batch):
        yield dataset[i : i + batch]["text"]

trained_tokenizer = tokenizer.train_new_from_iterator(
    text_iterator=get_training_corpus(),
    vocab_size=32000,
    length=len(dataset),
    show_progress=True,
)

trained_tokenizer.push_to_hub("my_custom_tokenizer")

The resulting tokenizer will keep the same tokenization behavior as LLaMA (spacing, special tokens, decoding), but with vocabulary and merges adjusted to your domain.

Real benefits for developers and projects

Transparency: you can see what normalization and pre-tokenization a model applies.
Productivity: less duplicated code and a single implementation per model.
Flexibility: trainable templates that make it easier to create domain-specific tokenizers.
Fewer subtle bugs: you no longer need to maintain parity between slow/fast.

For research and production this means faster iteration and lower chances of tokenization-related bugs.

Recommendations and warnings

If you work with highly segmented languages (Chinese, Japanese) try different pre-tokenizers and compare the resulting sequence lengths.
The Rust backend is recommended for performance, but for non-standard behavior it’s still valid to use PythonBackend or SentencePieceBackend.
When training your own tokenizer, validate quality by measuring average sequence length and the unknown-token (unk) rate.

A practical closing

Transformers v5 is not just code cleanup: it’s an improvement in ergonomics and control. Now you can treat tokenizers like configurable templates, see exactly what they do, and adapt tokenization to your data with the same natural flow you use to define a neural network. If you build or adapt models for specific domains, this saves work and gives more reproducible results.

Original source

https://huggingface.co/blog/tokenizers

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

Transformers v5 redesigns tokenization: clearer and more modular

11 hours ago4 minutes

What tokenization does and why it matters

Tokenization pipeline in v5

Tokenization is not a black box: it’s a pipeline with separate stages and clear responsibilities.

Normalizer: cleaning and Unicode normalization, lowercasing, etc.
Pre-tokenizer: preliminary splitting into chunks (for example, splitting on spaces)
Model: tokenization algorithm (BPE, Unigram, WordPiece)
Post-processor: inserts special tokens (BOS, EOS, padding)
Decoder: reconstructs text from tokens

Dominant algorithms

BPE (Byte Pair Encoding): merges frequent character pairs, deterministic and widely used.
Unigram: a probabilistic approach that selects segmentations from a large initial vocabulary.
WordPiece: similar to BPE but with likelihood-based criteria.

These algorithms are usually implemented in the Rust tokenizers library, which is fast and model-agnostic.

The relationship between `tokenizers` (Rust) and `transformers`

Practical example: with AutoTokenizer you get a ready-to-use interface, and if you need the fast engine it’s available at tokenizer._tokenizer.

How things were before v5 (brief)

In v4 there were two implementations per model: a slow Python one and a fast Rust one. That caused:

duplicated files per model
behavioral discrepancies between slow/fast versions
difficulty seeing the real architecture of the tokenizer
inability to instantiate an empty “template” to train from scratch easily

Sound like redundant code and a confusing user experience? Exactly.

v5’s philosophical shift: architecture separated from parameters

Instead of loading a tokenizer as a closed box, you now instantiate the structure and then fill it with trained vocabulary. For example:

from transformers import LlamaTokenizer
# Instantiate the architecture
tokenizer = LlamaTokenizer()
# Train the learned part using your data
trained = tokenizer.train_new_from_iterator(text_iterator, vocab_size=32000)

This makes it easy to create tokenizers that behave identically to the reference ones but with vocabularies tuned to your domain.

Relevant technical changes in the library

A single file per model (no slow/fast split).
TokenizersBackend (Rust) is the preferred backend and wraps the Rust tokenizer.
PythonBackend and SentencePieceBackend still exist for particular cases.
PreTrainedTokenizerBase defines the interface and common functions: handling special tokens, encode, decode, apply_chat_template, save_pretrained, from_pretrained, etc.
AutoTokenizer remains the entry point, but now maps to classes that represent the tokenizer’s clear architecture.

Quick inspection example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")
print(tokenizer._tokenizer.normalizer)
print(tokenizer._tokenizer.pre_tokenizer)
print(tokenizer._tokenizer.model)

Training a tokenizer compatible with a model (example)

If you want a tokenizer with the same LLaMA architecture but trained on your data:

from transformers import LlamaTokenizer
from datasets import load_dataset

tokenizer = LlamaTokenizer()
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

def get_training_corpus():
    batch = 1000
    for i in range(0, len(dataset), batch):
        yield dataset[i : i + batch]["text"]

trained_tokenizer = tokenizer.train_new_from_iterator(
    text_iterator=get_training_corpus(),
    vocab_size=32000,
    length=len(dataset),
    show_progress=True,
)

trained_tokenizer.push_to_hub("my_custom_tokenizer")

The resulting tokenizer will keep the same tokenization behavior as LLaMA (spacing, special tokens, decoding), but with vocabulary and merges adjusted to your domain.

Real benefits for developers and projects

Transparency: you can see what normalization and pre-tokenization a model applies.
Productivity: less duplicated code and a single implementation per model.
Flexibility: trainable templates that make it easier to create domain-specific tokenizers.
Fewer subtle bugs: you no longer need to maintain parity between slow/fast.

For research and production this means faster iteration and lower chances of tokenization-related bugs.

Recommendations and warnings

If you work with highly segmented languages (Chinese, Japanese) try different pre-tokenizers and compare the resulting sequence lengths.
The Rust backend is recommended for performance, but for non-standard behavior it’s still valid to use PythonBackend or SentencePieceBackend.
When training your own tokenizer, validate quality by measuring average sequence length and the unknown-token (unk) rate.

A practical closing

Original source

https://huggingface.co/blog/tokenizers

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

What tokenization does and why it matters

Tokenization pipeline in v5

Dominant algorithms

The relationship between tokenizers (Rust) and transformers

How things were before v5 (brief)

v5’s philosophical shift: architecture separated from parameters

Relevant technical changes in the library

Training a tokenizer compatible with a model (example)

Real benefits for developers and projects

Recommendations and warnings

A practical closing

Original source

Stay up to date!

What tokenization does and why it matters

Tokenization pipeline in v5

Dominant algorithms

The relationship between tokenizers (Rust) and transformers

How things were before v5 (brief)

v5’s philosophical shift: architecture separated from parameters

Relevant technical changes in the library

Training a tokenizer compatible with a model (example)

Real benefits for developers and projects

Recommendations and warnings

A practical closing

Original source

Stay up to date!

The relationship between `tokenizers` (Rust) and `transformers`

The relationship between `tokenizers` (Rust) and `transformers`