Files
transformers/docs/source/en/model_doc/blt.md
Yuanyuan Chen f64354e89a Format empty lines and white space in markdown files. (#41100)
* Remove additional white space and empty lines from markdown files

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Add empty lines around code

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

---------

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-09-23 16:20:01 -07:00

7.8 KiB

PyTorch Flax FlashAttention SDPA

Byte Lantet Transformer (BLT)

Overview

The BLT model was proposed in Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer. BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.

The abstract from the paper is the following:

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

Usage Tips:

  • Dual Model Architecture: BLT consists of two separate trained models:

    • Patcher (Entropy Model): A smaller transformer model that predicts byte-level entropy to determine patch boundaries and segment input.
    • Main Transformer Model: The primary model that processes the patches through a Local Encoder, Global Transformer, and Local Decoder.
  • Dynamic Patching: The model uses entropy-based dynamic patching where:

    • High-entropy regions (complex data) get shorter patches with more computational attention
    • Low-entropy regions (predictable data) get longer patches for efficiency
    • This allows the model to allocate compute resources where they're most needed
  • Local Encoder: Processes byte sequences with cross-attention to patch embeddings

  • Global Transformer: Processes patch-level representations with full attention across patches

  • Local Decoder: Generates output with cross-attention back to the original byte sequence

  • Byte-Level Tokenizer: Unlike traditional tokenizers that use learned vocabularies, BLT's tokenizer simply converts text to UTF-8 bytes and maps each byte to a token ID. There is no need for a vocabulary.

The model can be loaded via:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("itazap/blt-1b-hf")
model = AutoModelForCausalLM.from_pretrained(
    "itazap/blt-1b-hf", 
    device_map="auto", 
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

prompt = "my name is"
generated_ids = model.generate(
    **inputs, max_new_tokens=NUM_TOKENS_TO_GENERATE, do_sample=False, use_cache=False
)

print(tokenizer.decode(generated_ids[0]))

This model was contributed by itazap. The original code can be found here.

BltConfig

autodoc BltConfig

autodoc BltModel - forward

BltForCausalLM

autodoc BltForCausalLM - forward