mirror of
https://github.com/huggingface/transformers.git
synced 2025-10-20 17:13:56 +08:00
3.1 KiB
3.1 KiB
This model was released on 2022-07-11 and added to Hugging Face Transformers on 2022-07-18 and contributed by lysandre.
NLLB
NLLB addresses the challenge of translating low-resource languages by developing a conditional compute model based on Sparsely Gated Mixture of Experts. This model uses novel data mining techniques to train on thousands of tasks, improving overfitting resistance. Evaluated on over 40,000 translation directions with the Flores-200 benchmark and a toxicity benchmark, NLLB achieves a 44% BLEU improvement over the previous state-of-the-art, advancing towards a universal translation system.
import torch
from transformers import pipeline
pipeline = pipeline(task="translation_en_to_fr", model="facebook/nllb-200-distilled-600M", src_lang="eng_Latn", tgt_lang="fra_Latn", dtype="auto")
pipeline("Plants create energy through a process known as photosynthesis.")
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
inputs = tokenizer("Plants create energy through a process known as photosynthesis.", return_tensors="pt")
outputs = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"))
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
Usage tips
- The tokenizer was updated in April 2023. It now prefixes the source sequence with the source language instead of the target language. This prioritizes zero-shot performance at a minor cost to supervised performance.
- For non-English languages, specify the language's BCP-47 code with the
src_lang
keyword.
NllbTokenizer
autodoc NllbTokenizer - build_inputs_with_special_tokens
NllbTokenizerFast
autodoc NllbTokenizerFast