Files
transformers/docs/source/en/model_doc/bertweet.md
2025-10-15 14:08:54 -07:00

2.7 KiB

This model was released on 2020-05-20 and added to Hugging Face Transformers on 2020-11-16 and contributed by dqnguyen.

BERTweet

BERTweet is a large-scale pre-trained language model for English Tweets, sharing the architecture of BERT-base and trained using the RoBERTa pre-training procedure. It surpasses strong baselines like RoBERTa-base and XLM-R-base, achieving superior results in Part-of-speech tagging, Named-entity recognition, and text classification tasks.

import torch
from transformers import pipeline

pipeline = pipeline(task="text-classification", model="vinai/bertweet-base", dtype="auto")
result = pipeline("SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:")
print(f"Label: {result[0]['label']}, Score: {result[0]['score']}")
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

inputs = tokenizer("SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:", return_tensors="pt")
outputs = model(**inputs)
predicted_class_id = outputs.logits.argmax(dim=-1).item()
label = model.config.id2label[predicted_class_id]
print(f"Predicted label: {label}")

Usage tips

  • Use [AutoTokenizer] or [BertweetTokenizer]. They come preloaded with custom vocabulary for tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Install the emoji library too.
  • Pad inputs on the right (padding="max_length"). BERT uses absolute position embeddings.

BertweetTokenizer

autodoc BertweetTokenizer