mirror of
https://github.com/huggingface/transformers.git
synced 2025-10-20 17:13:56 +08:00
2.7 KiB
2.7 KiB
This model was released on 2020-05-20 and added to Hugging Face Transformers on 2020-11-16 and contributed by dqnguyen.
BERTweet
BERTweet is a large-scale pre-trained language model for English Tweets, sharing the architecture of BERT-base and trained using the RoBERTa pre-training procedure. It surpasses strong baselines like RoBERTa-base and XLM-R-base, achieving superior results in Part-of-speech tagging, Named-entity recognition, and text classification tasks.
import torch
from transformers import pipeline
pipeline = pipeline(task="text-classification", model="vinai/bertweet-base", dtype="auto")
result = pipeline("SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:")
print(f"Label: {result[0]['label']}, Score: {result[0]['score']}")
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
inputs = tokenizer("SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:", return_tensors="pt")
outputs = model(**inputs)
predicted_class_id = outputs.logits.argmax(dim=-1).item()
label = model.config.id2label[predicted_class_id]
print(f"Predicted label: {label}")
Usage tips
- Use [
AutoTokenizer
] or [BertweetTokenizer
]. They come preloaded with custom vocabulary for tweet-specific tokens like hashtags (#), mentions (@), emojis, and common abbreviations. Install the emoji library too. - Pad inputs on the right (
padding="max_length"
). BERT uses absolute position embeddings.
BertweetTokenizer
autodoc BertweetTokenizer