Files
transformers/docs/source/en/model_doc/plbart.md
2025-10-15 14:08:54 -07:00

3.6 KiB

This model was released on 2021-03-10 and added to Hugging Face Transformers on 2022-02-18 and contributed by gchhablani.

PLBart

PLBart is a sequence-to-sequence model designed for both code and natural language tasks, including code summarization, generation, and translation. It is pre-trained on large datasets of Java and Python functions paired with natural language using denoising autoencoding. The model outperforms or matches state-of-the-art performance across multiple programming languages and tasks, including program repair, clone detection, and vulnerability detection. Analysis shows that PLBART captures key aspects of code syntax, style, and logical flow, enabling strong performance even with limited annotated data.

import torch
from transformers import pipeline, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("uclanlp/plbart-python-en_XX", src_lang="python", tgt_lang="en_XX")
pipeline = pipeline(task="text2text-generation", model="uclanlp/plbart-python-en_XX", dtype="auto", tokenizer=tokenizer)
pipeline("def maximum(a,b,c):NEW_LINE_INDENTreturn max([a,b,c])")
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

AutoTokenizer = PLBartTokenizer.from_pretrained("uclanlp/plbart-python-en_XX", src_lang="python", tgt_lang="en_XX")
model = AutoModelForSeq2SeqLM.from_pretrained("uclanlp/plbart-python-en_XX", dtype="auto")

inputs = tokenizer("def maximum(a,b,c):NEW_LINE_INDENTreturn max([a,b,c])", return_tensors="pt")
outputs = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"])
print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])

Usage tips

  • The model expects sequences in a specific format with special language ID tokens. Source text format: X [eos, src_lang_code] where X is the source text. Target text format: [tgt_lang_code] X [eos]. The bos token is never used.
  • For fine-tuning with a single language, language tokens may not be needed. Refer to the paper for details.
  • Use the regular call() method to encode source text format (pass text as first argument or with text keyword). Use text_target keyword for target text format.
  • Set decoder_start_token_id to the target language ID when generating text.

PLBartConfig

autodoc PLBartConfig

PLBartTokenizer

autodoc PLBartTokenizer - build_inputs_with_special_tokens

PLBartModel

autodoc PLBartModel - forward

PLBartForConditionalGeneration

autodoc PLBartForConditionalGeneration - forward

PLBartForSequenceClassification

autodoc PLBartForSequenceClassification - forward

PLBartForCausalLM

autodoc PLBartForCausalLM - forward