3.6 KiB
This model was released on 2021-03-10 and added to Hugging Face Transformers on 2022-02-18 and contributed by gchhablani.
PLBart
PLBart is a sequence-to-sequence model designed for both code and natural language tasks, including code summarization, generation, and translation. It is pre-trained on large datasets of Java and Python functions paired with natural language using denoising autoencoding. The model outperforms or matches state-of-the-art performance across multiple programming languages and tasks, including program repair, clone detection, and vulnerability detection. Analysis shows that PLBART captures key aspects of code syntax, style, and logical flow, enabling strong performance even with limited annotated data.
import torch
from transformers import pipeline, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("uclanlp/plbart-python-en_XX", src_lang="python", tgt_lang="en_XX")
pipeline = pipeline(task="text2text-generation", model="uclanlp/plbart-python-en_XX", dtype="auto", tokenizer=tokenizer)
pipeline("def maximum(a,b,c):NEW_LINE_INDENTreturn max([a,b,c])")
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
AutoTokenizer = PLBartTokenizer.from_pretrained("uclanlp/plbart-python-en_XX", src_lang="python", tgt_lang="en_XX")
model = AutoModelForSeq2SeqLM.from_pretrained("uclanlp/plbart-python-en_XX", dtype="auto")
inputs = tokenizer("def maximum(a,b,c):NEW_LINE_INDENTreturn max([a,b,c])", return_tensors="pt")
outputs = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"])
print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])
Usage tips
- The model expects sequences in a specific format with special language ID tokens. Source text format:
X [eos, src_lang_code]
whereX
is the source text. Target text format:[tgt_lang_code] X [eos]
. Thebos
token is never used. - For fine-tuning with a single language, language tokens may not be needed. Refer to the paper for details.
- Use the regular
call()
method to encode source text format (pass text as first argument or withtext
keyword). Usetext_target
keyword for target text format. - Set
decoder_start_token_id
to the target language ID when generating text.
PLBartConfig
autodoc PLBartConfig
PLBartTokenizer
autodoc PLBartTokenizer - build_inputs_with_special_tokens
PLBartModel
autodoc PLBartModel - forward
PLBartForConditionalGeneration
autodoc PLBartForConditionalGeneration - forward
PLBartForSequenceClassification
autodoc PLBartForSequenceClassification - forward
PLBartForCausalLM
autodoc PLBartForCausalLM - forward