Skip to content

Commit 0bf3c0c

Browse files
committed
update readme & other scripts, and support bart
1 parent 4b5c7ca commit 0bf3c0c

File tree

10 files changed

+278
-278
lines changed

10 files changed

+278
-278
lines changed

README.md

Lines changed: 85 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,27 @@
11
# CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
2-
This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:
32

4-
**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
3+
This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:
54

6-
**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/), [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
5+
**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
6+
7+
**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
8+
, [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
79

810
![CodeT5 demo](codet5.gif)
911

1012
## Updates
13+
14+
**Oct 29, 2021**
15+
16+
We
17+
release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
18+
for all the downstream tasks covered in the paper.
19+
1120
**Oct 25, 2021**
1221

13-
We release a CodeT5-base fine-tuned checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for multi-lingual code summarzation. Below is how to use this model:
22+
We release a CodeT5-base fine-tuned
23+
checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
24+
multilingual code summarzation. Below is how to use this model:
1425

1526
```python
1627
from transformers import RobertaTokenizer, T5ForConditionalGeneration
@@ -39,26 +50,17 @@ if __name__ == '__main__':
3950
# this prints: "Convert a SVG string to a QImage."
4051
```
4152

42-
It significantly outperforms previous methods on code summarization in the [CodeXGLUE benchmark](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text):
43-
| Model | Ruby | Javascript | Go | Python | Java | PHP | Overall |
44-
| ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: |
45-
| Seq2Seq | 9.64 | 10.21 | 13.98 | 15.93 | 15.09 | 21.08 | 14.32 |
46-
| Transformer | 11.18 | 11.59 | 16.38 | 15.81 | 16.26 | 22.12 | 15.56 |
47-
| [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf) | 11.17 | 11.90 | 17.72 | 18.14 | 16.47 | 24.02 | 16.57 |
48-
| [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 |
49-
| [PLBART](https://aclanthology.org/2021.naacl-main.211.pdf) | 14.11 |15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 |
50-
| [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859) | **15.24** | **16.18** | **19.95** | **20.42** | **20.26** | **26.10** | **19.69** |
51-
52-
5353
**Oct 18, 2021**
5454

55-
We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out if you have any questions about it.
55+
We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
56+
if you have any questions about it.
5657

5758
**Sep 24, 2021**
5859

5960
CodeT5 is now in [hugginface](https://huggingface.co/)!
6061

61-
You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference:
62+
You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
63+
and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference:
6264

6365
```python
6466
from transformers import RobertaTokenizer, T5ForConditionalGeneration
@@ -76,22 +78,30 @@ print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
7678
```
7779

7880
## Introduction
79-
This repo provides the code for reproducing the experiments in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf).
80-
CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M** functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#).
81-
In total, it achieves state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE).
81+
82+
This repo provides the code for reproducing the experiments
83+
in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
84+
. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M**
85+
functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves
86+
state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE).
8287

8388
Paper link: https://arxiv.org/abs/2109.00859
8489

8590
Blog link: https://blog.einstein.ai/codet5/
8691

87-
The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tine them on 4 generation tasks (code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and clone detection) in CodeXGLUE.
92+
The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
93+
and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tine them on 4 generation tasks (
94+
code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and
95+
clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication
96+
of our paper.
8897

89-
In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
90-
At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
98+
In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
99+
At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using
100+
CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
91101

92102
- **Text-to-code generation**: generate code based on the natural language description.
93103
- **Code autocompletion**: complete the whole function of code given the target function name.
94-
- **Code summarization**: generate the summary of a function in natural language description.
104+
- **Code summarization**: generate the summary of a function in natural language description.
95105

96106
## Table of Contents
97107

@@ -103,7 +113,9 @@ At Salesforce, we build an [AI coding assistant demo](https://github.com/salesfo
103113
6. [Get Involved](#get-involved)
104114

105115
## Citation
116+
106117
If you find this code to be useful for your research, please consider citing.
118+
107119
```
108120
@inproceedings{
109121
wang2021codet5,
@@ -115,116 +127,90 @@ If you find this code to be useful for your research, please consider citing.
115127
```
116128

117129
## License
118-
The code is released under the BSD-3 License (see `LICENSE.txt` for details), but we also ask that users respect the following:
130+
131+
The code is released under the BSD-3 License (see `LICENSE.txt` for details), but we also ask that users respect the
132+
following:
119133

120134
This software should not be used to promote or profit from:
121135

122136
violence, hate, and division,
123137

124138
environmental destruction,
125139

126-
abuse of human rights, or
140+
abuse of human rights, or
127141

128142
the destruction of people's physical and mental health.
129143

130-
We encourage users of this software to tell us about the applications in which they are putting it to use by emailing codeT5@salesforce.com, and to use [appropriate](https://arxiv.org/abs/1810.03993) [documentation](https://www.partnershiponai.org/about-ml/) when developing high-stakes applications of this model.
144+
We encourage users of this software to tell us about the applications in which they are putting it to use by emailing
145+
codeT5@salesforce.com, and to
146+
use [appropriate](https://arxiv.org/abs/1810.03993) [documentation](https://www.partnershiponai.org/about-ml/) when
147+
developing high-stakes applications of this model.
131148

132149
## Dependency
150+
133151
- Pytorch 1.7.1
134152
- tensorboard 2.4.1
135153
- transformers 4.6.1
136-
- tree-sitter 0.2.2
137-
138-
## Download
139-
* [Pre-trained checkpoints & Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research)
140-
* Fine-tuned checkpoints (TBA)
141-
* Extra C/C# pre-training data (TBA)
154+
- tree-sitter 0.2.2
142155

143-
Instructions to download:
144-
```
145-
pip install gsutil
156+
## Download
146157

147-
gsutil -m cp -r "gs://sfr-codet5-data-research/data/" .
158+
* [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models)
159+
* [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data)
160+
* [Fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
148161

149-
mkdir pretrained_models; cd pretrained_models
150-
gsutil -m cp -r \
151-
"gs://sfr-codet5-data-research/pretrained_models/codet5_small" \
152-
"gs://sfr-codet5-data-research/pretrained_models/codet5_base" \
153-
.
154-
```
162+
Instructions to download:
155163

156-
The repository structure will look like the following after the download:
157164
```
158-
├── CODE_OF_CONDUCT.md
159-
├── README.md
160-
├── SECURITY.md
161-
├── codet5.gif
162-
├── configs.py
163-
├── models.py
164-
├── run_clone.py
165-
├── run_gen.py
166-
├── utils.py
167-
├── _utils.py
168-
├── LICENSE.txt
169-
├── data
170-
│ ├── clone
171-
│ ├── concode
172-
│ ├── defect
173-
│ ├── refine
174-
│ │ ├── medium
175-
│ │ └── small
176-
│ ├── summarize
177-
│ │ ├── go
178-
│ │ ├── java
179-
│ │ ├── javascript
180-
│ │ ├── php
181-
│ │ ├── python
182-
│ │ └── ruby
183-
│ └── translate
184-
├── evaluator
185-
│ ├── bleu.py
186-
│ ├── smooth_bleu.py
187-
│ └── CodeBLEU
188-
├── pretrained_models
189-
│ ├── codet5_base
190-
│ └── codet5_small
191-
├── sh
192-
│ ├── exp_with_args.sh
193-
│ ├── run_exp.py
194-
│ ├── results
195-
│ ├── saved_models
196-
│ └── tensorboard
197-
└── tokenizer
198-
└── salesforce
199-
├── codet5-merges.txt
200-
└── codet5-vocab.json
165+
# pip install gsutil
166+
cd your-cloned-codet5-path
167+
168+
gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" .
169+
gsutil -m cp -r "gs://sfr-codet5-data-research/data" .
170+
gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" .
201171
```
202172

203173
## Fine-tuning
204-
Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your downloaded CodeT5 repository path.
205-
206-
You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task` arguments.
207-
In total, we support four models (i.e., ['roberta', 'codebert', 'codet5_small', 'codet5_base']) and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']).
208-
For each task, we use the `sub_task` to specify which specific datasets to fine-tine on.
209-
210-
For example, if you want to run CodeT5-base model on the code summarization task for Ruby, you can simply run:
174+
175+
Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path.
176+
177+
You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task`
178+
arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
179+
and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use
180+
the `sub_task` to specify which specific datasets to fine-tine on. Below is the full list:
181+
182+
| \--task | \--sub\_task | Description |
183+
| --------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
184+
| summarize | ruby/javascript/go/python/java/php | code summarization task on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data with six PLs |
185+
| concode | none | text-to-code generation on [Concode](https://aclanthology.org/D18-1192.pdf) data |
186+
| translate | java-cs/cs-java | code-to-code translation between [Java and C#](https://arxiv.org/pdf/2102.04664.pdf) |
187+
| refine | small/medium | code refinement on [code repair data](https://arxiv.org/pdf/1812.08693.pdf) with small/medium functions |
188+
| defect | none | code defect detection in [C/C++ data](https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf) |
189+
| clone | none | code clone detection in [Java data](https://arxiv.org/pdf/2002.08653.pdf) |
190+
191+
For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run:
192+
211193
```
212-
python run_exp.py --model_tag codet5_base --task summarize --sub_task ruby
194+
python run_exp.py --model_tag codet5_base --task summarize --sub_task python
213195
```
214196

215197
Besides, you can specify:
198+
216199
```
217200
model_dir: where to save fine-tuning checkpoints
218201
res_dir: where to save the performance results
219202
summary_dir: where to save the training curves
220203
data_num: how many data instances to use, the default -1 is for using the full data
221204
gpu: the index of the GPU to use in the cluster
222205
```
223-
You can also revise the suggested arguments [here](https://github.com/salesforce/CodeT5/blob/4f8818aea1bf170f019381671087e4c4f9608005/sh/run_exp.py#L14) and refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full available options.
224-
The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
206+
207+
You can also revise the suggested
208+
arguments [here](https://github.com/salesforce/CodeT5/blob/4f8818aea1bf170f019381671087e4c4f9608005/sh/run_exp.py#L14) or directly customize the [exp_with_args.sh](https://github.com/salesforce/CodeT5/blob/main/sh/exp_with_args.sh) bash file.
209+
Please refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full
210+
available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
211+
225212

226213
## Get Involved
227214

228-
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports.
229-
We welcome PRs!
215+
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!
230216

_utils.py

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,12 +66,22 @@ def convert_clone_examples_to_features(item):
6666
else:
6767
source_str = example.source
6868
target_str = example.target
69-
code1 = tokenizer.encode(source_str, max_length=args.block_size, padding='max_length', truncation=True)
70-
code2 = tokenizer.encode(target_str, max_length=args.block_size, padding='max_length', truncation=True)
69+
code1 = tokenizer.encode(source_str, max_length=args.max_source_length, padding='max_length', truncation=True)
70+
code2 = tokenizer.encode(target_str, max_length=args.max_source_length, padding='max_length', truncation=True)
7171
source_ids = code1 + code2
7272
return CloneInputFeatures(example_index, source_ids, example.label, example.url1, example.url2)
7373

7474

75+
def convert_defect_examples_to_features(item):
76+
example, example_index, tokenizer, args = item
77+
if args.model_type in ['t5', 'codet5'] and args.add_task_prefix:
78+
source_str = "{}: {}".format(args.task, example.source)
79+
else:
80+
source_str = example.source
81+
code = tokenizer.encode(source_str, max_length=args.max_source_length, padding='max_length', truncation=True)
82+
return DefectInputFeatures(example_index, code, example.target)
83+
84+
7585
class CloneInputFeatures(object):
7686
"""A single training/test features for a example."""
7787

@@ -89,6 +99,19 @@ def __init__(self,
8999
self.url2 = url2
90100

91101

102+
class DefectInputFeatures(object):
103+
"""A single training/test features for a example."""
104+
105+
def __init__(self,
106+
example_id,
107+
source_ids,
108+
label
109+
):
110+
self.example_id = example_id
111+
self.source_ids = source_ids
112+
self.label = label
113+
114+
92115
class InputFeatures(object):
93116
"""A single training/test features for a example."""
94117

configs.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
def add_args(parser):
1111
parser.add_argument("--task", type=str, required=True,
12-
choices=['summarize', 'refine', 'translate', 'concode', 'clone', 'defect'])
12+
choices=['summarize', 'concode', 'translate', 'refine', 'defect', 'clone'])
1313
parser.add_argument("--sub_task", type=str, default='')
1414
parser.add_argument("--lang", type=str, default='')
1515
parser.add_argument("--eval_task", type=str, default='')
@@ -49,7 +49,6 @@ def add_args(parser):
4949
help="Pretrained config name or path if not the same as model_name")
5050
parser.add_argument("--tokenizer_name", default="roberta-base", type=str,
5151
help="Pretrained tokenizer name or path if not the same as model_name")
52-
parser.add_argument("--block_size", default=512, type=int)
5352
parser.add_argument("--max_source_length", default=64, type=int,
5453
help="The maximum total source sequence length after tokenization. Sequences longer "
5554
"than this will be truncated, sequences shorter will be padded.")
@@ -98,7 +97,7 @@ def add_args(parser):
9897
parser.add_argument("--local_rank", type=int, default=-1,
9998
help="For distributed training: local_rank")
10099
parser.add_argument('--seed', type=int, default=1234,
101-
help="random seed for initialization")
100+
help="random seed for initialization")
102101
args = parser.parse_args()
103102

104103
if args.task in ['summarize']:

0 commit comments

Comments
 (0)