salesforce
diff --git a/‎README.md‎
Lines changed: 85 additions & 99 deletions b/‎README.md‎
Lines changed: 85 additions & 99 deletions
diff --git a/‎_utils.py‎
Lines changed: 25 additions & 2 deletions b/‎_utils.py‎
Lines changed: 25 additions & 2 deletions
diff --git a/‎configs.py‎
Lines changed: 2 additions & 3 deletions b/‎configs.py‎
Lines changed: 2 additions & 3 deletions
@@ -1,16 +1,27 @@
 # CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
-This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research: 
 
-**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf) 
+This is the official PyTorch implementation for the following EMNLP 2021 paper from Salesforce Research:
 
-**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/), [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) 
+**Title**: [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
+
+**Authors**: [Yue Wang](https://yuewang-cuhk.github.io/), [Weishi Wang](https://www.linkedin.com/in/weishi-wang/)
+, [Shafiq Joty](https://raihanjoty.github.io/), and [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home)
 
 ![CodeT5 demo](codet5.gif)
 
 ## Updates
+
+**Oct 29, 2021**
+
+We
+release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
+for all the downstream tasks covered in the paper.
+
 **Oct 25, 2021**
 
-We release a CodeT5-base fine-tuned checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for multi-lingual code summarzation. Below is how to use this model:
+We release a CodeT5-base fine-tuned
+checkpoint ([Salesforce/codet5-base-multi-sum](https://huggingface.co/Salesforce/codet5-base-multi-sum)) for
+multilingual code summarzation. Below is how to use this model:
 
 ```python
 from transformers import RobertaTokenizer, T5ForConditionalGeneration
@@ -39,26 +50,17 @@ if __name__ == '__main__':
     # this prints: "Convert a SVG string to a QImage."
 ```
 
-It significantly outperforms previous methods on code summarization in the [CodeXGLUE benchmark](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text):
-| Model       |   Ruby    | Javascript |    Go     |  Python   |   Java    |    PHP    |  Overall  |
-| ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: |
-| Seq2Seq     |   9.64    |   10.21    |   13.98   |   15.93   |   15.09   |   21.08   |   14.32   |
-| Transformer |   11.18   |   11.59    |   16.38   |   15.81   |   16.26   |   22.12   |   15.56   |
-| [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf)     |   11.17   |   11.90    |   17.72   |   18.14   |   16.47   |   24.02   |   16.57   |
-| [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf)    | 12.16 | 14.90  | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 |
-| [PLBART](https://aclanthology.org/2021.naacl-main.211.pdf)    | 14.11 |15.56  |  18.91 |   19.30 |  18.45 |  23.58 |  18.32 | 
-| [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859)    | **15.24**   | **16.18**       | **19.95**   |    **20.42**       | **20.26**   |    **26.10**   |    **19.69** | 
-
-
 **Oct 18, 2021**
 
-We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out if you have any questions about it.
+We add a [model card](https://github.com/salesforce/CodeT5/blob/main/CodeT5_model_card.pdf) for CodeT5! Please reach out
+if you have any questions about it.
 
 **Sep 24, 2021**
 
 CodeT5 is now in [hugginface](https://huggingface.co/)!
 
-You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference:
+You can simply load the model ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
+and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and do the inference:
 
 ```python
 from transformers import RobertaTokenizer, T5ForConditionalGeneration
@@ -76,22 +78,30 @@ print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
 ```
 
 ## Introduction
-This repo provides the code for reproducing the experiments in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf). 
-CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M** functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). 
-In total, it achieves state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE). 
+
+This repo provides the code for reproducing the experiments
+in [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/pdf/2109.00859.pdf)
+. CodeT5 is a new pre-trained encoder-decoder model for programming languages, which is pre-trained on **8.35M**
+functions in 8 programming languages (Python, Java, JavaScript, PHP, Ruby, Go, C, and C#). In total, it achieves
+state-of-the-art results on **14 sub-tasks** in a code intelligence benchmark - [CodeXGLUE](https://github.com/microsoft/CodeXGLUE).
 
 Paper link: https://arxiv.org/abs/2109.00859
 
 Blog link: https://blog.einstein.ai/codet5/
 
-The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small) and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tine them on 4 generation tasks (code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and clone detection) in CodeXGLUE.
+The code currently includes two pre-trained checkpoints ([CodeT5-small](https://huggingface.co/Salesforce/codet5-small)
+and [CodeT5-base](https://huggingface.co/Salesforce/codet5-base)) and scripts to fine-tine them on 4 generation tasks (
+code summarization, code generation, translation, and refinement) plus 2 understanding tasks (code defect detection and
+clone detection) in CodeXGLUE. We also provide their fine-tuned checkpoints to facilitate the easy replication
+of our paper.
 
-In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers. 
-At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
+In practice, CodeT5 can be deployed as an AI-powered coding assistant to boost the productivity of software developers.
+At Salesforce, we build an [AI coding assistant demo](https://github.com/salesforce/CodeT5/raw/main/codet5.gif) using
+CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
 
 - **Text-to-code generation**: generate code based on the natural language description.
 - **Code autocompletion**: complete the whole function of code given the target function name.
-- **Code summarization**: generate the summary of a function in natural language description.  
+- **Code summarization**: generate the summary of a function in natural language description.
 
 ## Table of Contents
 
@@ -103,7 +113,9 @@ At Salesforce, we build an [AI coding assistant demo](https://github.com/salesfo
 6. [Get Involved](#get-involved)
 
 ## Citation
+
 If you find this code to be useful for your research, please consider citing.
+
 ```
 @inproceedings{
     wang2021codet5,
@@ -115,116 +127,90 @@ If you find this code to be useful for your research, please consider citing.
 ```
 
 ## License
-The code is released under the BSD-3 License (see `LICENSE.txt` for details), but we also ask that users respect the following:
+
+The code is released under the BSD-3 License (see `LICENSE.txt` for details), but we also ask that users respect the
+following:
 
 This software should not be used to promote or profit from:
 
 violence, hate, and division,
 
 environmental destruction,
 
-abuse of human rights, or 
+abuse of human rights, or
 
 the destruction of people's physical and mental health.
 
-We encourage users of this software to tell us about the applications in which they are putting it to use by emailing codeT5@salesforce.com, and to use [appropriate](https://arxiv.org/abs/1810.03993) [documentation](https://www.partnershiponai.org/about-ml/) when developing high-stakes applications of this model.
+We encourage users of this software to tell us about the applications in which they are putting it to use by emailing
+codeT5@salesforce.com, and to
+use [appropriate](https://arxiv.org/abs/1810.03993) [documentation](https://www.partnershiponai.org/about-ml/) when
+developing high-stakes applications of this model.
 
 ## Dependency
+
 - Pytorch 1.7.1
 - tensorboard 2.4.1
 - transformers 4.6.1
-- tree-sitter 0.2.2 
- 
-## Download 
-* [Pre-trained checkpoints & Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research)
-* Fine-tuned checkpoints (TBA)
-* Extra C/C# pre-training data (TBA)
+- tree-sitter 0.2.2
 
-Instructions to download:
-```
-pip install gsutil
+## Download
 
-gsutil -m cp -r "gs://sfr-codet5-data-research/data/" .
+* [Pre-trained checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/pretrained_models)
+* [Fine-tuning data](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/data)
+* [Fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
 
-mkdir pretrained_models; cd pretrained_models
-gsutil -m cp -r \
-  "gs://sfr-codet5-data-research/pretrained_models/codet5_small" \
-  "gs://sfr-codet5-data-research/pretrained_models/codet5_base" \
-  .
-```
+Instructions to download:
 
-The repository structure will look like the following after the download:
 ```
-├── CODE_OF_CONDUCT.md
-├── README.md
-├── SECURITY.md
-├── codet5.gif
-├── configs.py
-├── models.py
-├── run_clone.py
-├── run_gen.py
-├── utils.py
-├── _utils.py
-├── LICENSE.txt
-├── data
-│   ├── clone
-│   ├── concode
-│   ├── defect
-│   ├── refine
-│   │   ├── medium
-│   │   └── small
-│   ├── summarize
-│   │   ├── go
-│   │   ├── java
-│   │   ├── javascript
-│   │   ├── php
-│   │   ├── python
-│   │   └── ruby
-│   └── translate
-├── evaluator
-│   ├── bleu.py
-│   ├── smooth_bleu.py
-│   └── CodeBLEU
-├── pretrained_models
-│   ├── codet5_base
-│   └── codet5_small
-├── sh
-│   ├── exp_with_args.sh
-│   ├── run_exp.py
-│   ├── results
-│   ├── saved_models
-│   └── tensorboard
-└── tokenizer
-    └── salesforce
-        ├── codet5-merges.txt
-        └── codet5-vocab.json    
+# pip install gsutil
+cd your-cloned-codet5-path
+
+gsutil -m cp -r "gs://sfr-codet5-data-research/pretrained_models" .
+gsutil -m cp -r "gs://sfr-codet5-data-research/data" .
+gsutil -m cp -r "gs://sfr-codet5-data-research/finetuned_models" .
 ```
 
 ## Fine-tuning
-Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your downloaded CodeT5 repository path.
- 
-You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task` arguments. 
-In total, we support four models (i.e., ['roberta', 'codebert', 'codet5_small', 'codet5_base']) and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). 
-For each task, we use the `sub_task` to specify which specific datasets to fine-tine on.
- 
-For example, if you want to run CodeT5-base model on the code summarization task for Ruby, you can simply run:
+
+Go to `sh` folder, set the `WORKDIR` in `exp_with_args.sh` to be your cloned CodeT5 repository path.
+
+You can use `run_exp.py` to run a broad set of experiments by simply passing the `model_tag`, `task`, and `sub_task`
+arguments. In total, we support five models (i.e., ['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
+and six tasks (i.e., ['summarize', 'concode', 'translate', 'refine', 'defect', 'clone']). For each task, we use
+the `sub_task` to specify which specific datasets to fine-tine on. Below is the full list:
+
+| \--task   | \--sub\_task                       | Description                                                                                                                      |
+| --------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
+| summarize | ruby/javascript/go/python/java/php | code summarization task on [CodeSearchNet](https://arxiv.org/abs/1909.09436) data with six PLs                                   |
+| concode   | none                               | text-to-code generation on [Concode](https://aclanthology.org/D18-1192.pdf) data                                                 |
+| translate | java-cs/cs-java                    | code-to-code translation between [Java and C#](https://arxiv.org/pdf/2102.04664.pdf)                                             |
+| refine    | small/medium                       | code refinement on [code repair data](https://arxiv.org/pdf/1812.08693.pdf) with small/medium functions                          |
+| defect    | none                               | code defect detection in [C/C++ data](https://proceedings.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf) |
+| clone     | none                               | code clone detection in [Java data](https://arxiv.org/pdf/2002.08653.pdf)                                                        |
+
+For example, if you want to run CodeT5-base model on the code summarization task for Python, you can simply run:
+
 ```
-python run_exp.py --model_tag codet5_base --task summarize --sub_task ruby
+python run_exp.py --model_tag codet5_base --task summarize --sub_task python
 ```
 
 Besides, you can specify:
+
 ```
 model_dir: where to save fine-tuning checkpoints
 res_dir: where to save the performance results 
 summary_dir: where to save the training curves
 data_num: how many data instances to use, the default -1 is for using the full data
 gpu: the index of the GPU to use in the cluster
 ``` 
-You can also revise the suggested arguments [here](https://github.com/salesforce/CodeT5/blob/4f8818aea1bf170f019381671087e4c4f9608005/sh/run_exp.py#L14) and refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full available options.
-The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
+
+You can also revise the suggested
+arguments [here](https://github.com/salesforce/CodeT5/blob/4f8818aea1bf170f019381671087e4c4f9608005/sh/run_exp.py#L14) or directly customize the [exp_with_args.sh](https://github.com/salesforce/CodeT5/blob/main/sh/exp_with_args.sh) bash file.
+Please refer to the argument flags in [configs.py](https://github.com/salesforce/CodeT5/blob/main/configs.py) for the full
+available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
+
 
 ## Get Involved
 
-Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. 
-We welcome PRs!
+Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!
 
@@ -66,12 +66,22 @@ def convert_clone_examples_to_features(item):
     else:
         source_str = example.source
         target_str = example.target
-    code1 = tokenizer.encode(source_str, max_length=args.block_size, padding='max_length', truncation=True)
-    code2 = tokenizer.encode(target_str, max_length=args.block_size, padding='max_length', truncation=True)
+    code1 = tokenizer.encode(source_str, max_length=args.max_source_length, padding='max_length', truncation=True)
+    code2 = tokenizer.encode(target_str, max_length=args.max_source_length, padding='max_length', truncation=True)
     source_ids = code1 + code2
     return CloneInputFeatures(example_index, source_ids, example.label, example.url1, example.url2)
 
 
+def convert_defect_examples_to_features(item):
+    example, example_index, tokenizer, args = item
+    if args.model_type in ['t5', 'codet5'] and args.add_task_prefix:
+        source_str = "{}: {}".format(args.task, example.source)
+    else:
+        source_str = example.source
+    code = tokenizer.encode(source_str, max_length=args.max_source_length, padding='max_length', truncation=True)
+    return DefectInputFeatures(example_index, code, example.target)
+
+
 class CloneInputFeatures(object):
     """A single training/test features for a example."""
 
@@ -89,6 +99,19 @@ def __init__(self,
         self.url2 = url2
 
 
+class DefectInputFeatures(object):
+    """A single training/test features for a example."""
+
+    def __init__(self,
+                 example_id,
+                 source_ids,
+                 label
+                 ):
+        self.example_id = example_id
+        self.source_ids = source_ids
+        self.label = label
+
+
 class InputFeatures(object):
     """A single training/test features for a example."""
 
 
@@ -9,7 +9,7 @@
 
 def add_args(parser):
     parser.add_argument("--task", type=str, required=True,
-                        choices=['summarize', 'refine', 'translate', 'concode', 'clone', 'defect'])
+                        choices=['summarize', 'concode', 'translate', 'refine', 'defect', 'clone'])
     parser.add_argument("--sub_task", type=str, default='')
     parser.add_argument("--lang", type=str, default='')
     parser.add_argument("--eval_task", type=str, default='')
@@ -49,7 +49,6 @@ def add_args(parser):
                         help="Pretrained config name or path if not the same as model_name")
     parser.add_argument("--tokenizer_name", default="roberta-base", type=str,
                         help="Pretrained tokenizer name or path if not the same as model_name")
-    parser.add_argument("--block_size", default=512, type=int)
     parser.add_argument("--max_source_length", default=64, type=int,
                         help="The maximum total source sequence length after tokenization. Sequences longer "
                              "than this will be truncated, sequences shorter will be padded.")
@@ -98,7 +97,7 @@ def add_args(parser):
     parser.add_argument("--local_rank", type=int, default=-1,
                         help="For distributed training: local_rank")
     parser.add_argument('--seed', type=int, default=1234,
-                        help="random seed for initialization") 
+                        help="random seed for initialization")
     args = parser.parse_args()
 
     if args.task in ['summarize']: