Skip to content

Commit 22e9512

Browse files
author
Miltos
authored
Update README.md
1 parent 6b8f409 commit 22e9512

File tree

1 file changed

+5
-6
lines changed

1 file changed

+5
-6
lines changed

README.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,18 @@
1-
Near-Duplicate Code Detector
2-
===
1+
# Near-Duplicate Code Detector
32

4-
This cross-platform sample tool detects exact and near duplicates of code maintained by MSRC. It has been created for the purpose of deduplicating code corpora for research purposes.
3+
This cross-platform sample tool detects exact and near duplicates of code maintained by the [Deep Program Understanding](https://www.microsoft.com/en-us/research/project/program/) group in Microsoft Research, Cambridge, UK. It has been created for the purpose of deduplicating code corpora for research purposes.
54

65
*Requirements*: .NET Core 2.1 or higher. For parsing code, an appropriate runtime for each of the languages that needs to be tokenized is also required.
76

87
To run the near-duplicate detection run:
98
```
10-
dotnet run /path/to/DuplicateCodeDetector.csproj detect path/to/dataFolder outputFile
9+
$ dotnet run /path/to/DuplicateCodeDetector.csproj detect path/to/dataFolder outputFile
1110
```
1211
This will use all the `.jsonl.gz` files in the `dataFolder` and output an `outputFile` with the duplicate pairs and an `outputFile.json` with the groups of detected duplicates.
1312

1413
### Input Data
1514

16-
The input data should be one or more `.jsonl.gz` files. These are compressed files where each line has a single JSON entry of the form
15+
The input data should be one or more `.jsonl.gz` files. These are compressed [JSONL](http://jsonlines.org/) files where each line has a single JSON entry of the form
1716
```
1817
{
1918
"filename": "unique identifier of file, such as a path or a unique id",
@@ -22,7 +21,7 @@ The input data should be one or more `.jsonl.gz` files. These are compressed fil
2221
```
2322

2423
The `tokenizers` folder in this repository contains tokenizers for
25-
C\#, Java, JavaScript and Python. Feel free to contribute tokenizers for other languages too.
24+
C\#, Java, JavaScript and Python. Please, feel free to contribute tokenizers for other languages too.
2625

2726
# Contributing
2827

0 commit comments

Comments
 (0)