Skip to content

Commit 2987b76

Browse files
hominkr9y9
authored andcommitted
Adding Korean read speech corpus (#44)
* adding scripts and modification for Korean (NIKL corpus) * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * fix file searching logic * Update README.md * Update README.md * Update README.md
1 parent 48d1014 commit 2987b76

File tree

9 files changed

+372
-2
lines changed

9 files changed

+372
-2
lines changed

README.md

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,15 @@ Audio samples are available at https://r9y9.github.io/deepvoice3_pytorch/.
1818
- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets
1919
- Language-dependent frontend text processor for English and Japanese
2020

21+
### Samples
22+
23+
- [Ja Step000380000 Predicted](https://soundcloud.com/user-623907374/ja-step000380000-predicted)
24+
- [Ja Step000370000 Predicted](https://soundcloud.com/user-623907374/ja-step000370000-predicted)
25+
- [Ko_single Step000410000 Predicted](https://soundcloud.com/user-623907374/ko-step000410000-predicted)
26+
- [Ko_single Step000400000 Predicted](https://soundcloud.com/user-623907374/ko-step000400000-predicted)
27+
- [Ko_multi Step001680000 Predicted](https://soundcloud.com/user-623907374/step001680000-predicted)
28+
- [Ko_multi Step001700000 Predicted](https://soundcloud.com/user-623907374/step001700000-predicted)
29+
2130
## Pretrained models
2231

2332
| URL | Model | Data | Hyper paramters | Git commit | Steps |
@@ -83,6 +92,7 @@ pip install -e ".[jp]"
8392
- LJSpeech (en): https://keithito.com/LJ-Speech-Dataset/
8493
- VCTK (en): http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
8594
- JSUT (jp): https://sites.google.com/site/shinnosuketakamichi/publication/jsut
95+
- NIKL (ko): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464
8696

8797
### 1. Preprocessing
8898

@@ -97,6 +107,8 @@ Supported `${dataset_name}`s for now are
97107
- `ljspeech` (en, single speaker)
98108
- `vctk` (en, multi-speaker)
99109
- `jsut` (jp, single speaker)
110+
- `nikl_m` (ko, multi-speaker)
111+
- `nikl_s` (ko, single speaker)
100112

101113
Suppose you will want to preprocess LJSpeech dataset and have it in `~/data/LJSpeech-1.0`, then you can preprocess data by:
102114

@@ -132,6 +144,15 @@ python train.py --data-root=./data/jsut --hparams="frontend=jp" --hparams="build
132144

133145
Note that there are many hyper parameters and design choices. Some are configurable by `hparams.py` and some are hardcoded in the source (e.g., dilation factor for each convolution layer). If you find better hyper parameters, please let me know!
134146

147+
#### NIKL
148+
Pleae check [this](https://github.com/homink/deepvoice3_pytorch/blob/master/nikl_preprocess/README.md) in advance and follow the commands below.
149+
150+
```
151+
python preprocess.py nikl_s ${your_nikl_root_path} data/nikl_s
152+
153+
python train.py --data-root=./data/nikl_s --checkpoint-dir checkpoint_nikl_s \
154+
--hparams="frontend=ko,builder=deepvoice3,preset=deepvoice3_nikls"
155+
```
135156

136157
### 4. Monitor with Tensorboard
137158

@@ -167,7 +188,10 @@ python synthesis.py --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" 
167188

168189
### Multi-speaker model
169190

170-
Currently VCTK is the only supported dataset for building a multi-speaker model. Since some audio samples in VCTK have long silences that affect performance, it's recommended to do phoneme alignment and remove silences according to [vctk_preprocess](vctk_preprocess/).
191+
VCTK and NIKL are supported dataset for building a multi-speaker model.
192+
193+
#### VCTK
194+
Since some audio samples in VCTK have long silences that affect performance, it's recommended to do phoneme alignment and remove silences according to [vctk_preprocess](vctk_preprocess/).
171195

172196
Once you have phoneme alignment for each utterance, you can extract features by:
173197

@@ -194,6 +218,23 @@ python train.py --data-root=./data/vctk --checkpoint-dir=checkpoints_vctk \
194218

195219
This may improve training speed a bit.
196220

221+
#### NIKL
222+
You will be able to obtain cleaned-up audio samples in ../nikl_preprocoess. Details are found in [here](https://github.com/homink/speech.ko).
223+
224+
225+
Once NIKL corpus is ready to use from the preprocessing, you can extract features by:
226+
227+
```
228+
python preprocess.py nikl_m ${your_nikl_root_path} data/nikl_m
229+
```
230+
231+
Now that you have data prepared, then you can train a multi-speaker version of DeepVoice3 by:
232+
233+
```
234+
python train.py --data-root=./data/nikl_m --checkpoint-dir checkpoint_nikl_m \
235+
--hparams="frontend=ko,builder=deepvoice3,preset=deepvoice3_niklm,builder=deepvoice3_multispeaker"
236+
```
237+
197238
### Speaker adaptation
198239

199240
If you have very limited data, then you can consider to try fine-turn pre-trained model. For example, using pre-trained model on LJSpeech, you can adapt it to data from VCTK speaker `p225` (30 mins) by the following command:

deepvoice3_pytorch/frontend/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,9 @@
1919
from deepvoice3_pytorch.frontend import jp
2020
except ImportError:
2121
jp = None
22+
23+
try:
24+
from deepvoice3_pytorch.frontend import ko
25+
except ImportError:
26+
ko = None
27+
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# coding: utf-8
2+
3+
4+
from random import random
5+
6+
n_vocab = 0xffff
7+
8+
_eos = 1
9+
_pad = 0
10+
_tagger = None
11+
12+
13+
def text_to_sequence(text, p=0.0):
14+
return [ord(c) for c in text] + [_eos] # EOS
15+
16+
def sequence_to_text(seq):
17+
return "".join(chr(n) for n in seq)

hparams.py

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,60 @@
106106
"clip_thresh": 0.1,
107107
"initial_learning_rate": 5e-4,
108108
},
109+
"deepvoice3_niklm": {
110+
"n_speakers": 118,
111+
"speaker_embed_dim": 16,
112+
"downsample_step": 4,
113+
"outputs_per_step": 1,
114+
"embedding_weight_std": 0.1,
115+
"speaker_embedding_weight_std": 0.05,
116+
"dropout": 1 - 0.95,
117+
"kernel_size": 3,
118+
"text_embed_dim": 256,
119+
"encoder_channels": 512,
120+
"decoder_channels": 256,
121+
"converter_channels": 256,
122+
"use_guided_attention": True,
123+
"guided_attention_sigma": 0.4,
124+
"binary_divergence_weight": 0.1,
125+
"use_decoder_state_for_postnet_input": True,
126+
"max_positions": 1200,
127+
"query_position_rate": 2.0,
128+
"key_position_rate": 7.6,
129+
"key_projection": True,
130+
"value_projection": True,
131+
"clip_thresh": 0.1,
132+
"initial_learning_rate": 5e-4,
133+
"batch_size": 8,
134+
"text_embed_dim":256,
135+
},
136+
"deepvoice3_nikls": {
137+
"n_speakers": 1,
138+
"speaker_embed_dim": 16,
139+
"downsample_step": 4,
140+
"outputs_per_step": 1,
141+
"embedding_weight_std": 0.1,
142+
"speaker_embedding_weight_std": 0.05,
143+
"dropout": 1 - 0.95,
144+
"kernel_size": 3,
145+
"text_embed_dim": 256,
146+
"encoder_channels": 512,
147+
"decoder_channels": 256,
148+
"converter_channels": 256,
149+
"use_guided_attention": True,
150+
"guided_attention_sigma": 0.4,
151+
"binary_divergence_weight": 0.1,
152+
"use_decoder_state_for_postnet_input": True,
153+
"max_positions": 512,
154+
"query_position_rate": 2.0,
155+
"key_position_rate": 7.6,
156+
"key_projection": True,
157+
"value_projection": True,
158+
"clip_thresh": 0.1,
159+
"initial_learning_rate": 5e-4,
160+
"batch_size": 8,
161+
"text_embed_dim":256,
162+
},
109163
},
110164

111165
# Audio:

nikl_m.py

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
from concurrent.futures import ProcessPoolExecutor
2+
from functools import partial
3+
import numpy as np
4+
import os
5+
import audio
6+
import re
7+
8+
9+
def build_from_path(in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
10+
'''Preprocesses the LJ Speech dataset from a given input path into a given output directory.
11+
12+
Args:
13+
in_dir: The directory where you have downloaded the LJ Speech dataset
14+
out_dir: The directory to write the output into
15+
num_workers: Optional number of worker processes to parallelize across
16+
tqdm: You can optionally pass tqdm to get a nice progress bar
17+
18+
Returns:
19+
A list of tuples describing the training examples. This should be written to train.txt
20+
'''
21+
22+
# We use ProcessPoolExecutor to parallize across processes. This is just an optimization and you
23+
# can omit it and just call _process_utterance on each input if you want.
24+
25+
# You will need to modify and format NIKL transcrption file will UTF-8 format
26+
# please check https://github.com/homink/deepspeech.pytorch.ko/blob/master/data/local/clean_corpus.sh
27+
28+
executor = ProcessPoolExecutor(max_workers=num_workers)
29+
futures = []
30+
31+
spk_id = {}
32+
with open(in_dir + '/speaker.mid', encoding='utf-8') as f:
33+
for i, line in enumerate(f):
34+
spk_id[line.rstrip()] = i
35+
36+
index = 1
37+
with open(in_dir + '/metadata.txt', encoding='utf-8') as f:
38+
for line in f:
39+
parts = line.strip().split('|')
40+
wav_path = parts[0]
41+
text = parts[1]
42+
uid = re.search(r'([a-z][a-z][0-9][0-9]_t)',wav_path)
43+
uid = uid.group(1).replace('_t','')
44+
futures.append(executor.submit(
45+
partial(_process_utterance, out_dir, index+1, spk_id[uid], wav_path, text)))
46+
index += 1
47+
return [future.result() for future in tqdm(futures)]
48+
49+
50+
def _process_utterance(out_dir, index, speaker_id, wav_path, text):
51+
'''Preprocesses a single utterance audio/text pair.
52+
53+
This writes the mel and linear scale spectrograms to disk and returns a tuple to write
54+
to the train.txt file.
55+
56+
Args:
57+
out_dir: The directory to write the spectrograms into
58+
index: The numeric index to use in the spectrogram filenames.
59+
wav_path: Path to the audio file containing the speech input
60+
text: The text spoken in the input audio file
61+
62+
Returns:
63+
A (spectrogram_filename, mel_filename, n_frames, text) tuple to write to train.txt
64+
'''
65+
66+
# Load the audio to a numpy array:
67+
wav = audio.load_wav(wav_path)
68+
69+
# Compute the linear-scale spectrogram from the wav:
70+
spectrogram = audio.spectrogram(wav).astype(np.float32)
71+
n_frames = spectrogram.shape[1]
72+
73+
# Compute a mel-scale spectrogram from the wav:
74+
mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
75+
76+
# Write the spectrograms to disk:
77+
spectrogram_filename = 'nikl-multi-spec-%05d.npy' % index
78+
mel_filename = 'nikl-multi-mel-%05d.npy' % index
79+
np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
80+
np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)
81+
82+
# Return a tuple describing this training example:
83+
return (spectrogram_filename, mel_filename, n_frames, text, speaker_id)

nikl_preprocess/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Preparation for Korean speech
2+
3+
## Corpus
4+
https://github.com/homink/speech.ko
5+
6+
## Command
7+
8+
### Multi-speaker
9+
```
10+
cd nikl_preprocess
11+
python prepare_metadata.py --corpus ${corpus location} --trans_file ${corpus location}/trans.txt --spk_id ${corpus location}/speaker.mid
12+
```
13+
### Single-speaker
14+
```
15+
cd nikl_preprocess
16+
python prepare_metadata.py --corpus ${corpus location} --trans_file ${corpus location}/trans.txt --spk_id ${corpus location}/speaker.sid
17+
```
18+
Default single speaker id is fv01. You can edit it by speaker id in [here](https://github.com/homink/speech.ko).
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
from __future__ import print_function
2+
import subprocess,os,re
3+
4+
def pwrap(args, shell=False):
5+
p = subprocess.Popen(args, shell=shell, stdout=subprocess.PIPE,
6+
stdin=subprocess.PIPE, stderr=subprocess.PIPE,
7+
universal_newlines=True)
8+
return p
9+
10+
def execute(cmd, shell=False):
11+
popen = pwrap(cmd, shell=shell)
12+
for stdout_line in iter(popen.stdout.readline, ""):
13+
yield stdout_line
14+
15+
popen.stdout.close()
16+
return_code = popen.wait()
17+
if return_code:
18+
raise subprocess.CalledProcessError(return_code, cmd)
19+
20+
def pe(cmd, shell=False):
21+
"""
22+
Print and execute command on system
23+
"""
24+
ret = []
25+
for line in execute(cmd, shell=shell):
26+
ret.append(line)
27+
print(line, end="")
28+
return ret
29+
30+
31+
if __name__ == "__main__":
32+
import argparse
33+
parser = argparse.ArgumentParser(description="Produce metafile where wav file path and its transcription are aligned",
34+
epilog="Example usage: python preprea_metadata $HOME/copora/NIKL")
35+
parser.add_argument("--corpus_dir", "-c",
36+
help="filepath for the root directory of corpus",
37+
required=True)
38+
39+
parser.add_argument("--trans_file", "-t",
40+
help="Extracted transcription file obatained from extract_trans.py",
41+
required=True)
42+
43+
parser.add_argument("--spk_id", "-sid",
44+
help="Speaker ID for single speaker such as fv01",
45+
required=False)
46+
args = parser.parse_args()
47+
48+
print("Prepare metadata file for all speakers")
49+
pe("find %s -name %s | grep -v 'Bad\|Non\|Invalid' > %s/wav.lst" % (args.corpus_dir,"*.wav",args.corpus_dir),shell=True)
50+
51+
trans={}
52+
with open(args.trans_file,"r") as f:
53+
for line in f:
54+
line = line.rstrip()
55+
line_split = line.split(" ")
56+
trans[line_split[0]] = " ".join(line_split[1:])
57+
58+
with open(args.corpus_dir+"/wav.lst", "r") as f:
59+
wavfiles = f.readlines()
60+
61+
pe("rm -f %s/metadata.txt" % (args.corpus_dir),shell=True)
62+
for w in wavfiles:
63+
w = w.rstrip()
64+
tid = re.search(r'(t[0-9][0-9]_s[0-9][0-9])',w)
65+
if tid:
66+
tid_found = tid.group(1)
67+
pe('echo %s"|"%s >> %s/metadata.txt' % (w,trans.get(tid_found),args.corpus_dir),shell=True)
68+
69+
print("Metadata files is created in %s/metadata.txt" % (args.corpus_dir))
70+
pe("ls -d -- %s/*/ | grep -v 'Bad\|Non\|Invalid' | rev | cut -d'/' -f2 | rev > %s/speaker.mid" % (args.corpus_dir,args.corpus_dir),shell=True)
71+
pe("head -n 1 %s/speaker.mid > %s/speaker.sid" % (args.corpus_dir,args.corpus_dir),shell=True)

0 commit comments

Comments
 (0)