Adding Korean read speech corpus (#44)

homink · r9y9 · commit 2987b7640dc4 · 2018-02-20T15:22:32.000+09:00
* adding scripts and modification for Korean (NIKL corpus)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* fix file searching logic

* Update README.md

* Update README.md

* Update README.md
diff --git a/README.md b/README.md
@@ -18,6 +18,15 @@ Audio samples are available at https://r9y9.github.io/deepvoice3_pytorch/.
 - Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets
 - Language-dependent frontend text processor for English and Japanese
 
+### Samples
+
+- [Ja Step000380000 Predicted](https://soundcloud.com/user-623907374/ja-step000380000-predicted)
+- [Ja Step000370000 Predicted](https://soundcloud.com/user-623907374/ja-step000370000-predicted)
+- [Ko_single Step000410000 Predicted](https://soundcloud.com/user-623907374/ko-step000410000-predicted)
+- [Ko_single Step000400000 Predicted](https://soundcloud.com/user-623907374/ko-step000400000-predicted)
+- [Ko_multi Step001680000 Predicted](https://soundcloud.com/user-623907374/step001680000-predicted)
+- [Ko_multi Step001700000 Predicted](https://soundcloud.com/user-623907374/step001700000-predicted)
+
 ## Pretrained models
 
  | URL | Model      | Data     | Hyper paramters                                  | Git commit | Steps  |
@@ -83,6 +92,7 @@ pip install -e ".[jp]"
 - LJSpeech (en): https://keithito.com/LJ-Speech-Dataset/
 - VCTK (en): http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
 - JSUT (jp): https://sites.google.com/site/shinnosuketakamichi/publication/jsut
+- NIKL (ko): http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464
 
 ### 1. Preprocessing
 
@@ -97,6 +107,8 @@ Supported `${dataset_name}`s for now are
 - `ljspeech` (en, single speaker)
 - `vctk` (en, multi-speaker)
 - `jsut` (jp, single speaker)
+- `nikl_m` (ko, multi-speaker)
+- `nikl_s` (ko, single speaker)
 
 Suppose you will want to preprocess LJSpeech dataset and have it in `~/data/LJSpeech-1.0`, then you can preprocess data by:
 
@@ -132,6 +144,15 @@ python train.py --data-root=./data/jsut --hparams="frontend=jp" --hparams="build
 
 Note that there are many hyper parameters and design choices. Some are configurable by `hparams.py` and some are hardcoded in the source (e.g., dilation factor for each convolution layer). If you find better hyper parameters, please let me know!
 
+#### NIKL
+Pleae check [this](https://github.com/homink/deepvoice3_pytorch/blob/master/nikl_preprocess/README.md) in advance and follow the commands below.
+
+```
+python preprocess.py nikl_s ${your_nikl_root_path} data/nikl_s
+
+python train.py --data-root=./data/nikl_s --checkpoint-dir checkpoint_nikl_s \
+  --hparams="frontend=ko,builder=deepvoice3,preset=deepvoice3_nikls"
+```
 
 ### 4. Monitor with Tensorboard
 
@@ -167,7 +188,10 @@ python synthesis.py --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech"　
 
 ### Multi-speaker model
 
-Currently VCTK is the only supported dataset for building a multi-speaker model. Since some audio samples in VCTK have long silences that affect performance, it's recommended to do phoneme alignment and remove silences according to [vctk_preprocess](vctk_preprocess/).
+VCTK and NIKL are supported dataset for building a multi-speaker model. 
+
+#### VCTK
+Since some audio samples in VCTK have long silences that affect performance, it's recommended to do phoneme alignment and remove silences according to [vctk_preprocess](vctk_preprocess/).
 
 Once you have phoneme alignment for each utterance, you can extract features by:
 
@@ -194,6 +218,23 @@ python train.py --data-root=./data/vctk --checkpoint-dir=checkpoints_vctk \
 
 This may improve training speed a bit.
 
+#### NIKL
+You will be able to obtain cleaned-up audio samples in ../nikl_preprocoess. Details are found in [here](https://github.com/homink/speech.ko).
+
+
+Once NIKL corpus is ready to use from the preprocessing, you can extract features by:
+
+```
+python preprocess.py nikl_m ${your_nikl_root_path} data/nikl_m
+```
+
+Now that you have data prepared, then you can train a multi-speaker version of DeepVoice3 by:
+
+```
+python train.py --data-root=./data/nikl_m  --checkpoint-dir checkpoint_nikl_m \
+   --hparams="frontend=ko,builder=deepvoice3,preset=deepvoice3_niklm,builder=deepvoice3_multispeaker"
+```
+
 ### Speaker adaptation
 
 If you have very limited data, then you can consider to try fine-turn pre-trained model. For example, using pre-trained model on LJSpeech, you can adapt it to data from VCTK speaker `p225` (30 mins) by the following command:
diff --git a/deepvoice3_pytorch/frontend/__init__.py b/deepvoice3_pytorch/frontend/__init__.py
@@ -19,3 +19,9 @@
     from deepvoice3_pytorch.frontend import jp
 except ImportError:
     jp = None
+
+try:
+    from deepvoice3_pytorch.frontend import ko
+except ImportError:
+    ko = None
+
diff --git a/deepvoice3_pytorch/frontend/ko/__init__.py b/deepvoice3_pytorch/frontend/ko/__init__.py
@@ -0,0 +1,17 @@
+# coding: utf-8
+
+
+from random import random
+
+n_vocab = 0xffff
+
+_eos = 1
+_pad = 0
+_tagger = None
+
+
+def text_to_sequence(text, p=0.0):
+    return [ord(c) for c in text] + [_eos]  # EOS
+
+def sequence_to_text(seq):
+    return "".join(chr(n) for n in seq)
diff --git a/hparams.py b/hparams.py
@@ -106,6 +106,60 @@
             "clip_thresh": 0.1,
             "initial_learning_rate": 5e-4,
         },
+         "deepvoice3_niklm": {
+             "n_speakers": 118,
+             "speaker_embed_dim": 16,
+             "downsample_step": 4,
+             "outputs_per_step": 1,
+             "embedding_weight_std": 0.1,
+             "speaker_embedding_weight_std": 0.05,
+             "dropout": 1 - 0.95,
+             "kernel_size": 3,
+             "text_embed_dim": 256,
+             "encoder_channels": 512,
+             "decoder_channels": 256,
+             "converter_channels": 256,
+             "use_guided_attention": True,
+             "guided_attention_sigma": 0.4,
+             "binary_divergence_weight": 0.1,
+             "use_decoder_state_for_postnet_input": True,
+             "max_positions": 1200,
+             "query_position_rate": 2.0,
+             "key_position_rate": 7.6,
+             "key_projection": True,
+             "value_projection": True,
+             "clip_thresh": 0.1,
+             "initial_learning_rate": 5e-4,
+             "batch_size": 8,
+             "text_embed_dim":256,
+        },
+         "deepvoice3_nikls": {
+             "n_speakers": 1,
+             "speaker_embed_dim": 16,
+             "downsample_step": 4,
+             "outputs_per_step": 1,
+             "embedding_weight_std": 0.1,
+             "speaker_embedding_weight_std": 0.05,
+             "dropout": 1 - 0.95,
+             "kernel_size": 3,
+             "text_embed_dim": 256,
+             "encoder_channels": 512,
+             "decoder_channels": 256,
+             "converter_channels": 256,
+             "use_guided_attention": True,
+             "guided_attention_sigma": 0.4,
+             "binary_divergence_weight": 0.1,
+             "use_decoder_state_for_postnet_input": True,
+             "max_positions": 512,
+             "query_position_rate": 2.0,
+             "key_position_rate": 7.6,
+             "key_projection": True,
+             "value_projection": True,
+             "clip_thresh": 0.1,
+             "initial_learning_rate": 5e-4,
+             "batch_size": 8,
+             "text_embed_dim":256,
+        },
     },
 
     # Audio:
diff --git a/nikl_m.py b/nikl_m.py
@@ -0,0 +1,83 @@
+from concurrent.futures import ProcessPoolExecutor
+from functools import partial
+import numpy as np
+import os
+import audio
+import re
+
+
+def build_from_path(in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
+    '''Preprocesses the LJ Speech dataset from a given input path into a given output directory.
+
+      Args:
+        in_dir: The directory where you have downloaded the LJ Speech dataset
+        out_dir: The directory to write the output into
+        num_workers: Optional number of worker processes to parallelize across
+        tqdm: You can optionally pass tqdm to get a nice progress bar
+
+      Returns:
+        A list of tuples describing the training examples. This should be written to train.txt
+    '''
+
+    # We use ProcessPoolExecutor to parallize across processes. This is just an optimization and you
+    # can omit it and just call _process_utterance on each input if you want.
+
+    # You will need to modify and format NIKL transcrption file will UTF-8 format
+    # please check https://github.com/homink/deepspeech.pytorch.ko/blob/master/data/local/clean_corpus.sh
+
+    executor = ProcessPoolExecutor(max_workers=num_workers)
+    futures = []
+
+    spk_id = {}
+    with open(in_dir + '/speaker.mid', encoding='utf-8') as f:
+        for i, line in enumerate(f):
+            spk_id[line.rstrip()] = i
+
+    index = 1
+    with open(in_dir + '/metadata.txt', encoding='utf-8') as f:
+        for line in f:
+            parts = line.strip().split('|')
+            wav_path = parts[0]
+            text = parts[1]
+            uid = re.search(r'([a-z][a-z][0-9][0-9]_t)',wav_path)
+            uid = uid.group(1).replace('_t','')
+            futures.append(executor.submit(
+                partial(_process_utterance, out_dir, index+1, spk_id[uid], wav_path, text)))
+            index += 1
+    return [future.result() for future in tqdm(futures)]
+
+
+def _process_utterance(out_dir, index, speaker_id, wav_path, text):
+    '''Preprocesses a single utterance audio/text pair.
+
+    This writes the mel and linear scale spectrograms to disk and returns a tuple to write
+    to the train.txt file.
+
+    Args:
+      out_dir: The directory to write the spectrograms into
+      index: The numeric index to use in the spectrogram filenames.
+      wav_path: Path to the audio file containing the speech input
+      text: The text spoken in the input audio file
+
+    Returns:
+      A (spectrogram_filename, mel_filename, n_frames, text) tuple to write to train.txt
+    '''
+
+    # Load the audio to a numpy array:
+    wav = audio.load_wav(wav_path)
+
+    # Compute the linear-scale spectrogram from the wav:
+    spectrogram = audio.spectrogram(wav).astype(np.float32)
+    n_frames = spectrogram.shape[1]
+
+    # Compute a mel-scale spectrogram from the wav:
+    mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
+
+    # Write the spectrograms to disk:
+    spectrogram_filename = 'nikl-multi-spec-%05d.npy' % index
+    mel_filename = 'nikl-multi-mel-%05d.npy' % index
+    np.save(os.path.join(out_dir, spectrogram_filename), spectrogram.T, allow_pickle=False)
+    np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)
+
+    # Return a tuple describing this training example:
+    return (spectrogram_filename, mel_filename, n_frames, text, speaker_id)
diff --git a/nikl_preprocess/README.md b/nikl_preprocess/README.md
@@ -0,0 +1,18 @@
+# Preparation for Korean speech
+
+## Corpus
+https://github.com/homink/speech.ko
+
+## Command
+
+### Multi-speaker
+```
+cd nikl_preprocess
+python prepare_metadata.py --corpus ${corpus location} --trans_file ${corpus location}/trans.txt --spk_id ${corpus location}/speaker.mid
+```
+### Single-speaker
+```
+cd nikl_preprocess
+python prepare_metadata.py --corpus ${corpus location} --trans_file ${corpus location}/trans.txt --spk_id ${corpus location}/speaker.sid
+```
+Default single speaker id is fv01. You can edit it by speaker id in [here](https://github.com/homink/speech.ko).
diff --git a/nikl_preprocess/prepare_metafile.py b/nikl_preprocess/prepare_metafile.py
@@ -0,0 +1,71 @@
+from __future__ import print_function
+import subprocess,os,re
+
+def pwrap(args, shell=False):
+    p = subprocess.Popen(args, shell=shell, stdout=subprocess.PIPE,
+                         stdin=subprocess.PIPE, stderr=subprocess.PIPE,
+                         universal_newlines=True)
+    return p
+
+def execute(cmd, shell=False):
+    popen = pwrap(cmd, shell=shell)
+    for stdout_line in iter(popen.stdout.readline, ""):
+        yield stdout_line
+
+    popen.stdout.close()
+    return_code = popen.wait()
+    if return_code:
+        raise subprocess.CalledProcessError(return_code, cmd)
+
+def pe(cmd, shell=False):
+    """
+    Print and execute command on system
+    """
+    ret = []
+    for line in execute(cmd, shell=shell):
+        ret.append(line)
+        print(line, end="")
+    return ret
+
+
+if __name__ == "__main__":
+  import argparse
+  parser = argparse.ArgumentParser(description="Produce metafile where wav file path and its transcription are aligned",
+                                   epilog="Example usage: python preprea_metadata $HOME/copora/NIKL")
+  parser.add_argument("--corpus_dir", "-c",
+                      help="filepath for the root directory of corpus",
+                      required=True)
+
+  parser.add_argument("--trans_file", "-t",
+                      help="Extracted transcription file obatained from extract_trans.py",
+                      required=True)
+
+  parser.add_argument("--spk_id", "-sid",
+                      help="Speaker ID for single speaker such as fv01",
+                      required=False)
+  args = parser.parse_args()
+
+  print("Prepare metadata file for all speakers")
+  pe("find %s -name %s | grep -v 'Bad\|Non\|Invalid' > %s/wav.lst" % (args.corpus_dir,"*.wav",args.corpus_dir),shell=True)
+
+  trans={}
+  with open(args.trans_file,"r") as f:
+    for line in f:
+      line = line.rstrip()
+      line_split = line.split(" ")
+      trans[line_split[0]] = " ".join(line_split[1:])
+
+  with open(args.corpus_dir+"/wav.lst", "r") as f:
+    wavfiles = f.readlines()
+
+  pe("rm -f %s/metadata.txt" % (args.corpus_dir),shell=True)
+  for w in wavfiles:
+    w = w.rstrip()
+    tid = re.search(r'(t[0-9][0-9]_s[0-9][0-9])',w)
+    if tid:
+      tid_found = tid.group(1)
+      pe('echo %s"|"%s >> %s/metadata.txt' % (w,trans.get(tid_found),args.corpus_dir),shell=True)
+
+  print("Metadata files is created in %s/metadata.txt" % (args.corpus_dir))
+  pe("ls -d -- %s/*/ | grep -v 'Bad\|Non\|Invalid' | rev | cut -d'/' -f2 | rev > %s/speaker.mid" % (args.corpus_dir,args.corpus_dir),shell=True)
+  pe("head -n 1 %s/speaker.mid > %s/speaker.sid" % (args.corpus_dir,args.corpus_dir),shell=True)
diff --git a/nikl_s.py b/nikl_s.py
diff --git a/preprocess.py b/preprocess.py