From c78803204727e88e056bfd21d141cd41b35b647d Mon Sep 17 00:00:00 2001
From: ronliwag <roncorwinrliwag@gmail.com>
Date: Tue, 21 Oct 2025 01:43:09 +0800
Subject: [PATCH 1/7] Fix audio processing issues and improve demo
 functionality

- Fixed 404 errors for /process, /output, /asr, and /translation routes
- Resolved TorchAudio sox dependency issues with custom audio conversion
- Fixed pydub ffmpeg dependency by using soundfile for audio processing
- Added proper audio normalization to prevent loud output
- Fixed audio synchronization between input and output waveforms
- Added favicon route to eliminate 404 errors
- Updated requirements.txt with flexible version requirements
- Added comprehensive .gitignore to exclude large model files and virtual environment
---
 .gitignore                                    | 139 +++++++++
 SETUP_INSTRUCTIONS.md                         | 264 ++++++++++++++++++
 configs/es-en/config_gcmvn.yaml               |   6 +-
 configs/es-en/config_mtl_asr_st_ctcst.yaml    |  12 +-
 demo/.gitignore                               |  15 +
 demo/app.py                                   | 247 ++++++++++++----
 demo/config.json                              |  11 +-
 demo/paths_config_template.json               |  32 +++
 demo/setup_paths.py                           | 105 +++++++
 demo/templates/index.html                     |  18 +-
 fairseq/examples/speech_to_text/__init__.py   |   4 +
 .../speech_to_speech/modules/__init__.py      |   1 +
 requirements.txt                              |  51 ++++
 13 files changed, 826 insertions(+), 79 deletions(-)
 create mode 100644 .gitignore
 create mode 100644 SETUP_INSTRUCTIONS.md
 create mode 100644 demo/.gitignore
 create mode 100644 demo/paths_config_template.json
 create mode 100644 demo/setup_paths.py
 create mode 100644 fairseq/examples/speech_to_text/__init__.py
 create mode 100644 requirements.txt

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..3fff701
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,139 @@
+# Virtual Environment
+streamspeech_env/
+
+# Large Model Files (move to drive)
+pretrain_models/
+*.pt
+*.pth
+*.bin
+*.safetensors
+
+# Python Cache
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+
+# OS
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+
+# Audio files (if any)
+*.wav
+*.mp3
+*.flac
+*.ogg
+*.m4a
+
+# Temporary files
+*.tmp
+*.temp
diff --git a/SETUP_INSTRUCTIONS.md b/SETUP_INSTRUCTIONS.md
new file mode 100644
index 0000000..2a6305f
--- /dev/null
+++ b/SETUP_INSTRUCTIONS.md
@@ -0,0 +1,264 @@
+# StreamSpeech Setup Instructions
+
+This guide will help you set up StreamSpeech for simultaneous speech-to-speech translation on Windows.
+
+## Prerequisites
+
+- **Python 3.10** (required - other versions may not work)
+- **CUDA-capable GPU** (recommended for optimal performance)
+- **Windows 10/11** (tested on Windows 10.0.19045)
+- **Git** (for cloning repositories)
+
+## Quick Setup
+
+### 1. Install Python 3.10
+
+If you don't have Python 3.10, install it using Windows Package Manager:
+
+```powershell
+winget install Python.Python.3.10
+```
+
+Verify installation:
+```powershell
+py -3.10 --version
+```
+
+### 2. Clone and Setup Environment
+
+```powershell
+# Navigate to your desired directory
+cd D:\StreamSpeech
+
+# Create virtual environment with Python 3.10
+py -3.10 -m venv streamspeech_env
+
+# Activate virtual environment
+streamspeech_env\Scripts\activate
+
+# Upgrade pip
+python -m pip install --upgrade pip
+```
+
+### 3. Install Dependencies
+
+#### Option A: Install from requirements.txt (Recommended)
+```powershell
+pip install -r requirements.txt
+```
+
+#### Option B: Manual installation
+```powershell
+# Install PyTorch with CUDA support
+pip install torch==2.0.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+
+# Install fairseq
+pip install fairseq
+
+# Install SimulEval (editable mode)
+cd SimulEval
+pip install --editable ./
+cd ..
+
+# Install Flask for web demo
+pip install flask
+```
+
+### 4. Verify Installation
+
+```powershell
+python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"
+python -c "import fairseq; print('Fairseq: OK')"
+python -c "import simuleval; print('SimulEval: OK')"
+python -c "import flask; print('Flask: OK')"
+```
+
+## Model Setup
+
+### 1. Download Pre-trained Models
+
+Create a `pretrain_models` directory and download the required models:
+
+```powershell
+mkdir pretrain_models
+cd pretrain_models
+```
+
+#### StreamSpeech Models (choose one language pair):
+
+**French-English:**
+- Simultaneous: [streamspeech.simultaneous.fr-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.simultaneous.fr-en.pt)
+- Offline: [streamspeech.offline.fr-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.offline.fr-en.pt)
+
+**Spanish-English:**
+- Simultaneous: [streamspeech.simultaneous.es-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.simultaneous.es-en.pt)
+- Offline: [streamspeech.offline.es-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.offline.es-en.pt)
+
+**German-English:**
+- Simultaneous: [streamspeech.simultaneous.de-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.simultaneous.de-en.pt)
+- Offline: [streamspeech.offline.de-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.offline.de-en.pt)
+
+#### HiFi-GAN Vocoder:
+- Model: [g_00500000](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000)
+- Config: [config.json](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json)
+
+Create the vocoder directory structure:
+```powershell
+mkdir unit-based_HiFi-GAN_vocoder\mHuBERT.layer11.km1000.en
+# Place g_00500000 and config.json in this directory
+```
+
+### 2. Configure Paths
+
+#### Option A: Automatic Setup (Recommended)
+```powershell
+cd demo
+python setup_paths.py
+```
+
+#### Option B: Manual Setup
+1. Copy the template: `cp paths_config_template.json paths_config.json`
+2. Edit `paths_config.json` with your actual paths:
+
+```json
+{
+    "streamspeech_root": "D:/StreamSpeech",
+    "pretrain_models_root": "D:/StreamSpeech/pretrain_models",
+    "language_pair": "es-en",
+    "models": {
+        "simultaneous": "D:/StreamSpeech/pretrain_models/streamspeech.simultaneous.es-en.pt",
+        "offline": "D:/StreamSpeech/pretrain_models/streamspeech.offline.es-en.pt"
+    },
+    "vocoder": {
+        "checkpoint": "D:/StreamSpeech/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000",
+        "config": "D:/StreamSpeech/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json"
+    },
+    "configs": {
+        "data_bin": "D:/StreamSpeech/configs/es-en",
+        "user_dir": "D:/StreamSpeech/researches/ctc_unity",
+        "agent_dir": "D:/StreamSpeech/agent"
+    }
+}
+```
+
+#### Update Language Config Files
+Update config files in `configs/es-en/`:
+- Replace `/data/zhangshaolei/StreamSpeech` with your actual StreamSpeech path in:
+  - `config_gcmvn.yaml`
+  - `config_mtl_asr_st_ctcst.yaml`
+
+## Running the Application
+
+### 1. Command Line Interface
+
+```powershell
+# Activate environment
+streamspeech_env\Scripts\activate
+
+# Set CUDA device
+$env:CUDA_VISIBLE_DEVICES="0"
+
+# Run inference
+cd demo
+python infer.py --data-bin ../configs/fr-en --user-dir ../researches/ctc_unity --agent-dir ../agent --model-path ../pretrain_models/streamspeech.simultaneous.fr-en.pt --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml --segment-size 320 --vocoder ../pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000 --vocoder-cfg ../pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json --dur-prediction
+```
+
+### 2. Web Demo
+
+```powershell
+# Activate environment
+streamspeech_env\Scripts\activate
+
+# Start web server
+cd demo
+python app.py
+```
+
+Open your browser to `http://localhost:7860`
+
+## Features
+
+- **Streaming ASR**: Real-time speech recognition
+- **Simultaneous S2TT**: Speech-to-text translation
+- **Simultaneous S2ST**: Speech-to-speech translation
+- **Adjustable Latency**: 320ms to 5000ms
+- **Real-time Results**: Live updates during playback
+
+## Troubleshooting
+
+### Common Issues:
+
+1. **CUDA out of memory**: Reduce batch size or use CPU
+2. **Model loading errors**: Check file paths in config.json
+3. **Audio format issues**: Ensure audio is in supported format (WAV, MP3)
+4. **Permission errors**: Run PowerShell as Administrator
+5. **Python version issues**: Ensure Python 3.10 is used
+
+### Performance Tips:
+
+- **GPU Recommended**: Significant speedup with CUDA
+- **Memory Requirements**: ~8GB+ GPU memory for optimal performance
+- **Latency**: Lower values (320ms) = faster response, higher values = better quality
+
+## Paths Configuration System
+
+StreamSpeech uses a flexible paths configuration system that makes it easy to deploy across different environments:
+
+### Files:
+- **`demo/paths_config.json`**: Your actual paths (not in git)
+- **`demo/paths_config_template.json`**: Template for paths (in git)
+- **`demo/setup_paths.py`**: Automatic setup script
+- **`demo/.gitignore`**: Excludes local paths from git
+
+### Benefits:
+- ✅ Easy to change paths without modifying code
+- ✅ Git-friendly (local paths not committed)
+- ✅ Environment-specific configurations
+- ✅ Automatic path validation
+
+## Directory Structure
+
+```
+StreamSpeech/
+├── configs/
+│   └── [lang]-en/          # Language-specific configs
+├── pretrain_models/        # Downloaded models
+│   └── unit-based_HiFi-GAN_vocoder/
+├── demo/
+│   ├── config.json         # Main configuration
+│   ├── paths_config.json   # Your paths (auto-generated)
+│   ├── paths_config_template.json # Template
+│   ├── setup_paths.py      # Setup script
+│   ├── app.py             # Flask web app
+│   └── templates/
+│       └── index.html     # Web interface
+├── requirements.txt        # Dependencies
+└── SETUP_INSTRUCTIONS.md  # This file
+```
+
+## Supported Languages
+
+- French → English
+- Spanish → English  
+- German → English
+
+## Citation
+
+If you use StreamSpeech in your research, please cite:
+
+```bibtex
+@inproceedings{streamspeech,
+    title={StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning}, 
+    author={Shaolei Zhang and Qingkai Fang and Shoutao Guo and Zhengrui Ma and Min Zhang and Yang Feng},
+    year={2024},
+    booktitle = {Proceedings of the 62th Annual Meeting of the Association for Computational Linguistics (Long Papers)},
+    publisher = {Association for Computational Linguistics}
+}
+```
+
+## Links
+
+- **Paper**: [arXiv:2406.03049](https://arxiv.org/abs/2406.03049)
+- **Demo**: [StreamSpeech Demo](https://ictnlp.github.io/StreamSpeech-site/)
+- **Models**: [Hugging Face](https://huggingface.co/ICTNLP/StreamSpeech_Models/tree/main)
+- **GitHub**: [StreamSpeech Repository](https://github.com/ictnlp/StreamSpeech)
diff --git a/configs/es-en/config_gcmvn.yaml b/configs/es-en/config_gcmvn.yaml
index 6083d2a..9109ca1 100644
--- a/configs/es-en/config_gcmvn.yaml
+++ b/configs/es-en/config_gcmvn.yaml
@@ -1,5 +1,5 @@
 global_cmvn:
-  stats_npz_path: /data/zhangshaolei/StreamSpeech/configs/es-en/gcmvn.npz
+  stats_npz_path: D:/StreamSpeech/configs/es-en/gcmvn.npz
 input_channels: 1
 input_feat_per_channel: 80
 specaugment:
@@ -16,6 +16,6 @@ transforms:
   - global_cmvn
   - specaugment
 vocoder:
-  checkpoint: /data/zhangshaolei/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000
-  config: /data/zhangshaolei/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json
+  checkpoint: D:/StreamSpeech/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000
+  config: D:/StreamSpeech/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json
   type: code_hifigan
diff --git a/configs/es-en/config_mtl_asr_st_ctcst.yaml b/configs/es-en/config_mtl_asr_st_ctcst.yaml
index 5dba702..06b1d73 100644
--- a/configs/es-en/config_mtl_asr_st_ctcst.yaml
+++ b/configs/es-en/config_mtl_asr_st_ctcst.yaml
@@ -1,7 +1,7 @@
 target_unigram:
    decoder_type: transformer
-   dict: /data/zhangshaolei/StreamSpeech/configs/es-en/tgt_unigram6000/spm_unigram_es.txt
-   data: /data/zhangshaolei/StreamSpeech/configs/es-en/tgt_unigram6000
+   dict: D:/StreamSpeech/configs/es-en/tgt_unigram6000/spm_unigram_es.txt
+   data: D:/StreamSpeech/configs/es-en/tgt_unigram6000
    loss_weight: 8.0
    rdrop_alpha: 0.0
    decoder_args:
@@ -12,8 +12,8 @@ target_unigram:
    label_smoothing: 0.1
 source_unigram:
    decoder_type: ctc
-   dict: /data/zhangshaolei/StreamSpeech/configs/es-en/src_unigram6000/spm_unigram_es.txt
-   data: /data/zhangshaolei/StreamSpeech/configs/es-en/src_unigram6000
+   dict: D:/StreamSpeech/configs/es-en/src_unigram6000/spm_unigram_es.txt
+   data: D:/StreamSpeech/configs/es-en/src_unigram6000
    loss_weight: 4.0
    rdrop_alpha: 0.0
    decoder_args:
@@ -24,8 +24,8 @@ source_unigram:
    label_smoothing: 0.1
 ctc_target_unigram:
    decoder_type: ctc
-   dict: /data/zhangshaolei/StreamSpeech/configs/es-en/tgt_unigram6000/spm_unigram_es.txt
-   data: /data/zhangshaolei/StreamSpeech/configs/es-en/tgt_unigram6000
+   dict: D:/StreamSpeech/configs/es-en/tgt_unigram6000/spm_unigram_es.txt
+   data: D:/StreamSpeech/configs/es-en/tgt_unigram6000
    loss_weight: 4.0
    rdrop_alpha: 0.0
    decoder_args:
diff --git a/demo/.gitignore b/demo/.gitignore
new file mode 100644
index 0000000..b23e4c7
--- /dev/null
+++ b/demo/.gitignore
@@ -0,0 +1,15 @@
+# Ignore actual paths configuration (contains local paths)
+paths_config.json
+
+# Ignore uploads directory
+uploads/
+
+# Ignore Python cache
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+
+# Ignore virtual environment
+streamspeech_env/
diff --git a/demo/app.py b/demo/app.py
index 2e27934..311b9d0 100644
--- a/demo/app.py
+++ b/demo/app.py
@@ -4,6 +4,11 @@
 #
 # StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning (ACL 2024)
 ##########################################
+import sys
+import os
+# Add fairseq to Python path
+sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), 'fairseq'))
+
 from flask import Flask, request, jsonify, render_template, send_from_directory,url_for
 import os
 import json
@@ -23,7 +28,10 @@
 from pathlib import Path
 from typing import Any, Dict, Optional, Union
 from fairseq.data.audio.audio_utils import convert_waveform
-from examples.speech_to_text.data_utils import extract_fbank_features
+# Import data_utils directly from the file path
+import sys
+sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), 'fairseq', 'examples', 'speech_to_text'))
+from data_utils import extract_fbank_features
 import ast
 import math
 import os
@@ -97,9 +105,18 @@ def __call__(self, new_samples, sr=ORG_SAMPLE_RATE):
             + self.len_ms_to_samples(self.window_size - self.shift_size)
         )
         samples = samples[:effective_num_samples]
-        waveform, sample_rate = convert_waveform(
-            torch.tensor([samples]), sr, to_mono=True, to_sample_rate=16000
-        )
+        # Simple audio conversion without sox dependency
+        waveform = torch.tensor([samples])
+        if sr != 16000:
+            # Simple resampling using torch.nn.functional.interpolate
+            # waveform is 2D: [1, samples_length]
+            target_length = int(len(samples) * 16000 / sr)
+            # For linear interpolation, we need 3D input: [batch, channels, length]
+            waveform = waveform.unsqueeze(0)  # Now [1, 1, samples_length]
+            waveform = torch.nn.functional.interpolate(
+                waveform, size=target_length, mode='linear', align_corners=False
+            ).squeeze(0)  # Back to [1, target_length]
+        sample_rate = 16000
         output = extract_fbank_features(waveform, 16000)
         output = self.transform(output)
         return torch.tensor(output, device=self.device)
@@ -824,7 +841,24 @@ def policy(self):
     
 def run(source):
     # if len(S2ST)!=0: return
-    samples, _ = soundfile.read(source, dtype="float32")
+    samples, sr = soundfile.read(source, dtype="float32")
+    
+    # Resample to expected sample rate if needed
+    if sr != ORG_SAMPLE_RATE:
+        print(f"Resampling from {sr}Hz to {ORG_SAMPLE_RATE}Hz")
+        # Simple resampling using torch
+        samples_tensor = torch.tensor(samples).unsqueeze(0).unsqueeze(0)  # [1, 1, length]
+        target_length = int(len(samples) * ORG_SAMPLE_RATE / sr)
+        samples_tensor = torch.nn.functional.interpolate(
+            samples_tensor, size=target_length, mode='linear', align_corners=False
+        )
+        samples = samples_tensor.squeeze().numpy()
+    
+    # Normalize input audio to prevent loud playback
+    max_val = np.max(np.abs(samples))
+    if max_val > 0:
+        samples = samples / max_val * 0.8  # Normalize and scale to 80%
+    
     agent.reset()
 
     interval=int(agent.segment_size*(ORG_SAMPLE_RATE/1000))
@@ -856,31 +890,108 @@ def find_largest_key_value(dictionary, N):
     return dictionary[largest_key]
 
 def merge_audio(left_audio_path, right_audio_path, offset_ms):
-    # 读取左右声道音频文件
-    left_audio = AudioSegment.from_file(left_audio_path)
-    right_audio = AudioSegment.from_file(right_audio_path)
-
-    right_audio=AudioSegment.silent(duration=offset_ms)+right_audio
-
+    # Use soundfile instead of pydub to avoid ffmpeg dependency
+    left_data, left_sr = soundfile.read(left_audio_path, dtype='float32')
+    right_data, right_sr = soundfile.read(right_audio_path, dtype='float32')
     
-    # 确保两个音频文件具有相同的长度
-    if len(left_audio) > len(right_audio):
-        right_audio += AudioSegment.silent(duration=len(left_audio) - len(right_audio))
-    elif len(left_audio) < len(right_audio):
-        left_audio += AudioSegment.silent(duration=len(right_audio) - len(left_audio))
-
-    # # 将左右声道音频合并
-    # merged_audio = left_audio.overlay(right_audio.pan(1))
-    # # 保存合并后的音频文件
-    # merged_audio.export(output_file, format="wav")
+    # Convert offset from ms to samples
+    offset_samples = int(offset_ms * right_sr / 1000)
     
-    return left_audio,right_audio
-
+    # Add silence at the beginning of right audio
+    right_data = np.concatenate([np.zeros(offset_samples), right_data])
+    
+    # Ensure both audio files have the same length
+    max_length = max(len(left_data), len(right_data))
+    
+    if len(left_data) < max_length:
+        left_data = np.concatenate([left_data, np.zeros(max_length - len(left_data))])
+    if len(right_data) < max_length:
+        right_data = np.concatenate([right_data, np.zeros(max_length - len(right_data))])
+    
+    # Normalize audio data before creating AudioSegment objects
+    left_max = np.max(np.abs(left_data))
+    if left_max > 0:
+        left_data = left_data / left_max * 0.8
+    
+    right_max = np.max(np.abs(right_data))
+    if right_max > 0:
+        right_data = right_data / right_max * 0.8
+    
+    # Convert to int16 for AudioSegment (standard format)
+    left_data_int16 = (left_data * 32767).astype(np.int16)
+    right_data_int16 = (right_data * 32767).astype(np.int16)
+    
+    # Create AudioSegment objects for compatibility with the rest of the code
+    left_audio = AudioSegment(
+        left_data_int16.tobytes(),
+        frame_rate=left_sr,
+        sample_width=2,  # int16 = 2 bytes
+        channels=1
+    )
+    right_audio = AudioSegment(
+        right_data_int16.tobytes(),
+        frame_rate=right_sr,
+        sample_width=2,  # int16 = 2 bytes
+        channels=1
+    )
+    
+    # Audio normalization is now handled at the source when writing the file
+    
+    return left_audio, right_audio
+
+# Flask routes will be defined after app initialization
+
+# Load main configuration
+with open('config.json', 'r') as f:
+    main_config = json.load(f)
+
+# Load paths configuration
+with open('paths_config.json', 'r') as f:
+    paths_config = json.load(f)
+
+# Merge configurations
+args_dict = main_config.copy()
+if main_config.get('use_paths_config', False):
+    # Add paths from paths_config.json
+    args_dict.update({
+        'data-bin': paths_config['configs']['data_bin'],
+        'user-dir': paths_config['configs']['user_dir'],
+        'agent-dir': paths_config['configs']['agent_dir'],
+        'model-path': paths_config['models']['simultaneous'],
+        'vocoder': paths_config['vocoder']['checkpoint'],
+        'vocoder-cfg': paths_config['vocoder']['config']
+    })
+
+# Initialize Flask app with config
 app = Flask(__name__)
-app.config['UPLOAD_FOLDER'] = 'uploads'
+# Set upload folder from paths config
+upload_folder = paths_config.get('demo', {}).get('upload_folder', 'uploads')
+app.config['UPLOAD_FOLDER'] = upload_folder
 os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
 
+# Initialize agent
+parser = argparse.ArgumentParser()
+StreamSpeechS2STAgent.add_args(parser)
+
+# Create the list of arguments from args_dict
+args_list = []
+# pdb.set_trace()
+for key, value in args_dict.items():
+    # Skip non-argument fields
+    if key.startswith('_') or key in ['use_paths_config', 'language_pair']:
+        continue
+    if isinstance(value, bool):
+        if value:
+            args_list.append(f'--{key}')
+    else:
+        args_list.append(f'--{key}')
+        args_list.append(str(value))
+
+args = parser.parse_args(args_list)
+
+agent = StreamSpeechS2STAgent(args)
 
+# Define Flask routes
 @app.route('/')
 def index():
     return render_template('index.html')
@@ -897,71 +1008,87 @@ def upload():
         file.save(filepath)
         return filepath
 
-@app.route('/uploads/<filename>')
-def uploaded_file(filename):
+@app.route('/process/<path:filepath>')
+def uploaded_file(filepath):
     latency = request.args.get('latency', default=320, type=int)
     agent.set_chunk_size(latency)
 
-    path=app.config['UPLOAD_FOLDER']+'/'+filename
+    # Handle both full path and just filename
+    if filepath.startswith(app.config['UPLOAD_FOLDER']):
+        path = filepath
+    else:
+        path = os.path.join(app.config['UPLOAD_FOLDER'], filepath)
     # pdb.set_trace()
     # if len(S2ST)==0:
     reset()
     run(path)
-    soundfile.write('/'.join(path.split('/')[:-1])+'/output.'+path.split('/')[-1],S2ST,SAMPLE_RATE)
-    left,right=merge_audio(path, '/'.join(path.split('/')[:-1])+'/output.'+path.split('/')[-1], OFFSET_MS)
-    left.export('/'.join(path.split('/')[:-1])+'/input.'+path.split('/')[-1], format="wav")
-    right.export('/'.join(path.split('/')[:-1])+'/output.'+path.split('/')[-1], format="wav")
+    filename = os.path.basename(path)
+    output_path = os.path.join(os.path.dirname(path), 'output.'+filename)
+    
+    # Normalize the audio data to prevent it from being too loud
+    if len(S2ST) > 0:
+        # Convert to numpy array and normalize
+        audio_data = np.array(S2ST, dtype=np.float32)
+        # Normalize to [-1, 1] range
+        max_val = np.max(np.abs(audio_data))
+        if max_val > 0:
+            audio_data = audio_data / max_val * 0.8  # Scale to 80% of max to be safe
+        soundfile.write(output_path, audio_data, SAMPLE_RATE)
+    else:
+        # Create silent audio if no data
+        soundfile.write(output_path, np.zeros(1000), SAMPLE_RATE)
+    left,right=merge_audio(path, output_path, OFFSET_MS)
+    input_path = os.path.join(os.path.dirname(path), 'input.'+filename)
+    left.export(input_path, format="wav")
+    right.export(output_path, format="wav")
     # left=left.split_to_mono()[0]
     # right=right.split_to_mono()[1]
     # pdb.set_trace()
     return send_from_directory(app.config['UPLOAD_FOLDER'], 'input.'+filename)
 
-@app.route('/uploads/output/<filename>')
-def uploaded_output_file(filename):
+@app.route('/output/<path:filepath>')
+def uploaded_output_file(filepath):
+    # Handle both full path and just filename
+    if filepath.startswith(app.config['UPLOAD_FOLDER']):
+        filename = os.path.basename(filepath)
+    else:
+        filename = filepath
     
     return send_from_directory(app.config['UPLOAD_FOLDER'], 'output.'+filename)
 
 
-@app.route('/asr/<float:current_time>', methods=['GET'])
+@app.route('/asr/<current_time>', methods=['GET'])
 def asr(current_time):
+    try:
+        current_time = float(current_time)
+    except ValueError:
+        return jsonify(result="")
+    
     # asr_result = f"ABCD... {int(current_time * 1000)}"
     N = current_time*ORG_SAMPLE_RATE
 
     asr_result=find_largest_key_value(ASR, N)
     return jsonify(result=asr_result)
 
-@app.route('/translation/<float:current_time>', methods=['GET'])
+@app.route('/translation/<current_time>', methods=['GET'])
 def translation(current_time):
+    try:
+        current_time = float(current_time)
+    except ValueError:
+        return jsonify(result="")
+    
     N = current_time*ORG_SAMPLE_RATE
 
     translation_result=find_largest_key_value(S2TT, N)
     # translation_result = f"1234... {int(current_time * 1000)}"
     return jsonify(result=translation_result)
 
-with open('/data/zhangshaolei/StreamSpeech/demo/config.json', 'r') as f:
-    args_dict = json.load(f)
-
-# Initialize agent
-parser = argparse.ArgumentParser()
-StreamSpeechS2STAgent.add_args(parser)
-
-# Create the list of arguments from args_dict
-args_list = []
-# pdb.set_trace()
-for key, value in args_dict.items():
-    if isinstance(value, bool):
-        if value:
-            args_list.append(f'--{key}')
-    else:
-        args_list.append(f'--{key}')
-        args_list.append(str(value))
-
-args = parser.parse_args(args_list)
-
-agent = StreamSpeechS2STAgent(args)
-
-
-
+@app.route('/favicon.ico')
+def favicon():
+    # Return a simple 204 No Content response to stop the 404 error
+    return '', 204
 
 if __name__ == '__main__':
-    app.run(host='0.0.0.0', port=7860, debug=True)
+    host = paths_config.get('demo', {}).get('host', '0.0.0.0')
+    port = paths_config.get('demo', {}).get('port', 7860)
+    app.run(host=host, port=port, debug=True)
diff --git a/demo/config.json b/demo/config.json
index 8d9b4ac..2340d62 100644
--- a/demo/config.json
+++ b/demo/config.json
@@ -1,12 +1,9 @@
 {
-    "data-bin": "/data/zhangshaolei/StreamSpeech/configs/fr-en",
-    "user-dir": "/data/zhangshaolei/StreamSpeech/researches/ctc_unity",
-    "agent-dir": "/data/zhangshaolei/StreamSpeech/agent",
-    "model-path": "/data/zhangshaolei/StreamSpeech_model/streamspeech.simultaneous.fr-en.pt",
+    "_comment": "StreamSpeech Demo Configuration - Paths are loaded from paths_config.json",
     "config-yaml": "config_gcmvn.yaml",
     "multitask-config-yaml": "config_mtl_asr_st_ctcst.yaml",
     "segment-size": 320,
-    "vocoder": "/data/zhangshaolei/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000",
-    "vocoder-cfg": "/data/zhangshaolei/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json",
-    "dur-prediction": true
+    "dur-prediction": true,
+    "language_pair": "es-en",
+    "use_paths_config": true
 }
diff --git a/demo/paths_config_template.json b/demo/paths_config_template.json
new file mode 100644
index 0000000..c6a38de
--- /dev/null
+++ b/demo/paths_config_template.json
@@ -0,0 +1,32 @@
+{
+    "_comment": "StreamSpeech Paths Configuration Template",
+    "_instructions": "Copy this file to paths_config.json and update the paths for your environment",
+    "_note": "Use forward slashes (/) for paths, even on Windows",
+    
+    "streamspeech_root": "CHANGE_THIS_TO_YOUR_STREAMSPEECH_PATH",
+    "pretrain_models_root": "CHANGE_THIS_TO_YOUR_PRETRAIN_MODELS_PATH",
+    
+    "language_pair": "es-en",
+    
+    "models": {
+        "simultaneous": "CHANGE_THIS_TO_YOUR_SIMULTANEOUS_MODEL_PATH",
+        "offline": "CHANGE_THIS_TO_YOUR_OFFLINE_MODEL_PATH"
+    },
+    
+    "vocoder": {
+        "checkpoint": "CHANGE_THIS_TO_YOUR_VOCODER_CHECKPOINT_PATH",
+        "config": "CHANGE_THIS_TO_YOUR_VOCODER_CONFIG_PATH"
+    },
+    
+    "configs": {
+        "data_bin": "CHANGE_THIS_TO_YOUR_DATA_BIN_PATH",
+        "user_dir": "CHANGE_THIS_TO_YOUR_USER_DIR_PATH",
+        "agent_dir": "CHANGE_THIS_TO_YOUR_AGENT_DIR_PATH"
+    },
+    
+    "demo": {
+        "upload_folder": "CHANGE_THIS_TO_YOUR_UPLOAD_FOLDER_PATH",
+        "host": "0.0.0.0",
+        "port": 7860
+    }
+}
diff --git a/demo/setup_paths.py b/demo/setup_paths.py
new file mode 100644
index 0000000..afc6f2c
--- /dev/null
+++ b/demo/setup_paths.py
@@ -0,0 +1,105 @@
+#!/usr/bin/env python3
+"""
+StreamSpeech Paths Setup Script
+
+This script helps you set up the paths_config.json file for your environment.
+Run this script to automatically generate the paths configuration.
+"""
+
+import os
+import json
+import sys
+from pathlib import Path
+
+def get_streamspeech_root():
+    """Get the StreamSpeech root directory"""
+    current_dir = Path(__file__).parent.parent.absolute()
+    return str(current_dir).replace('\\', '/')
+
+def setup_paths():
+    """Set up paths configuration"""
+    streamspeech_root = get_streamspeech_root()
+    
+    # Default paths based on current directory structure
+    paths_config = {
+        "_comment": "StreamSpeech Paths Configuration - Auto-generated",
+        "_note": "Use forward slashes (/) for paths, even on Windows",
+        
+        "streamspeech_root": streamspeech_root,
+        "pretrain_models_root": f"{streamspeech_root}/pretrain_models",
+        
+        "language_pair": "es-en",
+        
+        "models": {
+            "simultaneous": f"{streamspeech_root}/pretrain_models/streamspeech.simultaneous.es-en.pt",
+            "offline": f"{streamspeech_root}/pretrain_models/streamspeech.offline.es-en.pt"
+        },
+        
+        "vocoder": {
+            "checkpoint": f"{streamspeech_root}/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000",
+            "config": f"{streamspeech_root}/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json"
+        },
+        
+        "configs": {
+            "data_bin": f"{streamspeech_root}/configs/es-en",
+            "user_dir": f"{streamspeech_root}/researches/ctc_unity",
+            "agent_dir": f"{streamspeech_root}/agent"
+        },
+        
+        "demo": {
+            "upload_folder": f"{streamspeech_root}/demo/uploads",
+            "host": "0.0.0.0",
+            "port": 7860
+        }
+    }
+    
+    # Check if files exist
+    print("Checking if required files exist...")
+    missing_files = []
+    
+    for key, path in [
+        ("Simultaneous Model", paths_config["models"]["simultaneous"]),
+        ("Offline Model", paths_config["models"]["offline"]),
+        ("Vocoder Checkpoint", paths_config["vocoder"]["checkpoint"]),
+        ("Vocoder Config", paths_config["vocoder"]["config"]),
+        ("Data Bin", paths_config["configs"]["data_bin"]),
+        ("User Dir", paths_config["configs"]["user_dir"]),
+        ("Agent Dir", paths_config["configs"]["agent_dir"])
+    ]:
+        if os.path.exists(path):
+            print(f"✅ {key}: {path}")
+        else:
+            print(f"❌ {key}: {path} (NOT FOUND)")
+            missing_files.append((key, path))
+    
+    if missing_files:
+        print(f"\n⚠️  Warning: {len(missing_files)} files/directories are missing!")
+        print("Please ensure all models are downloaded and paths are correct.")
+        response = input("Do you want to continue anyway? (y/N): ")
+        if response.lower() != 'y':
+            print("Setup cancelled.")
+            return False
+    
+    # Write the configuration file
+    config_path = Path(__file__).parent / "paths_config.json"
+    with open(config_path, 'w') as f:
+        json.dump(paths_config, f, indent=4)
+    
+    print(f"\n✅ Paths configuration saved to: {config_path}")
+    print("You can now run the StreamSpeech demo!")
+    
+    return True
+
+if __name__ == "__main__":
+    print("StreamSpeech Paths Setup")
+    print("=" * 30)
+    
+    if setup_paths():
+        print("\n🎉 Setup completed successfully!")
+        print("\nNext steps:")
+        print("1. Activate your virtual environment: streamspeech_env\\Scripts\\activate")
+        print("2. Run the demo: python app.py")
+        print("3. Open your browser to: http://localhost:7860")
+    else:
+        print("\n❌ Setup failed. Please check the error messages above.")
+        sys.exit(1)
diff --git a/demo/templates/index.html b/demo/templates/index.html
index be31479..7574bd4 100644
--- a/demo/templates/index.html
+++ b/demo/templates/index.html
@@ -262,7 +262,7 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
                         normalize: true // Add normalize to output waveform
                     });
 
-                    outputWaveSurfer.load(`/uploads/output/${filename}`);
+                    outputWaveSurfer.load(`/output/${filename}`);
 
                     playButton.disabled = false;
                     playButton.style.backgroundColor = '#4CAF50'; // Change color to green
@@ -301,11 +301,16 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
                     inputWaveSurfer.on('finish', function() {
                         updateASRResult(inputWaveSurfer.getCurrentTime());
                         updateTranslationResult(inputWaveSurfer.getCurrentTime());
+                        
+                        // Continue playing output audio even after input finishes
+                        if (outputWaveSurfer && !outputWaveSurfer.isPlaying()) {
+                            outputWaveSurfer.play();
+                        }
                     });
                 });
 
                 // Pass the latency parameter to the server
-                inputWaveSurfer.load(`/uploads/${filename}?latency=${latency}`);
+                inputWaveSurfer.load(`/process/${filename}?latency=${latency}`);
             })
             .catch(error => console.error('Error:', error));
         });
@@ -330,7 +335,14 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
                 inputWaveSurfer.playPause();
             }
             if (outputWaveSurfer) {
-                outputWaveSurfer.playPause();
+                // If input is finished but output is still playing, just control output
+                if (inputWaveSurfer && inputWaveSurfer.isFinished() && outputWaveSurfer.isPlaying()) {
+                    outputWaveSurfer.pause();
+                } else if (inputWaveSurfer && inputWaveSurfer.isFinished() && !outputWaveSurfer.isPlaying()) {
+                    outputWaveSurfer.play();
+                } else {
+                    outputWaveSurfer.playPause();
+                }
             }
         }
 
diff --git a/fairseq/examples/speech_to_text/__init__.py b/fairseq/examples/speech_to_text/__init__.py
new file mode 100644
index 0000000..6264236
--- /dev/null
+++ b/fairseq/examples/speech_to_text/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/fairseq/fairseq/models/speech_to_speech/modules/__init__.py b/fairseq/fairseq/models/speech_to_speech/modules/__init__.py
index e69de29..6293554 100644
--- a/fairseq/fairseq/models/speech_to_speech/modules/__init__.py
+++ b/fairseq/fairseq/models/speech_to_speech/modules/__init__.py
@@ -0,0 +1 @@
+# Speech-to-Speech modules
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..fe130e8
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,51 @@
+# StreamSpeech Requirements
+# Python 3.9+ required
+
+# Core PyTorch dependencies
+torch>=2.0.0
+torchvision>=0.15.0
+torchaudio>=2.0.0
+
+# Core ML/AI packages
+fairseq>=0.12.0
+numpy>=1.21.0
+pandas>=1.3.0
+
+# Audio processing
+soundfile>=0.12.0
+pydub>=0.25.0
+librosa>=0.9.0
+
+# Web framework
+Flask>=2.0.0
+Werkzeug>=2.0.0
+
+# Configuration and utilities
+PyYAML>=6.0.0
+omegaconf>=2.0.0
+hydra-core>=1.0.0
+tqdm>=4.60.0
+regex>=2022.0.0
+sacrebleu>=2.0.0
+bitarray>=2.0.0
+
+# Development and testing
+pytest>=7.0.0
+pytest-cov>=4.0.0
+pytest-flake8>=1.0.0
+flake8>=5.0.0
+
+# Additional utilities
+colorama>=0.4.0
+tabulate>=0.9.0
+lxml>=4.0.0
+portalocker>=2.0.0
+tornado>=6.0.0
+textgrid>=1.5.0
+yt-dlp>=2023.0.0
+
+# System dependencies (Windows)
+pywin32>=300; sys_platform == "win32"
+
+# Note: SimulEval should be installed separately in editable mode:
+# cd SimulEval && pip install --editable ./

From 0a6c918a52cb8b034a7c0220fc34ad8a17b6ffb0 Mon Sep 17 00:00:00 2001
From: ronliwag <roncorwinrliwag@gmail.com>
Date: Tue, 21 Oct 2025 01:54:12 +0800
Subject: [PATCH 2/7] added drive folder for pretrained model download

---
 SETUP_INSTRUCTIONS.md | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/SETUP_INSTRUCTIONS.md b/SETUP_INSTRUCTIONS.md
index 2a6305f..f779b8b 100644
--- a/SETUP_INSTRUCTIONS.md
+++ b/SETUP_INSTRUCTIONS.md
@@ -77,7 +77,27 @@ python -c "import flask; print('Flask: OK')"
 
 ### 1. Download Pre-trained Models
 
-Create a `pretrain_models` directory and download the required models:
+**🚀 Fast Download Option (Recommended):**
+
+Download all pre-trained models from Google Drive for faster speeds:
+
+**[📁 Download Pre-trained Models from Google Drive](https://drive.google.com/drive/folders/1C4Y0sq_-tSRSbbu8dt0QGRQsk4h-9v5m?usp=drive_link)**
+
+1. Click the link above to access the Google Drive folder
+2. Download the entire `pretrain_models` folder
+3. Extract it to your StreamSpeech root directory
+
+The folder contains:
+- **StreamSpeech Models**: All language pairs (French-English, Spanish-English, German-English)
+  - `streamspeech.simultaneous.[lang]-en.pt` (simultaneous translation)
+  - `streamspeech.offline.[lang]-en.pt` (offline translation)
+- **HiFi-GAN Vocoder**: Complete unit-based vocoder with config
+  - `unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000`
+  - `unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json`
+
+**Alternative Download (Original Sources):**
+
+If you prefer to download from original sources:
 
 ```powershell
 mkdir pretrain_models

From cba44ccab73ef0ca2081bffbd42853b6c0fffaab Mon Sep 17 00:00:00 2001
From: ronliwag <roncorwinrliwag@gmail.com>
Date: Fri, 7 Nov 2025 02:56:34 +0800
Subject: [PATCH 3/7] fixed setup

---
 SETUP_COMPLETE.md | 210 ++++++++++++++++++++++++++++++++++++++++++++++
 demo/app.py       |  34 +++++---
 2 files changed, 232 insertions(+), 12 deletions(-)
 create mode 100644 SETUP_COMPLETE.md

diff --git a/SETUP_COMPLETE.md b/SETUP_COMPLETE.md
new file mode 100644
index 0000000..e56410d
--- /dev/null
+++ b/SETUP_COMPLETE.md
@@ -0,0 +1,210 @@
+# StreamSpeech Setup Complete! 🎉
+
+## Virtual Environment Status
+✅ **Virtual environment created**: `streamspeech_env`
+✅ **All dependencies installed**
+✅ **Fairseq configured** (via Python path)
+✅ **SimulEval installed** (editable mode)
+
+## Installed Packages
+- **PyTorch 2.0.1** with CUDA 11.8 support
+- **TorchVision & TorchAudio** (compatible versions)
+- **Fairseq** (custom version from local directory)
+- **SimulEval 1.1.0** (for evaluation)
+- **Flask** (for web demo)
+- **Audio processing**: soundfile, librosa, pydub
+- **ML utilities**: numpy, pandas, scipy, scikit-learn
+- **Configuration**: PyYAML, omegaconf, hydra-core
+- **Other tools**: tensorboardX, sacrebleu, tqdm, and more
+
+## CUDA Status
+✅ **CUDA is available** on your system - GPU acceleration is ready!
+
+---
+
+## 📥 Required Models to Download
+
+You need to download the following pre-trained models to use StreamSpeech:
+
+### Option 1: Quick Download (Recommended)
+**All models are available on Hugging Face:**
+https://huggingface.co/ICTNLP/StreamSpeech_Models
+
+### Option 2: Download Individual Models
+
+#### 1️⃣ **StreamSpeech Models** (Choose your language pair)
+
+**French → English:**
+- **Simultaneous**: [streamspeech.simultaneous.fr-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.simultaneous.fr-en.pt) (~1.2 GB)
+- **Offline**: [streamspeech.offline.fr-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.offline.fr-en.pt) (~1.2 GB)
+- **Unity baseline**: [unity.fr-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/unity.fr-en.pt) (~1.2 GB)
+
+**Spanish → English:**
+- **Simultaneous**: [streamspeech.simultaneous.es-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.simultaneous.es-en.pt) (~1.2 GB)
+- **Offline**: [streamspeech.offline.es-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.offline.es-en.pt) (~1.2 GB)
+- **Unity baseline**: [unity.es-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/unity.es-en.pt) (~1.2 GB)
+
+**German → English:**
+- **Simultaneous**: [streamspeech.simultaneous.de-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.simultaneous.de-en.pt) (~1.2 GB)
+- **Offline**: [streamspeech.offline.de-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/streamspeech.offline.de-en.pt) (~1.2 GB)
+- **Unity baseline**: [unity.de-en.pt](https://huggingface.co/ICTNLP/StreamSpeech_Models/blob/main/unity.de-en.pt) (~1.2 GB)
+
+#### 2️⃣ **Unit-based HiFi-GAN Vocoder** (Required for speech synthesis)
+
+**For English output:**
+- **Checkpoint**: [g_00500000](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000) (~55 MB)
+- **Config**: [config.json](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json) (~1 KB)
+
+**For Spanish output (if needed):**
+- **Checkpoint**: [g_00500000](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_es_css10/g_00500000)
+- **Config**: [config.json](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_es_css10/config.json)
+
+**For French output (if needed):**
+- **Checkpoint**: [g_00500000](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_fr_css10/g_00500000)
+- **Config**: [config.json](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_fr_css10/config.json)
+
+#### 3️⃣ **mHuBERT Model** (For unit extraction)
+- **Model**: [mhubert_base_vp_en_es_fr_it3.pt](https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3.pt) (~316 MB)
+- **K-means**: [mhubert_base_vp_en_es_fr_it3_L11_km1000.bin](https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin) (~4 MB)
+
+---
+
+## 📁 Recommended Directory Structure
+
+After downloading, organize your models like this:
+
+```
+D:\StreamSpeech\
+├── pretrain_models\
+│   ├── streamspeech.simultaneous.fr-en.pt
+│   ├── streamspeech.offline.fr-en.pt
+│   ├── unit-based_HiFi-GAN_vocoder\
+│   │   ├── mHuBERT.layer11.km1000.en\
+│   │   │   ├── g_00500000
+│   │   │   └── config.json
+│   │   ├── mHuBERT.layer11.km1000.es\
+│   │   │   ├── g_00500000
+│   │   │   └── config.json
+│   │   └── mHuBERT.layer11.km1000.fr\
+│   │       ├── g_00500000
+│   │       └── config.json
+│   └── mHuBERT\
+│       ├── mhubert_base_vp_en_es_fr_it3.pt
+│       └── mhubert_base_vp_en_es_fr_it3_L11_km1000.bin
+└── ... (other project files)
+```
+
+**Create the directories:**
+```powershell
+mkdir pretrain_models
+mkdir pretrain_models\unit-based_HiFi-GAN_vocoder\mHuBERT.layer11.km1000.en
+mkdir pretrain_models\unit-based_HiFi-GAN_vocoder\mHuBERT.layer11.km1000.es
+mkdir pretrain_models\unit-based_HiFi-GAN_vocoder\mHuBERT.layer11.km1000.fr
+mkdir pretrain_models\mHuBERT
+```
+
+Then download the models into their respective directories.
+
+---
+
+## 🚀 Quick Start Guide
+
+### 1. Activate the Environment
+```powershell
+.\streamspeech_env\Scripts\Activate.ps1
+```
+
+### 2. Test the Installation
+```powershell
+python -c "import torch; print('CUDA:', torch.cuda.is_available())"
+```
+
+### 3. Run Example Inference (after downloading models)
+
+**Simultaneous Speech-to-Speech Translation:**
+```powershell
+$env:CUDA_VISIBLE_DEVICES="0"
+$ROOT="D:\StreamSpeech"
+$PRETRAIN_ROOT="D:\StreamSpeech\pretrain_models"
+$LANG="fr"
+
+$env:PYTHONPATH="$ROOT\fairseq"
+simuleval --data-bin "$ROOT\configs\$LANG-en" `
+    --user-dir "$ROOT\researches\ctc_unity" `
+    --agent-dir "$ROOT\agent" `
+    --source "$ROOT\example\wav_list.txt" `
+    --target "$ROOT\example\target.txt" `
+    --model-path "$PRETRAIN_ROOT\streamspeech.simultaneous.$LANG-en.pt" `
+    --config-yaml config_gcmvn.yaml `
+    --multitask-config-yaml config_mtl_asr_st_ctcst.yaml `
+    --agent "$ROOT\agent\speech_to_speech.streamspeech.agent.py" `
+    --vocoder "$PRETRAIN_ROOT\unit-based_HiFi-GAN_vocoder\mHuBERT.layer11.km1000.en\g_00500000" `
+    --vocoder-cfg "$PRETRAIN_ROOT\unit-based_HiFi-GAN_vocoder\mHuBERT.layer11.km1000.en\config.json" `
+    --dur-prediction `
+    --source-segment-size 320 `
+    --device gpu `
+    --computation-aware `
+    --output-asr-translation True
+```
+
+### 4. Run Web Demo (after downloading models)
+```powershell
+cd demo
+python app.py
+```
+Then open your browser to `http://localhost:7860`
+
+---
+
+## 📋 Summary of What You Need
+
+**For basic S2ST (French→English):**
+1. ✅ Environment (already set up)
+2. ⬇️ `streamspeech.simultaneous.fr-en.pt` (~1.2 GB)
+3. ⬇️ HiFi-GAN vocoder for English (`g_00500000` + `config.json`) (~55 MB)
+4. ⬇️ mHuBERT model (`.pt` file) (~316 MB)
+5. ⬇️ mHuBERT k-means (`.bin` file) (~4 MB)
+
+**Total download size: ~1.6 GB**
+
+---
+
+## 💡 Next Steps
+
+1. **Download Models**: Start with French→English simultaneous model and English vocoder
+2. **Update Config Files**: Edit paths in `configs/fr-en/config_gcmvn.yaml` and `config_mtl_asr_st_ctcst.yaml`
+3. **Test with Examples**: Use the provided example audio files in `example/wavs/`
+4. **Explore Features**: Try different tasks (ASR, S2TT, S2ST) with different latency settings
+
+---
+
+## 🔧 Troubleshooting
+
+**Issue**: ImportError for fairseq
+**Solution**: Make sure the virtual environment is activated. The `.pth` file automatically adds fairseq to the path.
+
+**Issue**: CUDA out of memory
+**Solution**: Use CPU mode by setting `--device cpu` or reduce batch size
+
+**Issue**: Module not found
+**Solution**: Ensure PYTHONPATH includes the fairseq directory:
+```powershell
+$env:PYTHONPATH="D:\StreamSpeech\fairseq"
+```
+
+---
+
+## 📚 Resources
+
+- **Paper**: https://arxiv.org/abs/2406.03049
+- **Demo Site**: https://ictnlp.github.io/StreamSpeech-site/
+- **Model Hub**: https://huggingface.co/ICTNLP/StreamSpeech_Models
+- **GitHub**: https://github.com/ictnlp/StreamSpeech
+
+---
+
+**Environment created on**: November 6, 2025
+**Python version**: 3.10
+**PyTorch version**: 2.0.1 + CUDA 11.8
+**GPU Support**: ✅ Enabled
+
diff --git a/demo/app.py b/demo/app.py
index 311b9d0..1a74b98 100644
--- a/demo/app.py
+++ b/demo/app.py
@@ -841,7 +841,22 @@ def policy(self):
     
 def run(source):
     # if len(S2ST)!=0: return
-    samples, sr = soundfile.read(source, dtype="float32")
+    
+    # Handle MP3 files by converting to WAV first
+    if source.lower().endswith('.mp3'):
+        print(f"Converting MP3 to WAV: {source}")
+        audio = AudioSegment.from_mp3(source)
+        # Create a temporary WAV file
+        wav_path = source.rsplit('.', 1)[0] + '_temp.wav'
+        audio.export(wav_path, format='wav')
+        samples, sr = soundfile.read(wav_path, dtype="float32")
+        # Clean up temp file
+        try:
+            os.remove(wav_path)
+        except:
+            pass
+    else:
+        samples, sr = soundfile.read(source, dtype="float32")
     
     # Resample to expected sample rate if needed
     if sr != ORG_SAMPLE_RATE:
@@ -1006,18 +1021,16 @@ def upload():
     if file:
         filepath = os.path.join(app.config['UPLOAD_FOLDER'], file.filename)
         file.save(filepath)
-        return filepath
+        # Return just the filename, not the full path
+        return file.filename
 
 @app.route('/process/<path:filepath>')
 def uploaded_file(filepath):
     latency = request.args.get('latency', default=320, type=int)
     agent.set_chunk_size(latency)
 
-    # Handle both full path and just filename
-    if filepath.startswith(app.config['UPLOAD_FOLDER']):
-        path = filepath
-    else:
-        path = os.path.join(app.config['UPLOAD_FOLDER'], filepath)
+    # Construct full path from upload folder and filename
+    path = os.path.join(app.config['UPLOAD_FOLDER'], filepath)
     # pdb.set_trace()
     # if len(S2ST)==0:
     reset()
@@ -1048,11 +1061,8 @@ def uploaded_file(filepath):
 
 @app.route('/output/<path:filepath>')
 def uploaded_output_file(filepath):
-    # Handle both full path and just filename
-    if filepath.startswith(app.config['UPLOAD_FOLDER']):
-        filename = os.path.basename(filepath)
-    else:
-        filename = filepath
+    # filepath is just the filename
+    filename = filepath
     
     return send_from_directory(app.config['UPLOAD_FOLDER'], 'output.'+filename)
 

From 2fb7e9171506190495576ac6f75e640002d82acf Mon Sep 17 00:00:00 2001
From: ronliwag <roncorwinrliwag@gmail.com>
Date: Fri, 7 Nov 2025 03:37:21 +0800
Subject: [PATCH 4/7] added audio and discrete unit extraction

---
 EXTRACTION_GUIDE.md           | 396 ++++++++++++++++++++++++++++++++++
 EXTRACTION_QUICK_REFERENCE.md | 103 +++++++++
 demo/app.py                   |  20 ++
 demo/extract_intermediates.py | 173 +++++++++++++++
 4 files changed, 692 insertions(+)
 create mode 100644 EXTRACTION_GUIDE.md
 create mode 100644 EXTRACTION_QUICK_REFERENCE.md
 create mode 100644 demo/extract_intermediates.py

diff --git a/EXTRACTION_GUIDE.md b/EXTRACTION_GUIDE.md
new file mode 100644
index 0000000..bc8deae
--- /dev/null
+++ b/EXTRACTION_GUIDE.md
@@ -0,0 +1,396 @@
+# StreamSpeech Intermediate Data Extraction Guide
+
+This guide shows you **exactly where** to extract the source Spanish audio and discrete speech units from StreamSpeech.
+
+---
+
+## 📁 What You'll Extract
+
+1. **Source Spanish Audio Input** (`.wav` file)
+   - Original/resampled Spanish audio before feature extraction
+   - Location: `demo/app.py`, `run()` function
+
+2. **Discrete Speech Units** (`.pt` file) 
+   - Integer codes representing phonetic units (before vocoder)
+   - Location: `agent/speech_to_speech.streamspeech.agent.py`, `policy()` method
+
+---
+
+## 🎯 Code Location 1: Source Audio Input
+
+### File: `demo/app.py`
+
+**Location**: In the `run()` function, around **line 859**
+
+### Current Code:
+```python
+def run(source):
+    # if len(S2ST)!=0: return
+    
+    # Handle MP3 files by converting to WAV first
+    if source.lower().endswith('.mp3'):
+        print(f"Converting MP3 to WAV: {source}")
+        audio = AudioSegment.from_mp3(source)
+        # Create a temporary WAV file
+        wav_path = source.rsplit('.', 1)[0] + '_temp.wav'
+        audio.export(wav_path, format='wav')
+        samples, sr = soundfile.read(wav_path, dtype="float32")
+        # Clean up temp file
+        try:
+            os.remove(wav_path)
+        except:
+            pass
+    else:
+        samples, sr = soundfile.read(source, dtype="float32")
+    
+    # Resample to expected sample rate if needed
+    if sr != ORG_SAMPLE_RATE:
+        print(f"Resampling from {sr}Hz to {ORG_SAMPLE_RATE}Hz")
+        # Simple resampling using torch
+        samples_tensor = torch.tensor(samples).unsqueeze(0).unsqueeze(0)
+        target_length = int(len(samples) * ORG_SAMPLE_RATE / sr)
+        samples_tensor = torch.nn.functional.interpolate(
+            samples_tensor, size=target_length, mode='linear', align_corners=False
+        )
+        samples = samples_tensor.squeeze().numpy()
+    
+    # Normalize input audio to prevent loud playback
+    max_val = np.max(np.abs(samples))
+    if max_val > 0:
+        samples = samples / max_val * 0.8
+    
+    # 👇 ADD EXTRACTION CODE HERE 👇
+```
+
+### Modified Code (ADD THIS):
+```python
+    # Normalize input audio to prevent loud playback
+    max_val = np.max(np.abs(samples))
+    if max_val > 0:
+        samples = samples / max_val * 0.8
+    
+    # ==========================================
+    # EXTRACT SOURCE AUDIO AT 16kHz
+    # ==========================================
+    # Save source audio for analysis (resampled to 16kHz)
+    import soundfile, torch
+    source_filename = os.path.basename(source).rsplit('.', 1)[0]
+    extract_dir = os.path.join(os.path.dirname(__file__), 'extracted_intermediates')
+    os.makedirs(extract_dir, exist_ok=True)
+    
+    # Resample to 16kHz (model's processing rate)
+    TARGET_SR = 16000
+    if ORG_SAMPLE_RATE != TARGET_SR:
+        samples_tensor = torch.tensor(samples).unsqueeze(0).unsqueeze(0)
+        target_length = int(len(samples) * TARGET_SR / ORG_SAMPLE_RATE)
+        samples_tensor = torch.nn.functional.interpolate(
+            samples_tensor, size=target_length, mode='linear', align_corners=False
+        )
+        samples_16k = samples_tensor.squeeze().numpy()
+    else:
+        samples_16k = samples
+    
+    source_audio_path = os.path.join(extract_dir, f"{source_filename}_source_audio_16k.wav")
+    soundfile.write(source_audio_path, samples_16k, TARGET_SR)
+    print(f"✅ EXTRACTED: Source audio saved to {source_audio_path}")
+    print(f"   Sample rate: {TARGET_SR} Hz (16kHz), Duration: {len(samples_16k)/TARGET_SR:.2f}s")
+    # ==========================================
+    
+    agent.reset()
+    # ... rest of the function
+```
+
+---
+
+## 🎯 Code Location 2: Discrete Speech Units
+
+### File: `agent/speech_to_speech.streamspeech.agent.py`
+
+**Location**: In the `policy()` method, around **line 713-748**
+
+### Current Code:
+```python
+        for i, hypo in enumerate(finalized):
+            i_beam = 0
+            tmp = hypo[i_beam]["tokens"].int()  # hyp + eos
+            if tmp[-1] == self.generator.eos:
+                tmp = tmp[:-1]
+            unit = []
+            for c in tmp:
+                u = self.generator.tgt_dict[c].replace("<s>", "").replace("</s>", "")
+                if u != "":
+                    unit.append(int(u))
+
+            if len(unit) > 0 and unit[0] == " ":
+                unit = unit[1:]
+            text = " ".join([str(_) for _ in unit])
+            if self.states.source_finished and not self.quiet:
+                with open(self.unit_file, "a") as file:
+                    print(text, file=file)
+        cur_unit = unit if self.unit is None else unit[len(self.unit) :]
+        if len(unit) < 1 or len(cur_unit) < 1:
+            # ... return ReadAction or WriteAction
+            
+        x = {
+            "code": torch.tensor(unit, dtype=torch.long, device=self.device).view(
+                1, -1
+            ),
+        }
+        wav, dur = self.vocoder(x, self.dur_prediction)
+```
+
+### Modified Code (ADD THIS):
+```python
+        for i, hypo in enumerate(finalized):
+            i_beam = 0
+            tmp = hypo[i_beam]["tokens"].int()  # hyp + eos
+            if tmp[-1] == self.generator.eos:
+                tmp = tmp[:-1]
+            unit = []
+            for c in tmp:
+                u = self.generator.tgt_dict[c].replace("<s>", "").replace("</s>", "")
+                if u != "":
+                    unit.append(int(u))
+
+            if len(unit) > 0 and unit[0] == " ":
+                unit = unit[1:]
+            text = " ".join([str(_) for _ in unit])
+            if self.states.source_finished and not self.quiet:
+                with open(self.unit_file, "a") as file:
+                    print(text, file=file)
+        cur_unit = unit if self.unit is None else unit[len(self.unit) :]
+        if len(unit) < 1 or len(cur_unit) < 1:
+            # ... return ReadAction or WriteAction
+        
+        # ==========================================
+        # EXTRACT DISCRETE UNITS (before vocoder)
+        # ==========================================
+        # Only save when source is finished (final complete units)
+        if self.states.source_finished and len(unit) > 0:
+            import torch
+            import os
+            from datetime import datetime
+            
+            extract_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'demo', 'extracted_intermediates')
+            os.makedirs(extract_dir, exist_ok=True)
+            
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            units_pt_path = os.path.join(extract_dir, f"discrete_units_{timestamp}.pt")
+            units_txt_path = os.path.join(extract_dir, f"discrete_units_{timestamp}.txt")
+            
+            # Save as PyTorch tensor
+            units_tensor = torch.tensor(unit, dtype=torch.long)
+            torch.save(units_tensor, units_pt_path)
+            
+            # Also save as readable text
+            with open(units_txt_path, 'w') as f:
+                f.write(' '.join(map(str, unit)) + '\n')
+                f.write(f"\n# Number of units: {len(unit)}\n")
+                f.write(f"# Unit range: [{min(unit)}, {max(unit)}]\n")
+            
+            print(f"✅ EXTRACTED: Discrete units saved to {units_pt_path}")
+            print(f"   Number of units: {len(unit)}, Range: [{min(unit)}, {max(unit)}]")
+        # ==========================================
+            
+        x = {
+            "code": torch.tensor(unit, dtype=torch.long, device=self.device).view(
+                1, -1
+            ),
+        }
+        wav, dur = self.vocoder(x, self.dur_prediction)
+```
+
+---
+
+## 📊 Data Flow Visualization
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    StreamSpeech Pipeline                     │
+└─────────────────────────────────────────────────────────────┘
+
+1. Source Spanish Audio (MP3/WAV)
+   │
+   ├─> Load & Resample (demo/app.py, run())
+   │   ├─> samples: numpy array (48000 Hz)
+   │   └─> 💾 EXTRACT HERE: source_audio.wav
+   │
+   ├─> Feature Extraction (agent, OnlineFeatureExtractor)
+   │   └─> fbank features (80-dim)
+   │
+   ├─> StreamSpeech Model (encoder + decoder)
+   │   ├─> ASR output (Spanish text)
+   │   ├─> Translation output (English text)
+   │   └─> Text-to-Unit decoder
+   │
+   ├─> Discrete Speech Units (agent, policy())
+   │   ├─> unit: list of integers [234, 567, 891, ...]
+   │   └─> 💾 EXTRACT HERE: discrete_units.pt
+   │
+   ├─> HiFi-GAN Vocoder (CodeHiFiGAN)
+   │   └─> Synthesized English speech (16000 Hz)
+   │
+   └─> Output Audio (WAV)
+```
+
+---
+
+## 🔍 Understanding the Extracted Data
+
+### Source Audio (`.wav` file)
+- **Format**: WAV, float32
+- **Sample Rate**: 16000 Hz (16kHz - model's processing rate)
+- **Content**: Original Spanish speech, resampled and normalized to [-0.8, 0.8]
+- **Use Case**: Input to acoustic feature extraction (same rate the model uses)
+
+### Discrete Units (`.pt` file)
+- **Format**: PyTorch tensor (torch.long)
+- **Content**: Integer codes representing phonetic units
+- **Range**: Typically 0-999 (for 1000-unit codebook)
+- **Length**: Variable, depends on speech duration
+- **Example**: `tensor([234, 567, 891, 123, 456, ...])`
+
+**How to Load:**
+```python
+import torch
+
+# Load units
+units = torch.load('discrete_units_20251107_123456.pt')
+print(f"Shape: {units.shape}")
+print(f"Units: {units}")
+
+# Or load as text
+with open('discrete_units_20251107_123456.txt', 'r') as f:
+    units_str = f.readline().strip()
+    units_list = [int(x) for x in units_str.split()]
+```
+
+---
+
+## 📂 Output Structure
+
+After running the demo, you'll find:
+
+```
+demo/
+├── extracted_intermediates/
+│   ├── common_voice_es_18311412_source_audio_16k.wav
+│   ├── discrete_units_20251107_025030.pt
+│   ├── discrete_units_20251107_025030.txt
+│   ├── another_audio_source_audio_16k.wav
+│   └── discrete_units_20251107_030145.pt
+└── ...
+```
+
+---
+
+## 🚀 Quick Implementation
+
+### Option 1: Manual Copy-Paste (Recommended)
+1. Open `demo/app.py`
+2. Find line ~859 (after `samples = samples / max_val * 0.8`)
+3. Copy-paste the "EXTRACT SOURCE AUDIO" code block
+4. Open `agent/speech_to_speech.streamspeech.agent.py`
+5. Find line ~742 (before `x = {"code": ...}`)
+6. Copy-paste the "EXTRACT DISCRETE UNITS" code block
+7. Restart the demo app
+
+### Option 2: Use the Helper Script
+The `demo/extract_intermediates.py` file contains reusable functions you can import.
+
+---
+
+## 🧪 Testing
+
+1. Start the demo:
+   ```powershell
+   cd demo
+   python app.py
+   ```
+
+2. Upload a Spanish audio file
+
+3. Process it
+
+4. Check the console output for:
+   ```
+   ✅ EXTRACTED: Source audio saved to extracted_intermediates/...
+   ✅ EXTRACTED: Discrete units saved to extracted_intermediates/...
+   ```
+
+5. Verify files in `demo/extracted_intermediates/`
+
+---
+
+## 📝 Notes
+
+- **Source audio** is saved at the beginning of processing (immediately available)
+- **Discrete units** are saved only when `source_finished=True` (at the end)
+- Both use timestamps to avoid overwriting files
+- `.txt` files are human-readable for inspection
+- `.pt` files can be loaded back into PyTorch for further processing
+
+---
+
+## 🔬 Advanced: Using the Extracted Data
+
+### Analyzing Discrete Units
+```python
+import torch
+import matplotlib.pyplot as plt
+
+# Load units
+units = torch.load('discrete_units_20251107_025030.pt')
+
+# Statistics
+print(f"Total units: {len(units)}")
+print(f"Unique units: {len(torch.unique(units))}")
+print(f"Most common unit: {torch.mode(units).values.item()}")
+
+# Histogram
+plt.hist(units.numpy(), bins=50)
+plt.xlabel('Unit Index')
+plt.ylabel('Frequency')
+plt.title('Discrete Unit Distribution')
+plt.show()
+```
+
+### Analyzing Source Audio
+```python
+import soundfile
+import numpy as np
+
+# Load 16kHz source audio
+audio, sr = soundfile.read('common_voice_es_18311412_source_audio_16k.wav')
+print(f"Sample rate: {sr} Hz (should be 16000)")
+print(f"Duration: {len(audio)/sr:.2f} seconds")
+print(f"Shape: {audio.shape}")
+print(f"Range: [{audio.min():.3f}, {audio.max():.3f}]")
+```
+
+### Reusing Units with Vocoder
+```python
+import torch
+
+# Load saved units
+units = torch.load('discrete_units_20251107_025030.pt')
+
+# Feed directly to vocoder (without running full model)
+# This would be in the agent context with vocoder loaded
+x = {
+    "code": units.view(1, -1).to(device)
+}
+wav, dur = vocoder(x, dur_prediction=True)
+# wav is now synthesized speech!
+```
+
+---
+
+## ✅ Summary
+
+**Two extraction points:**
+1. 📍 `demo/app.py:859` → Save source Spanish audio WAV
+2. 📍 `agent/speech_to_speech.streamspeech.agent.py:742` → Save discrete units PT
+
+Both files will be in `demo/extracted_intermediates/` directory.
+
diff --git a/EXTRACTION_QUICK_REFERENCE.md b/EXTRACTION_QUICK_REFERENCE.md
new file mode 100644
index 0000000..4e64029
--- /dev/null
+++ b/EXTRACTION_QUICK_REFERENCE.md
@@ -0,0 +1,103 @@
+# 🎯 Quick Reference: Extract Intermediates from StreamSpeech
+
+## Two Code Locations
+
+### 1️⃣ Source Spanish Audio @ 16kHz → `demo/app.py` line ~859
+
+**Add after**: `samples = samples / max_val * 0.8`
+
+```python
+# Save source audio at 16kHz
+import soundfile, os, torch
+source_filename = os.path.basename(source).rsplit('.', 1)[0]
+extract_dir = os.path.join(os.path.dirname(__file__), 'extracted_intermediates')
+os.makedirs(extract_dir, exist_ok=True)
+
+# Resample to 16kHz (model's processing rate)
+TARGET_SR = 16000
+if ORG_SAMPLE_RATE != TARGET_SR:
+    samples_tensor = torch.tensor(samples).unsqueeze(0).unsqueeze(0)
+    target_length = int(len(samples) * TARGET_SR / ORG_SAMPLE_RATE)
+    samples_16k = torch.nn.functional.interpolate(
+        samples_tensor, size=target_length, mode='linear', align_corners=False
+    ).squeeze().numpy()
+else:
+    samples_16k = samples
+
+source_audio_path = os.path.join(extract_dir, f"{source_filename}_source_audio_16k.wav")
+soundfile.write(source_audio_path, samples_16k, TARGET_SR)
+print(f"✅ Source audio (16kHz): {source_audio_path}")
+```
+
+**Output**: `demo/extracted_intermediates/<filename>_source_audio_16k.wav`
+
+---
+
+### 2️⃣ Discrete Units → `agent/speech_to_speech.streamspeech.agent.py` line ~742
+
+**Add before**: `x = {"code": torch.tensor(unit, ...}`
+
+```python
+# Save discrete units
+if self.states.source_finished and len(unit) > 0:
+    import torch, os
+    from datetime import datetime
+    extract_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'demo', 'extracted_intermediates')
+    os.makedirs(extract_dir, exist_ok=True)
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    
+    # Save as PyTorch tensor
+    units_tensor = torch.tensor(unit, dtype=torch.long)
+    torch.save(units_tensor, os.path.join(extract_dir, f"units_{timestamp}.pt"))
+    
+    # Save as text
+    with open(os.path.join(extract_dir, f"units_{timestamp}.txt"), 'w') as f:
+        f.write(' '.join(map(str, unit)))
+    print(f"✅ Discrete units: {len(unit)} units saved")
+```
+
+**Output**: 
+- `demo/extracted_intermediates/units_<timestamp>.pt`
+- `demo/extracted_intermediates/units_<timestamp>.txt`
+
+---
+
+## 📊 What You Get
+
+| File | Format | Content | Size |
+|------|--------|---------|------|
+| `*_source_audio_16k.wav` | WAV, **16kHz**, float32 | Spanish speech at model's rate | ~1-2 MB/min |
+| `units_*.pt` | PyTorch tensor | Discrete phonetic codes | ~1-2 KB |
+| `units_*.txt` | Text | Human-readable units | ~1-2 KB |
+
+---
+
+## 🔬 Usage
+
+### Load Source Audio
+```python
+import soundfile
+audio, sr = soundfile.read('common_voice_es_18311412_source_audio_16k.wav')
+print(f"Sample rate: {sr} Hz")  # Should be 16000
+```
+
+### Load Discrete Units
+```python
+import torch
+units = torch.load('units_20251107_025030.pt')
+# tensor([234, 567, 891, ...])
+```
+
+---
+
+## ✨ Quick Test
+
+1. Add the two code snippets above
+2. Restart demo: `python demo/app.py`
+3. Upload & process Spanish audio
+4. Check `demo/extracted_intermediates/` folder
+
+---
+
+See **EXTRACTION_GUIDE.md** for detailed explanations and advanced usage.
+
diff --git a/demo/app.py b/demo/app.py
index 1a74b98..ebaee3e 100644
--- a/demo/app.py
+++ b/demo/app.py
@@ -805,6 +805,18 @@ def policy(self):
                     finished=True,
                 )
 
+        # Extract discrete units before vocoder
+        if self.states.source_finished and len(unit) > 0:
+            try:
+                from extract_intermediates import save_discrete_units
+                # Use source filename if available, otherwise default to "output"
+                filename_prefix = getattr(self, 'current_source_filename', 'output')
+                save_discrete_units(unit, filename_prefix=f"{filename_prefix}_output")
+            except Exception as e:
+                import traceback
+                print(f"⚠️  Failed to save discrete units: {e}")
+                traceback.print_exc()
+        
         x = {
             "code": torch.tensor(unit, dtype=torch.long, device=self.device).view(
                 1, -1
@@ -858,6 +870,14 @@ def run(source):
     else:
         samples, sr = soundfile.read(source, dtype="float32")
     
+    # Extract source audio at 16kHz
+    source_filename = os.path.basename(source).split('.')[0]
+    from extract_intermediates import save_source_audio
+    save_source_audio(samples, ORG_SAMPLE_RATE, filename_prefix=source_filename)
+    
+    # Store filename for later use in discrete units extraction
+    agent.current_source_filename = source_filename
+    
     # Resample to expected sample rate if needed
     if sr != ORG_SAMPLE_RATE:
         print(f"Resampling from {sr}Hz to {ORG_SAMPLE_RATE}Hz")
diff --git a/demo/extract_intermediates.py b/demo/extract_intermediates.py
new file mode 100644
index 0000000..03cc017
--- /dev/null
+++ b/demo/extract_intermediates.py
@@ -0,0 +1,173 @@
+"""
+Extract Intermediate Outputs from StreamSpeech
+================================================
+
+This script modifies the demo app to save:
+1. Source Spanish audio input (WAV file)
+2. Discrete speech units output (PT file)
+
+Add this code to your demo/app.py to extract intermediates.
+"""
+
+import torch
+import soundfile
+import os
+from datetime import datetime
+
+# Directory to save extracted files - use absolute path
+# This file is in demo/, so go up to project root, then into demo/extracted_intermediates
+_current_dir = os.path.dirname(os.path.abspath(__file__))
+EXTRACT_DIR = os.path.join(_current_dir, "extracted_intermediates")
+os.makedirs(EXTRACT_DIR, exist_ok=True)
+
+def save_source_audio(samples, sample_rate, filename_prefix=None, target_sample_rate=16000):
+    """
+    Save the source audio input as WAV file at 16kHz.
+    
+    Call this after loading/processing the source audio.
+    Location: In demo/app.py, run() function, after samples are loaded
+    
+    Args:
+        samples: numpy array of audio samples
+        sample_rate: current sample rate (e.g., 48000)
+        filename_prefix: optional prefix for filename
+        target_sample_rate: target sample rate (default 16000)
+    """
+    if filename_prefix is None:
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename_prefix = f"source_{timestamp}"
+    
+    output_path = os.path.join(EXTRACT_DIR, f"{filename_prefix}_source_audio_16k.wav")
+    
+    # Resample to 16kHz if needed
+    if sample_rate != target_sample_rate:
+        import torch
+        samples_tensor = torch.tensor(samples).unsqueeze(0).unsqueeze(0)  # [1, 1, length]
+        target_length = int(len(samples) * target_sample_rate / sample_rate)
+        samples_tensor = torch.nn.functional.interpolate(
+            samples_tensor, size=target_length, mode='linear', align_corners=False
+        )
+        samples = samples_tensor.squeeze().numpy()
+        print(f"  - Resampled from {sample_rate}Hz to {target_sample_rate}Hz")
+    
+    # Save as WAV file at 16kHz
+    soundfile.write(output_path, samples, target_sample_rate)
+    
+    print(f"✓ Saved source audio: {output_path}")
+    print(f"  - Sample rate: {target_sample_rate} Hz (16kHz)")
+    print(f"  - Duration: {len(samples)/target_sample_rate:.2f} seconds")
+    print(f"  - Shape: {samples.shape}")
+    
+    return output_path
+
+
+def save_discrete_units(units_tensor, filename_prefix=None, save_as_text=True):
+    """
+    Save discrete speech units as PT file (and optionally as text).
+    
+    Call this when units are generated before being fed to vocoder.
+    Location: In agent/speech_to_speech.streamspeech.agent.py, 
+              in policy() method, after line 713-724 where units are generated
+    
+    Args:
+        units_tensor: torch tensor of discrete units (can be list or tensor)
+        filename_prefix: optional prefix for filename
+        save_as_text: also save units as readable text file
+    """
+    if filename_prefix is None:
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        filename_prefix = f"units_{timestamp}"
+    
+    # Convert to tensor if it's a list
+    if isinstance(units_tensor, list):
+        units_tensor = torch.tensor(units_tensor, dtype=torch.long)
+    
+    # Save as PyTorch file
+    pt_path = os.path.join(EXTRACT_DIR, f"{filename_prefix}_discrete_units.pt")
+    torch.save(units_tensor, pt_path)
+    
+    print(f"✓ Saved discrete units: {pt_path}")
+    print(f"  - Shape: {units_tensor.shape}")
+    print(f"  - Number of units: {units_tensor.numel()}")
+    print(f"  - Unit range: [{units_tensor.min().item()}, {units_tensor.max().item()}]")
+    
+    # Also save as text for inspection
+    if save_as_text:
+        txt_path = os.path.join(EXTRACT_DIR, f"{filename_prefix}_discrete_units.txt")
+        units_list = units_tensor.cpu().tolist() if units_tensor.dim() > 0 else [units_tensor.item()]
+        with open(txt_path, 'w') as f:
+            # Save as space-separated values
+            if isinstance(units_list[0], list):
+                for row in units_list:
+                    f.write(' '.join(map(str, row)) + '\n')
+            else:
+                f.write(' '.join(map(str, units_list)) + '\n')
+        print(f"✓ Saved units as text: {txt_path}")
+    
+    return pt_path
+
+
+# Example usage metadata
+def save_metadata(source_audio_path, units_path, additional_info=None):
+    """Save metadata about the extraction"""
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    metadata_path = os.path.join(EXTRACT_DIR, f"metadata_{timestamp}.txt")
+    
+    with open(metadata_path, 'w') as f:
+        f.write("StreamSpeech Intermediate Outputs\n")
+        f.write("=" * 50 + "\n\n")
+        f.write(f"Timestamp: {timestamp}\n")
+        f.write(f"Source Audio: {source_audio_path}\n")
+        f.write(f"Discrete Units: {units_path}\n")
+        if additional_info:
+            f.write(f"\nAdditional Info:\n")
+            for key, value in additional_info.items():
+                f.write(f"  {key}: {value}\n")
+    
+    print(f"✓ Saved metadata: {metadata_path}")
+    return metadata_path
+
+
+"""
+INTEGRATION INSTRUCTIONS
+========================
+
+1. In demo/app.py, modify the run() function:
+   
+   Add at line ~859 (after samples are loaded and resampled):
+   
+   ```python
+   # Import the extraction functions
+   from extract_intermediates import save_source_audio
+   
+   # Save source audio at 16kHz
+   # Note: Even though samples are at 48kHz here, the function will resample to 16kHz
+   save_source_audio(samples, ORG_SAMPLE_RATE, filename_prefix=os.path.basename(source).split('.')[0])
+   ```
+
+2. In agent/speech_to_speech.streamspeech.agent.py, modify the policy() method:
+   
+   Add at line ~744 (before units are fed to vocoder):
+   
+   ```python
+   # Import the extraction functions (add at top of file)
+   from demo.extract_intermediates import save_discrete_units
+   
+   # Save discrete units (add right before line 744)
+   if self.states.source_finished:  # Only save final units
+       save_discrete_units(unit, filename_prefix="output")
+   
+   x = {
+       "code": torch.tensor(unit, dtype=torch.long, device=self.device).view(
+           1, -1
+       ),
+   }
+   ```
+
+3. The extracted files will be saved in: demo/extracted_intermediates/
+"""
+
+if __name__ == "__main__":
+    print(__doc__)
+    print("\nExtracted files will be saved to:", os.path.abspath(EXTRACT_DIR))
+

From 61394041de1cca5600fcf104913ceb185dbc8fc3 Mon Sep 17 00:00:00 2001
From: ronliwag <roncorwinrliwag@gmail.com>
Date: Fri, 7 Nov 2025 03:53:16 +0800
Subject: [PATCH 5/7] added frontend components for comparison

---
 demo/templates/index.html | 107 +++++++++++++++++++++++++++++++++-----
 1 file changed, 93 insertions(+), 14 deletions(-)

diff --git a/demo/templates/index.html b/demo/templates/index.html
index 7574bd4..40d5abe 100644
--- a/demo/templates/index.html
+++ b/demo/templates/index.html
@@ -59,7 +59,14 @@
         button:hover:not([disabled]) {
             background-color: #45a049;
         }
-        #waveform, #outputWaveform {
+        #playModifiedButton {
+            background-color: #FF9800;
+            margin-left: 10px;
+        }
+        #playModifiedButton:hover:not([disabled]) {
+            background-color: #F57C00;
+        }
+        #waveform, #outputWaveform, #outputWaveformModified {
             margin: 10px 0; /* Reduced the margin */
             border: 1px solid #ddd;
             border-radius: 5px;
@@ -156,17 +163,10 @@
 <body>
     <div class="container">
         <div class="title-container">
-            <div class="tag-sticker"> ACL 2024 </div>
+            <div class="tag-sticker"> MODIFIED </div>
             <h1>StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning</h1>
         </div>
-        <p><strong>Authors</strong>: Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng*</p>
-        <h4>💡<strong>StreamSpeech</strong> is an <font color="red">"All in One" seamless model</font> for offline and simultaneous speech recognition, speech translation and speech synthesis under any latency.</h4>
-        <div class="badges">
-            <a href="https://arxiv.org/abs/2406.03049"><img src="https://img.shields.io/badge/arXiv-2406.03049-b31b1b.svg?logo=arXiv" alt="arXiv"></a>
-            <a href="https://ictnlp.github.io/StreamSpeech-site/"><img src="https://img.shields.io/badge/%F0%9F%8E%A7%20Demo-Listen%20to%20StreamSpeech-orange.svg" alt="Demo"></a>
-            <a href="https://huggingface.co/ICTNLP/StreamSpeech_Models/tree/main"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20-StreamSpeech_Models-blue.svg" alt="StreamSpeech Models"></a>
-            <a href="https://github.com/ictnlp/StreamSpeech"><img src="https://img.shields.io/github/stars/ictnlp/StreamSpeech?style=social&label=GitHub+Repo" alt="Visitors"></a>
-        </div>
+        <br>
         <form id="uploadForm">
             <div class="slider-container">
                 <label for="latencySlider">Latency (ms): </label>
@@ -179,9 +179,9 @@ <h4>💡<strong>StreamSpeech</strong> is an <font color="red">"All in One" seaml
         <div class="streaming-inputs">
             <h2>Streaming Inputs</h2>
             <div id="waveform"></div>
-            <button id="playButton" disabled>▶Play/Pause</button>
+            <button id="playButton" disabled>▶Play/Pause All</button>
+            <button id="playModifiedButton" disabled>▶Play Input + Modified Vocoder</button>
         </div>
-        <br><br>
         <div class="results">
             <h2 class="result-title">Streaming Speech Recognition<br></h2>
             <div id="asrResult" class="result-content"><br></div> <!-- Ensure empty line initially -->
@@ -191,15 +191,20 @@ <h2 class="result-title">Simultaneous Speech-to-Text Translation</h2>
             <div id="translationResult" class="result-content2"><br></div> <!-- Ensure empty line initially -->
         </div>
         <div class="results">
-            <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
+            <h2 class="result-title">Simultaneous Speech-to-Speech Translation (StreamSpeech)</h2>
             <div id="outputWaveform"></div>
         </div>
+        <div class="results">
+            <h2 class="result-title">Simultaneous Speech-to-Speech Translation (Modified Vocoder)</h2>
+            <div id="outputWaveformModified"></div>
+        </div>
     </div>
 
     <script>
-        var inputWaveSurfer, outputWaveSurfer;
+        var inputWaveSurfer, outputWaveSurfer, outputWaveSurferModified;
         var asrInterval, translationInterval;
         var playButton = document.getElementById('playButton');
+        var playModifiedButton = document.getElementById('playModifiedButton');
 
         // Listen for changes in the file input
         document.getElementById('fileInput').addEventListener('change', function() {
@@ -238,6 +243,9 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
                 if (outputWaveSurfer) {
                     outputWaveSurfer.destroy();
                 }
+                if (outputWaveSurferModified) {
+                    outputWaveSurferModified.destroy();
+                }
 
                 inputWaveSurfer = WaveSurfer.create({
                     container: '#waveform',
@@ -264,16 +272,41 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
 
                     outputWaveSurfer.load(`/output/${filename}`);
 
+                    // Create Modified Vocoder output WaveSurfer
+                    outputWaveSurferModified = WaveSurfer.create({
+                        container: '#outputWaveformModified',
+                        waveColor: '#f9f9f9', // Audio Result color
+                        progressColor: 'green', // Green progress color for Modified Vocoder
+                        // barWidth:1.0,
+                        splitChannels: true, // Split channels for left and right output
+                        channel: 'right', // Right channel
+                        normalize: true // Add normalize to output waveform
+                    });
+
+                    // Load Modified Vocoder output - you'll need to update this path when you have the backend endpoint
+                    outputWaveSurferModified.load(`/output_hifigan/${filename}`);
+
                     playButton.disabled = false;
                     playButton.style.backgroundColor = '#4CAF50'; // Change color to green
+                    
+                    playModifiedButton.disabled = false;
+                    playModifiedButton.style.backgroundColor = '#FF9800'; // Change color to orange
 
                     // Ensure old event listeners are removed before adding new ones
                     playButton.removeEventListener('click', playPauseHandler);
                     playButton.addEventListener('click', playPauseHandler);
+                    
+                    playModifiedButton.removeEventListener('click', playModifiedHandler);
+                    playModifiedButton.addEventListener('click', playModifiedHandler);
 
                     inputWaveSurfer.on('play', function() {
                         outputWaveSurfer.play();
                         outputWaveSurfer.seekTo(inputWaveSurfer.getCurrentTime() / inputWaveSurfer.getDuration());
+                        
+                        if (outputWaveSurferModified) {
+                            outputWaveSurferModified.play();
+                            outputWaveSurferModified.seekTo(inputWaveSurfer.getCurrentTime() / inputWaveSurfer.getDuration());
+                        }
 
                         updateASRResult(inputWaveSurfer.getCurrentTime());
                         updateTranslationResult(inputWaveSurfer.getCurrentTime());
@@ -289,6 +322,9 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
 
                     inputWaveSurfer.on('pause', function() {
                         outputWaveSurfer.pause();
+                        if (outputWaveSurferModified) {
+                            outputWaveSurferModified.pause();
+                        }
                         clearInterval(asrInterval);
                         clearInterval(translationInterval);
                     });
@@ -296,6 +332,9 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
                     inputWaveSurfer.on('seek', function() {
                         var seekTime = inputWaveSurfer.getCurrentTime() / inputWaveSurfer.getDuration();
                         outputWaveSurfer.seekTo(seekTime);
+                        if (outputWaveSurferModified) {
+                            outputWaveSurferModified.seekTo(seekTime);
+                        }
                     });
 
                     inputWaveSurfer.on('finish', function() {
@@ -306,6 +345,9 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
                         if (outputWaveSurfer && !outputWaveSurfer.isPlaying()) {
                             outputWaveSurfer.play();
                         }
+                        if (outputWaveSurferModified && !outputWaveSurferModified.isPlaying()) {
+                            outputWaveSurferModified.play();
+                        }
                     });
                 });
 
@@ -319,6 +361,7 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
             document.getElementById('asrResult').innerText = ''; // Clear ASR result and preserve empty line
             document.getElementById('translationResult').innerText = ''; // Clear translation result and preserve empty line
             document.getElementById('outputWaveform').innerHTML = ''; // Clear output waveform
+            document.getElementById('outputWaveformModified').innerHTML = ''; // Clear Modified Vocoder output waveform
             document.getElementById('waveform').innerHTML = ''; // Clear input waveform
             if (inputWaveSurfer) {
                 inputWaveSurfer.destroy();
@@ -328,6 +371,10 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
                 outputWaveSurfer.destroy();
                 outputWaveSurfer = null;
             }
+            if (outputWaveSurferModified) {
+                outputWaveSurferModified.destroy();
+                outputWaveSurferModified = null;
+            }
         }
 
         function playPauseHandler() {
@@ -344,6 +391,38 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation</h2>
                     outputWaveSurfer.playPause();
                 }
             }
+            if (outputWaveSurferModified) {
+                // Handle Modified Vocoder waveform synchronization
+                if (inputWaveSurfer && inputWaveSurfer.isFinished() && outputWaveSurferModified.isPlaying()) {
+                    outputWaveSurferModified.pause();
+                } else if (inputWaveSurfer && inputWaveSurfer.isFinished() && !outputWaveSurferModified.isPlaying()) {
+                    outputWaveSurferModified.play();
+                } else {
+                    outputWaveSurferModified.playPause();
+                }
+            }
+        }
+
+        function playModifiedHandler() {
+            // Play only input and modified vocoder output
+            if (inputWaveSurfer) {
+                if (inputWaveSurfer.isPlaying()) {
+                    inputWaveSurfer.pause();
+                    if (outputWaveSurferModified) {
+                        outputWaveSurferModified.pause();
+                    }
+                } else {
+                    inputWaveSurfer.play();
+                    if (outputWaveSurferModified) {
+                        outputWaveSurferModified.seekTo(inputWaveSurfer.getCurrentTime() / inputWaveSurfer.getDuration());
+                        outputWaveSurferModified.play();
+                    }
+                }
+            }
+            // Ensure StreamSpeech output is paused
+            if (outputWaveSurfer && outputWaveSurfer.isPlaying()) {
+                outputWaveSurfer.pause();
+            }
         }
 
         function updateASRResult(currentTime) {

From 9bba72233389daa8e5ad65c8822a2a1acb8d2a63 Mon Sep 17 00:00:00 2001
From: ronliwag <roncorwinrliwag@gmail.com>
Date: Fri, 7 Nov 2025 06:11:56 +0800
Subject: [PATCH 6/7] integrated conditioning feature extraction and generation

---
 EXTRACTION_GUIDE.md                 | 396 -------------------
 EXTRACTION_QUICK_REFERENCE.md       | 103 -----
 demo/app.py                         | 244 ++++++++++--
 demo/base_hifigan_config.json       |  53 +++
 demo/config.json                    |   3 +-
 demo/templates/index.html           | 567 +++++++++++++++++++++-------
 demo/vocoder_wrapper.py             | 268 +++++++++++++
 models/__init__.py                  |   0
 models/ecapa.py                     |  30 ++
 models/emotion2vec.py               |  70 ++++
 models/film.py                      |  89 +++++
 models/hifigan_generator.py         | 230 +++++++++++
 models/multiperiod_discriminator.py | 158 ++++++++
 models/multiscale_discriminator.py  | 169 +++++++++
 models/resblock.py                  |  86 +++++
 models/utils.py                     |  30 ++
 requirements.txt                    |  10 +
 17 files changed, 1845 insertions(+), 661 deletions(-)
 delete mode 100644 EXTRACTION_GUIDE.md
 delete mode 100644 EXTRACTION_QUICK_REFERENCE.md
 create mode 100644 demo/base_hifigan_config.json
 create mode 100644 demo/vocoder_wrapper.py
 create mode 100644 models/__init__.py
 create mode 100644 models/ecapa.py
 create mode 100644 models/emotion2vec.py
 create mode 100644 models/film.py
 create mode 100644 models/hifigan_generator.py
 create mode 100644 models/multiperiod_discriminator.py
 create mode 100644 models/multiscale_discriminator.py
 create mode 100644 models/resblock.py
 create mode 100644 models/utils.py

diff --git a/EXTRACTION_GUIDE.md b/EXTRACTION_GUIDE.md
deleted file mode 100644
index bc8deae..0000000
--- a/EXTRACTION_GUIDE.md
+++ /dev/null
@@ -1,396 +0,0 @@
-# StreamSpeech Intermediate Data Extraction Guide
-
-This guide shows you **exactly where** to extract the source Spanish audio and discrete speech units from StreamSpeech.
-
----
-
-## 📁 What You'll Extract
-
-1. **Source Spanish Audio Input** (`.wav` file)
-   - Original/resampled Spanish audio before feature extraction
-   - Location: `demo/app.py`, `run()` function
-
-2. **Discrete Speech Units** (`.pt` file) 
-   - Integer codes representing phonetic units (before vocoder)
-   - Location: `agent/speech_to_speech.streamspeech.agent.py`, `policy()` method
-
----
-
-## 🎯 Code Location 1: Source Audio Input
-
-### File: `demo/app.py`
-
-**Location**: In the `run()` function, around **line 859**
-
-### Current Code:
-```python
-def run(source):
-    # if len(S2ST)!=0: return
-    
-    # Handle MP3 files by converting to WAV first
-    if source.lower().endswith('.mp3'):
-        print(f"Converting MP3 to WAV: {source}")
-        audio = AudioSegment.from_mp3(source)
-        # Create a temporary WAV file
-        wav_path = source.rsplit('.', 1)[0] + '_temp.wav'
-        audio.export(wav_path, format='wav')
-        samples, sr = soundfile.read(wav_path, dtype="float32")
-        # Clean up temp file
-        try:
-            os.remove(wav_path)
-        except:
-            pass
-    else:
-        samples, sr = soundfile.read(source, dtype="float32")
-    
-    # Resample to expected sample rate if needed
-    if sr != ORG_SAMPLE_RATE:
-        print(f"Resampling from {sr}Hz to {ORG_SAMPLE_RATE}Hz")
-        # Simple resampling using torch
-        samples_tensor = torch.tensor(samples).unsqueeze(0).unsqueeze(0)
-        target_length = int(len(samples) * ORG_SAMPLE_RATE / sr)
-        samples_tensor = torch.nn.functional.interpolate(
-            samples_tensor, size=target_length, mode='linear', align_corners=False
-        )
-        samples = samples_tensor.squeeze().numpy()
-    
-    # Normalize input audio to prevent loud playback
-    max_val = np.max(np.abs(samples))
-    if max_val > 0:
-        samples = samples / max_val * 0.8
-    
-    # 👇 ADD EXTRACTION CODE HERE 👇
-```
-
-### Modified Code (ADD THIS):
-```python
-    # Normalize input audio to prevent loud playback
-    max_val = np.max(np.abs(samples))
-    if max_val > 0:
-        samples = samples / max_val * 0.8
-    
-    # ==========================================
-    # EXTRACT SOURCE AUDIO AT 16kHz
-    # ==========================================
-    # Save source audio for analysis (resampled to 16kHz)
-    import soundfile, torch
-    source_filename = os.path.basename(source).rsplit('.', 1)[0]
-    extract_dir = os.path.join(os.path.dirname(__file__), 'extracted_intermediates')
-    os.makedirs(extract_dir, exist_ok=True)
-    
-    # Resample to 16kHz (model's processing rate)
-    TARGET_SR = 16000
-    if ORG_SAMPLE_RATE != TARGET_SR:
-        samples_tensor = torch.tensor(samples).unsqueeze(0).unsqueeze(0)
-        target_length = int(len(samples) * TARGET_SR / ORG_SAMPLE_RATE)
-        samples_tensor = torch.nn.functional.interpolate(
-            samples_tensor, size=target_length, mode='linear', align_corners=False
-        )
-        samples_16k = samples_tensor.squeeze().numpy()
-    else:
-        samples_16k = samples
-    
-    source_audio_path = os.path.join(extract_dir, f"{source_filename}_source_audio_16k.wav")
-    soundfile.write(source_audio_path, samples_16k, TARGET_SR)
-    print(f"✅ EXTRACTED: Source audio saved to {source_audio_path}")
-    print(f"   Sample rate: {TARGET_SR} Hz (16kHz), Duration: {len(samples_16k)/TARGET_SR:.2f}s")
-    # ==========================================
-    
-    agent.reset()
-    # ... rest of the function
-```
-
----
-
-## 🎯 Code Location 2: Discrete Speech Units
-
-### File: `agent/speech_to_speech.streamspeech.agent.py`
-
-**Location**: In the `policy()` method, around **line 713-748**
-
-### Current Code:
-```python
-        for i, hypo in enumerate(finalized):
-            i_beam = 0
-            tmp = hypo[i_beam]["tokens"].int()  # hyp + eos
-            if tmp[-1] == self.generator.eos:
-                tmp = tmp[:-1]
-            unit = []
-            for c in tmp:
-                u = self.generator.tgt_dict[c].replace("<s>", "").replace("</s>", "")
-                if u != "":
-                    unit.append(int(u))
-
-            if len(unit) > 0 and unit[0] == " ":
-                unit = unit[1:]
-            text = " ".join([str(_) for _ in unit])
-            if self.states.source_finished and not self.quiet:
-                with open(self.unit_file, "a") as file:
-                    print(text, file=file)
-        cur_unit = unit if self.unit is None else unit[len(self.unit) :]
-        if len(unit) < 1 or len(cur_unit) < 1:
-            # ... return ReadAction or WriteAction
-            
-        x = {
-            "code": torch.tensor(unit, dtype=torch.long, device=self.device).view(
-                1, -1
-            ),
-        }
-        wav, dur = self.vocoder(x, self.dur_prediction)
-```
-
-### Modified Code (ADD THIS):
-```python
-        for i, hypo in enumerate(finalized):
-            i_beam = 0
-            tmp = hypo[i_beam]["tokens"].int()  # hyp + eos
-            if tmp[-1] == self.generator.eos:
-                tmp = tmp[:-1]
-            unit = []
-            for c in tmp:
-                u = self.generator.tgt_dict[c].replace("<s>", "").replace("</s>", "")
-                if u != "":
-                    unit.append(int(u))
-
-            if len(unit) > 0 and unit[0] == " ":
-                unit = unit[1:]
-            text = " ".join([str(_) for _ in unit])
-            if self.states.source_finished and not self.quiet:
-                with open(self.unit_file, "a") as file:
-                    print(text, file=file)
-        cur_unit = unit if self.unit is None else unit[len(self.unit) :]
-        if len(unit) < 1 or len(cur_unit) < 1:
-            # ... return ReadAction or WriteAction
-        
-        # ==========================================
-        # EXTRACT DISCRETE UNITS (before vocoder)
-        # ==========================================
-        # Only save when source is finished (final complete units)
-        if self.states.source_finished and len(unit) > 0:
-            import torch
-            import os
-            from datetime import datetime
-            
-            extract_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'demo', 'extracted_intermediates')
-            os.makedirs(extract_dir, exist_ok=True)
-            
-            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-            units_pt_path = os.path.join(extract_dir, f"discrete_units_{timestamp}.pt")
-            units_txt_path = os.path.join(extract_dir, f"discrete_units_{timestamp}.txt")
-            
-            # Save as PyTorch tensor
-            units_tensor = torch.tensor(unit, dtype=torch.long)
-            torch.save(units_tensor, units_pt_path)
-            
-            # Also save as readable text
-            with open(units_txt_path, 'w') as f:
-                f.write(' '.join(map(str, unit)) + '\n')
-                f.write(f"\n# Number of units: {len(unit)}\n")
-                f.write(f"# Unit range: [{min(unit)}, {max(unit)}]\n")
-            
-            print(f"✅ EXTRACTED: Discrete units saved to {units_pt_path}")
-            print(f"   Number of units: {len(unit)}, Range: [{min(unit)}, {max(unit)}]")
-        # ==========================================
-            
-        x = {
-            "code": torch.tensor(unit, dtype=torch.long, device=self.device).view(
-                1, -1
-            ),
-        }
-        wav, dur = self.vocoder(x, self.dur_prediction)
-```
-
----
-
-## 📊 Data Flow Visualization
-
-```
-┌─────────────────────────────────────────────────────────────┐
-│                    StreamSpeech Pipeline                     │
-└─────────────────────────────────────────────────────────────┘
-
-1. Source Spanish Audio (MP3/WAV)
-   │
-   ├─> Load & Resample (demo/app.py, run())
-   │   ├─> samples: numpy array (48000 Hz)
-   │   └─> 💾 EXTRACT HERE: source_audio.wav
-   │
-   ├─> Feature Extraction (agent, OnlineFeatureExtractor)
-   │   └─> fbank features (80-dim)
-   │
-   ├─> StreamSpeech Model (encoder + decoder)
-   │   ├─> ASR output (Spanish text)
-   │   ├─> Translation output (English text)
-   │   └─> Text-to-Unit decoder
-   │
-   ├─> Discrete Speech Units (agent, policy())
-   │   ├─> unit: list of integers [234, 567, 891, ...]
-   │   └─> 💾 EXTRACT HERE: discrete_units.pt
-   │
-   ├─> HiFi-GAN Vocoder (CodeHiFiGAN)
-   │   └─> Synthesized English speech (16000 Hz)
-   │
-   └─> Output Audio (WAV)
-```
-
----
-
-## 🔍 Understanding the Extracted Data
-
-### Source Audio (`.wav` file)
-- **Format**: WAV, float32
-- **Sample Rate**: 16000 Hz (16kHz - model's processing rate)
-- **Content**: Original Spanish speech, resampled and normalized to [-0.8, 0.8]
-- **Use Case**: Input to acoustic feature extraction (same rate the model uses)
-
-### Discrete Units (`.pt` file)
-- **Format**: PyTorch tensor (torch.long)
-- **Content**: Integer codes representing phonetic units
-- **Range**: Typically 0-999 (for 1000-unit codebook)
-- **Length**: Variable, depends on speech duration
-- **Example**: `tensor([234, 567, 891, 123, 456, ...])`
-
-**How to Load:**
-```python
-import torch
-
-# Load units
-units = torch.load('discrete_units_20251107_123456.pt')
-print(f"Shape: {units.shape}")
-print(f"Units: {units}")
-
-# Or load as text
-with open('discrete_units_20251107_123456.txt', 'r') as f:
-    units_str = f.readline().strip()
-    units_list = [int(x) for x in units_str.split()]
-```
-
----
-
-## 📂 Output Structure
-
-After running the demo, you'll find:
-
-```
-demo/
-├── extracted_intermediates/
-│   ├── common_voice_es_18311412_source_audio_16k.wav
-│   ├── discrete_units_20251107_025030.pt
-│   ├── discrete_units_20251107_025030.txt
-│   ├── another_audio_source_audio_16k.wav
-│   └── discrete_units_20251107_030145.pt
-└── ...
-```
-
----
-
-## 🚀 Quick Implementation
-
-### Option 1: Manual Copy-Paste (Recommended)
-1. Open `demo/app.py`
-2. Find line ~859 (after `samples = samples / max_val * 0.8`)
-3. Copy-paste the "EXTRACT SOURCE AUDIO" code block
-4. Open `agent/speech_to_speech.streamspeech.agent.py`
-5. Find line ~742 (before `x = {"code": ...}`)
-6. Copy-paste the "EXTRACT DISCRETE UNITS" code block
-7. Restart the demo app
-
-### Option 2: Use the Helper Script
-The `demo/extract_intermediates.py` file contains reusable functions you can import.
-
----
-
-## 🧪 Testing
-
-1. Start the demo:
-   ```powershell
-   cd demo
-   python app.py
-   ```
-
-2. Upload a Spanish audio file
-
-3. Process it
-
-4. Check the console output for:
-   ```
-   ✅ EXTRACTED: Source audio saved to extracted_intermediates/...
-   ✅ EXTRACTED: Discrete units saved to extracted_intermediates/...
-   ```
-
-5. Verify files in `demo/extracted_intermediates/`
-
----
-
-## 📝 Notes
-
-- **Source audio** is saved at the beginning of processing (immediately available)
-- **Discrete units** are saved only when `source_finished=True` (at the end)
-- Both use timestamps to avoid overwriting files
-- `.txt` files are human-readable for inspection
-- `.pt` files can be loaded back into PyTorch for further processing
-
----
-
-## 🔬 Advanced: Using the Extracted Data
-
-### Analyzing Discrete Units
-```python
-import torch
-import matplotlib.pyplot as plt
-
-# Load units
-units = torch.load('discrete_units_20251107_025030.pt')
-
-# Statistics
-print(f"Total units: {len(units)}")
-print(f"Unique units: {len(torch.unique(units))}")
-print(f"Most common unit: {torch.mode(units).values.item()}")
-
-# Histogram
-plt.hist(units.numpy(), bins=50)
-plt.xlabel('Unit Index')
-plt.ylabel('Frequency')
-plt.title('Discrete Unit Distribution')
-plt.show()
-```
-
-### Analyzing Source Audio
-```python
-import soundfile
-import numpy as np
-
-# Load 16kHz source audio
-audio, sr = soundfile.read('common_voice_es_18311412_source_audio_16k.wav')
-print(f"Sample rate: {sr} Hz (should be 16000)")
-print(f"Duration: {len(audio)/sr:.2f} seconds")
-print(f"Shape: {audio.shape}")
-print(f"Range: [{audio.min():.3f}, {audio.max():.3f}]")
-```
-
-### Reusing Units with Vocoder
-```python
-import torch
-
-# Load saved units
-units = torch.load('discrete_units_20251107_025030.pt')
-
-# Feed directly to vocoder (without running full model)
-# This would be in the agent context with vocoder loaded
-x = {
-    "code": units.view(1, -1).to(device)
-}
-wav, dur = vocoder(x, dur_prediction=True)
-# wav is now synthesized speech!
-```
-
----
-
-## ✅ Summary
-
-**Two extraction points:**
-1. 📍 `demo/app.py:859` → Save source Spanish audio WAV
-2. 📍 `agent/speech_to_speech.streamspeech.agent.py:742` → Save discrete units PT
-
-Both files will be in `demo/extracted_intermediates/` directory.
-
diff --git a/EXTRACTION_QUICK_REFERENCE.md b/EXTRACTION_QUICK_REFERENCE.md
deleted file mode 100644
index 4e64029..0000000
--- a/EXTRACTION_QUICK_REFERENCE.md
+++ /dev/null
@@ -1,103 +0,0 @@
-# 🎯 Quick Reference: Extract Intermediates from StreamSpeech
-
-## Two Code Locations
-
-### 1️⃣ Source Spanish Audio @ 16kHz → `demo/app.py` line ~859
-
-**Add after**: `samples = samples / max_val * 0.8`
-
-```python
-# Save source audio at 16kHz
-import soundfile, os, torch
-source_filename = os.path.basename(source).rsplit('.', 1)[0]
-extract_dir = os.path.join(os.path.dirname(__file__), 'extracted_intermediates')
-os.makedirs(extract_dir, exist_ok=True)
-
-# Resample to 16kHz (model's processing rate)
-TARGET_SR = 16000
-if ORG_SAMPLE_RATE != TARGET_SR:
-    samples_tensor = torch.tensor(samples).unsqueeze(0).unsqueeze(0)
-    target_length = int(len(samples) * TARGET_SR / ORG_SAMPLE_RATE)
-    samples_16k = torch.nn.functional.interpolate(
-        samples_tensor, size=target_length, mode='linear', align_corners=False
-    ).squeeze().numpy()
-else:
-    samples_16k = samples
-
-source_audio_path = os.path.join(extract_dir, f"{source_filename}_source_audio_16k.wav")
-soundfile.write(source_audio_path, samples_16k, TARGET_SR)
-print(f"✅ Source audio (16kHz): {source_audio_path}")
-```
-
-**Output**: `demo/extracted_intermediates/<filename>_source_audio_16k.wav`
-
----
-
-### 2️⃣ Discrete Units → `agent/speech_to_speech.streamspeech.agent.py` line ~742
-
-**Add before**: `x = {"code": torch.tensor(unit, ...}`
-
-```python
-# Save discrete units
-if self.states.source_finished and len(unit) > 0:
-    import torch, os
-    from datetime import datetime
-    extract_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'demo', 'extracted_intermediates')
-    os.makedirs(extract_dir, exist_ok=True)
-    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-    
-    # Save as PyTorch tensor
-    units_tensor = torch.tensor(unit, dtype=torch.long)
-    torch.save(units_tensor, os.path.join(extract_dir, f"units_{timestamp}.pt"))
-    
-    # Save as text
-    with open(os.path.join(extract_dir, f"units_{timestamp}.txt"), 'w') as f:
-        f.write(' '.join(map(str, unit)))
-    print(f"✅ Discrete units: {len(unit)} units saved")
-```
-
-**Output**: 
-- `demo/extracted_intermediates/units_<timestamp>.pt`
-- `demo/extracted_intermediates/units_<timestamp>.txt`
-
----
-
-## 📊 What You Get
-
-| File | Format | Content | Size |
-|------|--------|---------|------|
-| `*_source_audio_16k.wav` | WAV, **16kHz**, float32 | Spanish speech at model's rate | ~1-2 MB/min |
-| `units_*.pt` | PyTorch tensor | Discrete phonetic codes | ~1-2 KB |
-| `units_*.txt` | Text | Human-readable units | ~1-2 KB |
-
----
-
-## 🔬 Usage
-
-### Load Source Audio
-```python
-import soundfile
-audio, sr = soundfile.read('common_voice_es_18311412_source_audio_16k.wav')
-print(f"Sample rate: {sr} Hz")  # Should be 16000
-```
-
-### Load Discrete Units
-```python
-import torch
-units = torch.load('units_20251107_025030.pt')
-# tensor([234, 567, 891, ...])
-```
-
----
-
-## ✨ Quick Test
-
-1. Add the two code snippets above
-2. Restart demo: `python demo/app.py`
-3. Upload & process Spanish audio
-4. Check `demo/extracted_intermediates/` folder
-
----
-
-See **EXTRACTION_GUIDE.md** for detailed explanations and advanced usage.
-
diff --git a/demo/app.py b/demo/app.py
index ebaee3e..5051596 100644
--- a/demo/app.py
+++ b/demo/app.py
@@ -12,6 +12,10 @@
 from flask import Flask, request, jsonify, render_template, send_from_directory,url_for
 import os
 import json
+import logging
+
+# Re-enable Flask request logging (comment out to hide logs)
+# logging.getLogger('werkzeug').setLevel(logging.ERROR)
 import pdb
 import argparse
 from pydub import AudioSegment
@@ -63,6 +67,8 @@
 S2TT={}
 
 S2ST=[]
+S2ST_ORIGINAL=[]  # For dual mode: original vocoder output
+S2ST_MODIFIED=[]  # For dual mode: modified vocoder output
 
 class OnlineFeatureExtractor:
     """
@@ -164,7 +170,6 @@ def __init__(self, args):
         from agent.sequence_generator import SequenceGenerator
         from agent.ctc_generator import CTCSequenceGenerator
         from agent.ctc_decoder import CTCDecoder
-        from agent.tts.vocoder import CodeHiFiGANVocoderWithDur
 
         self.ctc_generator = CTCSequenceGenerator(
             tgt_dict, self.models, use_incremental_states=False
@@ -212,11 +217,30 @@ def __init__(self, args):
             use_incremental_states=False,
         )
 
-        with open(args.vocoder_cfg) as f:
-            vocoder_cfg = json.load(f)
-        self.vocoder = CodeHiFiGANVocoderWithDur(args.vocoder, vocoder_cfg)
-        if self.device == "cuda":
-            self.vocoder = self.vocoder.cuda()
+        # Initialize vocoder wrapper
+        from vocoder_wrapper import VocoderWrapper, DualVocoderWrapper
+        
+        vocoder_type = getattr(args, 'vocoder_type', 'original')  # default to original
+        
+        if vocoder_type == 'dual':
+            # Initialize dual vocoder system for side-by-side comparison
+            self.vocoder = DualVocoderWrapper(
+                original_vocoder_path=getattr(args, 'original_vocoder', args.vocoder),
+                original_vocoder_cfg=getattr(args, 'original_vocoder_cfg', args.vocoder_cfg),
+                modified_vocoder_path=getattr(args, 'modified_vocoder', args.vocoder),
+                modified_vocoder_cfg=getattr(args, 'modified_vocoder_cfg', args.vocoder_cfg),
+                device=self.device
+            )
+            self.dual_mode = True
+        else:
+            self.vocoder = VocoderWrapper(
+                vocoder_type=vocoder_type,
+                vocoder_path=args.vocoder,
+                vocoder_cfg_path=args.vocoder_cfg,
+                device=self.device
+            )
+            self.dual_mode = False
+        
         self.dur_prediction = args.dur_prediction
 
         self.lagging_k1 = args.lagging_k1
@@ -336,6 +360,37 @@ def add_args(parser):
             required=True,
             help="path to the CodeHiFiGAN vocoder config",
         )
+        parser.add_argument(
+            "--vocoder-type",
+            type=str,
+            default="original",
+            choices=["original", "modified", "dual"],
+            help="Type of vocoder to use: 'original' (CodeHiFiGAN), 'modified' (UnitHiFiGAN+FiLM), or 'dual' (both for comparison)",
+        )
+        parser.add_argument(
+            "--original-vocoder",
+            type=str,
+            required=False,
+            help="path to the original CodeHiFiGAN vocoder (for dual mode)",
+        )
+        parser.add_argument(
+            "--original-vocoder-cfg",
+            type=str,
+            required=False,
+            help="path to the original CodeHiFiGAN vocoder config (for dual mode)",
+        )
+        parser.add_argument(
+            "--modified-vocoder",
+            type=str,
+            required=False,
+            help="path to the modified UnitHiFiGAN vocoder (for dual mode)",
+        )
+        parser.add_argument(
+            "--modified-vocoder-cfg",
+            type=str,
+            required=False,
+            help="path to the modified UnitHiFiGAN vocoder config (for dual mode)",
+        )
         parser.add_argument(
             "--dur-prediction",
             action="store_true",
@@ -822,26 +877,67 @@ def policy(self):
                 1, -1
             ),
         }
-        wav, dur = self.vocoder(x, self.dur_prediction)
-
-        cur_wav_length = dur[:, -len(cur_unit) :].sum() * 320
-        new_wav = wav[-cur_wav_length:]
-        if self.unfinished_wav is not None and len(self.unfinished_wav) > 0:
-            new_wav = torch.cat((self.unfinished_wav, new_wav), dim=0)
+        
+        # Handle dual mode or single mode
+        if self.dual_mode:
+            # Dual mode: generate both outputs
+            vocoder_outputs = self.vocoder(x, self.dur_prediction)
+            wav_original, dur_original = vocoder_outputs['original']
+            wav_modified, dur_modified = vocoder_outputs['modified']
+            
+            # Process original output
+            cur_wav_length_original = dur_original[:, -len(cur_unit) :].sum() * 320
+            new_wav_original = wav_original[-cur_wav_length_original:]
+            if hasattr(self, 'unfinished_wav_original') and self.unfinished_wav_original is not None and len(self.unfinished_wav_original) > 0:
+                new_wav_original = torch.cat((self.unfinished_wav_original, new_wav_original), dim=0)
+            
+            # Process modified output
+            cur_wav_length_modified = dur_modified[:, -len(cur_unit) :].sum() * 320
+            new_wav_modified = wav_modified[-cur_wav_length_modified:]
+            if hasattr(self, 'unfinished_wav_modified') and self.unfinished_wav_modified is not None and len(self.unfinished_wav_modified) > 0:
+                new_wav_modified = torch.cat((self.unfinished_wav_modified, new_wav_modified), dim=0)
+            
+            # Store both outputs
+            self.wav_original = wav_original
+            self.wav_modified = wav_modified
+            self.wav = wav_original  # For compatibility
+            self.unit = unit
+            
+            # Add to global lists
+            global S2ST_ORIGINAL, S2ST_MODIFIED
+            S2ST_ORIGINAL.extend(new_wav_original.tolist())
+            S2ST_MODIFIED.extend(new_wav_modified.tolist())
+            
+            # Also add original to S2ST for backward compatibility
+            S2ST.extend(new_wav_original.tolist())
+            
+            # Use original for return (primary output)
+            new_wav = new_wav_original
+            wav = wav_original
+            dur = dur_original
+        else:
+            # Single mode: original behavior
+            wav, dur = self.vocoder(x, self.dur_prediction)
+            
+            cur_wav_length = dur[:, -len(cur_unit) :].sum() * 320
+            new_wav = wav[-cur_wav_length:]
+            if self.unfinished_wav is not None and len(self.unfinished_wav) > 0:
+                new_wav = torch.cat((self.unfinished_wav, new_wav), dim=0)
+            
+            self.wav = wav
+            self.unit = unit
+            
+            S2ST.extend(new_wav.tolist())
 
-        self.wav = wav
-        self.unit = unit
+        global OFFSET_MS
+        if OFFSET_MS==-1:
+            OFFSET_MS=1000*len(self.states.source)/ORG_SAMPLE_RATE
 
         # A SpeechSegment has to be returned for speech-to-speech translation system
         if self.states.source_finished and new_subword_tokens == -1:
             self.states.target_finished = True
             # self.reset()
 
-        S2ST.extend(new_wav.tolist())
-        global OFFSET_MS
-        if OFFSET_MS==-1:
-            OFFSET_MS=1000*len(self.states.source)/ORG_SAMPLE_RATE
-
         return WriteAction(
             SpeechSegment(
                 content=new_wav.tolist(),
@@ -873,11 +969,15 @@ def run(source):
     # Extract source audio at 16kHz
     source_filename = os.path.basename(source).split('.')[0]
     from extract_intermediates import save_source_audio
-    save_source_audio(samples, ORG_SAMPLE_RATE, filename_prefix=source_filename)
+    extracted_wav_path = save_source_audio(samples, ORG_SAMPLE_RATE, filename_prefix=source_filename)
     
     # Store filename for later use in discrete units extraction
     agent.current_source_filename = source_filename
     
+    # Pass extracted 16kHz WAV to vocoder for embedding extraction (modified vocoder only)
+    # This ensures ECAPA and Emotion2Vec get clean, properly formatted audio
+    agent.vocoder.set_source_audio(extracted_wav_path)
+    
     # Resample to expected sample rate if needed
     if sr != ORG_SAMPLE_RATE:
         print(f"Resampling from {sr}Hz to {ORG_SAMPLE_RATE}Hz")
@@ -897,6 +997,7 @@ def run(source):
     agent.reset()
 
     interval=int(agent.segment_size*(ORG_SAMPLE_RATE/1000))
+    print(f"🔄 Processing with segment_size={agent.segment_size}ms, interval={interval} samples")
     cur_idx=0
     while not agent.states.target_finished:
         cur_idx+=interval
@@ -913,8 +1014,10 @@ def reset():
     ASR={}
     global S2TT
     S2TT={}
-    global S2ST
+    global S2ST, S2ST_ORIGINAL, S2ST_MODIFIED
     S2ST=[]
+    S2ST_ORIGINAL=[]
+    S2ST_MODIFIED=[]
 
 
 def find_largest_key_value(dictionary, N):
@@ -987,14 +1090,38 @@ def merge_audio(left_audio_path, right_audio_path, offset_ms):
 # Merge configurations
 args_dict = main_config.copy()
 if main_config.get('use_paths_config', False):
+    # Determine which vocoder to use
+    vocoder_type = args_dict.get('vocoder-type', 'original')
+    
+    if vocoder_type == 'dual':
+        # Provide both vocoder paths for dual mode
+        vocoder_paths = {
+            'original-vocoder': paths_config['vocoder']['checkpoint'],
+            'original-vocoder-cfg': paths_config['vocoder']['config'],
+            'modified-vocoder': paths_config['modified_vocoder']['checkpoint'],
+            'modified-vocoder-cfg': paths_config['modified_vocoder']['config'],
+            # Default to original for backward compatibility
+            'vocoder': paths_config['vocoder']['checkpoint'],
+            'vocoder-cfg': paths_config['vocoder']['config']
+        }
+    elif vocoder_type == 'modified':
+        vocoder_paths = {
+            'vocoder': paths_config['modified_vocoder']['checkpoint'],
+            'vocoder-cfg': paths_config['modified_vocoder']['config']
+        }
+    else:
+        vocoder_paths = {
+            'vocoder': paths_config['vocoder']['checkpoint'],
+            'vocoder-cfg': paths_config['vocoder']['config']
+        }
+    
     # Add paths from paths_config.json
     args_dict.update({
         'data-bin': paths_config['configs']['data_bin'],
         'user-dir': paths_config['configs']['user_dir'],
         'agent-dir': paths_config['configs']['agent_dir'],
         'model-path': paths_config['models']['simultaneous'],
-        'vocoder': paths_config['vocoder']['checkpoint'],
-        'vocoder-cfg': paths_config['vocoder']['config']
+        **vocoder_paths
     })
 
 # Initialize Flask app with config
@@ -1047,7 +1174,9 @@ def upload():
 @app.route('/process/<path:filepath>')
 def uploaded_file(filepath):
     latency = request.args.get('latency', default=320, type=int)
+    print(f"📊 Received latency parameter: {latency} ms")
     agent.set_chunk_size(latency)
+    print(f"✅ Agent segment_size updated to: {agent.segment_size} ms")
 
     # Construct full path from upload folder and filename
     path = os.path.join(app.config['UPLOAD_FOLDER'], filepath)
@@ -1058,18 +1187,47 @@ def uploaded_file(filepath):
     filename = os.path.basename(path)
     output_path = os.path.join(os.path.dirname(path), 'output.'+filename)
     
-    # Normalize the audio data to prevent it from being too loud
-    if len(S2ST) > 0:
-        # Convert to numpy array and normalize
-        audio_data = np.array(S2ST, dtype=np.float32)
-        # Normalize to [-1, 1] range
-        max_val = np.max(np.abs(audio_data))
-        if max_val > 0:
-            audio_data = audio_data / max_val * 0.8  # Scale to 80% of max to be safe
-        soundfile.write(output_path, audio_data, SAMPLE_RATE)
+    # Check if dual mode is enabled
+    if agent.dual_mode and len(S2ST_ORIGINAL) > 0 and len(S2ST_MODIFIED) > 0:
+        # DUAL MODE: Save both original and modified outputs
+        
+        # Save original vocoder output
+        output_path_original = os.path.join(os.path.dirname(path), 'output_original.'+filename)
+        audio_data_original = np.array(S2ST_ORIGINAL, dtype=np.float32)
+        max_val_original = np.max(np.abs(audio_data_original))
+        if max_val_original > 0:
+            audio_data_original = audio_data_original / max_val_original * 0.8
+        soundfile.write(output_path_original, audio_data_original, SAMPLE_RATE)
+        
+        # Save modified vocoder output
+        output_path_modified = os.path.join(os.path.dirname(path), 'output_modified.'+filename)
+        audio_data_modified = np.array(S2ST_MODIFIED, dtype=np.float32)
+        max_val_modified = np.max(np.abs(audio_data_modified))
+        if max_val_modified > 0:
+            audio_data_modified = audio_data_modified / max_val_modified * 0.8
+        soundfile.write(output_path_modified, audio_data_modified, SAMPLE_RATE)
+        
+        # Also save as default output (use original)
+        soundfile.write(output_path, audio_data_original, SAMPLE_RATE)
+        
+        print(f"✓ DUAL MODE: Saved both outputs")
+        print(f"  - Original: {output_path_original}")
+        print(f"  - Modified: {output_path_modified}")
     else:
-        # Create silent audio if no data
-        soundfile.write(output_path, np.zeros(1000), SAMPLE_RATE)
+        # SINGLE MODE: Original behavior
+        # Normalize the audio data to prevent it from being too loud
+        if len(S2ST) > 0:
+            # Convert to numpy array and normalize
+            audio_data = np.array(S2ST, dtype=np.float32)
+            # Normalize to [-1, 1] range
+            max_val = np.max(np.abs(audio_data))
+            if max_val > 0:
+                audio_data = audio_data / max_val * 0.8  # Scale to 80% of max to be safe
+            soundfile.write(output_path, audio_data, SAMPLE_RATE)
+        else:
+            # Create silent audio if no data
+            soundfile.write(output_path, np.zeros(1000), SAMPLE_RATE)
+    
     left,right=merge_audio(path, output_path, OFFSET_MS)
     input_path = os.path.join(os.path.dirname(path), 'input.'+filename)
     left.export(input_path, format="wav")
@@ -1086,6 +1244,24 @@ def uploaded_output_file(filepath):
     
     return send_from_directory(app.config['UPLOAD_FOLDER'], 'output.'+filename)
 
+@app.route('/output_original/<path:filepath>')
+def uploaded_output_original_file(filepath):
+    # filepath is just the filename
+    filename = filepath
+    
+    return send_from_directory(app.config['UPLOAD_FOLDER'], 'output_original.'+filename)
+
+@app.route('/output_modified/<path:filepath>')
+def uploaded_output_modified_file(filepath):
+    # filepath is just the filename
+    filename = filepath
+    
+    return send_from_directory(app.config['UPLOAD_FOLDER'], 'output_modified.'+filename)
+
+@app.route('/is_dual_mode', methods=['GET'])
+def is_dual_mode():
+    return jsonify(dual_mode=agent.dual_mode)
+
 
 @app.route('/asr/<current_time>', methods=['GET'])
 def asr(current_time):
diff --git a/demo/base_hifigan_config.json b/demo/base_hifigan_config.json
new file mode 100644
index 0000000..3c30393
--- /dev/null
+++ b/demo/base_hifigan_config.json
@@ -0,0 +1,53 @@
+{
+    "input_wavs_dir": "/private/home/adampolyak/datasets/LJ/LJSpeech-1.1/wavs_16khz_padded",
+    "input_training_file": "/large_experiments/ust/annl/datasets/tts/LJSpeech/filelist/mhubert_vp_en_es_fr_it3_400k/lj_train_layer11_hubert1000_filelist.txt",
+    "input_validation_file": "/large_experiments/ust/annl/datasets/tts/LJSpeech/filelist/mhubert_vp_en_es_fr_it3_400k/lj_dev_layer11_hubert1000_filelist.txt",
+
+    "resblock": "1",
+    "num_gpus": 0,
+    "batch_size": 16,
+    "learning_rate": 0.0002,
+    "adam_b1": 0.8,
+    "adam_b2": 0.99,
+    "lr_decay": 0.999,
+    "seed": 1234,
+
+    "upsample_rates": [5,4,4,2,2],
+    "upsample_kernel_sizes": [11,8,8,4,4],
+    "upsample_initial_channel": 512,
+    "resblock_kernel_sizes": [3,7,11],
+    "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
+    "num_embeddings": 1000,
+    "embedding_dim": 128,
+    "model_in_dim": 128,
+
+    "segment_size": 8960,
+    "code_hop_size": 320,
+    "f0": false,
+    "num_mels": 80,
+    "num_freq": 1025,
+    "n_fft": 1024,
+    "hop_size": 256,
+    "win_size": 1024,
+
+    "dur_prediction_weight": 1.0,
+    "dur_predictor_params": {
+        "encoder_embed_dim": 128,
+        "var_pred_hidden_dim": 128,
+        "var_pred_kernel_size": 3,
+        "var_pred_dropout": 0.5
+    },
+
+    "sampling_rate": 16000,
+
+    "fmin": 0,
+    "fmax": 8000,
+    "fmax_for_loss": null,
+
+    "num_workers": 4,
+
+    "dist_config": {
+        "dist_backend": "nccl",
+        "dist_url": "env://"
+    }
+}
diff --git a/demo/config.json b/demo/config.json
index 2340d62..ad77cc1 100644
--- a/demo/config.json
+++ b/demo/config.json
@@ -5,5 +5,6 @@
     "segment-size": 320,
     "dur-prediction": true,
     "language_pair": "es-en",
-    "use_paths_config": true
+    "use_paths_config": true,
+    "vocoder-type": "dual"
 }
diff --git a/demo/templates/index.html b/demo/templates/index.html
index 40d5abe..34b07f3 100644
--- a/demo/templates/index.html
+++ b/demo/templates/index.html
@@ -59,12 +59,47 @@
         button:hover:not([disabled]) {
             background-color: #45a049;
         }
+        #playOriginalButton {
+            background-color: #ffa500;
+            padding: 8px 16px;
+        }
+        #playOriginalButton:hover:not([disabled]) {
+            background-color: #cc8400;
+        }
         #playModifiedButton {
-            background-color: #FF9800;
-            margin-left: 10px;
+            background-color: #800080;
+            padding: 8px 16px;
         }
         #playModifiedButton:hover:not([disabled]) {
-            background-color: #F57C00;
+            background-color: #600060;
+        }
+        #playBothButton {
+            background-color: #70b027;
+            padding: 8px 16px;
+        }
+        #playBothButton:hover:not([disabled]) {
+            background-color: #5a8d1f;
+        }
+        #playSourceOriginalButton {
+            background-color: #ffa500;
+            padding: 8px 16px;
+        }
+        #playSourceOriginalButton:hover:not([disabled]) {
+            background-color: #cc8400;
+        }
+        #playSourceModifiedButton {
+            background-color: #800080;
+            padding: 8px 16px;
+        }
+        #playSourceModifiedButton:hover:not([disabled]) {
+            background-color: #600060;
+        }
+        #playSourceButton {
+            background-color: blue;
+            padding: 8px 16px;
+        }
+        #playSourceButton:hover:not([disabled]) {
+            background-color: #0000cc;
         }
         #waveform, #outputWaveform, #outputWaveformModified {
             margin: 10px 0; /* Reduced the margin */
@@ -177,10 +212,14 @@ <h1>StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Lear
             <button type="submit" id="uploadButton" disabled>Upload</button>
         </form>
         <div class="streaming-inputs">
-            <h2>Streaming Inputs</h2>
+            <h2>Source Audio (Spanish)</h2>
             <div id="waveform"></div>
-            <button id="playButton" disabled>▶Play/Pause All</button>
-            <button id="playModifiedButton" disabled>▶Play Input + Modified Vocoder</button>
+            <div style="margin-top: 10px;">
+                <button id="playSourceButton" disabled style="background-color: blue; padding: 8px 16px;">▶ Play Source</button>
+                <button id="playBothButton" disabled style="background-color: #70b027; padding: 8px 16px; margin-left: 10px;">▶ Play Both Outputs</button>
+                <button id="playSourceOriginalButton" disabled style="background-color: #ffa500; padding: 8px 16px; margin-left: 10px;">▶ Play Source + Original</button>
+                <button id="playSourceModifiedButton" disabled style="background-color: #800080; padding: 8px 16px; margin-left: 10px;">▶ Play Source + Modified</button>
+            </div>
         </div>
         <div class="results">
             <h2 class="result-title">Streaming Speech Recognition<br></h2>
@@ -191,20 +230,43 @@ <h2 class="result-title">Simultaneous Speech-to-Text Translation</h2>
             <div id="translationResult" class="result-content2"><br></div> <!-- Ensure empty line initially -->
         </div>
         <div class="results">
-            <h2 class="result-title">Simultaneous Speech-to-Speech Translation (StreamSpeech)</h2>
+            <h2 class="result-title">Output: Original Vocoder (Standard English)</h2>
             <div id="outputWaveform"></div>
+            <button id="playOriginalButton" disabled style="background-color: #ffa500; margin-top: 10px;">▶ Play Original Vocoder</button>
         </div>
-        <div class="results">
-            <h2 class="result-title">Simultaneous Speech-to-Speech Translation (Modified Vocoder)</h2>
+        <div class="results" id="modifiedVocoderSection">
+            <h2 class="result-title">Output: Modified Vocoder (Voice Transfer - English with Spanish Voice)</h2>
             <div id="outputWaveformModified"></div>
+            <button id="playModifiedButton" disabled style="background-color: #800080; margin-top: 10px;">▶ Play Modified Vocoder</button>
         </div>
     </div>
 
     <script>
+        // Create AudioContext with explicit 16kHz sample rate to match server output
+        // This prevents browser from using default 48kHz which causes 3x speed-up
+        var audioContext;
+        try {
+            audioContext = new (window.AudioContext || window.webkitAudioContext)({
+                sampleRate: 16000  // CRITICAL: Match the server's SAMPLE_RATE
+            });
+            console.log('✓ AudioContext created at ' + audioContext.sampleRate + ' Hz');
+        } catch (e) {
+            console.error('Failed to create AudioContext with 16kHz:', e);
+            // Fallback to default context (may cause speed issues)
+            audioContext = new (window.AudioContext || window.webkitAudioContext)();
+            console.warn('⚠ Using default AudioContext at ' + audioContext.sampleRate + ' Hz - audio may be sped up');
+        }
+        
         var inputWaveSurfer, outputWaveSurfer, outputWaveSurferModified;
         var asrInterval, translationInterval;
-        var playButton = document.getElementById('playButton');
+        var playSourceButton = document.getElementById('playSourceButton');
+        var playOriginalButton = document.getElementById('playOriginalButton');
         var playModifiedButton = document.getElementById('playModifiedButton');
+        var playBothButton = document.getElementById('playBothButton');
+        var playSourceOriginalButton = document.getElementById('playSourceOriginalButton');
+        var playSourceModifiedButton = document.getElementById('playSourceModifiedButton');
+        var lastPlaybackTime = 0;
+        var stuckCounter = 0;
 
         // Listen for changes in the file input
         document.getElementById('fileInput').addEventListener('change', function() {
@@ -227,6 +289,7 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation (Modified Voc
 
             // Get the latency value
             var latency = document.getElementById('latencySlider').value;
+            console.log('📊 Latency value selected:', latency, 'ms');
 
             fetch('/upload', {
                 method: 'POST',
@@ -251,107 +314,169 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation (Modified Voc
                     container: '#waveform',
                     waveColor: '#ffffff',
                     progressColor: 'blue',
-                    normalize: true, // Add normalize to input waveform
-                    // barWidth:1.0,
-                    splitChannels: true, // Split channels for left and right output
-                    channel: 'left', // Left channel for inputWaveSurfer
-                    progressClass: 'wave-progress' /* Custom class for progress line */
-                    
+                    normalize: true,
+                    progressClass: 'wave-progress',
+                    interact: false,  // Disable seeking by clicking
+                    loop: false,      // Disable looping
+                    audioContext: audioContext  // Use 16kHz context
                 });
 
                 inputWaveSurfer.on('ready', function() {
-                    outputWaveSurfer = WaveSurfer.create({
-                        container: '#outputWaveform',
-                        waveColor: '#f9f9f9', // Audio Result: blue color
-                        progressColor: 'orange', // Blue progress color
-                        // barWidth:1.0,
-                        splitChannels: true, // Split channels for left and right output
-                        channel: 'right', // Left channel for inputWaveSurfer
-                        normalize: true // Add normalize to output waveform
-                    });
-
-                    outputWaveSurfer.load(`/output/${filename}`);
-
-                    // Create Modified Vocoder output WaveSurfer
-                    outputWaveSurferModified = WaveSurfer.create({
-                        container: '#outputWaveformModified',
-                        waveColor: '#f9f9f9', // Audio Result color
-                        progressColor: 'green', // Green progress color for Modified Vocoder
-                        // barWidth:1.0,
-                        splitChannels: true, // Split channels for left and right output
-                        channel: 'right', // Right channel
-                        normalize: true // Add normalize to output waveform
-                    });
-
-                    // Load Modified Vocoder output - you'll need to update this path when you have the backend endpoint
-                    outputWaveSurferModified.load(`/output_hifigan/${filename}`);
-
-                    playButton.disabled = false;
-                    playButton.style.backgroundColor = '#4CAF50'; // Change color to green
-                    
-                    playModifiedButton.disabled = false;
-                    playModifiedButton.style.backgroundColor = '#FF9800'; // Change color to orange
+                    // Check if dual mode is enabled
+                    fetch('/is_dual_mode')
+                        .then(response => response.json())
+                        .then(data => {
+                            const isDualMode = data.dual_mode;
+                            
+                            // Create original output WaveSurfer
+                            outputWaveSurfer = WaveSurfer.create({
+                                container: '#outputWaveform',
+                                waveColor: '#f9f9f9',
+                                progressColor: 'orange',
+                                normalize: true,
+                                loop: false,  // Disable looping
+                                audioContext: audioContext  // Use 16kHz context
+                            });
 
-                    // Ensure old event listeners are removed before adding new ones
-                    playButton.removeEventListener('click', playPauseHandler);
-                    playButton.addEventListener('click', playPauseHandler);
-                    
-                    playModifiedButton.removeEventListener('click', playModifiedHandler);
-                    playModifiedButton.addEventListener('click', playModifiedHandler);
-
-                    inputWaveSurfer.on('play', function() {
-                        outputWaveSurfer.play();
-                        outputWaveSurfer.seekTo(inputWaveSurfer.getCurrentTime() / inputWaveSurfer.getDuration());
-                        
-                        if (outputWaveSurferModified) {
-                            outputWaveSurferModified.play();
-                            outputWaveSurferModified.seekTo(inputWaveSurfer.getCurrentTime() / inputWaveSurfer.getDuration());
-                        }
+                            if (isDualMode) {
+                                // DUAL MODE: Load both outputs
+                                console.log("Dual mode detected - loading both vocoders");
+                                
+                                // Update title for original output
+                                document.querySelector('.results:nth-of-type(3) .result-title').innerText = 
+                                    'Simultaneous Speech-to-Speech Translation (Original Vocoder)';
+                                
+                                // Load original vocoder output
+                                outputWaveSurfer.load(`/output_original/${filename}`);
+                                
+                                // Create Modified Vocoder output WaveSurfer
+                                outputWaveSurferModified = WaveSurfer.create({
+                                    container: '#outputWaveformModified',
+                                    waveColor: '#f9f9f9',
+                                    progressColor: 'green',
+                                    normalize: true,
+                                    loop: false,  // Disable looping
+                                    audioContext: audioContext  // Use 16kHz context
+                                });
+                                
+                                // Load modified vocoder output
+                                outputWaveSurferModified.load(`/output_modified/${filename}`);
+                                
+                                playModifiedButton.disabled = false;
+                                playModifiedButton.style.backgroundColor = '#800080';
+                            } else {
+                                // SINGLE MODE: Load only one output
+                                console.log("Single mode - loading default vocoder");
+                                
+                                // Load default output
+                                outputWaveSurfer.load(`/output/${filename}`);
+                                
+                                playModifiedButton.disabled = true;
+                                playModifiedButton.style.backgroundColor = '#ddd';
+                            }
 
-                        updateASRResult(inputWaveSurfer.getCurrentTime());
-                        updateTranslationResult(inputWaveSurfer.getCurrentTime());
+                            playSourceButton.disabled = false;
+                            playSourceButton.style.backgroundColor = 'blue';
+                            
+                            playOriginalButton.disabled = false;
+                            playOriginalButton.style.backgroundColor = '#ffa500';
+                            
+                            playSourceOriginalButton.disabled = false;
+                            playSourceOriginalButton.style.backgroundColor = '#ffa500';
 
-                        asrInterval = setInterval(function() {
-                            updateASRResult(inputWaveSurfer.getCurrentTime());
-                        }, 320);
-
-                        translationInterval = setInterval(function() {
-                            updateTranslationResult(inputWaveSurfer.getCurrentTime());
-                        }, 320);
-                    });
-
-                    inputWaveSurfer.on('pause', function() {
-                        outputWaveSurfer.pause();
-                        if (outputWaveSurferModified) {
-                            outputWaveSurferModified.pause();
-                        }
-                        clearInterval(asrInterval);
-                        clearInterval(translationInterval);
-                    });
-
-                    inputWaveSurfer.on('seek', function() {
-                        var seekTime = inputWaveSurfer.getCurrentTime() / inputWaveSurfer.getDuration();
-                        outputWaveSurfer.seekTo(seekTime);
-                        if (outputWaveSurferModified) {
-                            outputWaveSurferModified.seekTo(seekTime);
-                        }
-                    });
-
-                    inputWaveSurfer.on('finish', function() {
-                        updateASRResult(inputWaveSurfer.getCurrentTime());
-                        updateTranslationResult(inputWaveSurfer.getCurrentTime());
-                        
-                        // Continue playing output audio even after input finishes
-                        if (outputWaveSurfer && !outputWaveSurfer.isPlaying()) {
-                            outputWaveSurfer.play();
-                        }
-                        if (outputWaveSurferModified && !outputWaveSurferModified.isPlaying()) {
-                            outputWaveSurferModified.play();
-                        }
-                    });
+                            // Ensure old event listeners are removed before adding new ones
+                            playSourceButton.removeEventListener('click', playSourceHandler);
+                            playSourceButton.addEventListener('click', playSourceHandler);
+                            
+                            playOriginalButton.removeEventListener('click', playOriginalHandler);
+                            playOriginalButton.addEventListener('click', playOriginalHandler);
+                            
+                            playSourceOriginalButton.removeEventListener('click', playSourceOriginalHandler);
+                            playSourceOriginalButton.addEventListener('click', playSourceOriginalHandler);
+                            
+                            if (isDualMode) {
+                                playModifiedButton.removeEventListener('click', playModifiedHandler);
+                                playModifiedButton.addEventListener('click', playModifiedHandler);
+                                
+                                playBothButton.disabled = false;
+                                playBothButton.style.backgroundColor = '#70b027';
+                                playBothButton.removeEventListener('click', playBothHandler);
+                                playBothButton.addEventListener('click', playBothHandler);
+                                
+                                playSourceModifiedButton.disabled = false;
+                                playSourceModifiedButton.style.backgroundColor = '#800080';
+                                playSourceModifiedButton.removeEventListener('click', playSourceModifiedHandler);
+                                playSourceModifiedButton.addEventListener('click', playSourceModifiedHandler);
+                            }
+                            
+                            // Add finish and pause event handlers INSIDE the async callback
+                            // This ensures WaveSurfer instances are fully created
+                            outputWaveSurfer.on('finish', function() {
+                                console.log("Original output finished");
+                                // Reset ALL button states that might be using this audio
+                                playOriginalButton.innerHTML = '▶ Play Original Vocoder';
+                                playBothButton.innerHTML = '▶ Play Both Outputs';
+                                playSourceOriginalButton.innerHTML = '▶ Play Source + Original';
+                                clearInterval(asrInterval);
+                                clearInterval(translationInterval);
+                                asrInterval = null;
+                                translationInterval = null;
+                            });
+                            
+                            outputWaveSurfer.on('pause', function() {
+                                console.log("Original output paused");
+                                clearInterval(asrInterval);
+                                clearInterval(translationInterval);
+                                asrInterval = null;
+                                translationInterval = null;
+                            });
+                            
+                            if (isDualMode && outputWaveSurferModified) {
+                                outputWaveSurferModified.on('finish', function() {
+                                    console.log("Modified output finished");
+                                    // Reset ALL button states that might be using this audio
+                                    playModifiedButton.innerHTML = '▶ Play Modified Vocoder';
+                                    playBothButton.innerHTML = '▶ Play Both Outputs';
+                                    playSourceModifiedButton.innerHTML = '▶ Play Source + Modified';
+                                    clearInterval(asrInterval);
+                                    clearInterval(translationInterval);
+                                    asrInterval = null;
+                                    translationInterval = null;
+                                });
+                                
+                                outputWaveSurferModified.on('pause', function() {
+                                    console.log("Modified output paused");
+                                    clearInterval(asrInterval);
+                                    clearInterval(translationInterval);
+                                    asrInterval = null;
+                                    translationInterval = null;
+                                });
+                            }
+                            
+                            inputWaveSurfer.on('pause', function() {
+                                console.log("Input paused");
+                                clearInterval(asrInterval);
+                                clearInterval(translationInterval);
+                                asrInterval = null;
+                                translationInterval = null;
+                            });
+                            
+                            inputWaveSurfer.on('finish', function() {
+                                console.log("Input finished");
+                                // Reset ALL button states that might be using this audio
+                                playSourceButton.innerHTML = '▶ Play Source';
+                                playSourceOriginalButton.innerHTML = '▶ Play Source + Original';
+                                playSourceModifiedButton.innerHTML = '▶ Play Source + Modified';
+                                clearInterval(asrInterval);
+                                clearInterval(translationInterval);
+                                asrInterval = null;
+                                translationInterval = null;
+                            });
+                        });
                 });
 
                 // Pass the latency parameter to the server
+                console.log('🌐 Requesting processing with URL:', `/process/${filename}?latency=${latency}`);
                 inputWaveSurfer.load(`/process/${filename}?latency=${latency}`);
             })
             .catch(error => console.error('Error:', error));
@@ -377,55 +502,243 @@ <h2 class="result-title">Simultaneous Speech-to-Speech Translation (Modified Voc
             }
         }
 
-        function playPauseHandler() {
+        function playSourceHandler() {
+            // Play/pause only the source audio
             if (inputWaveSurfer) {
-                inputWaveSurfer.playPause();
+                if (inputWaveSurfer.isPlaying()) {
+                    inputWaveSurfer.pause();
+                    playSourceButton.innerHTML = '▶ Play Source';
+                    clearInterval(asrInterval);
+                    clearInterval(translationInterval);
+                    asrInterval = null;
+                    translationInterval = null;
+                } else {
+                    // Clear any existing intervals first
+                    if (asrInterval) clearInterval(asrInterval);
+                    if (translationInterval) clearInterval(translationInterval);
+                    
+                    inputWaveSurfer.play();
+                    playSourceButton.innerHTML = '⏸ Pause Source';
+                    
+                    // Update ASR and translation results
+                    updateASRResult(inputWaveSurfer.getCurrentTime());
+                    updateTranslationResult(inputWaveSurfer.getCurrentTime());
+
+                    asrInterval = setInterval(function() {
+                        updateASRResult(inputWaveSurfer.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
+
+                    translationInterval = setInterval(function() {
+                        updateTranslationResult(inputWaveSurfer.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
+                }
             }
+        }
+
+        function playOriginalHandler() {
+            // Play/pause only the original vocoder output
             if (outputWaveSurfer) {
-                // If input is finished but output is still playing, just control output
-                if (inputWaveSurfer && inputWaveSurfer.isFinished() && outputWaveSurfer.isPlaying()) {
+                if (outputWaveSurfer.isPlaying()) {
                     outputWaveSurfer.pause();
-                } else if (inputWaveSurfer && inputWaveSurfer.isFinished() && !outputWaveSurfer.isPlaying()) {
-                    outputWaveSurfer.play();
+                    playOriginalButton.innerHTML = '▶ Play Original Vocoder';
+                    clearInterval(asrInterval);
+                    clearInterval(translationInterval);
+                    asrInterval = null;
+                    translationInterval = null;
                 } else {
-                    outputWaveSurfer.playPause();
+                    // Clear any existing intervals first
+                    if (asrInterval) clearInterval(asrInterval);
+                    if (translationInterval) clearInterval(translationInterval);
+                    
+                    outputWaveSurfer.play();
+                    playOriginalButton.innerHTML = '⏸ Pause Original Vocoder';
+                    
+                    // Update ASR and translation results
+                    updateASRResult(outputWaveSurfer.getCurrentTime());
+                    updateTranslationResult(outputWaveSurfer.getCurrentTime());
+
+                    asrInterval = setInterval(function() {
+                        updateASRResult(outputWaveSurfer.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
+
+                    translationInterval = setInterval(function() {
+                        updateTranslationResult(outputWaveSurfer.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
                 }
             }
+        }
+
+        function playModifiedHandler() {
+            // Play/pause only the modified vocoder output
             if (outputWaveSurferModified) {
-                // Handle Modified Vocoder waveform synchronization
-                if (inputWaveSurfer && inputWaveSurfer.isFinished() && outputWaveSurferModified.isPlaying()) {
+                if (outputWaveSurferModified.isPlaying()) {
                     outputWaveSurferModified.pause();
-                } else if (inputWaveSurfer && inputWaveSurfer.isFinished() && !outputWaveSurferModified.isPlaying()) {
+                    playModifiedButton.innerHTML = '▶ Play Modified Vocoder';
+                    clearInterval(asrInterval);
+                    clearInterval(translationInterval);
+                    asrInterval = null;
+                    translationInterval = null;
+                } else {
+                    // Clear any existing intervals first
+                    if (asrInterval) clearInterval(asrInterval);
+                    if (translationInterval) clearInterval(translationInterval);
+                    
                     outputWaveSurferModified.play();
+                    playModifiedButton.innerHTML = '⏸ Pause Modified Vocoder';
+                    
+                    // Update ASR and translation results
+                    updateASRResult(outputWaveSurferModified.getCurrentTime());
+                    updateTranslationResult(outputWaveSurferModified.getCurrentTime());
+
+                    asrInterval = setInterval(function() {
+                        updateASRResult(outputWaveSurferModified.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
+
+                    translationInterval = setInterval(function() {
+                        updateTranslationResult(outputWaveSurferModified.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
+                }
+            }
+        }
+
+        function playBothHandler() {
+            // Play/pause both output vocoders simultaneously
+            if (outputWaveSurfer && outputWaveSurferModified) {
+                var isPlaying = outputWaveSurfer.isPlaying() || outputWaveSurferModified.isPlaying();
+                
+                if (isPlaying) {
+                    outputWaveSurfer.pause();
+                    outputWaveSurferModified.pause();
+                    playBothButton.innerHTML = '▶ Play Both Outputs';
+                    clearInterval(asrInterval);
+                    clearInterval(translationInterval);
+                    asrInterval = null;
+                    translationInterval = null;
                 } else {
-                    outputWaveSurferModified.playPause();
+                    // Clear any existing intervals first
+                    if (asrInterval) clearInterval(asrInterval);
+                    if (translationInterval) clearInterval(translationInterval);
+                    
+                    outputWaveSurfer.play();
+                    outputWaveSurferModified.play();
+                    playBothButton.innerHTML = '⏸ Pause Both Outputs';
+                    
+                    // Update ASR and translation results (use original vocoder timing)
+                    updateASRResult(outputWaveSurfer.getCurrentTime());
+                    updateTranslationResult(outputWaveSurfer.getCurrentTime());
+
+                    asrInterval = setInterval(function() {
+                        updateASRResult(outputWaveSurfer.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
+
+                    translationInterval = setInterval(function() {
+                        updateTranslationResult(outputWaveSurfer.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
                 }
             }
         }
 
-        function playModifiedHandler() {
-            // Play only input and modified vocoder output
-            if (inputWaveSurfer) {
-                if (inputWaveSurfer.isPlaying()) {
+        function playSourceOriginalHandler() {
+            // Play/pause source audio + original vocoder simultaneously
+            if (inputWaveSurfer && outputWaveSurfer) {
+                var isPlaying = inputWaveSurfer.isPlaying() || outputWaveSurfer.isPlaying();
+                
+                if (isPlaying) {
+                    inputWaveSurfer.pause();
+                    outputWaveSurfer.pause();
+                    playSourceOriginalButton.innerHTML = '▶ Play Source + Original';
+                    clearInterval(asrInterval);
+                    clearInterval(translationInterval);
+                    asrInterval = null;
+                    translationInterval = null;
+                } else {
+                    // Clear any existing intervals first
+                    if (asrInterval) clearInterval(asrInterval);
+                    if (translationInterval) clearInterval(translationInterval);
+                    
+                    // Synchronize both waveforms
+                    inputWaveSurfer.play();
+                    outputWaveSurfer.play();
+                    playSourceOriginalButton.innerHTML = '⏸ Pause Source + Original';
+                    
+                    // Update ASR and translation results
+                    updateASRResult(inputWaveSurfer.getCurrentTime());
+                    updateTranslationResult(inputWaveSurfer.getCurrentTime());
+
+                    asrInterval = setInterval(function() {
+                        updateASRResult(inputWaveSurfer.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
+
+                    translationInterval = setInterval(function() {
+                        updateTranslationResult(inputWaveSurfer.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
+                }
+            }
+        }
+
+        function playSourceModifiedHandler() {
+            // Play/pause source audio + modified vocoder simultaneously
+            if (inputWaveSurfer && outputWaveSurferModified) {
+                var isPlaying = inputWaveSurfer.isPlaying() || outputWaveSurferModified.isPlaying();
+                
+                if (isPlaying) {
                     inputWaveSurfer.pause();
-                    if (outputWaveSurferModified) {
-                        outputWaveSurferModified.pause();
-                    }
+                    outputWaveSurferModified.pause();
+                    playSourceModifiedButton.innerHTML = '▶ Play Source + Modified';
+                    clearInterval(asrInterval);
+                    clearInterval(translationInterval);
+                    asrInterval = null;
+                    translationInterval = null;
                 } else {
+                    // Clear any existing intervals first
+                    if (asrInterval) clearInterval(asrInterval);
+                    if (translationInterval) clearInterval(translationInterval);
+                    
+                    // Synchronize both waveforms
                     inputWaveSurfer.play();
-                    if (outputWaveSurferModified) {
-                        outputWaveSurferModified.seekTo(inputWaveSurfer.getCurrentTime() / inputWaveSurfer.getDuration());
-                        outputWaveSurferModified.play();
-                    }
+                    outputWaveSurferModified.play();
+                    playSourceModifiedButton.innerHTML = '⏸ Pause Source + Modified';
+                    
+                    // Update ASR and translation results
+                    updateASRResult(inputWaveSurfer.getCurrentTime());
+                    updateTranslationResult(inputWaveSurfer.getCurrentTime());
+
+                    asrInterval = setInterval(function() {
+                        updateASRResult(inputWaveSurfer.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
+
+                    translationInterval = setInterval(function() {
+                        updateTranslationResult(inputWaveSurfer.getCurrentTime());
+                    }, 1000);  // Update every 1 second to reduce terminal spam
                 }
             }
-            // Ensure StreamSpeech output is paused
-            if (outputWaveSurfer && outputWaveSurfer.isPlaying()) {
-                outputWaveSurfer.pause();
+        }
+
+        function checkIfStuck(currentTime) {
+            // Check if playback time hasn't changed (stuck/finished)
+            if (Math.abs(currentTime - lastPlaybackTime) < 0.01) {
+                stuckCounter++;
+                if (stuckCounter >= 3) {
+                    // Audio has been stuck for 3 intervals - force stop
+                    console.log("Audio playback stuck at " + currentTime + ", clearing intervals");
+                    clearInterval(asrInterval);
+                    clearInterval(translationInterval);
+                    asrInterval = null;
+                    translationInterval = null;
+                    stuckCounter = 0;
+                    lastPlaybackTime = 0;
+                    return true;
+                }
+            } else {
+                stuckCounter = 0;
             }
+            lastPlaybackTime = currentTime;
+            return false;
         }
 
         function updateASRResult(currentTime) {
+            if (checkIfStuck(currentTime)) return;
+            
             fetch(`/asr/${currentTime}`)
                 .then(response => response.json())
                 .then(data => {
diff --git a/demo/vocoder_wrapper.py b/demo/vocoder_wrapper.py
new file mode 100644
index 0000000..b688a6d
--- /dev/null
+++ b/demo/vocoder_wrapper.py
@@ -0,0 +1,268 @@
+"""
+Vocoder Wrapper for StreamSpeech Demo
+Provides a unified interface for different vocoder implementations
+"""
+
+import torch
+import json
+from pathlib import Path
+from agent.tts.vocoder import CodeHiFiGANVocoderWithDur
+
+
+class ModifiedVocoderAdapter:
+    """
+    Adapter for the modified UnitHiFiGAN vocoder with FiLM conditioning.
+    Extracts speaker and emotion embeddings from source audio and uses them
+    for conditioned speech synthesis.
+    """
+    
+    def __init__(self, checkpoint_path, config_path, device="cpu"):
+        """
+        Initialize modified vocoder with ECAPA and Emotion2Vec models
+        
+        Args:
+            checkpoint_path: Path to fine-tuned HiFiGAN checkpoint
+            config_path: Path to HiFiGAN config JSON
+            device: Device to load models on
+        """
+        self.device = device
+        
+        # Import model classes
+        from models.hifigan_generator import UnitHiFiGANGenerator
+        from models.ecapa import ECAPA
+        from models.emotion2vec import Emotion2Vec
+        
+        # Load HiFiGAN config
+        with open(config_path, 'r') as f:
+            hifigan_config = json.load(f)
+        
+        # Initialize HiFiGAN generator with FiLM
+        self.generator = UnitHiFiGANGenerator(config=hifigan_config, use_film=True).to(device)
+        
+        # Load checkpoint (prefer ema_generator weights)
+        checkpoint = torch.load(checkpoint_path, map_location=device)
+        if "ema_generator" in checkpoint:
+            print("✓ Loaded 'ema_generator' weights for modified vocoder")
+            self.generator.load_state_dict(checkpoint['ema_generator'])
+        elif "generator" in checkpoint:
+            print("✓ Loaded 'generator' weights for modified vocoder")
+            self.generator.load_state_dict(checkpoint['generator'])
+        else:
+            print("✓ Loading state_dict directly for modified vocoder")
+            self.generator.load_state_dict(checkpoint)
+        
+        self.generator.eval()
+        self.generator.remove_weight_norm()
+        
+        # Initialize embedding extractors
+        print("Loading ECAPA speaker embedding model...")
+        self.ecapa = ECAPA(device=device)
+        
+        print("Loading Emotion2Vec emotion embedding model...")
+        self.emotion2vec = Emotion2Vec(device=device)
+        
+        # Cache for embeddings (per audio file)
+        self.speaker_embedding = None
+        self.emotion_embedding = None
+        self.current_source_audio = None
+        
+        print("✓ Modified vocoder initialized successfully")
+    
+    def set_source_audio(self, audio_path):
+        """
+        Extract and cache speaker/emotion embeddings from source audio.
+        Called once per audio file.
+        
+        Args:
+            audio_path: Path to source audio file (str or Path)
+        """
+        audio_path = Path(audio_path)
+        
+        # Only extract if it's a new audio file
+        if self.current_source_audio != audio_path:
+            print(f"Extracting embeddings from: {audio_path.name}")
+            
+            with torch.no_grad():
+                self.speaker_embedding = self.ecapa.extract_speaker_embeddings(
+                    wav_path=audio_path
+                ).to(self.device)
+                
+                self.emotion_embedding = self.emotion2vec.extract_emotion_embeddings(
+                    wav_path=audio_path
+                ).to(self.device)
+            
+            self.current_source_audio = audio_path
+            print(f"✓ Embeddings extracted - Speaker: {self.speaker_embedding.shape}, Emotion: {self.emotion_embedding.shape}")
+    
+    def __call__(self, x, dur_prediction=False):
+        """
+        Generate audio from discrete units with speaker/emotion conditioning
+        
+        Args:
+            x: dict with "code" tensor [B, T] containing discrete units
+            dur_prediction: Ignored (for interface compatibility)
+            
+        Returns:
+            wav: Generated waveform tensor [T]
+            dur: Dummy duration tensor (for interface compatibility)
+        """
+        if self.speaker_embedding is None or self.emotion_embedding is None:
+            raise RuntimeError(
+                "Speaker/emotion embeddings not set. Call set_source_audio() first."
+            )
+        
+        units = x["code"]  # [B, T]
+        
+        with torch.no_grad():
+            # Generate audio with conditioning
+            # Output shape: [B, 1, L] where L is audio length
+            audio_tensor = self.generator(
+                units=units,
+                speaker=self.speaker_embedding.unsqueeze(0),  # Add batch dim
+                emotion=self.emotion_embedding.unsqueeze(0),  # Add batch dim
+            )
+            
+            # Remove batch and channel dimensions: [B, 1, L] -> [L]
+            wav = audio_tensor.squeeze()
+            
+            # Create dummy duration tensor for interface compatibility
+            # Assume 320 samples per unit (code_hop_size from config)
+            num_units = units.shape[1]
+            dur = torch.ones(1, num_units, dtype=torch.long, device=self.device)
+        
+        return wav, dur
+
+
+class DualVocoderWrapper:
+    """
+    Wrapper that runs both original and modified vocoders simultaneously
+    for side-by-side comparison
+    """
+    
+    def __init__(self, original_vocoder_path, original_vocoder_cfg, 
+                 modified_vocoder_path, modified_vocoder_cfg, device="cpu"):
+        """
+        Initialize both vocoders
+        
+        Args:
+            original_vocoder_path: Path to original vocoder checkpoint
+            original_vocoder_cfg: Path to original vocoder config
+            modified_vocoder_path: Path to modified vocoder checkpoint
+            modified_vocoder_cfg: Path to modified vocoder config
+            device: Device to load vocoders on
+        """
+        self.device = device
+        
+        print("Initializing DUAL vocoder system...")
+        print("=" * 60)
+        
+        # Initialize original vocoder
+        print("\n[1/2] Loading ORIGINAL vocoder (CodeHiFiGAN)...")
+        with open(original_vocoder_cfg) as f:
+            vocoder_cfg = json.load(f)
+        self.original_vocoder = CodeHiFiGANVocoderWithDur(original_vocoder_path, vocoder_cfg)
+        if device == "cuda":
+            self.original_vocoder = self.original_vocoder.cuda()
+        print("✓ Original vocoder loaded")
+        
+        # Initialize modified vocoder
+        print("\n[2/2] Loading MODIFIED vocoder (UnitHiFiGAN+FiLM)...")
+        self.modified_vocoder = ModifiedVocoderAdapter(
+            modified_vocoder_path, modified_vocoder_cfg, device
+        )
+        
+        print("\n" + "=" * 60)
+        print("✓ DUAL vocoder system ready!")
+        print("  - Original: Standard CodeHiFiGAN")
+        print("  - Modified: Voice-conditioned UnitHiFiGAN+FiLM")
+        print("=" * 60 + "\n")
+    
+    def set_source_audio(self, audio_path):
+        """Extract embeddings for modified vocoder"""
+        self.modified_vocoder.set_source_audio(audio_path)
+    
+    def __call__(self, x, dur_prediction=False):
+        """
+        Generate audio from both vocoders
+        
+        Args:
+            x: dict with "code" tensor
+            dur_prediction: duration prediction flag
+            
+        Returns:
+            dict with 'original' and 'modified' keys containing (wav, dur) tuples
+        """
+        # Generate with original vocoder
+        wav_original, dur_original = self.original_vocoder(x, dur_prediction)
+        
+        # Generate with modified vocoder
+        wav_modified, dur_modified = self.modified_vocoder(x, dur_prediction)
+        
+        return {
+            'original': (wav_original, dur_original),
+            'modified': (wav_modified, dur_modified)
+        }
+
+
+class VocoderWrapper:
+    """Wrapper to handle multiple vocoder types with different interfaces"""
+    
+    def __init__(self, vocoder_type, vocoder_path, vocoder_cfg_path, device="cpu"):
+        """
+        Initialize vocoder wrapper
+        
+        Args:
+            vocoder_type: Type of vocoder ("original" or "modified")
+            vocoder_path: Path to vocoder checkpoint
+            vocoder_cfg_path: Path to vocoder config file
+            device: Device to load vocoder on ("cpu" or "cuda")
+        """
+        self.vocoder_type = vocoder_type
+        self.device = device
+        
+        if vocoder_type == "original":
+            self.vocoder = self._init_original_vocoder(vocoder_path, vocoder_cfg_path)
+            # Move to device if CUDA
+            if device == "cuda":
+                self.vocoder = self.vocoder.cuda()
+        elif vocoder_type == "modified":
+            # ModifiedVocoderAdapter handles device internally
+            self.vocoder = ModifiedVocoderAdapter(vocoder_path, vocoder_cfg_path, device)
+        else:
+            raise ValueError(
+                f"Unknown vocoder type: {vocoder_type}. "
+                f"Supported types: 'original', 'modified'"
+            )
+    
+    def _init_original_vocoder(self, vocoder_path, vocoder_cfg_path):
+        """Initialize the original CodeHiFiGAN vocoder with duration prediction"""
+        with open(vocoder_cfg_path) as f:
+            vocoder_cfg = json.load(f)
+        return CodeHiFiGANVocoderWithDur(vocoder_path, vocoder_cfg)
+    
+    def set_source_audio(self, audio_path):
+        """
+        Set source audio for embedding extraction (modified vocoder only).
+        For original vocoder, this is a no-op.
+        
+        Args:
+            audio_path: Path to source audio file
+        """
+        if self.vocoder_type == "modified":
+            self.vocoder.set_source_audio(audio_path)
+        # Original vocoder doesn't need source audio
+    
+    def __call__(self, x, dur_prediction=False):
+        """
+        Unified interface for vocoder inference
+        
+        Args:
+            x: dict with "code" tensor containing discrete units
+            dur_prediction: duration prediction flag
+            
+        Returns:
+            wav: waveform tensor
+            dur: duration tensor
+        """
+        return self.vocoder(x, dur_prediction)
+
diff --git a/models/__init__.py b/models/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/models/ecapa.py b/models/ecapa.py
new file mode 100644
index 0000000..2e00fa8
--- /dev/null
+++ b/models/ecapa.py
@@ -0,0 +1,30 @@
+import torch
+import torchaudio
+from speechbrain.inference.speaker import SpeakerRecognition
+
+
+class ECAPA:
+
+    def __init__(self, device="cuda"):
+        self.ecapa = SpeakerRecognition.from_hparams(
+            source="speechbrain/spkrec-ecapa-voxceleb",
+            savedir="pretrained_models/spkrec-ecapa-voxceleb",
+            run_opts={ "device": device }
+        )
+        self.device = device
+
+
+    @torch.no_grad()
+    def extract_speaker_embeddings(self, wav_path):
+        signal, sample_rate = torchaudio.load(wav_path)
+
+        # Move audio to same device as model
+        signal = signal.to(self.device)
+
+        # Returns tensor with shape [1, 1, 192] (need to convert to 1D)
+        raw_embeddings = self.ecapa.encode_batch(signal)
+
+        # Remove unneccesary dimensions (batch=1, time=1)
+        # Reshapes to [192]
+        embeddings = raw_embeddings.squeeze()
+        return embeddings
diff --git a/models/emotion2vec.py b/models/emotion2vec.py
new file mode 100644
index 0000000..fda7f3d
--- /dev/null
+++ b/models/emotion2vec.py
@@ -0,0 +1,70 @@
+import io
+import contextlib
+
+import torch
+from funasr import AutoModel
+
+
+class Emotion2Vec:
+    def __init__(self, variant="iic/emotion2vec_plus_base", device="cuda"):
+        """
+        Available emotion2vec model variants:
+        - "iic/emotion2vec_base"
+        - "iic/emotion2vec_base_finetuned"
+        - "iic/emotion2vec_plus_seed"
+        - "iic/emotion2vec_plus_base" (default)
+        - "iic/emotion2vec_plus_large"
+        """
+        self.device = device
+        self.emotion2vec = AutoModel(
+            model=variant,
+            hub="hf",  # Use hugging face
+            device=self.device
+        )
+
+    @torch.no_grad()
+    def extract_emotion_embeddings(self, wav_path, output_dir=None):
+        # Suppress tqdm / logging output
+        with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
+            raw_result = self.emotion2vec.generate(
+                input=wav_path.as_posix(),
+                output_dir=output_dir,
+                granularity="utterance",
+                extract_embedding=True,
+                verbose=False,
+                progress=False,
+            )[0]
+
+        embeddings = torch.as_tensor(
+            data=raw_result["feats"],
+            dtype=torch.float32,
+            device=self.device
+        ).flatten()
+
+        assert embeddings.ndim == 1, f"Expected 1D embedding, got {embeddings.shape}"
+        return embeddings
+
+    # @torch.no_grad()
+    # def extract_emotion_embeddings(self, wav_path, output_dir=None):
+    #     raw_result = self.emotion2vec.generate(
+    #         input=wav_path,
+    #         output_dir=output_dir,
+    #         granularity="utterance",
+    #         extract_embedding=True,
+    #         verbose=False,
+    #         progress=False
+    #     )[0]
+
+    #     # emotion2vec outputs a numpy array
+    #     # So, convert it to a 1D torch tensor
+    #     # raw_result["feats"] contain the actual embeddings
+    #     embeddings = torch.as_tensor(
+    #         data=raw_result["feats"],
+    #         dtype=torch.float32,
+    #         device=self.device
+    #     ).flatten()
+
+    #     # To check if embeddings shape is 1D
+    #     assert embeddings.ndim == 1, f"Expected 1D embedding, got {embeddings.shape}"
+
+    #     return embeddings
diff --git a/models/film.py b/models/film.py
new file mode 100644
index 0000000..6dc246b
--- /dev/null
+++ b/models/film.py
@@ -0,0 +1,89 @@
+"""
+FiLM (Feature-wise Linear Modulation) layer used for speaker/emotion conditioning.
+
+Supports two variants:
+1. Simple (default): single Linear → gamma, beta
+2. MLP: (Linear → ReLU → Dropout → Linear) → gamma, beta
+
+Usage:
+    FiLM(in_channels=512, cond_dim=512, use_mlp=False)
+
+Original FiLM Paper: https://arxiv.org/abs/1709.07871
+"""
+
+import torch
+import torch.nn as nn
+
+
+class FiLM(nn.Module):
+
+    def __init__(
+        self,
+        in_channels: int,
+        cond_dim: int,
+        use_mlp: bool = False,
+        hidden_dim: int = 256,
+        dropout_p: float = 0.1,
+    ):
+        super().__init__()
+        self.use_mlp = use_mlp
+
+        # Choose variant (nonlinear vs simple)
+        if use_mlp:
+            # Nonlinear variant (used in AdaSpeech, Meta-StyleSpeech)
+            self.net = nn.Sequential(
+                nn.Linear(cond_dim, hidden_dim),
+                nn.ReLU(inplace=True),
+                nn.Dropout(p=dropout_p),
+                nn.Linear(hidden_dim, 2 * in_channels),
+            )
+            proj_layer = self.net[-1]
+        else:
+            # Simple linear projection variant (aligned with original paper)
+            self.net = nn.Linear(cond_dim, 2 * in_channels)
+            proj_layer = self.net
+
+        # Identity initialization (gamma=1, beta=0)
+        nn.init.zeros_(proj_layer.weight)
+        nn.init.zeros_(proj_layer.bias)
+
+        with torch.no_grad():
+            proj_layer.bias[:in_channels].fill_(1.0)
+
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        cond: torch.Tensor | None = None
+    ) -> torch.Tensor:
+        """
+        Apply FiLM modulation based on the conditioning vector
+        """
+
+        if cond is None:
+            return x
+
+        gamma, beta = self.net(cond).chunk(2, dim=-1)
+        
+        # Safer FiLM scaling (keeps identity around 1.0)
+        # - gamma starts near 1.0 (identity)
+        # - beta starts near 0.0 (no shift)
+        # - tanh bounds keep modulation stable
+        # gamma = 1.0 + 0.1 * torch.tanh(gamma)
+        # beta  = 0.1 * torch.tanh(beta)
+
+        gamma = gamma.unsqueeze(-1)
+        beta  = beta.unsqueeze(-1)
+
+        # FiLM Diagnostics (prints every 2000 steps)
+        if hasattr(self, "global_step") and self.global_step is not None:
+            if self.global_step % 1000 == 0:
+                avg_gamma = gamma.abs().mean().item()
+                avg_beta = beta.abs().mean().item()
+                avg_act = x.abs().mean().item()
+                print(
+                    f"[FiLM] step={self.global_step} | "
+                    f"gamma={avg_gamma:.3f} | beta={avg_beta:.3f} | act={avg_act:.3f}"
+                )
+
+        return gamma * x + beta
diff --git a/models/hifigan_generator.py b/models/hifigan_generator.py
new file mode 100644
index 0000000..b2d3efe
--- /dev/null
+++ b/models/hifigan_generator.py
@@ -0,0 +1,230 @@
+"""
+Source code inspired from:
+- fairseq:
+    - Core HiFi-GAN: https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/text_to_speech/hifigan.py
+    - Unit HiFi-GAN: https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/text_to_speech/codehifigan.py
+- jik876: https://github.com/jik876/hifi-gan/blob/master/models.py
+
+For detailed model architecture details, refer to the original paper: https://arxiv.org/abs/2010.05646
+"""
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.utils import weight_norm, remove_weight_norm
+
+from .resblock import ResBlock
+from .film import FiLM
+from .utils import init_weights
+
+
+# Empirically, 0.1 worked best in GAN-based vocoders for audio stability
+# The small negative slope (0.1) allows a small gradient to flow even for negative activations
+# Helpful in preventing "dead" neurons
+LEAKY_RELU_SLOPE = 0.1
+
+
+class UnitHiFiGANGenerator(nn.Module):
+    """
+    ## Overview
+    Unit-based HiFi-GAN Generator with optional FiLM conditioning 
+    that converts discrete speech units into audio waveforms.
+
+    - **Input**:  Discrete unit IDs (LongTensor [B, T])
+    - **Output**: Audio waveform (FloatTensor [B, 1, L])
+
+    ## Requirements
+    Config dictionary (`config.json`) expected to contain:
+
+    ### Embedding/Input
+    - `num_embeddings`: `int`            (e.g., `1000`)
+    - `embedding_dim`: `int`             (e.g., `128`)
+    - `model_in_dim`: `int`              (e.g., `128` - must match channel count fed to `conv_pre`)
+
+    ### Generator
+    - `upsample_initial_channel`: `int`                      (e.g., `512`)
+    - `upsample_rates`: `List[int]`                          (e.g., `[5,4,4,2,2]`)
+    - `upsample_kernel_sizes`: `List[int]`                   (e.g., `[11,8,8,4,4]`)
+    - `resblock_kernel_sizes`: `List[int]`                   (e.g., `[3,7,11]`)
+    - `resblock_dilation_sizes`: `List[Tuple[int,int,int]]`  (e.g., `[(1,3,5), ...]`)
+
+    [Link to downloadable config.json](https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json)
+    """
+
+    def __init__(self, config: dict, use_film: bool = False):
+        super().__init__()
+        self.use_film = use_film
+
+        # Converts unit IDs to become continuous, learnable feature vectors
+        # Should have shape (1000, 128)
+        self.dict = nn.Embedding(config["num_embeddings"], config["embedding_dim"])
+
+        # Sanity Check: embedding_dim must equal model_in_dim for checkpoint compatibility
+        in_dim = config.get("model_in_dim", config["embedding_dim"])
+        assert in_dim == config["embedding_dim"], (
+            f"model_in_dim ({in_dim}) must equal embedding_dim ({config['embedding_dim']})"
+        )
+
+        # Converts embedding vectors into the generator’s internal channels
+        # input channel (128) -> output channel (512)
+        self.conv_pre = weight_norm(
+            nn.Conv1d(
+                in_channels=in_dim,
+                out_channels=config["upsample_initial_channel"],
+                kernel_size=7,
+                padding=3
+            )
+        )
+
+        # Upsampling x Multi-Receptive Field Fusion (MRF) module x FiLM conditioning
+        self.ups = nn.ModuleList()
+        self.resblocks = nn.ModuleList()
+        self.film_layers = nn.ModuleList() if self.use_film else None
+
+        in_ch = config["upsample_initial_channel"]
+
+        for stride, k_up in zip(config["upsample_rates"], config["upsample_kernel_sizes"]):
+            out_ch = in_ch // 2
+
+            # Upsampling (ConvTranspose)
+            # At each stage, channels halve (e.g., 512 -> 256 -> 128 -> ...)
+            self.ups.append(
+                weight_norm(
+                    # Upsamples time by stride (e.g., 5, 4, 4, 2, 2)
+                    nn.ConvTranspose1d(
+                        in_channels=in_ch,
+                        out_channels=out_ch,
+                        kernel_size=k_up,
+                        stride=stride,
+                        padding=(k_up - stride) // 2
+                    )
+                )
+            )
+
+            # MRF
+            # For each upsample stage, create several ResBlocks:
+            #   e.g., kernels [3, 7, 11] with dilations (1, 3, 5) for each
+            for ks, ds in zip(config["resblock_kernel_sizes"], config["resblock_dilation_sizes"]):
+                self.resblocks.append(
+                    ResBlock(
+                        channels=out_ch,
+                        kernel_size=ks,
+                        dilations=tuple(ds)
+                    )
+                )
+
+            # FiLM
+            if self.use_film:
+                self.film_layers.append(
+                    FiLM(
+                        in_channels=out_ch,
+                        cond_dim=config.get("film_cond_dim", 512),
+                        use_mlp=config.get("use_film_mlp", False),
+                        hidden_dim=config.get("film_hidden_dim", 256),
+                        dropout_p=config.get("film_dropout_p", 0.1),
+                    )
+                )
+
+            # Update for next iteration
+            in_ch = out_ch
+
+        # Converts the final features into 1-channel waveform output
+        self.conv_post = weight_norm(
+            nn.Conv1d(
+                in_channels=in_ch,
+                out_channels=1,
+                kernel_size=7,
+                padding=3
+            )
+        )
+
+        self.num_kernels = len(config["resblock_kernel_sizes"])
+
+        # Project speaker + emotion embeddings (960 → 512)
+        # (192 speaker + 768 emotion = 960 total dims)
+        self.cond_proj = nn.Linear(960, config.get("film_cond_dim", 512))
+
+        # Override default weights to stabilize training and reduce artifacts
+        self.conv_pre.apply(init_weights)
+        for up in self.ups:
+            up.apply(init_weights)
+        self.conv_post.apply(init_weights)
+
+
+    def forward(
+        self,
+        units: torch.LongTensor,
+        speaker: torch.Tensor | None = None,
+        emotion: torch.Tensor | None = None,
+        global_step: int | None = None,
+    ) -> torch.Tensor:
+        """
+        Args:
+            units: [B, T] discrete speech unit IDs.
+            speaker: [B, D_s] speaker embedding (optional).
+            emotion: [B, D_e] emotion embedding (optional).
+        """
+
+        # 1. Embed discrete speech units -> [B, C, T]
+        x = self.dict(units).transpose(1, 2)
+
+        # 2. Pre-conv to match generator's internal channels
+        x = self.conv_pre(x)
+
+        # 3. Concatenate conditioning if FiLM is active
+        cond = None
+        if self.use_film:
+            if speaker is not None and emotion is not None:
+                speaker_norm = F.normalize(speaker, dim=-1)
+                emotion_norm = F.normalize(emotion, dim=-1)
+                cond = torch.cat([speaker_norm, emotion_norm], dim=-1)
+                # cond = torch.cat([speaker, emotion], dim=-1)
+            elif speaker is not None:
+                cond = F.normalize(speaker, dim=-1)
+                # cond = speaker
+            elif emotion is not None:
+                cond = F.normalize(emotion, dim=-1)
+                # cond = emotion
+
+            if cond is not None:
+                # NOTE: Removed normalization temporarily since it might be erasing meaningful magnitude differences from the embeddings
+                # Normalize to prevent FiLM from over-conditioning due to large embedding magnitudes
+                # cond = cond / (cond.norm(dim=-1, keepdim=True) + 1e-8)
+                cond = self.cond_proj(cond)
+
+        # 4. Upsample x FiLM (optional) x MRF 
+        for i, upsample in enumerate(self.ups):
+            # Upsample
+            x = F.leaky_relu(x, LEAKY_RELU_SLOPE)
+            x = upsample(x)
+
+            # Apply FiLM to inject speaker/emotion conditioning
+            if self.use_film and cond is not None:
+                self.film_layers[i].global_step = global_step
+                x = self.film_layers[i](x, cond)
+
+            # Sum & average MRF outputs
+            res_outputs = []
+            for j in range(self.num_kernels):
+                idx = i * self.num_kernels + j
+                y = self.resblocks[idx](x)
+                res_outputs.append(y)
+            x = sum(res_outputs) / self.num_kernels
+
+        # 5. Post-conv to generate single channel waveform output
+        x = F.leaky_relu(x, LEAKY_RELU_SLOPE)
+        x = torch.tanh(self.conv_post(x))
+        return x
+
+
+    def remove_weight_norm(self):
+        """Strip weight norm for faster inference"""
+        remove_weight_norm(self.conv_pre)
+        remove_weight_norm(self.conv_post)
+
+        for up in self.ups:
+            remove_weight_norm(up)
+
+        # Has internal remove_weight_norm
+        for rb in self.resblocks:
+            rb.remove_weight_norm()
diff --git a/models/multiperiod_discriminator.py b/models/multiperiod_discriminator.py
new file mode 100644
index 0000000..78954bf
--- /dev/null
+++ b/models/multiperiod_discriminator.py
@@ -0,0 +1,158 @@
+"""
+Multi-Period Discriminator
+
+The MPD is made up of several smaller discriminators, each focusing on a different time period 
+of the audio (e.g., every 2, 3, 5, 7, or 11 samples). This helps the model detect repeating patterns 
+like pitch or rhythm that occur at different frequencies. 
+
+Each sub-discriminator reshapes the 1D waveform into a 2D map so it can look at periodic structures 
+more effectively using 2D convolutions. This design allows MPD to catch unnatural repeating noises or 
+distortions in generated speech while keeping gradients connected across all time steps for stable training.
+
+Reference: https://arxiv.org/pdf/2010.05646 (Section 2.3 - Multi-Period Discriminator)
+"""
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.utils import weight_norm, spectral_norm
+
+from .utils import LEAKY_RELU_SLOPE, get_padding
+
+
+class PeriodDiscriminator(nn.Module):
+    """
+    Discriminator that inspects the waveform periodically.
+    It reshapes the input into a 2D map [batch, 1, time//period, period]
+    to detect periodic patterns in the waveform.
+    """
+
+    def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
+        super().__init__()
+        self.period = period
+
+        # According to paper, they used weight_norm for MPD
+        norm = spectral_norm if use_spectral_norm else weight_norm
+
+        self.convs = nn.ModuleList([
+            norm(
+                nn.Conv2d(
+                    in_channels=1,
+                    out_channels=32,
+                    kernel_size=(kernel_size, 1),
+                    stride=(stride, 1),
+                    padding=(get_padding(kernel_size, 1), 0)
+                )
+            ),
+            norm(
+                nn.Conv2d(
+                    in_channels=32,
+                    out_channels=128,
+                    kernel_size=(kernel_size, 1),
+                    stride=(stride, 1),
+                    padding=(get_padding(kernel_size, 1), 0)
+                )
+            ),
+            norm(
+                nn.Conv2d(
+                    in_channels=128,
+                    out_channels=512,
+                    kernel_size=(kernel_size, 1),
+                    stride=(stride, 1),
+                    padding=(get_padding(kernel_size, 1), 0)
+                )
+            ),
+            norm(
+                nn.Conv2d(
+                    in_channels=512,
+                    out_channels=1024,
+                    kernel_size=(kernel_size, 1),
+                    stride=(stride, 1),
+                    padding=(get_padding(kernel_size, 1), 0)
+                )
+            ),
+            norm(
+                nn.Conv2d(
+                    in_channels=1024,
+                    out_channels=1024,
+                    kernel_size=(kernel_size, 1),
+                    stride=1,
+                    padding=(get_padding(kernel_size, 1), 0)
+                )
+            ),
+        ])
+
+        self.conv_post = norm(
+            nn.Conv2d(
+                in_channels=1024,
+                out_channels=1,
+                kernel_size=(3, 1),
+                stride=1,
+                padding=(1, 0)
+            )
+        )
+
+    def forward(self, x: torch.Tensor):
+        """
+        Args:
+            x: waveform [B, 1, T]
+        Returns:
+            score: final scalar logits
+            feature_maps: list of intermediate activations for feature matching
+        """
+
+        batch_size, channels, time_steps = x.shape
+
+        # Ensure divisible by period
+        if time_steps % self.period != 0:
+            pad_len = self.period - (time_steps % self.period)
+            x = F.pad(x, (0, pad_len), mode="reflect")
+            time_steps = time_steps + pad_len
+
+        # Reshape from 1D to 2D [B, 1, T//P, P]
+        x = x.view(batch_size, channels, time_steps // self.period, self.period)
+
+        feature_maps = []
+        for conv in self.convs:
+            x = F.leaky_relu(conv(x), LEAKY_RELU_SLOPE)
+            feature_maps.append(x)
+
+        score = self.conv_post(x)
+        feature_maps.append(score)
+
+        # Flatten score map for discriminator loss
+        return score.flatten(1, -1), feature_maps
+
+
+class MultiPeriodDiscriminator(nn.Module):
+    """
+    Combines multiple PeriodDiscriminators, each with different periodicities.
+    Detects pitch-related periodic artifacts at various temporal resolutions.
+    """
+
+    # Used the same values as original HiFi-GAN paper
+    # Chose these specific values to minimize overlap
+    def __init__(self, periods=(2, 3, 5, 7, 11)):
+        super().__init__()
+        self.sub_discriminators = nn.ModuleList(
+            [ PeriodDiscriminator(p) for p in periods ]
+        )
+
+
+    def forward(self, audio: torch.Tensor):
+        """
+        Args:
+            audio: waveform [B, 1, T]
+        Returns:
+            scores: List of logits
+            feature_maps: List of intermediate activations for feature matching
+        """
+        scores = []
+        feature_maps = []
+
+        for discriminator in self.sub_discriminators:
+            score, feats = discriminator(audio)
+            scores.append(score)
+            feature_maps.append(feats)
+
+        return scores, feature_maps
\ No newline at end of file
diff --git a/models/multiscale_discriminator.py b/models/multiscale_discriminator.py
new file mode 100644
index 0000000..dda9588
--- /dev/null
+++ b/models/multiscale_discriminator.py
@@ -0,0 +1,169 @@
+"""
+Multi-Scale Discriminator
+
+The MSD evaluates audio quality at multiple time resolutions by running three sub-discriminators on raw,
+2x downsampled, and 4x downsampled waveforms. This allows it to detect broad, long-range artifacts that
+may be missed at a single scale. 
+
+Each sub-discriminator uses strided and grouped 1D convolutions with LeakyReLU to analyze smoothed waveform structure, 
+while spectral normalization is applied only to the first (raw) scale for training stability. 
+Unlike the MPD, which inspects disjoint periodic samples, the MSD continuously analyzes the audio sequence 
+to capture overall timbre, clarity, and temporal consistency.
+
+Reference: https://arxiv.org/pdf/2010.05646 (Section 2.3 - Multi-Scale Discriminator)
+"""
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.utils import weight_norm, spectral_norm
+
+from .utils import LEAKY_RELU_SLOPE
+
+
+class ScaleDiscriminator(nn.Module):
+    """
+    Discriminator that inspects the waveform at different time resolutions.
+    It applies strided 1D convolutions (with grouped kernels) to detect
+    broad, long-range artifacts in the audio such as timbre or smoothness.
+    """
+
+    def __init__(self, use_spectral_norm: bool = False):
+        super().__init__()
+
+        # HiFi-GAN applies spectral norm only for the first scale.
+        norm = spectral_norm if use_spectral_norm else weight_norm
+
+        self.conv_layers = nn.ModuleList([
+            norm(
+                nn.Conv1d(
+                    in_channels=1,
+                    out_channels=128,
+                    kernel_size=15,
+                    stride=1,
+                    padding=7
+                )
+            ),
+            norm(
+                nn.Conv1d(
+                    in_channels=128,
+                    out_channels=128,
+                    kernel_size=41,
+                    stride=2,
+                    padding=20,
+                    groups=4
+                )
+            ),
+            norm(
+                nn.Conv1d(
+                    in_channels=128,
+                    out_channels=256,
+                    kernel_size=41,
+                    stride=2,
+                    padding=20,
+                    groups=16
+                )
+            ),
+            norm(
+                nn.Conv1d(
+                    in_channels=256,
+                    out_channels=512,
+                    kernel_size=41,
+                    stride=4,
+                    padding=20,
+                    groups=16
+                )
+            ),
+            norm(
+                nn.Conv1d(
+                    in_channels=512,
+                    out_channels=1024,
+                    kernel_size=41,
+                    stride=4,
+                    padding=20,
+                    groups=16
+                )
+            ),
+            norm(
+                nn.Conv1d(
+                    in_channels=1024,
+                    out_channels=1024,
+                    kernel_size=41,
+                    stride=1,
+                    padding=20,
+                    groups=16
+                )
+            ),
+            norm(
+                nn.Conv1d(
+                    in_channels=1024,
+                    out_channels=1024,
+                    kernel_size=5,
+                    stride=1,
+                    padding=2
+                )
+            )
+        ])
+
+        self.conv_post = norm(
+            nn.Conv1d(
+                in_channels=1024,
+                out_channels=1,
+                kernel_size=3,
+                stride=1,
+                padding=1
+            )
+        )
+
+
+    def forward(self, x: torch.Tensor):
+        feature_maps = []
+
+        for conv in self.conv_layers:
+            x = conv(x)
+            x = F.leaky_relu(x, LEAKY_RELU_SLOPE)
+            feature_maps.append(x)
+
+        # Final logit prediction
+        logits = self.conv_post(x)
+        feature_maps.append(logits)
+
+        # Flatten time dimension
+        logits = logits.flatten(1, -1)
+        return logits, feature_maps
+
+
+class MultiScaleDiscriminator(nn.Module):
+    """
+    The MSD runs three identical ScaleDiscriminators on:
+    - the original waveform
+    - a 2x time-downsampled waveform
+    - a 4x time-downsampled waveform
+
+    Only the first scale uses spectral normalization to stabilize training.
+    """
+
+    def __init__(self):
+        super().__init__()
+
+        self.discriminators = nn.ModuleList([
+            ScaleDiscriminator(use_spectral_norm=True),   # highest stability
+            ScaleDiscriminator(use_spectral_norm=False),  # more flexibility
+            ScaleDiscriminator(use_spectral_norm=False),
+        ])
+
+
+    def forward(self, audio: torch.Tensor):
+        scores = []
+        feature_maps = []
+
+        for disc in self.discriminators:
+            score, feats = disc(audio)
+            scores.append(score)
+            feature_maps.append(feats)
+
+            # Downsample for next discriminator
+            audio = F.avg_pool1d(audio, kernel_size=4, stride=2, padding=2)
+
+        return scores, feature_maps
+
diff --git a/models/resblock.py b/models/resblock.py
new file mode 100644
index 0000000..b054ba3
--- /dev/null
+++ b/models/resblock.py
@@ -0,0 +1,86 @@
+"""
+Source code inspired from:
+- fairseq: https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/text_to_speech/hifigan.py
+- jik876: https://github.com/jik876/hifi-gan/blob/master/models.py
+
+For detailed model architecture details, refer to the original paper: https://arxiv.org/abs/2010.05646
+
+For in-depth details for residual blocks/nets, refer to its seminal paper: https://arxiv.org/abs/1512.03385
+"""
+
+import torch.nn as nn
+from torch.nn.utils import weight_norm, remove_weight_norm
+from torch.nn.functional import leaky_relu
+
+from .utils import init_weights
+
+
+# Empirically, 0.1 worked best in GAN-based vocoders for audio stability
+# The small negative slope (0.1) allows a small gradient to flow even for negative activations
+# Helpful in preventing "dead" neurons
+LEAKY_RELU_SLOPE = 0.1
+
+
+# Residual Blocks that form the Multi-Receptive Field Fusion (MRF) Module
+class ResBlock(nn.Module):
+
+    def __init__(self, channels, kernel_size=3, dilations=(1, 3, 5)):
+        super().__init__()
+        self.convs1 = nn.ModuleList()  # with dilations
+        self.convs2 = nn.ModuleList()  # no dilations
+
+        # weight_norm() is crucial to stabilize training by normalizing the scale of each filter
+        # Helps in avoiding exploding or vanishing gradients and improves convergence
+        for dilation in dilations:
+            # With dilations
+            self.convs1.append(
+                weight_norm(
+                    nn.Conv1d(
+                        in_channels=channels,
+                        out_channels=channels,
+                        kernel_size=kernel_size,
+                        dilation=dilation,
+                        padding=((kernel_size * dilation - dilation) // 2)
+                    )
+                )
+            )
+
+            # No dilations
+            self.convs2.append(
+                weight_norm(
+                    nn.Conv1d(
+                        in_channels=channels,
+                        out_channels=channels,
+                        kernel_size=kernel_size,
+                        dilation=1,
+                        padding=((kernel_size - 1) // 2)
+                    )
+                )
+            )
+
+        # Override default weights to stabilize training and reduce artifacts
+        self.convs1.apply(init_weights)
+        self.convs2.apply(init_weights)
+
+
+    def forward(self, x):
+        for conv1, conv2 in zip(self.convs1, self.convs2):
+            residual = x
+
+            x = leaky_relu(x, LEAKY_RELU_SLOPE)
+            x = conv1(x)
+            x = leaky_relu(x, LEAKY_RELU_SLOPE)
+            x = conv2(x)
+            x = x + residual
+
+        return x
+
+
+    def remove_weight_norm(self):
+        """
+        Use during inference, since normalization overhead is not needed anymore
+        """
+        for l in self.convs1:
+            remove_weight_norm(l)
+        for l in self.convs2:
+            remove_weight_norm(l)
diff --git a/models/utils.py b/models/utils.py
new file mode 100644
index 0000000..30d7763
--- /dev/null
+++ b/models/utils.py
@@ -0,0 +1,30 @@
+import torch
+import torch.nn as nn
+
+
+# Empirically, 0.1 worked best in GAN-based vocoders for audio stability
+# The small negative slope (0.1) allows a small gradient to flow even for negative activations
+# Helpful in preventing "dead" neurons
+LEAKY_RELU_SLOPE = 0.1
+
+
+def init_weights(module: nn.Module, mean: float = 0.0, std: float = 0.01):
+    """
+    Initializes convolutional weights with values from the normal distribution (mean = 0, std = 0.01)
+
+    If this is not overridden, conv layers use Kaiming uniform initialization by default.
+    
+    HiFi-GAN overrides this because
+    - GANs are very sensitive to initialization
+    - The authors found that small random normal weights stabilize early training and reduce artifacts
+    """
+
+    classname = module.__class__.__name__
+
+    if classname.find("Conv") != -1:
+        module.weight.data.normal_(mean, std)
+
+
+def get_padding(kernel_size: int, dilation: int = 1) -> int:
+    """Compute symmetric padding for 1D convs."""
+    return int((kernel_size * dilation - dilation) // 2)
diff --git a/requirements.txt b/requirements.txt
index fe130e8..8d55d9e 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -16,6 +16,11 @@ soundfile>=0.12.0
 pydub>=0.25.0
 librosa>=0.9.0
 
+# Speech/Audio AI models
+speechbrain>=0.5.16          # ECAPA-TDNN speaker recognition
+funasr>=1.0.0                # Emotion2Vec emotion recognition
+modelscope>=1.9.0            # Required by funasr for model downloads
+
 # Web framework
 Flask>=2.0.0
 Werkzeug>=2.0.0
@@ -49,3 +54,8 @@ pywin32>=300; sys_platform == "win32"
 
 # Note: SimulEval should be installed separately in editable mode:
 # cd SimulEval && pip install --editable ./
+
+# Installation order recommendation:
+# 1. Install PyTorch first (with appropriate CUDA version)
+# 2. pip install -r requirements.txt
+# 3. cd SimulEval && pip install --editable ./

From 3b311e559a806a622150e13dd2ccbf2d502235f5 Mon Sep 17 00:00:00 2001
From: ronliwag <roncorwinrliwag@gmail.com>
Date: Fri, 7 Nov 2025 17:06:17 +0800
Subject: [PATCH 7/7] purple

---
 demo/templates/index.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/demo/templates/index.html b/demo/templates/index.html
index 34b07f3..ee47d6b 100644
--- a/demo/templates/index.html
+++ b/demo/templates/index.html
@@ -353,7 +353,7 @@ <h2 class="result-title">Output: Modified Vocoder (Voice Transfer - English with
                                 outputWaveSurferModified = WaveSurfer.create({
                                     container: '#outputWaveformModified',
                                     waveColor: '#f9f9f9',
-                                    progressColor: 'green',
+                                    progressColor: 'purple',
                                     normalize: true,
                                     loop: false,  // Disable looping
                                     audioContext: audioContext  // Use 16kHz context