From 18c7779f9c6611bc2e96e0d8b8c026abfa13b0bf Mon Sep 17 00:00:00 2001 From: ethan Date: Thu, 13 Nov 2025 20:19:26 -0800 Subject: [PATCH 01/14] add fireredtts2 notebook --- notebooks/fireredtts2/README.md | 34 + notebooks/fireredtts2/fireredtts2.ipynb | 1545 +++++++++++++++++ notebooks/fireredtts2/gradio_helper.py | 363 ++++ notebooks/fireredtts2/ov_fireredtts_helper.py | 1474 ++++++++++++++++ 4 files changed, 3416 insertions(+) create mode 100644 notebooks/fireredtts2/README.md create mode 100644 notebooks/fireredtts2/fireredtts2.ipynb create mode 100644 notebooks/fireredtts2/gradio_helper.py create mode 100644 notebooks/fireredtts2/ov_fireredtts_helper.py diff --git a/notebooks/fireredtts2/README.md b/notebooks/fireredtts2/README.md new file mode 100644 index 00000000000..dc0b53cf620 --- /dev/null +++ b/notebooks/fireredtts2/README.md @@ -0,0 +1,34 @@ +# Multi-speaker dialogue generation with FireRedTTS‑2 and OpenVINO + +FireRedTTS‑2 is a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. It is highlighted by following features: +- **Long Conversational Speech Generation**: It currently supports 3 minutes dialogues with 4 speakers and can be easily scaled to longer conversations +with more speakers by extending training corpus. +- **Multilingual Support**: It supports multiple languages including English, Chinese, Japanese, Korean, French, German, and Russian. Support zero-shot voice cloning for cross-lingual and code-switching scenarios. +- **Ultra-Low Latency**: Building on the new **12.5Hz streaming** speech tokenizer, we employ a dual-transformer architecture that operates on a text–speech interleaved sequence, enabling flexible sentence-bysentence generation and reducing first-packet latency,Specifically, on an L20 GPU, our first-packet latency as low as 140ms while maintaining high-quality audio output. +- **Strong Stability**:Our model achieves high similarity and low WER/CER in both monologue and dialogue tests. +- **Random Timbre Generation**:Useful for creating ASR/speech interaction data. + +More details can be found in the [paper](https://arxiv.org/abs/2509.02020), original [repository](https://github.com/FireRedTeam/FireRedTTS2) and [model card](https://huggingface.co/FireRedTeam/FireRedTTS2) + +In this tutorial we consider how to run and optimize FireRedTTS‑2 using OpenVINO. + +## Notebook contents +The tutorial consists from following steps: + +- Install requirements +- Convert and Optimize model +- Run OpenVINO model inference +- Launch Interactive demo + +In this demonstration, you'll create interactive assistant that can answer questions about provided image's content or generate images based on text instructions. + +The images bellow illustrates example of input prompt and model answer for image understanding and generation +![example.png](https://github.com/user-attachments/assets/89a71be8-b472-4acd-a2e0-dbc97645fc1c) +![example2.png](https://github.com/user-attachments/assets/5aca2b37-52d9-403d-a773-311ccf82b375) + +## Installation instructions +This is a self-contained example that relies solely on its own code.
+We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. +For details, please refer to [Installation Guide](../../README.md). + + diff --git a/notebooks/fireredtts2/fireredtts2.ipynb b/notebooks/fireredtts2/fireredtts2.ipynb new file mode 100644 index 00000000000..ae3beb8ff54 --- /dev/null +++ b/notebooks/fireredtts2/fireredtts2.ipynb @@ -0,0 +1,1545 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Multi-speaker dialogue generation with FireRedTTS‑2 and OpenVINO\n", + "\n", + "FireRedTTS‑2 is a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. It is highlighted by following features:\n", + "- **Long Conversational Speech Generation**: It currently supports 3 minutes dialogues with 4 speakers and can be easily scaled to longer conversations\n", + "with more speakers by extending training corpus.\n", + "- **Multilingual Support**: It supports multiple languages including English, Chinese, Japanese, Korean, French, German, and Russian. Support zero-shot voice cloning for cross-lingual and code-switching scenarios.\n", + "- **Ultra-Low Latency**: Building on the new **12.5Hz streaming** speech tokenizer, we employ a dual-transformer architecture that operates on a text–speech interleaved sequence, enabling flexible sentence-bysentence generation and reducing first-packet latency,Specifically, on an L20 GPU, our first-packet latency as low as 140ms while maintaining high-quality audio output.\n", + "- **Strong Stability**:Our model achieves high similarity and low WER/CER in both monologue and dialogue tests.\n", + "- **Random Timbre Generation**:Useful for creating ASR/speech interaction data.\n", + "\n", + "More details can be found in the [paper](https://arxiv.org/abs/2509.02020), original [repository](https://github.com/FireRedTeam/FireRedTTS2) and [model card](https://huggingface.co/FireRedTeam/FireRedTTS2)\n", + "\n", + "In this tutorial we consider how to run and optimize FireRedTTS‑2 using OpenVINO.\n", + "\n", + "#### Table of contents:\n", + "\n", + "- [Prerequisites](#Prerequisites)\n", + "- [Convert and Optimize model](#Convert-and-Optimize-model)\n", + "- [Create Inference Pipeline](#Create-Inference-Pipeline)\n", + " - [Select Inference Device](#Select-Inference-Device)\n", + " - [Run Dialogue Generation](#Run-Dialogue-Generation)\n", + "- [Interactive demo](#Interactive-demo)\n", + "\n", + "\n", + "### Installation Instructions\n", + "\n", + "This is a self-contained example that relies solely on its own code.\n", + "\n", + "We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.\n", + "For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).\n", + "\n", + "\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "[back to top ⬆️](#Table-of-contents:)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch `notebook_utils` module\n", + "import requests\n", + "from pathlib import Path\n", + "\n", + "if not Path(\"notebook_utils.py\").exists():\n", + " r = requests.get(\n", + " url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py\",\n", + " )\n", + " open(\"notebook_utils.py\", \"w\").write(r.text)\n", + "\n", + "if not Path(\"cmd_helper.py\").exists():\n", + " r = requests.get(\n", + " url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/cmd_helper.py\",\n", + " )\n", + " open(\"cmd_helper.py\", \"w\").write(r.text)\n", + "\n", + "if not Path(\"pip_helper.py\").exists():\n", + " r = requests.get(\n", + " url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/pip_helper.py\",\n", + " )\n", + " open(\"pip_helper.py\", \"w\").write(r.text)\n", + "\n", + "# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry\n", + "from notebook_utils import collect_telemetry\n", + "\n", + "collect_telemetry(\"firetts2.ipynb\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Found existing installation: fireredtts2 0.1\n", + "Uninstalling fireredtts2-0.1:\n", + " Successfully uninstalled fireredtts2-0.1\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", + "\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "Note: switching to 'bfacbfb7bb88cade9c0b9ab2644ebd7f75c6989c'.\n", + "\n", + "You are in 'detached HEAD' state. You can look around, make experimental\n", + "changes and commit them, and you can discard any commits you make in this\n", + "state without impacting any branches by switching back to a branch.\n", + "\n", + "If you want to create a new branch to retain commits you create, you may\n", + "do so (now or later) by using -c with the switch command. Example:\n", + "\n", + " git switch -c \n", + "\n", + "Or undo this operation with:\n", + "\n", + " git switch -\n", + "\n", + "Turn off this advice by setting config variable advice.detachedHead to false\n", + "\n", + "HEAD is now at bfacbfb Update llm.py\n", + "\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33m WARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", + "\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", + "\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" + ] + } + ], + "source": [ + "from cmd_helper import clone_repo\n", + "from pip_helper import pip_install\n", + "import platform\n", + "\n", + "!pip uninstall -y FireRedTTS2\n", + "\n", + "pip_install(\n", + " \"-q\",\n", + " \"--extra-index-url\",\n", + " \"https://download.pytorch.org/whl/cpu\",\n", + " \"torch==2.7.1\",\n", + " \"torchvision==0.22.1\",\n", + " \"torchaudio==2.7.1\",\n", + " \"nncf\",\n", + " \"openvino>=2025.3.0\",\n", + " \"gradio\",\n", + ")\n", + "\n", + "repo_dir = Path(\"FireRedTTS2\")\n", + "revision = \"bfacbfb7bb88cade9c0b9ab2644ebd7f75c6989c\"\n", + "clone_repo(\"https://github.com/openvino-dev-samples/FireRedTTS2.git\", revision)\n", + "\n", + "pip_install(\n", + " \"-q -e\",\n", + " str(repo_dir),\n", + ")\n", + "\n", + "pip_install(\n", + " \"-q -r\",\n", + " str(repo_dir / \"requirements.txt\"),\n", + ")\n", + "if platform.system() == \"Darwin\":\n", + " pip_install(\"numpy<2.0\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Convert and Optimize model\n", + "[back to top ⬆️](#Table-of-contents:)\n", + "\n", + " Janus is PyTorch model. OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation (IR). [OpenVINO model conversion API](https://docs.openvino.ai/2024/openvino-workflow/model-preparation.html#convert-a-model-with-python-convert-model) should be used for these purposes. `ov.convert_model` function accepts original PyTorch model instance and example input for tracing and returns `ov.Model` representing this model in OpenVINO framework. Converted model can be used for saving on disk using `ov.save_model` function or directly loading on device using `core.complie_model`. \n", + "\n", + "The script `ov_firetts_helper.py` contains helper function for model conversion." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Skipping import of cpp extensions due to incompatible torch version 2.7.1+cpu for torchao version 0.14.1 Please see https://github.com/pytorch/ao/issues/2919 for more info\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "⌛ pretrained_models conversion started. Be patient, it may takes some time.\n", + "⌛ Load Original model\n", + "🔍 Detected Configuration:\n", + " num_heads: 12\n", + " num_kv_heads: 2\n", + " dim: 1536\n", + " head_dim: 128\n", + " intermediate_size: 8960\n", + " num_layers: 28\n", + " max_seq_len: 4096\n", + " tie_word_embeddings: True\n", + "\n", + "🔧 Removing 'model.' prefix...\n", + "\n", + "🔑 Cleaned key examples:\n", + " layers.0.self_attn.q_proj.weight\n", + " layers.0.self_attn.q_proj.bias\n", + " layers.0.self_attn.k_proj.weight\n", + " layers.0.self_attn.k_proj.bias\n", + " layers.0.self_attn.v_proj.weight\n", + "\n", + "⚠️ Missing keys: ['embed_tokens.weight']\n", + "\n", + "✅ Conversion completed!\n", + "🔍 Detected Configuration:\n", + " num_heads: 12\n", + " num_kv_heads: 2\n", + " dim: 1536\n", + " head_dim: 128\n", + " intermediate_size: 8960\n", + " num_layers: 4\n", + " max_seq_len: 4096\n", + " tie_word_embeddings: True\n", + "\n", + "🔧 Removing 'model.' prefix...\n", + "\n", + "🔑 Cleaned key examples:\n", + " layers.0.self_attn.q_proj.weight\n", + " layers.0.self_attn.q_proj.bias\n", + " layers.0.self_attn.k_proj.weight\n", + " layers.0.self_attn.k_proj.bias\n", + " layers.0.self_attn.v_proj.weight\n", + "\n", + "⚠️ Missing keys: ['embed_tokens.weight']\n", + "\n", + "✅ Conversion completed!\n", + "[INFO] LLM Loaded...\n", + "[INFO] Text Tokenizer Loaded...\n", + "[INFO] Codec Loaded...\n", + "✅ Original model successfully loaded\n", + "⌛ Export tokenizer and config\n", + "⌛ Convert TEXT_EMBEDDINGS model\n", + "✅ TEXT_EMBEDDINGS model successfully converted\n", + "⌛ Convert AUDIO_EMBEDDINGS model\n", + "✅ AUDIO_EMBEDDINGS model successfully converted\n", + "⌛ Convert AUDIO_UPSAMPLER model\n", + "vq_out_feats shape: 1\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/ov_fireredtts_helper.py:760: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", + " vq_out_length = torch.tensor(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "vq_out_feats shape: 1\n", + "vq_out_feats shape: 1\n", + "✅ AUDIO_UPSAMPLER model successfully converted\n", + "⌛ Convert AUDIO_DECODER model\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/utils.py:7: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " max_len = max_len if max_len > 0 else lengths.max().item()\n", + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/utils.py:26: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", + " num_blocks = torch.ceil(torch.tensor(attn_mask.shape[1] / chunk_size)).to(torch.int64)\n", + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/utils.py:26: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).\n", + " num_blocks = torch.ceil(torch.tensor(attn_mask.shape[1] / chunk_size)).to(torch.int64)\n", + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/decoder.py:402: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " assert (window_envelope > 1e-11).all()\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ AUDIO_DECODER model successfully converted\n", + "⌛ Convert AUDIO_ENCODER model\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/model.py:221: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", + " audio16k_length = torch.tensor(\n", + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/whisper.py:330: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", + " mel_filters = torch.from_numpy(self.mel_filters).type(torch.float32).to(device)\n", + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/utils.py:7: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " max_len = max_len if max_len > 0 else lengths.max().item()\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ AUDIO_ENCODER model successfully converted\n", + "⌛ Convert DECODER_MODEL model\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.\n", + "/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages/transformers/cache_utils.py:568: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " or not self.key_cache[layer_idx].numel() # the layer has no cache\n", + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/ov_fireredtts_helper.py:371: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if (padding_length := kv_length + kv_offset - attention_mask.shape[-1]) > 0:\n", + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/ov_fireredtts_helper.py:503: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", + " torch.tensor(0.0, device=mask.device, dtype=dtype),\n", + "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/ov_fireredtts_helper.py:504: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", + " torch.tensor(torch.finfo(torch.float16).min, device=mask.device, dtype=dtype),\n", + "/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages/transformers/cache_utils.py:551: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " elif (\n", + "/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages/transformers/integrations/sdpa_attention.py:59: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " is_causal = query.shape[2] > 1 and attention_mask is None and getattr(module, \"is_causal\", True)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Decoder model successfully converted\n", + "⌛ Convert BACKBONE_MODEL model\n", + "✅ Backbone model successfully converted\n", + "✅ pretrained_models model conversion finished. You can find results in FireRedTTS2-ov\n" + ] + }, + { + "data": { + "text/plain": [ + "PosixPath('FireRedTTS2-ov')" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from ov_fireredtts_helper import convert_fireredtts2\n", + "\n", + "# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry\n", + "from notebook_utils import collect_telemetry\n", + "\n", + "collect_telemetry(\"fireredtts2.ipynb\")\n", + "\n", + "pt_model_path = Path(\"pretrained_models\")\n", + "if not pt_model_path.exists():\n", + " !git clone https://huggingface.co/FireRedTeam/FireRedTTS2 pretrained_models\n", + "\n", + "model_path = \"FireRedTTS2-ov\"\n", + "convert_fireredtts2(pt_model_path, model_path)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create Inference Pipeline\n", + "[back to top ⬆️](#Table-of-contents:)\n", + "\n", + "`OVFireRedTTS2` defined in `ov_fireredtts_helper.py` provides unified interface for running model inference. It accepts model directory and target device for inference.\n", + "\n", + "### Select Inference Device\n", + "[back to top ⬆️](#Table-of-contents:)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "2d379251f53b43eb805b5a5f8d501f55", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Dropdown(description='Device:', options=('CPU', 'GPU', 'AUTO'), value='CPU')" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from notebook_utils import device_widget\n", + "\n", + "device = device_widget(\"CPU\", [\"NPU\"])\n", + "\n", + "device" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`OVFireRedTTS2` class used for pre- and postprocessing steps in original FireRedTTS-2 model. Our model is also compatible with the same processor code and we can reuse it. \n", + "\n", + "ℹ️ **Limitation**\n", + "- Currently it can support `dialogue mode` only. \n", + "- Codec model can be deployed to `CPU` only." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "ename": "RuntimeError", + "evalue": "Exception from src/inference/src/cpp/core.cpp:134:\nException from src/inference/src/dev/plugin.cpp:58:\nException from src/core/src/pass/graph_rewrite.cpp:298:\n[FuseBinaryEltwise] END: node: opset1::Add Add_494266 (SnippetsOpset::BrgemmCPU MatMul_494263[0]:f32[?,20,?,?], opset1::Parameter Add_494266[0]:f32[1,1,1,300]) -> (f32[?,20,?,300]) CALLBACK HAS THROWN: Exception from src/core/src/dimension.cpp:227:\nCannot get length of dynamic dimension\n\n\n\n\n", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mRuntimeError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[6], line 3\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mov_fireredtts_helper\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m OVFireRedTTS2\n\u001b[0;32m----> 3\u001b[0m ov_model \u001b[38;5;241m=\u001b[39m \u001b[43mOVFireRedTTS2\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mgen_type\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mdialogue\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdevice\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdevice\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalue\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcodec_device\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mCPU\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n", + "File \u001b[0;32m/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/ov_fireredtts_helper.py:1079\u001b[0m, in \u001b[0;36mOVFireRedTTS2.__init__\u001b[0;34m(self, pretrained_dir, gen_type, device, codec_device)\u001b[0m\n\u001b[1;32m 1077\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maudio_embeddings \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mcompile_model(model_dir \u001b[38;5;241m/\u001b[39m AUDIO_EMBEDDINGS_PATH, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdevice)\n\u001b[1;32m 1078\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maudio_decoder \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mcompile_model(model_dir \u001b[38;5;241m/\u001b[39m AUDIO_DECODER_PATH, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcodec_device)\n\u001b[0;32m-> 1079\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maudio_encoder \u001b[38;5;241m=\u001b[39m \u001b[43mcore\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcompile_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel_dir\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m/\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mAUDIO_ENCODER_PATH\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcodec_device\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1080\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtext_embeddings \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mcompile_model(model_dir \u001b[38;5;241m/\u001b[39m TEXT_EMBEDDINGS_PATH, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdevice)\n\u001b[1;32m 1081\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maudio_upsampler \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mcompile_model(model_dir \u001b[38;5;241m/\u001b[39m AUDIO_UPSAMPLER_PATH, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdevice)\n", + "File \u001b[0;32m/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages/openvino/_ov_api.py:610\u001b[0m, in \u001b[0;36mCore.compile_model\u001b[0;34m(self, model, device_name, config, weights)\u001b[0m\n\u001b[1;32m 605\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m device_name \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 606\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m CompiledModel(\n\u001b[1;32m 607\u001b[0m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39mcompile_model(model, {} \u001b[38;5;28;01mif\u001b[39;00m config \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m config),\n\u001b[1;32m 608\u001b[0m )\n\u001b[1;32m 609\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m CompiledModel(\n\u001b[0;32m--> 610\u001b[0m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcompile_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdevice_name\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m{\u001b[49m\u001b[43m}\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mif\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mconfig\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mis\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01melse\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mconfig\u001b[49m\u001b[43m)\u001b[49m,\n\u001b[1;32m 611\u001b[0m )\n\u001b[1;32m 612\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 613\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m device_name \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n", + "\u001b[0;31mRuntimeError\u001b[0m: Exception from src/inference/src/cpp/core.cpp:134:\nException from src/inference/src/dev/plugin.cpp:58:\nException from src/core/src/pass/graph_rewrite.cpp:298:\n[FuseBinaryEltwise] END: node: opset1::Add Add_494266 (SnippetsOpset::BrgemmCPU MatMul_494263[0]:f32[?,20,?,?], opset1::Parameter Add_494266[0]:f32[1,1,1,300]) -> (f32[?,20,?,300]) CALLBACK HAS THROWN: Exception from src/core/src/dimension.cpp:227:\nCannot get length of dynamic dimension\n\n\n\n\n" + ] + } + ], + "source": [ + "from ov_fireredtts_helper import OVFireRedTTS2\n", + "\n", + "ov_model = OVFireRedTTS2(model_path, gen_type=\"dialogue\", device=device.value, codec_device=\"CPU\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run visual language chat\n", + "[back to top ⬆️](#Table-of-contents:)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import torchaudio\n", + "\n", + "\n", + "text_list = [\n", + " \"[S1]It's alright, we'll take a breath and plan the next pass together.\",\n", + " \"[S2]Yeah, thanks. We'll get it right this time.\",\n", + " \"[S1]Let's review our signals tonight so we're in sync on the field tomorrow.\",\n", + "]\n", + "prompt_wav_list = [\n", + " \"FireRedTTS2/examples/chat_prompt/en/S1.flac\",\n", + " \"FireRedTTS2/examples/chat_prompt/en/S2.flac\",\n", + "]\n", + "\n", + "prompt_text_list = [\n", + " \"[S1]I think we should just talk about what happened and move on because there's going to be other jousts and Sir Saif isn't done yet. It's not, he's not, it's not done yet.\",\n", + " \"[S2]You know, maybe sorry, maybe maybe I pushed, maybe I pushed too hard. I was really excited. I didn't mean to make you snap.\",\n", + "]\n", + "\n", + "all_audio = ov_model.generate_dialogue(\n", + " text_list=text_list,\n", + " prompt_wav_list=prompt_wav_list,\n", + " prompt_text_list=prompt_text_list,\n", + " temperature=0.9,\n", + " topk=30,\n", + ")\n", + "torchaudio.save(\"chat_clone_ov.wav\", all_audio, 24000)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import IPython\n", + "\n", + "display(IPython.display.Audio(\"chat_clone_ov.wav\"))" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interactive demo\n", + "[back to top ⬆️](#Table-of-contents:)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from gradio_helper import make_demo\n", + "\n", + "demo = make_demo(ov_model)\n", + "\n", + "try:\n", + " demo.launch(debug=True)\n", + "except Exception:\n", + " demo.launch(share=True, debug=True)\n", + "# if you are launching remotely, specify server_name and server_port\n", + "# demo.launch(server_name='your server name', server_port='server port in int')\n", + "# Read more in the docs: https://gradio.app/docs/" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.7" + }, + "openvino_notebooks": { + "imageUrl": "https://github.com/user-attachments/assets/0d83b369-b8fc-423e-bc53-495022555e8c", + "tags": { + "categories": [ + "Model Demos", + "AI Trends" + ], + "libraries": [], + "other": [], + "tasks": [ + "Visual Question Answering", + "Image-to-Text", + "Text Generation", + "Text-to-Image" + ] + } + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": { + "03d2e3830e9f4a5a9df9e035445fc13b": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "05c1f611c4b547db94014027cb4a60d7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "0b0a7ba8b4a9478cbeff30f4dbbb2509": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "0efc15d7628048e6ab6daa8847033342": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "FloatProgressModel", + "state": { + "bar_style": "success", + "layout": "IPY_MODEL_edbaa56d923f4a34ad4cb1e3821696bb", + "max": 346, + "style": "IPY_MODEL_8d29e1f4516145358af0010393973f44", + "value": 346 + } + }, + "111eeed4303140d8882c19361bcf67f3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_946f9982bac14fac866b22ae44481715", + "style": "IPY_MODEL_d0a1f7a536464a4fbac4310131ddeed0", + "value": " 0/576 [00:00<?, ?steps/s]" + } + }, + "19db49e350b4431fa13da9ae7ec3750c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_3c3e4546165f4a1fb3dcbc4452a9a4d2", + "style": "IPY_MODEL_70044f1bb2114ce1b352bea12f1475f0", + "value": " 1.46k/1.46k [00:00<00:00, 171kB/s]" + } + }, + "1b41c99067204a289595da6325f40d59": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "1ca51247649f47c1bd4c67c64055f947": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "ProgressStyleModel", + "state": { + "description_width": "" + } + }, + "1f55d14e1e33499d8c2acb7755bb620e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "1f76478266de44ce84709bdd83c9b420": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "DropdownModel", + "state": { + "_options_labels": [ + "deepseek-ai/Janus-Pro-1B", + "deepseek-ai/Janus-Pro-7B", + "deepseek-ai/Janus-1.3B" + ], + "index": 0, + "layout": "IPY_MODEL_cc99923fd96f499193ddfccde886fc7c", + "style": "IPY_MODEL_9fa56671a8b14d878ff6acf025f3cd94" + } + }, + "208406abe0e04cd39b9a5c78a7fb9374": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "252980b688df42b8905baf00a60ffe20": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "FloatProgressModel", + "state": { + "bar_style": "success", + "layout": "IPY_MODEL_1b41c99067204a289595da6325f40d59", + "max": 344, + "style": "IPY_MODEL_82a909693fdf473c8c81aabd89785d07", + "value": 344 + } + }, + "295e4036a3c84f5494659a3ed52f85ea": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "29e99a461ee1428db28e6895253c9530": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "ProgressStyleModel", + "state": { + "description_width": "" + } + }, + "2fc9c65105714b618637263ab9803be2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "FloatProgressModel", + "state": { + "bar_style": "success", + "layout": "IPY_MODEL_44d68a1ee64f4150bedd543aea6a1245", + "max": 576, + "style": "IPY_MODEL_f202f2f043dd487e909e8b4edf66d63d", + "value": 576 + } + }, + "339a9e94c9f6448ca27a027b5758eeca": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_70896cdb44e74f97be117104cf1109c9", + "style": "IPY_MODEL_1f55d14e1e33499d8c2acb7755bb620e", + "value": " 344/344 [00:00<00:00, 35.6kB/s]" + } + }, + "3781e958c24143dd89b322a9bde9dca5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "3c3e4546165f4a1fb3dcbc4452a9a4d2": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "3cc656c47b76492fa2a8c94da9eea89e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_7942a6e0d5394725aa97ebc5f0ddf615", + "style": "IPY_MODEL_dc6e36c1b06b4a959a45a4243ec4d5e7", + "value": "config.json: 100%" + } + }, + "3fa8ab29131a4995a7f9ebf15fcce5d8": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_9f5c1568a4ad4980bca14dc585050422", + "style": "IPY_MODEL_3781e958c24143dd89b322a9bde9dca5", + "value": "100%" + } + }, + "425207e8ab2346379e9c1d8cbf63345f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "ProgressStyleModel", + "state": { + "description_width": "" + } + }, + "43b41ff5845046aaa53bd3dc9f063357": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "44d68a1ee64f4150bedd543aea6a1245": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "4558760de72c4af9a9bc8f155f94fee7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_5d88cb13eb55452ea602a42a2a7d9f29", + "style": "IPY_MODEL_7de650854a04404d9a3d3f51d3331cd7", + "value": "processor_config.json: 100%" + } + }, + "47fc99e537c1491881cdd0f922918176": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "483f584cbd0845ea9677e1cd75d5e030": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "FloatProgressModel", + "state": { + "bar_style": "success", + "layout": "IPY_MODEL_5d34409b239348d7bc8d0ec4bf7e8050", + "max": 4718799, + "style": "IPY_MODEL_a192fd1ec9ed418ba9fb5eec5c762c77", + "value": 4718799 + } + }, + "51359aacb849491a97ed9f747245c807": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "518d7f4f58b440fc8fbe4a89e03a359b": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "528bbf053321445081c2abb0744ae955": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_52bfc8ce36a3480dacaa9d071f3df67c", + "style": "IPY_MODEL_67d39c28015f405c85d857bd7254c6ba", + "value": " 346/346 [00:00<00:00, 35.1kB/s]" + } + }, + "52bfc8ce36a3480dacaa9d071f3df67c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "52fd01c252854b1180b3bd43886e5888": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "544b396b604844598a4fc6903443604a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "FloatProgressModel", + "state": { + "layout": "IPY_MODEL_295e4036a3c84f5494659a3ed52f85ea", + "max": 576, + "style": "IPY_MODEL_29e99a461ee1428db28e6895253c9530" + } + }, + "5929050d6c5a41139ac7f12dc72b5db0": { + "model_module": "@jupyter-widgets/output", + "model_module_version": "1.0.0", + "model_name": "OutputModel", + "state": { + "layout": "IPY_MODEL_7e06f8dd82c64208a7aa80939d1c4164", + "outputs": [ + { + "data": { + "text/html": "
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%0:00:440:00:00\n
\n", + "text/plain": "Applying Weight Compression \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[35m100%\u001b[0m • \u001b[38;2;0;104;181m0:00:44\u001b[0m • \u001b[38;2;0;104;181m0:00:00\u001b[0m\n" + }, + "metadata": {}, + "output_type": "display_data" + } + ] + } + }, + "5ae0ea491dab46e79bae21c0f074b2fa": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "DescriptionStyleModel", + "state": { + "description_width": "" + } + }, + "5d34409b239348d7bc8d0ec4bf7e8050": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "5d88cb13eb55452ea602a42a2a7d9f29": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "654d8381ec9048ccafaec41ac2b96e54": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_7aad089e5cf74701a27737e2877da54f", + "style": "IPY_MODEL_99c67ba283174da98bf56316c16b721b", + "value": "tokenizer.json: 100%" + } + }, + "67d39c28015f405c85d857bd7254c6ba": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "6ac0e37a11a24933aff9ac6ba821a1a2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "FloatProgressModel", + "state": { + "bar_style": "success", + "layout": "IPY_MODEL_7f86e41c1b814f7b9c921ab8993b107a", + "max": 210, + "style": "IPY_MODEL_1ca51247649f47c1bd4c67c64055f947", + "value": 210 + } + }, + "6affdca6b9974b16bfecc161575f5b0b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "FloatProgressModel", + "state": { + "bar_style": "success", + "layout": "IPY_MODEL_52fd01c252854b1180b3bd43886e5888", + "max": 1455, + "style": "IPY_MODEL_425207e8ab2346379e9c1d8cbf63345f", + "value": 1455 + } + }, + "6e5d75d09aeb4003bfc9930a628d84ae": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "ProgressStyleModel", + "state": { + "description_width": "" + } + }, + "70044f1bb2114ce1b352bea12f1475f0": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "70896cdb44e74f97be117104cf1109c9": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "736dec576cde4d39b8a88b5407f5f5a4": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_0b0a7ba8b4a9478cbeff30f4dbbb2509", + "style": "IPY_MODEL_208406abe0e04cd39b9a5c78a7fb9374", + "value": "special_tokens_map.json: 100%" + } + }, + "7394d989f22a42aca42adeac8539ce11": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "7942a6e0d5394725aa97ebc5f0ddf615": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "7aad089e5cf74701a27737e2877da54f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "7de650854a04404d9a3d3f51d3331cd7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "7df39ecc432e4144a5823141763af3c8": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "7e06f8dd82c64208a7aa80939d1c4164": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "7f86e41c1b814f7b9c921ab8993b107a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "803697e0513642fe9e8ff77b775f886a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "82a909693fdf473c8c81aabd89785d07": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "ProgressStyleModel", + "state": { + "description_width": "" + } + }, + "8b8873eac8a04d45898cf49f1b015ebc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_cb0c534940fd46c7a256a82016e80132", + "style": "IPY_MODEL_bc05a27a0c454cc18b31312cdabc2778", + "value": "preprocessor_config.json: 100%" + } + }, + "8d29e1f4516145358af0010393973f44": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "ProgressStyleModel", + "state": { + "description_width": "" + } + }, + "9309a7c3eb5140379ab6d83b417f3141": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "946f9982bac14fac866b22ae44481715": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "99c67ba283174da98bf56316c16b721b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "9a3e489b2bfe4c13979978fcf77d51d9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "9e81ab54a69047b5a002ab930424d02f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "9f5c1568a4ad4980bca14dc585050422": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "9fa56671a8b14d878ff6acf025f3cd94": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "DescriptionStyleModel", + "state": { + "description_width": "" + } + }, + "a0504a67d470438a965e07d17af21637": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "a192fd1ec9ed418ba9fb5eec5c762c77": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "ProgressStyleModel", + "state": { + "description_width": "" + } + }, + "a4369724a2ab4ec387e1cd61ca5684e5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "a77553b9f33a42a7b76fe0f98cd87d6d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "a874912fad204464bff2da6d2fae8124": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "aa262fb1b16142e28f92accd3061089e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HBoxModel", + "state": { + "children": [ + "IPY_MODEL_e9902416e1d84f98bcb78eb532825c35", + "IPY_MODEL_e57aa292ae1442a294da4c8f6f968f68", + "IPY_MODEL_af44e6a6a46e4ea3959394e29207c03f" + ], + "layout": "IPY_MODEL_9e81ab54a69047b5a002ab930424d02f" + } + }, + "aacdf8202ab64b25bf38a31143b52a56": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HBoxModel", + "state": { + "children": [ + "IPY_MODEL_3fa8ab29131a4995a7f9ebf15fcce5d8", + "IPY_MODEL_2fc9c65105714b618637263ab9803be2", + "IPY_MODEL_f382d51c6ee84d9fb7363d04bb881315" + ], + "layout": "IPY_MODEL_be5051d90a084ca6adabc86a12ebec90" + } + }, + "ac9029315f344bbc81252ef67eb9470b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HBoxModel", + "state": { + "children": [ + "IPY_MODEL_654d8381ec9048ccafaec41ac2b96e54", + "IPY_MODEL_483f584cbd0845ea9677e1cd75d5e030", + "IPY_MODEL_e6d57f1e219048079b975ace23f2d19d" + ], + "layout": "IPY_MODEL_f25fc41aa4b74cd6a27ac4f877ad0d71" + } + }, + "af44e6a6a46e4ea3959394e29207c03f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_a4369724a2ab4ec387e1cd61ca5684e5", + "style": "IPY_MODEL_803697e0513642fe9e8ff77b775f886a", + "value": " 4.18G/4.18G [04:44<00:00, 15.1MB/s]" + } + }, + "b47b7a09401d45f380e184ed4a1ab6c1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_dbcd0c980a8b4c03bbd58a5bbf26534a", + "style": "IPY_MODEL_7394d989f22a42aca42adeac8539ce11", + "value": "  0%" + } + }, + "b79c91b3765645b78b02c7e4941e2df8": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "ProgressStyleModel", + "state": { + "description_width": "" + } + }, + "bc05a27a0c454cc18b31312cdabc2778": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "bce056e2ae4b46b8abde82e63d8521af": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "be5051d90a084ca6adabc86a12ebec90": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "be674c5ea720499a80e15e6e96bc0c59": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "c03a4ba5262f4a55aa0f26eb794bf23d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "c04e993c235c42a2902ae3e382064a7b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_be674c5ea720499a80e15e6e96bc0c59", + "style": "IPY_MODEL_05c1f611c4b547db94014027cb4a60d7", + "value": " 285/285 [00:00<00:00, 42.5kB/s]" + } + }, + "c227c36317b64c62b032509c0ae080db": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HBoxModel", + "state": { + "children": [ + "IPY_MODEL_8b8873eac8a04d45898cf49f1b015ebc", + "IPY_MODEL_0efc15d7628048e6ab6daa8847033342", + "IPY_MODEL_528bbf053321445081c2abb0744ae955" + ], + "layout": "IPY_MODEL_e31992c47a094b7cad6090a6b955d195" + } + }, + "cb0c534940fd46c7a256a82016e80132": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "cc65ae3442154c74bb5b0087c70b9d21": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "cc99923fd96f499193ddfccde886fc7c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "cd6d777409a14f19a23211287f98c1fb": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HBoxModel", + "state": { + "children": [ + "IPY_MODEL_e67d06ecfeff4037b97555648018182f", + "IPY_MODEL_e428ee8add0944f2ad0e34249f3d11ae", + "IPY_MODEL_c04e993c235c42a2902ae3e382064a7b" + ], + "layout": "IPY_MODEL_a0504a67d470438a965e07d17af21637" + } + }, + "d0a1f7a536464a4fbac4310131ddeed0": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "d4abb15ed74d4fc99e4ef3910b89931b": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "d75cd4e722244b4fb3d9c3170dd3c153": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HBoxModel", + "state": { + "children": [ + "IPY_MODEL_3cc656c47b76492fa2a8c94da9eea89e", + "IPY_MODEL_6affdca6b9974b16bfecc161575f5b0b", + "IPY_MODEL_19db49e350b4431fa13da9ae7ec3750c" + ], + "layout": "IPY_MODEL_bce056e2ae4b46b8abde82e63d8521af" + } + }, + "dbcd0c980a8b4c03bbd58a5bbf26534a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "dc6e36c1b06b4a959a45a4243ec4d5e7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "df03bd924bc64f73ae3b5423ffbe53c6": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HBoxModel", + "state": { + "children": [ + "IPY_MODEL_4558760de72c4af9a9bc8f155f94fee7", + "IPY_MODEL_6ac0e37a11a24933aff9ac6ba821a1a2", + "IPY_MODEL_fc03c65d44ff4460a888184ac6ee2bd1" + ], + "layout": "IPY_MODEL_47fc99e537c1491881cdd0f922918176" + } + }, + "e31992c47a094b7cad6090a6b955d195": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "e428ee8add0944f2ad0e34249f3d11ae": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "FloatProgressModel", + "state": { + "bar_style": "success", + "layout": "IPY_MODEL_9309a7c3eb5140379ab6d83b417f3141", + "max": 285, + "style": "IPY_MODEL_6e5d75d09aeb4003bfc9930a628d84ae", + "value": 285 + } + }, + "e57aa292ae1442a294da4c8f6f968f68": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "FloatProgressModel", + "state": { + "bar_style": "success", + "layout": "IPY_MODEL_cc65ae3442154c74bb5b0087c70b9d21", + "max": 4178890389, + "style": "IPY_MODEL_b79c91b3765645b78b02c7e4941e2df8", + "value": 4178890389 + } + }, + "e67d06ecfeff4037b97555648018182f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_51359aacb849491a97ed9f747245c807", + "style": "IPY_MODEL_9a3e489b2bfe4c13979978fcf77d51d9", + "value": "tokenizer_config.json: 100%" + } + }, + "e6d57f1e219048079b975ace23f2d19d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_a874912fad204464bff2da6d2fae8124", + "style": "IPY_MODEL_43b41ff5845046aaa53bd3dc9f063357", + "value": " 4.72M/4.72M [00:01<00:00, 3.24MB/s]" + } + }, + "e75100f2bef244ba9bc4e58c7cc9712d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLStyleModel", + "state": { + "description_width": "", + "font_size": null, + "text_color": null + } + }, + "e9902416e1d84f98bcb78eb532825c35": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_518d7f4f58b440fc8fbe4a89e03a359b", + "style": "IPY_MODEL_e75100f2bef244ba9bc4e58c7cc9712d", + "value": "pytorch_model.bin: 100%" + } + }, + "edbaa56d923f4a34ad4cb1e3821696bb": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "ee3ae7bff9e847b3a15ba0838d39b633": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HBoxModel", + "state": { + "children": [ + "IPY_MODEL_736dec576cde4d39b8a88b5407f5f5a4", + "IPY_MODEL_252980b688df42b8905baf00a60ffe20", + "IPY_MODEL_339a9e94c9f6448ca27a027b5758eeca" + ], + "layout": "IPY_MODEL_d4abb15ed74d4fc99e4ef3910b89931b" + } + }, + "efa781389a0c45d588bbd524053e4347": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "DropdownModel", + "state": { + "_options_labels": [ + "CPU", + "AUTO" + ], + "description": "Device:", + "index": 0, + "layout": "IPY_MODEL_03d2e3830e9f4a5a9df9e035445fc13b", + "style": "IPY_MODEL_5ae0ea491dab46e79bae21c0f074b2fa" + } + }, + "f202f2f043dd487e909e8b4edf66d63d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "ProgressStyleModel", + "state": { + "description_width": "" + } + }, + "f25fc41aa4b74cd6a27ac4f877ad0d71": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "f382d51c6ee84d9fb7363d04bb881315": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_f586357fdf8b4f9a918d74283f9fc3f6", + "style": "IPY_MODEL_7df39ecc432e4144a5823141763af3c8", + "value": " 576/576 [00:19<00:00, 29.26it/s]" + } + }, + "f43b50f3a9c94c74a9e29593ee7eea95": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "f586357fdf8b4f9a918d74283f9fc3f6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "2.0.0", + "model_name": "LayoutModel", + "state": {} + }, + "f6ca16708efa429bbf0c4157ceff02f1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HBoxModel", + "state": { + "children": [ + "IPY_MODEL_b47b7a09401d45f380e184ed4a1ab6c1", + "IPY_MODEL_544b396b604844598a4fc6903443604a", + "IPY_MODEL_111eeed4303140d8882c19361bcf67f3" + ], + "layout": "IPY_MODEL_f43b50f3a9c94c74a9e29593ee7eea95" + } + }, + "fc03c65d44ff4460a888184ac6ee2bd1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "2.0.0", + "model_name": "HTMLModel", + "state": { + "layout": "IPY_MODEL_a77553b9f33a42a7b76fe0f98cd87d6d", + "style": "IPY_MODEL_c03a4ba5262f4a55aa0f26eb794bf23d", + "value": " 210/210 [00:00<00:00, 22.0kB/s]" + } + } + }, + "version_major": 2, + "version_minor": 0 + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/fireredtts2/gradio_helper.py b/notebooks/fireredtts2/gradio_helper.py new file mode 100644 index 00000000000..c0c68bb7c85 --- /dev/null +++ b/notebooks/fireredtts2/gradio_helper.py @@ -0,0 +1,363 @@ +import re +import gradio as gr +from tqdm import tqdm +from argparse import ArgumentParser +from typing import Literal, List, Tuple +from ov_firetts_helper import OVFireRedTTS2 + + +# ================================================ +# FireRedTTS2 Model +# ================================================ +# Global model instance +model: OVFireRedTTS2 = None + +examples = [ + ["English", + "examples\chat_prompt\en\S1.flac", + "[S1]I think we should just talk about what happened and move on because there's going to be other jousts and Sir Saif isn't done yet. It's not, he's not, it's not done yet.", + "examples\chat_prompt\en\S2.flac", + "[S2]You know, maybe sorry, maybe maybe I pushed, maybe I pushed too hard. I was really excited. I didn't mean to make you snap.", + "[S1]It's alright, we'll take a breath and plan the next pass together.[S2]Yeah, thanks. We'll get it right this time.[S1]Let's review our signals tonight so we're in sync on the field tomorrow." + ] + ["中文", + "examples\chat_prompt\zh\S1.flac", + "[S1]啊,可能说更适合美国市场应该是什么样子。那这这个可能说当然如果说有有机会能亲身的去考察去了解一下,那当然是有更好的帮助。", + "examples\chat_prompt\zh\S2.flac", + "[S2]比如具体一点的,他觉得最大的一个跟他预想的不一样的是在什么地方。", + "[S1]那可能说对对,没有去过美国来说去去看到美国线下。巴斯曼也好,沃尔玛也好,他们线下不管说,因为深圳出去的还是电子周边的会表达,会发现哇对这个价格真的是很高呀。都是卖三十五美金、四十美金,甚至一个手机壳,就是二十五美金开。[S2]对,没错,我每次都觉得不不可思议。我什么人会买三五十美金的手机壳?但是其实在在那个target啊,就塔吉特这种超级市场,大家都是这样的,定价也很多人买。"], +] + +def initiate_model(ov_model): + global model + model = ov_model + + +# ================================================ +# Gradio +# ================================================ + +# i18n +_i18n_key2lang_dict = dict( + # Title markdown + title_md_desc=dict( + en="FireRedTTS-2 🔥 Dialogue Generation", + zh="FireRedTTS-2 🔥 对话生成", + ), + # Voice mode radio + voice_mode_label=dict( + en="Voice Mode", + zh="音色模式", + ), + voice_model_choice1=dict( + en="Voice Clone", + zh="音色克隆", + ), + voice_model_choice2=dict( + en="Random Voice", + zh="随机音色", + ), + # Speaker1 Prompt + spk1_prompt_audio_label=dict( + en="Speaker 1 Prompt Audio", + zh="说话人 1 参考语音", + ), + spk1_prompt_text_label=dict( + en="Speaker 1 Prompt Text", + zh="说话人 1 参考文本", + ), + spk1_prompt_text_placeholder=dict( + en="[S1] text of speaker 1 prompt audio.", + zh="[S1] 说话人 1 参考文本", + ), + # Speaker2 Prompt + spk2_prompt_audio_label=dict( + en="Speaker 2 Prompt Audio", + zh="说话人 2 参考语音", + ), + spk2_prompt_text_label=dict( + en="Speaker 2 Prompt Text", + zh="说话人 2 参考文本", + ), + spk2_prompt_text_placeholder=dict( + en="[S2] text of speaker 2 prompt audio.", + zh="[S2] 说话人 2 参考文本", + ), + # Dialogue input textbox + dialogue_text_input_label=dict( + en="Dialogue Text Input", + zh="对话文本输入", + ), + dialogue_text_input_placeholder=dict( + en="[S1]text[S2]text[S1]text...", + zh="[S1]文本[S2]文本[S1]文本...", + ), + # Generate button + generate_btn_label=dict( + en="Generate Audio", + zh="合成", + ), + # Generated audio + generated_audio_label=dict( + en="Generated Dialogue Audio", + zh="合成的对话音频", + ), + # Warining1: invalid text for prompt + warn_invalid_spk1_prompt_text=dict( + en='Invalid speaker 1 prompt text, should strictly follow: "[S1]xxx"', + zh='说话人 1 参考文本不合规,格式:"[S1]xxx"', + ), + warn_invalid_spk2_prompt_text=dict( + en='Invalid speaker 2 prompt text, should strictly follow: "[S2]xxx"', + zh='说话人 2 参考文本不合规,格式:"[S2]xxx"', + ), + # Warining2: invalid text for dialogue input + warn_invalid_dialogue_text=dict( + en='Invalid dialogue input text, should strictly follow: "[S1]xxx[S2]xxx..."', + zh='对话文本输入不合规,格式:"[S1]xxx[S2]xxx..."', + ), + # Warining3: incomplete prompt info + warn_incomplete_prompt=dict( + en="Please provide prompt audio and text for both speaker 1 and speaker 2", + zh="请提供说话人 1 与说话人 2 的参考语音与参考文本", + ), +) + +global_lang: Literal["zh", "en"] = "en" + + +def i18n(key): + global global_lang + return _i18n_key2lang_dict[key][global_lang] + + +def check_monologue_text(text: str, prefix: str = None) -> bool: + text = text.strip() + # Check speaker tags + if prefix is not None and (not text.startswith(prefix)): + return False + # Remove prefix + if prefix is not None: + text = text.removeprefix(prefix) + text = text.strip() + # If empty? + if len(text) == 0: + return False + return True + + +def check_dialogue_text(text_list: List[str]) -> bool: + if len(text_list) == 0: + return False + for text in text_list: + if not ( + check_monologue_text(text, "[S1]") + or check_monologue_text(text, "[S2]") + or check_monologue_text(text, "[S3]") + or check_monologue_text(text, "[S4]") + ): + return False + return True + +def dialogue_synthesis_function( + target_text: str, + voice_mode: Literal[0, 1] = 0, # 0 means voice clone + spk1_prompt_text: str | None = "", + spk1_prompt_audio: str | None = None, + spk2_prompt_text: str | None = "", + spk2_prompt_audio: str | None = None, +): + # Voice clone mode, check prompt info + if voice_mode == 0: + prompt_has_value = [ + spk1_prompt_text != "", + spk1_prompt_audio is not None, + spk2_prompt_text != "", + spk2_prompt_audio is not None, + ] + if not all(prompt_has_value): + gr.Warning(message=i18n("warn_incomplete_prompt")) + return None + if not check_monologue_text(spk1_prompt_text, "[S1]"): + gr.Warning(message=i18n("warn_invalid_spk1_prompt_text")) + return None + if not check_monologue_text(spk2_prompt_text, "[S2]"): + gr.Warning(message=i18n("warn_invalid_spk2_prompt_text")) + return None + # Check dialogue text + target_text_list: List[str] = re.findall(r"(\[S[0-9]\][^\[\]]*)", target_text) + target_text_list = [text.strip() for text in target_text_list] + if not check_dialogue_text(target_text_list): + gr.Warning(message=i18n("warn_invalid_dialogue_text")) + return None + + # Go synthesis + progress_bar = gr.Progress(track_tqdm=True) + prompt_wav_list = ( + None if voice_mode != 0 else [spk1_prompt_audio, spk2_prompt_audio] + ) + prompt_text_list = None if voice_mode != 0 else [spk1_prompt_text, spk2_prompt_text] + target_audio = model.generate_dialogue( + text_list=target_text_list, + prompt_wav_list=prompt_wav_list, + prompt_text_list=prompt_text_list, + temperature=0.9, + topk=30, + ) + return (24000, target_audio.squeeze(0).numpy()) + + +# UI rendering +def render_interface() -> gr.Blocks: + with gr.Blocks(title="FireRedTTS-2", theme=gr.themes.Default()) as page: + # ======================== UI ======================== + # A large title + title_desc = gr.Markdown(value="# {}".format(i18n("title_md_desc"))) + with gr.Row(): + lang_choice = gr.Radio( + choices=["中文", "English"], + value="中文", + label="Display Language/显示语言", + type="index", + interactive=True, + ) + voice_mode_choice = gr.Radio( + choices=[i18n("voice_model_choice1"), i18n("voice_model_choice2")], + value=i18n("voice_model_choice1"), + label=i18n("voice_mode_label"), + type="index", + interactive=True, + ) + with gr.Row(): + # ==== Speaker1 Prompt ==== + with gr.Column(scale=1): + with gr.Group(visible=True) as spk1_prompt_group: + spk1_prompt_audio = gr.Audio( + label=i18n("spk1_prompt_audio_label"), + type="filepath", + editable=False, + interactive=True, + ) # Audio component returns tmp audio path + spk1_prompt_text = gr.Textbox( + label=i18n("spk1_prompt_text_label"), + placeholder=i18n("spk1_prompt_text_placeholder"), + lines=3, + ) + # ==== Speaker2 Prompt ==== + with gr.Column(scale=1): + with gr.Group(visible=True) as spk2_prompt_group: + spk2_prompt_audio = gr.Audio( + label=i18n("spk2_prompt_audio_label"), + type="filepath", + editable=False, + interactive=True, + ) + spk2_prompt_text = gr.Textbox( + label=i18n("spk2_prompt_text_label"), + placeholder=i18n("spk2_prompt_text_placeholder"), + lines=3, + ) + # ==== Text input ==== + with gr.Column(scale=2): + dialogue_text_input = gr.Textbox( + label=i18n("dialogue_text_input_label"), + placeholder=i18n("dialogue_text_input_placeholder"), + lines=18, + ) + # Generate button + generate_btn = gr.Button( + value=i18n("generate_btn_label"), variant="primary", size="lg" + ) + # Long output audio + generate_audio = gr.Audio( + label=i18n("generated_audio_label"), + interactive=False, + ) + gr.Examples( + examples=examples, + inputs=[lang_choice, spk1_prompt_audio, spk1_prompt_text, spk2_prompt_audio, spk2_prompt_text, dialogue_text_input], + ) + + # ======================== Action ======================== + # Language action + def _change_component_language(lang): + global global_lang + global_lang = ["zh", "en"][lang] + return [ + # title_desc + gr.update(value="# {}".format(i18n("title_md_desc"))), + # voice_mode_choice + gr.update( + choices=[i18n("voice_model_choice1"), i18n("voice_model_choice2")], + value=i18n("voice_model_choice1"), + label=i18n("voice_mode_label"), + ), + # spk1_prompt_{audio,text} + gr.update(label=i18n("spk1_prompt_audio_label")), + gr.update( + label=i18n("spk1_prompt_text_label"), + placeholder=i18n("spk1_prompt_text_placeholder"), + ), + # spk2_prompt_{audio,text} + gr.update(label=i18n("spk2_prompt_audio_label")), + gr.update( + label=i18n("spk2_prompt_text_label"), + placeholder=i18n("spk2_prompt_text_placeholder"), + ), + # dialogue_text_input + gr.update( + label=i18n("dialogue_text_input_label"), + placeholder=i18n("dialogue_text_input_placeholder"), + ), + # generate_btn + gr.update(value=i18n("generate_btn_label")), + # generate_audio + gr.update(label=i18n("generated_audio_label")), + ] + + lang_choice.change( + fn=_change_component_language, + inputs=[lang_choice], + outputs=[ + title_desc, + voice_mode_choice, + spk1_prompt_audio, + spk1_prompt_text, + spk2_prompt_audio, + spk2_prompt_text, + dialogue_text_input, + generate_btn, + generate_audio, + ], + ) + + # Voice clone mode action + def _change_prompt_input_visibility(voice_mode): + enable = voice_mode == 0 + return [gr.update(visible=enable), gr.update(visible=enable)] + + voice_mode_choice.change( + fn=_change_prompt_input_visibility, + inputs=[voice_mode_choice], + outputs=[spk1_prompt_group, spk2_prompt_group], + ) + generate_btn.click( + fn=dialogue_synthesis_function, + inputs=[ + dialogue_text_input, + voice_mode_choice, + spk1_prompt_text, + spk1_prompt_audio, + spk2_prompt_text, + spk2_prompt_audio, + ], + outputs=[generate_audio], + ) + return page + + +def make_demo(model): + initiate_model(model) + # UI + page = render_interface() + return page diff --git a/notebooks/fireredtts2/ov_fireredtts_helper.py b/notebooks/fireredtts2/ov_fireredtts_helper.py new file mode 100644 index 00000000000..158adc64488 --- /dev/null +++ b/notebooks/fireredtts2/ov_fireredtts_helper.py @@ -0,0 +1,1474 @@ +import openvino as ov +import nncf +from pathlib import Path +import torch +import types +from typing import List, Optional, Tuple, Union, Callable +import openvino.opset13 as opset13 +from openvino.frontend.pytorch.patch_model import __make_16bit_traceable +import numpy as np +import gc +from fireredtts2.fireredtts2 import FireRedTTS2 +from transformers.cache_utils import Cache, DynamicCache +from transformers.utils import is_torch_xpu_available +from transformers.masking_utils import ALL_MASK_ATTENTION_FUNCTIONS, eager_mask, sdpa_mask +import shutil +import os +import json +from transformers import AutoTokenizer +import torchaudio +from dataclasses import dataclass +from torch.nn.utils.rnn import pad_sequence +import re +import string +from tqdm import tqdm +import math +import torch.nn.functional as F + +def patch_cos_sin_cached_fp32(model): + if ( + hasattr(model, "layers") + and hasattr(model.layers[0], "self_attn") + and hasattr(model.layers[0].self_attn, "rotary_emb") + and hasattr(model.layers[0].self_attn.rotary_emb, "dtype") + and hasattr(model.layers[0].self_attn.rotary_emb, "inv_freq") + and hasattr(model.layers[0].self_attn.rotary_emb, "max_position_embeddings") + and hasattr(model.layers[0].self_attn.rotary_emb, "_set_cos_sin_cache") + ): + for layer in model.layers: + if layer.self_attn.rotary_emb.dtype != torch.float32: + layer.self_attn.rotary_emb._set_cos_sin_cache( + seq_len=layer.self_attn.rotary_emb.max_position_embeddings, + device=layer.self_attn.rotary_emb.inv_freq.device, + dtype=torch.float32, + ) + +SYMBOLS_MAPPING = { + "\n": "", + "\t": "", + "…": ",", + "“": "", + "”": "", + "‘": "'", + "’": "'", + "【": "", + "】": "", + "[": "", + "]": "", + "(": "", + ")": "", + "(": "", + ")": "", + "・": "", + "·": "", + "「": "'", + "」": "'", + "《": "'", + "》": "'", + "—": "", + "~": ",", + "~": ",", + ":": ",", + ";": ",", + ";": ",", + ":": ",", + '"': "", + "!": ",", + # "!": ".", + "————": "", + "——": "", + "—": "", + "……": ",", + "*": "", +} + +REPLACE_SYMBOL_REGEX = re.compile( + "|".join(re.escape(p) for p in SYMBOLS_MAPPING.keys()) +) + + +EMOJI_REGEX = re.compile( + "[" + "\U0001f600-\U0001f64f" # emoticons + "\U0001f300-\U0001f5ff" # symbols & pictographs + "\U0001f680-\U0001f6ff" # transport & map symbols + "\U0001f1e0-\U0001f1ff" # flags (iOS) + "]+", + flags=re.UNICODE, +) + + +def clean_text(text): + # Clean the text + text = text.strip() + text = text.replace("\xa0", "") + + # Replace all chinese symbols with their english counterparts + text = REPLACE_SYMBOL_REGEX.sub(lambda x: SYMBOLS_MAPPING[x.group()], text) + + # Remove emojis + text = EMOJI_REGEX.sub(r"", text) + + # Remove continuous periods (...) and commas (,,,) + text = re.sub(r"[.,]{2,}", lambda m: m.group()[0], text) + + return text + + +def utf_8_len(text): + return len(text.encode("utf-8")) + + +def break_text(texts, length, splits: set): + for text in texts: + if utf_8_len(text) <= length: + yield text + continue + + curr = "" + for char in text: + curr += char + + if char in splits: + yield curr + curr = "" + + if curr: + yield curr + + +def break_text_by_length(texts, length): + for text in texts: + if utf_8_len(text) <= length: + yield text + continue + + curr = "" + for char in text: + curr += char + + if utf_8_len(curr) >= length: + yield curr + curr = "" + + if curr: + yield curr + + +def add_cleaned(curr, segments): + curr = curr.strip() + if curr and not all(c.isspace() or c in string.punctuation for c in curr): + segments.append(curr) + + +def protect_float(text): + # Turns 3.14 into <3_f_14> to prevent splitting + return re.sub(r"(\d+)\.(\d+)", r"<\1_f_\2>", text) + + +def unprotect_float(text): + # Turns <3_f_14> into 3.14 + return re.sub(r"<(\d+)_f_(\d+)>", r"\1.\2", text) + + +def split_text(text, length): + text = clean_text(text) + + # Break the text into pieces with following rules: + # 1. Split the text at ".", "!", "?" if text is NOT a float + # 2. If the text is longer than length, split at "," + # 3. If the text is still longer than length, split at " " + # 4. If the text is still longer than length, split at any character to length + + texts = [text] + texts = map(protect_float, texts) + texts = break_text(texts, length, {".", "!", "?", "。", "!", "?"}) + texts = map(unprotect_float, texts) + texts = break_text(texts, length, {",", ","}) + texts = break_text(texts, length, {" "}) + texts = list(break_text_by_length(texts, length)) + + # Then, merge the texts into segments with length <= length + segments = [] + curr = "" + + for text in texts: + if utf_8_len(curr) + utf_8_len(text) <= length: + curr += text + else: + add_cleaned(curr, segments) + curr = text + + if curr: + add_cleaned(curr, segments) + + return segments + + +def contains_chinese(text): + """检测文本是否包含中文字符""" + return bool(re.search(r"[\u4e00-\u9fff]", text)) + + +def count_words_english(text): + """统计英文单词数量""" + return len(text.split()) + + +def count_characters_chinese(text): + """统计中文字符数量""" + return len(text) + + +def split_by_punctuation_english(text): + """按英文标点符号分割""" + sentences = re.split(r"([.!?])", text) + result = [] + for i in range(0, len(sentences) - 1, 2): + sentence = sentences[i].strip() + if sentence: + if i + 1 < len(sentences): + sentence += sentences[i + 1] + result.append(sentence) + + if len(sentences) % 2 == 1 and sentences[-1].strip(): + result.append(sentences[-1].strip()) + + return result + + +def split_by_punctuation_chinese(text): + """按中文标点符号分割""" + sentences = re.split(r"([。!?])", text) + result = [] + for i in range(0, len(sentences) - 1, 2): + sentence = sentences[i].strip() + if sentence: + if i + 1 < len(sentences): + sentence += sentences[i + 1] + result.append(sentence) + + if len(sentences) % 2 == 1 and sentences[-1].strip(): + result.append(sentences[-1].strip()) + + return result + + +def merge_sentences_english(sentences, max_words=80): + """合并英文句子""" + result = [] + current_chunk = "" + + for sentence in sentences: + if not current_chunk: + current_chunk = sentence + else: + test_chunk = current_chunk + " " + sentence + if count_words_english(test_chunk) <= max_words: + current_chunk = test_chunk + else: + result.append(current_chunk) + current_chunk = sentence + + if current_chunk: + result.append(current_chunk) + + return result + + +def merge_sentences_chinese(sentences, max_chars=100): + """合并中文句子""" + result = [] + current_chunk = "" + + for sentence in sentences: + if not current_chunk: + current_chunk = sentence + else: + test_chunk = current_chunk + sentence + if count_characters_chinese(test_chunk) <= max_chars: + current_chunk = test_chunk + else: + result.append(current_chunk) + current_chunk = sentence + + if current_chunk: + result.append(current_chunk) + + return result + + +def process_text(text): + chinese_max_limit = 150 + english_max_limit = 80 + # 移除开头的标记如[S2] + text = re.sub(r"^\[S\d+\]", "", text).strip() + is_chinese = contains_chinese(text) + if is_chinese: + if count_characters_chinese(text) <= chinese_max_limit: + return [text] + sentences = split_by_punctuation_chinese(text) + result = merge_sentences_chinese(sentences, chinese_max_limit) + else: + if count_words_english(text) <= english_max_limit: + return [text] + sentences = split_by_punctuation_english(text) + result = merge_sentences_english(sentences, english_max_limit) + + return result + + +def process_text_list(text_list): + new_text_list = [] + for text in text_list: + speaker = text[:4] + # print("---speaker:", speaker) + assert speaker in ["[S1]", "[S2]", "[S3]", "[S4]"] + result = process_text(text=text) + # print("---result:\n", result, len(result)) + for chunk in result: + new_text_list.append(speaker + chunk) + return new_text_list + +def _pad_and_chunk(audio: torch.Tensor, chunk_size: int) -> List[torch.Tensor]: + pad_len = math.ceil(audio.shape[1] / chunk_size) * chunk_size - audio.shape[1] + audio = F.pad(audio, (0, pad_len), mode="constant", value=0) + audio_chunks = audio.split(chunk_size, dim=1) + return audio_chunks + +def _multinomial_sample_one_no_sync(probs): + q = torch.empty_like(probs).exponential_(1) + return torch.argmax(probs / q, dim=-1, keepdim=True).to(dtype=torch.int) + +def sample_topk(logits: torch.Tensor, topk: int, temperature: float): + logits = logits / temperature + + filter_value: float = -float("Inf") + indices_to_remove = logits < torch.topk(logits, topk)[0][..., -1, None] + scores_processed = logits.masked_fill(indices_to_remove, filter_value) + scores_processed = torch.nn.functional.log_softmax(scores_processed, dim=-1) + probs = torch.nn.functional.softmax(scores_processed, dim=-1) + + sample_token = _multinomial_sample_one_no_sync(probs) + return sample_token + +def causal_mask_function(batch_idx: int, head_idx: int, q_idx: int, kv_idx: int) -> bool: + """ + This creates a basic lower-diagonal causal mask. + """ + return kv_idx <= q_idx + +def prepare_padding_mask( + attention_mask: Optional[torch.Tensor], kv_length: int, kv_offset: int, _slice: bool = True +) -> Optional[torch.Tensor]: + """ + From the 2D attention mask, prepare the correct padding mask to use by potentially padding it, and slicing + according to the `kv_offset` if `_slice` is `True`. + """ + local_padding_mask = attention_mask + if attention_mask is not None: + # Pad it if necessary + if (padding_length := kv_length + kv_offset - attention_mask.shape[-1]) > 0: + local_padding_mask = torch.nn.functional.pad(attention_mask, (0, padding_length)) + # For flex, we should not slice them, only use an offset + if _slice: + # Equivalent to: `local_padding_mask = attention_mask[:, kv_offset : kv_offset + kv_length]`, + # but without data-dependent slicing (i.e. torch.compile friendly) + mask_indices = torch.arange(kv_length, device=local_padding_mask.device) + mask_indices += kv_offset + local_padding_mask = local_padding_mask[:, mask_indices] + return local_padding_mask + +def and_masks(*mask_functions: list[Callable]) -> Callable: + """Returns a mask function that is the intersection of provided mask functions""" + if not all(callable(arg) for arg in mask_functions): + raise RuntimeError(f"All inputs should be callable mask_functions: {mask_functions}") + + def and_mask(batch_idx, head_idx, q_idx, kv_idx): + result = q_idx.new_ones((), dtype=torch.bool) + for mask in mask_functions: + result = result & mask(batch_idx, head_idx, q_idx, kv_idx).to(result.device) + return result + + return and_mask + +def padding_mask_function(padding_mask: torch.Tensor) -> Callable: + """ + This return the mask_function function corresponding to a 2D padding mask. + """ + + def inner_mask(batch_idx: int, head_idx: int, q_idx: int, kv_idx: int) -> bool: + # Note that here the mask should ALWAYS be at least of the max `kv_index` size in the dimension 1. This is because + # we cannot pad it here in the mask_function as we don't know the final size, and we cannot try/except, as it is not + # vectorizable on accelerator devices + return padding_mask[batch_idx, kv_idx] + + return inner_mask + +def _ignore_causal_mask_sdpa( + padding_mask: Optional[torch.Tensor], + query_length: int, + kv_length: int, + kv_offset: int, + local_attention_size: Optional[int] = None, +) -> bool: + """ + Detects whether the causal mask can be ignored in case PyTorch's SDPA is used, rather relying on SDPA's `is_causal` argument. + + In case no token is masked in the 2D `padding_mask` argument, if `query_length == 1` or + `key_value_length == query_length`, we rather rely on SDPA `is_causal` argument to use causal/non-causal masks, + allowing to dispatch to the flash attention kernel (that can otherwise not be used if a custom `attn_mask` is + passed). + """ + is_tracing = torch.jit.is_tracing() or isinstance(padding_mask, torch.fx.Proxy) or is_torchdynamo_compiling() + if padding_mask is not None and padding_mask.shape[-1] > kv_length: + mask_indices = torch.arange(kv_length, device=padding_mask.device) + mask_indices += kv_offset + padding_mask = padding_mask[:, mask_indices] + + # When using `torch.export` or `torch.onnx.dynamo_export`, we must pass an example input, and `is_causal` behavior is + # hard-coded to the forward. If a user exports a model with query_length > 1, the exported model will hard-code `is_causal=True` + # which is in general wrong (see https://github.com/pytorch/pytorch/issues/108108). Thus, we only set + # `ignore_causal_mask = True` if we are not tracing + if ( + not is_tracing + # only cases when lower and upper diags are the same, see https://github.com/pytorch/pytorch/issues/108108 + and (query_length == 1 or (kv_length == query_length or is_torch_xpu_available)) + # in this case we need to add special patterns to the mask so cannot be skipped otherwise + and (local_attention_size is None or kv_length < local_attention_size) + # In this case, we need to add padding to the mask, so cannot be skipped otherwise + and ( + padding_mask is None + or ( + padding_mask.all() + if not is_torch_xpu_available or query_length == 1 + else padding_mask[:, :query_length].all() + ) + ) + ): + return True + + return False + +def sdpa_mask_without_vmap( + batch_size: int, + cache_position: torch.Tensor, + kv_length: int, + kv_offset: int = 0, + mask_function: Optional[Callable] = None, + attention_mask: Optional[torch.Tensor] = None, + local_size: Optional[int] = None, + allow_is_causal_skip: bool = True, + **kwargs, +) -> Optional[torch.Tensor]: + if mask_function is None: + mask_function = causal_mask_function + + q_length = cache_position.shape[0] + # Potentially pad the 2D mask, and slice it correctly + padding_mask = prepare_padding_mask(attention_mask, kv_length, kv_offset, _slice=False) + + # Under specific conditions, we can avoid materializing the mask, instead relying on the `is_causal` argument + if allow_is_causal_skip and _ignore_causal_mask_sdpa(padding_mask, q_length, kv_length, kv_offset, local_size): + return None + + # Potentially add the padding 2D mask + if padding_mask is not None: + mask_function = and_masks(mask_function, padding_mask_function(padding_mask)) + + # Create broadcatable indices + device = cache_position.device + q_indices = cache_position[None, None, :, None] + head_indices = torch.arange(1, dtype=torch.long, device=device)[None, :, None, None] + batch_indices = torch.arange(batch_size, dtype=torch.long, device=device)[:, None, None, None] + kv_indices = torch.arange(kv_length, dtype=torch.long, device=device)[None, None, None, :] + kv_offset + + # Apply mask function element-wise through broadcasting + causal_mask = mask_function(batch_indices, head_indices, q_indices, kv_indices) + # Expand the mask to match batch size and query length if they weren't used in the mask function + causal_mask = causal_mask.expand(batch_size, -1, q_length, kv_length) + + return causal_mask + +# Adapted from https://github.com/huggingface/transformers/blob/v4.53.0/src/transformers/masking_utils.py#L433 +# Specifically for OpenVINO, we use torch.finfo(torch.float16).min instead of torch.finfo(dtype).min +def eager_mask_without_vmap(*args, **kwargs) -> Optional[torch.Tensor]: + kwargs.pop("allow_is_causal_skip", None) + dtype = kwargs.get("dtype", torch.float32) + mask = sdpa_mask_without_vmap(*args, allow_is_causal_skip=False, **kwargs) + # we use torch.finfo(torch.float16).min instead torch.finfo(dtype).min to avoid an overflow but not + # sure this is the right way to handle this, we are basically pretending that -65,504 is -inf + mask = torch.where( + mask, + torch.tensor(0.0, device=mask.device, dtype=dtype), + torch.tensor(torch.finfo(torch.float16).min, device=mask.device, dtype=dtype), + ) + return mask + + +# for OpenVINO, we use torch.finfo(torch.float16).min instead of torch.finfo(dtype).min +# Although I'm not sure this is the right way to handle this, we are basically pretending that -65,504 is -inf +ALL_MASK_ATTENTION_FUNCTIONS.register("eager", eager_mask_without_vmap) + +# for decoder models, we use eager mask without vmap for sdpa as well +# to avoid a nan output issue in OpenVINO that only happens in case of: +# non-stateful models on cpu and stateful models on npu +ALL_MASK_ATTENTION_FUNCTIONS.register("sdpa", eager_mask_without_vmap) + + +def model_has_state(ov_model: ov.Model): + return len(ov_model.get_sinks()) > 0 + + +def model_has_input_output_name(ov_model: ov.Model, name: str): + """ + Helper function for checking that model has specified input or output name + + Parameters: + ov_model (ov.Model): + name (str): + name of input or output + + Returns: + True if input or output with requested name exists else False + """ + return name in sum([list(t.get_names()) for t in ov_model.inputs + ov_model.outputs], []) + + +def fuse_cache_reorder( + ov_model: ov.Model, + not_kv_inputs: list[str], + key_value_input_names: list[str], + gather_dim: int, +): + """ + Fuses reored_cache during generate cycle into ov.Model. Used with stateful models, because we can not modify model state directly. + + Adds a new beam_idx parameter and Gather op per each kv-cache input in a given model. + Should be run before make_stateful. Implements optimumum's _reorder_cache + inside the model in the beginning of each iteration. + Gather works along given gather_dim dimension that may vary from model to model. + KV-cache inputs are identified based on names in key_value_input_names. + Append the new beam_idx parameter to not_kv_inputs. + + Parameters: + ov_model (`ov.Model`): + openvino model for processing + not_kv_inputs (`list[str]`): + list of input nodes in model that not related to past key values + key_value_input_names (`list[str]`): + list of names for key value input layers + gather_dim (int): + dimension for gathering cache during reorder pass + """ + + if model_has_input_output_name(ov_model, "beam_idx"): + raise ValueError("Model already has fused cache") + input_batch = ov_model.input("inputs_embeds").get_partial_shape()[0] + beam_idx = opset13.parameter(name="beam_idx", dtype=ov.Type.i32, shape=ov.PartialShape([input_batch])) + beam_idx.output(0).get_tensor().add_names({"beam_idx"}) # why list is not accepted? + ov_model.add_parameters([beam_idx]) + not_kv_inputs.append(ov_model.inputs[-1]) + # Go over all cache parameters and fuse _reorder_cache with indices provided by the new parameter beam_idx + for input_name in key_value_input_names: + parameter_output_port = ov_model.input(input_name) + consumers = parameter_output_port.get_target_inputs() + gather = opset13.gather(parameter_output_port, beam_idx, opset13.constant(gather_dim)) + for consumer in consumers: + consumer.replace_source_output(gather.output(0)) + ov_model.validate_nodes_and_infer_types() + + +def build_state_initializer(ov_model: ov.Model, batch_dim: int): + """ + Build initialization ShapeOf Expression for all ReadValue ops + + Parameters: + ov_model (ov.Model): + openvino model + batch_dim (int): + index of dimension corresponding to batch size + """ + input_ids = ov_model.input("inputs_embeds") + batch = opset13.gather( + opset13.shape_of(input_ids, output_type="i64"), + opset13.constant([0]), + opset13.constant(0), + ) + for op in ov_model.get_ops(): + if op.get_type_name() == "ReadValue": + dims = [dim.min_length for dim in list(op.get_output_partial_shape(0))] + dims[batch_dim] = batch + dims = [(opset13.constant(np.array([dim], dtype=np.int64)) if isinstance(dim, int) else dim) for dim in dims] + shape = opset13.concat(dims, axis=0) + broadcast = opset13.broadcast(opset13.constant(0.0, dtype=op.get_output_element_type(0)), shape) + op.set_arguments([broadcast]) + ov_model.validate_nodes_and_infer_types() + + +def make_stateful( + ov_model: ov.Model, + not_kv_inputs: list[str], + key_value_input_names: list[str], + key_value_output_names: list[str], + batch_dim: int, + num_attention_heads: int, + num_beams_and_batch: int = None, +): + """ + Hides kv-cache inputs and outputs inside the model as variables. + + Parameters: + ov_model (ov.Model): + openvino model + not_kv_inputs (`list[str]`): + list of input nodes in model that not related to past key values + key_value_input_names (`list[str]`): + list of names for key value input layers + key_value_output_names (`list[str]`): + list of names for key value input layers + batch_dim (int): + index of batch dimension in key value layers + num_attention_heads (int): + number of attention heads for batch dimension initialization + num_beams_an_batch (int): + precalculated number of beams and batch for shapes initialization + """ + from openvino._offline_transformations import apply_make_stateful_transformation + + + input_output_map = {} + + if num_beams_and_batch is not None: + # Set batch size for input_ids and attention mask to avoid dynamic dimension got propagated from the end of the model back to ReadValue + for input in not_kv_inputs: + shape = input.get_partial_shape() + if shape.rank.get_length() <= 2: # == 1 for beam_index + shape[0] = num_beams_and_batch + input.get_node().set_partial_shape(shape) + for kv_name_pair in zip(key_value_input_names, key_value_output_names): + input_output_map[kv_name_pair[0]] = kv_name_pair[1] + if num_beams_and_batch is not None: + input = ov_model.input(kv_name_pair[0]) + shape = input.get_partial_shape() + shape[batch_dim] = num_beams_and_batch * num_attention_heads + input.get_node().set_partial_shape(shape) + + if num_beams_and_batch is not None: + # Re-validation model if shapes are altered above + ov_model.validate_nodes_and_infer_types() + + apply_make_stateful_transformation(ov_model, input_output_map) + if num_beams_and_batch is None: + build_state_initializer(ov_model, batch_dim) + + +def patch_stateful(ov_model, dim=1): + key_value_input_names = [key.get_any_name() for key in ov_model.inputs[2:-1]] + key_value_output_names = [key.get_any_name() for key in ov_model.outputs[dim:]] + not_kv_inputs = [input for input in ov_model.inputs if not any(name in key_value_input_names for name in input.get_names())] + if not key_value_input_names or not key_value_output_names: + return + batch_dim = 0 + num_attention_heads = 1 + + fuse_cache_reorder(ov_model, not_kv_inputs, key_value_input_names, batch_dim) + make_stateful( + ov_model, + not_kv_inputs, + key_value_input_names, + key_value_output_names, + batch_dim, + num_attention_heads, + None, + ) + + +core = ov.Core() + + +def cleanup_torchscript_cache(): + """ + Helper for removing cached model representation + """ + torch._C._jit_clear_class_registry() + torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore() + torch.jit._state._clear_class_state() + +TEXT_EMBEDDINGS_PATH = "openvino_text_embeddings_model.xml" +AUDIO_EMBEDDINGS_PATH = "openvino_audio_embeddings_model.xml" +AUDIO_DECODER_PATH = "openvino_audio_decoder_model.xml" +AUDIO_UPSAMPLER_PATH = "openvino_audio_upsampler_model.xml" +AUDIO_ENCODER_PATH = "openvino_audio_encoder_model.xml" +DECODER_MODEL_PATH = "openvino_decoder_model.xml" +BACKBONE_MODEL_PATH = "openvino_backbone_model.xml" + +def convert_fireredtts2(model_id, model_path=None, quantization_config=None): + + if model_path is None: + model_path = Path(model_id.split("/")[-1]) + else: + model_path = Path(model_path) + + if all((model_path / model_name).exists() for model_name in [TEXT_EMBEDDINGS_PATH, AUDIO_DECODER_PATH, AUDIO_ENCODER_PATH, AUDIO_EMBEDDINGS_PATH, DECODER_MODEL_PATH, BACKBONE_MODEL_PATH, AUDIO_UPSAMPLER_PATH]): + print(f"✅ {model_id} model already converted. You can find results in {model_path}") + return model_path + print(f"⌛ {model_id} conversion started. Be patient, it may takes some time.") + print("⌛ Load Original model") + pt_model = FireRedTTS2( + pretrained_dir=model_id, + gen_type="dialogue", + device="cpu", + ) + + print("✅ Original model successfully loaded") + print("⌛ Export tokenizer and config") + + pt_model._text_tokenizer.save_pretrained(model_path) + for json_file in Path(model_id).glob("*.json"): + shutil.copy(json_file, model_path / json_file.name) + + + + if not (model_path / TEXT_EMBEDDINGS_PATH).exists(): + print("⌛ Convert TEXT_EMBEDDINGS model") + + ov_model = ov.convert_model(pt_model._model.text_embeddings, example_input=torch.ones([1, 1], dtype=torch.int32)) + ov.save_model(ov_model, model_path / TEXT_EMBEDDINGS_PATH) + del ov_model + cleanup_torchscript_cache() + gc.collect() + print("✅ TEXT_EMBEDDINGS model successfully converted") + + if not (model_path / AUDIO_EMBEDDINGS_PATH).exists(): + print("⌛ Convert AUDIO_EMBEDDINGS model") + + ov_model = ov.convert_model(pt_model._model.audio_embeddings, example_input=torch.ones([10], dtype=torch.int32)) + ov.save_model(ov_model, model_path / AUDIO_EMBEDDINGS_PATH) + del ov_model + cleanup_torchscript_cache() + gc.collect() + print("✅ AUDIO_EMBEDDINGS model successfully converted") + + if not (model_path / AUDIO_UPSAMPLER_PATH).exists(): + print("⌛ Convert AUDIO_UPSAMPLER model") + def forward_wrap_audio_upsampler(self, tokens: torch.Tensor): + tokens = tokens.permute(1, 0, 2) # (B, nq, L) -> (nq, B, L) + vq_out_feats = self.rvq.decode_codes(tokens) + vq_out_feats = vq_out_feats.transpose(1, 2) + print(f"vq_out_feats shape: {vq_out_feats.shape[1]}") + vq_out_length = torch.tensor( + [vq_out_feats.size(1)], dtype=torch.long, device=vq_out_feats.device + ) + vq_out_feats, vq_out_length = self.upsample(vq_out_feats, vq_out_length) + return vq_out_feats, vq_out_length + + pt_model._audio_tokenizer._orig_forward = pt_model._audio_tokenizer.forward + pt_model._audio_tokenizer.forward = types.MethodType(forward_wrap_audio_upsampler, pt_model._audio_tokenizer) + + ov_model = ov.convert_model(pt_model._audio_tokenizer, example_input=torch.ones([1, 16, 1], dtype=torch.int32)) + ov.save_model(ov_model, model_path / AUDIO_UPSAMPLER_PATH) + del ov_model + pt_model._audio_tokenizer.forward = pt_model._audio_tokenizer._orig_forward + del pt_model._audio_tokenizer._orig_forward + cleanup_torchscript_cache() + gc.collect() + print("✅ AUDIO_UPSAMPLER model successfully converted") + + + if not (model_path / AUDIO_DECODER_PATH).exists(): + print("⌛ Convert AUDIO_DECODER model") + example_input = { + "x": torch.ones([1, 584, 768], dtype=torch.float32), + "x_lens": torch.tensor([584], dtype=torch.int64), + } + + ov_model = ov.convert_model(pt_model._audio_tokenizer.acoustic_decoder, example_input=example_input) + ov.save_model(ov_model, model_path / AUDIO_DECODER_PATH) + del ov_model + cleanup_torchscript_cache() + gc.collect() + print("✅ AUDIO_DECODER model successfully converted") + + if not (model_path / AUDIO_ENCODER_PATH).exists(): + print("⌛ Convert AUDIO_ENCODER model") + def forward_wrap_audio_encoder(self, audio16k: torch.Tensor): + return self._encode_one_batch(audio16k) + pt_model._audio_tokenizer._orig_forward = pt_model._audio_tokenizer.forward + pt_model._audio_tokenizer.forward = types.MethodType(forward_wrap_audio_encoder, pt_model._audio_tokenizer) + + ov_model = ov.convert_model(pt_model._audio_tokenizer, example_input=torch.ones([1, 96000], dtype=torch.float32)) + ov.save_model(ov_model, model_path / AUDIO_ENCODER_PATH) + del ov_model + pt_model._audio_tokenizer.forward = pt_model._audio_tokenizer._orig_forward + del pt_model._audio_tokenizer._orig_forward + cleanup_torchscript_cache() + gc.collect() + print("✅ AUDIO_ENCODER model successfully converted") + + + + if not (model_path / DECODER_MODEL_PATH).exists(): + print("⌛ Convert DECODER_MODEL model") + patch_cos_sin_cached_fp32(pt_model._model.decoder) + if hasattr(pt_model._model.decoder, "model"): + patch_cos_sin_cached_fp32(pt_model._model.decoder.model) + def forward_wrap_decoder( + self, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[list[torch.FloatTensor]] = None, + inputs_embeds: Optional[torch.Tensor] = None, + step: Optional[torch.Tensor] = None, + ): + if past_key_values is not None: + past_key_values = DynamicCache.from_legacy_cache(past_key_values) + inputs_embeds_proj = self.projection(inputs_embeds) + # print(f"decoder inputs: {inputs_embeds}") + outputs = self.decoder( + attention_mask=attention_mask, + position_ids=position_ids, + past_key_values=past_key_values, + inputs_embeds=inputs_embeds_proj, + use_cache=True + ) + + if past_key_values is not None: + outputs["past_key_values"] = outputs["past_key_values"].to_legacy_cache() + decoder_h = outputs.last_hidden_state + ci_logits = torch.mm(decoder_h[:, -1, :], self.audio_head[step - 1]) + return (ci_logits, outputs.past_key_values) + + num_pkv = pt_model._model.decoder.config.num_hidden_layers + hidden_size = pt_model._model.decoder.config.hidden_size + + pt_model._model._orig_forward = pt_model._model.forward + pt_model._model.forward = types.MethodType(forward_wrap_decoder, pt_model._model) + + pkv_shape = ( + 2, + pt_model._model.decoder.config.num_key_value_heads, + 2, + pt_model._model.decoder.config.hidden_size // pt_model._model.decoder.config.num_attention_heads, + ) + + inputs_embeds = torch.randn((2, 2, hidden_size)) + attention_mask = torch.ones([2, 4], dtype=torch.int64) + position_ids = torch.arange(2).unsqueeze(0).expand(2, -1) + + + input_names = ["attention_mask", "position_ids"] + output_names = ["logits"] + past_key_values = [] + for i in range(num_pkv): + kv = [torch.randn(pkv_shape) for _ in range(2)] + past_key_values.append(kv) + input_names.extend([f"past_key_values.{i}.key", f"past_key_values.{i}.value"]) + output_names.extend([f"present.{i}.key", f"present.{i}.value"]) + input_names.extend(["inputs_embeds"]) + example_input = { + "attention_mask": attention_mask, + "position_ids": position_ids, + "past_key_values": past_key_values, + "inputs_embeds": inputs_embeds, + "step": torch.tensor(1).to(dtype=torch.int32), + } + + input_shapes = [ + ov.PartialShape([-1, -1]), # attention_mask + ov.PartialShape([-1, -1]), # position_ids (2D for code predictor) + ] + input_shapes += ( + [ + ov.PartialShape( + [ + -1, + pt_model._model.decoder.config.num_key_value_heads, + -1, + pt_model._model.decoder.config.hidden_size // pt_model._model.decoder.config.num_attention_heads, + ] + ) + ] + * 2 + * num_pkv + ) + input_shapes += [ov.PartialShape([-1, -1, hidden_size]), ov.PartialShape([])] # inputs_embeds + __make_16bit_traceable(pt_model._model) + + ov_model = ov.convert_model(pt_model._model, example_input=example_input, input=input_shapes) + for input, input_name in zip(ov_model.inputs, input_names): + input.get_tensor().set_names({input_name}) + + for output, output_name in zip(ov_model.outputs, output_names): + output.get_tensor().set_names({output_name}) + patch_stateful(ov_model) + print("✅ Decoder model successfully converted") + if quantization_config is not None and "llm" in quantization_config: + print(f"⌛ Weights compression with {quantization_config['llm']['mode']} mode started") + ov_model = nncf.compress_weights(ov_model, **quantization_config["llm"]) + print("✅ Weights compression finished") + else: + ov_model.set_rt_info("f16", ["runtime_options", "KV_CACHE_PRECISION"]) + ov.save_model(ov_model, model_path / DECODER_MODEL_PATH) + del ov_model + pt_model._model.forward = pt_model._model._orig_forward + del pt_model._model._orig_forward + cleanup_torchscript_cache() + gc.collect() + + + if not (model_path / BACKBONE_MODEL_PATH).exists(): + print("⌛ Convert BACKBONE_MODEL model") + + patch_cos_sin_cached_fp32(pt_model._model.backbone) + if hasattr(pt_model._model.backbone, "model"): + patch_cos_sin_cached_fp32(pt_model._model.backbone.model) + + backbone_config = pt_model._model.backbone.config + backbone_config.save_pretrained(model_path) + def forward_wrap_backbone( + self, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[list[torch.FloatTensor]] = None, + inputs_embeds: Optional[torch.Tensor] = None, + ): + if past_key_values is not None: + past_key_values = DynamicCache.from_legacy_cache(past_key_values) + outputs = self.backbone( + inputs_embeds=inputs_embeds, + attention_mask=attention_mask, + position_ids=position_ids, + past_key_values=past_key_values, + use_cache=True + ) + if past_key_values is not None: + outputs["past_key_values"] = outputs["past_key_values"].to_legacy_cache() + h = outputs.last_hidden_state + last_h = h[:, -1, :] + c0_logits = self.codebook0_head(last_h) + output = (c0_logits, last_h, outputs.past_key_values) + return output + + num_pkv = pt_model._model.backbone.config.num_hidden_layers + hidden_size = pt_model._model.backbone.config.hidden_size + pt_model._model._orig_forward = pt_model._model.forward + pt_model._model.forward = types.MethodType(forward_wrap_backbone, pt_model._model) + pkv_shape = ( + 2, + pt_model._model.backbone.config.num_key_value_heads, + 2, + pt_model._model.backbone.config.hidden_size // pt_model._model.backbone.config.num_attention_heads, + ) + + input_embeds = torch.randn((2, 2, hidden_size)) + attention_mask = torch.ones([2, 4], dtype=torch.int64) + position_ids = torch.arange(2).unsqueeze(0).expand(2, -1) + + input_names = ["attention_mask", "position_ids"] + output_names = ["logits", "last_hidden_state"] + past_key_values = [] + for i in range(num_pkv): + kv = [torch.randn(pkv_shape) for _ in range(2)] + past_key_values.append(kv) + input_names.extend([f"past_key_values.{i}.key", f"past_key_values.{i}.value"]) + output_names.extend([f"present.{i}.key", f"present.{i}.value"]) + input_names.extend(["inputs_embeds"]) + + example_input = { + "attention_mask": attention_mask, + "position_ids": position_ids, + "past_key_values": past_key_values, + "inputs_embeds": input_embeds, + } + + input_shapes = [ + ov.PartialShape([-1, -1]), + ov.PartialShape([-1, -1]), + ] + input_shapes += ( + [ + ov.PartialShape( + [ + -1, + pt_model._model.backbone.config.num_key_value_heads, + -1, + pt_model._model.backbone.config.hidden_size // pt_model._model.backbone.config.num_attention_heads, + ] + ) + ] + * 2 + * num_pkv + ) + input_shapes += [ov.PartialShape([-1, -1, hidden_size])] # inputs_embeds + + __make_16bit_traceable(pt_model._model) + ov_model = ov.convert_model(pt_model._model, example_input=example_input, input=input_shapes) + for input, input_name in zip(ov_model.inputs, input_names): + input.get_tensor().set_names({input_name}) + + for output, output_name in zip(ov_model.outputs, output_names): + output.get_tensor().set_names({output_name}) + patch_stateful(ov_model, 2) + print("✅ Backbone model successfully converted") + if quantization_config is not None and "llm" in quantization_config: + print(f"⌛ Weights compression with {quantization_config['llm']['mode']} mode started") + ov_model = nncf.compress_weights(ov_model, **quantization_config["llm"]) + print("✅ Weights compression finished") + else: + ov_model.set_rt_info("f16", ["runtime_options", "KV_CACHE_PRECISION"]) + ov.save_model(ov_model, model_path / BACKBONE_MODEL_PATH) + del ov_model + cleanup_torchscript_cache() + gc.collect() + del pt_model + gc.collect() + print(f"✅ {model_id} model conversion finished. You can find results in {model_path}") + return model_path + + +@dataclass +class Segment: + speaker: str + text: str + audio: torch.Tensor + +@dataclass +class ModelArgs: + backbone_flavor: str + decoder_flavor: str + text_vocab_size: int + audio_vocab_size: int + audio_num_codebooks: int + decoder_loss_weight: float + use_text_loss: bool + +class OVFireRedTTS2: + def __init__(self, pretrained_dir, gen_type, device, codec_device="CPU"): + self.device = device + self.codec_device = codec_device + self.sample_rate = 16000 + self.max_seq_len = 3100 + + assert os.path.exists(pretrained_dir) + assert gen_type in ["monologue", "dialogue"] + llm_config_path = os.path.join(pretrained_dir, "config_llm.json") + codec_config_path = os.path.join(pretrained_dir, "config_codec.json") + + # check + assert os.path.exists(llm_config_path) + assert os.path.exists(codec_config_path) + + # ==== Load Torch LLM ==== + llm_config = json.load(open(llm_config_path)) + self.config = ModelArgs( + backbone_flavor=llm_config["llm_models"]["backbone_flavor"], + decoder_flavor=llm_config["llm_models"]["decoder_flavor"], + text_vocab_size=llm_config["llm_models"]["text_vocab_size"], + audio_vocab_size=llm_config["llm_models"]["audio_vocab_size"], + audio_num_codebooks=llm_config["llm_models"]["audio_num_codebooks"], + decoder_loss_weight=llm_config["llm_models"]["decoder_loss_weight"], + use_text_loss=True, + ) + + model_dir = Path(pretrained_dir) + self.backbone = core.compile_model(model_dir / BACKBONE_MODEL_PATH, self.device).create_infer_request() + self.decoder = core.compile_model(model_dir / DECODER_MODEL_PATH, self.device).create_infer_request() + self.audio_embeddings = core.compile_model(model_dir / AUDIO_EMBEDDINGS_PATH, self.device) + self.audio_decoder = core.compile_model(model_dir / AUDIO_DECODER_PATH, self.codec_device) + self.audio_encoder = core.compile_model(model_dir / AUDIO_ENCODER_PATH, self.codec_device) + self.text_embeddings = core.compile_model(model_dir / TEXT_EMBEDDINGS_PATH, self.device) + self.audio_upsampler = core.compile_model(model_dir / AUDIO_UPSAMPLER_PATH, self.device) + print("[INFO] OV model Loaded...") + + # ==== Load Qwen2.5 Text Tokenizer ==== + self._text_tokenizer = AutoTokenizer.from_pretrained(pretrained_dir) + print("[INFO] Text Tokenizer Loaded...") + + def encode( + self, + audio16k: torch.Tensor, + audio16k_length: torch.Tensor = None, + batch_size: int = 96, + ): + """ + Args: + audio16k: shape (b, t) + audio16k_length: (b,) + Returns: + token: shape (b, nq, l) + token_length: (b,) + """ + if audio16k_length is None: + assert audio16k.shape[0] == 1 + audio16k_length = torch.tensor( + [audio16k.shape[1]], dtype=torch.long, device=audio16k.device + ) + + CHUNK_SIZE = 6 * 16000 + B, T = audio16k.shape + # Pad, chunk, and batch + audio16k_batch = [] + batch_size_list = [] + for i in range(B): + # Remove extra paddings + one_audio_chunks = _pad_and_chunk( + audio16k[i : (i + 1), : audio16k_length[i]], CHUNK_SIZE + ) + audio16k_batch += one_audio_chunks + batch_size_list.append(len(one_audio_chunks)) + audio16k_batch = torch.cat(audio16k_batch, dim=0) + # Batch encode + token_batch = [] + for i in range(0, audio16k_batch.shape[0], batch_size): + one_audio_batch = audio16k_batch[i : (i + batch_size)] + one_token_batch = torch.from_numpy(self.audio_encoder(one_audio_batch)[0]) + token_batch.append(one_token_batch) + token_batch = torch.cat(token_batch, dim=0) + # Recover & concat + token_list = torch.split( + token_batch, batch_size_list, dim=0 + ) # [(B=1, nq, l), (B=3, nq, l), ...] + token_list = [ + torch.cat(token_ts.split(1, dim=0), dim=-1) # (B=1, nq, l) + for token_ts in token_list + ] + # Pad tokens + token = pad_sequence( + [ts.squeeze(0).transpose(1, 0) for ts in token_list], + batch_first=True, + padding_value=0, + ).transpose( + 1, 2 + ) # (B, nq, L) + token_length = (audio16k_length / 1280).ceil().long() + token = token[ + ..., : token_length.max() + ] # Remove extra paddings (we pad to multiples of 6s) + return token, token_length + + def load_prompt_audio(self, audio_path) -> torch.Tensor: + audio, audio_sr = torchaudio.load(audio_path) + # Audio must be single channel + if audio.shape[0] > 1: + audio = audio[0, :].unsqueeze(0) + audio16k = torchaudio.functional.resample(audio, audio_sr, 16000) + return audio16k + + def prepare_prompt(self, text, speaker, audio_path) -> Segment: + audio_tensor = self.load_prompt_audio(audio_path) + return Segment(text=text, speaker=speaker, audio=audio_tensor) + + def _tokenize_text_segment( + self, text: str, speaker: str + ) -> Tuple[torch.Tensor, torch.Tensor]: + frame_tokens = [] + frame_masks = [] + + text = speaker + "<|text_start|>" + text + "<|text_end|>" + text_tokens = self._text_tokenizer.encode(text) + text_frame = torch.zeros(len(text_tokens), 17).long() + text_frame_mask = torch.zeros(len(text_tokens), 17).bool() + text_frame[:, -1] = torch.tensor(text_tokens) + text_frame_mask[:, -1] = True + + frame_tokens.append(text_frame) + frame_masks.append(text_frame_mask) + + return torch.cat(frame_tokens, dim=0), torch.cat(frame_masks, dim=0) + + def _tokenize_audio(self, audio: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: + frame_tokens = [] + frame_masks = [] + + # (K, T) + audio_length = torch.tensor([audio.shape[1]], dtype=torch.long) + audio_tokens, token_length = self.encode( + audio, + audio_length, + batch_size=48, + ) + + audio_tokens = audio_tokens.squeeze(0) + # add EOS frame + eos_frame = torch.zeros(audio_tokens.size(0), 1) + audio_tokens = torch.cat([audio_tokens, eos_frame], dim=1) + + audio_frame = torch.zeros(audio_tokens.size(1), 17).long() + audio_frame_mask = torch.zeros(audio_tokens.size(1), 17).bool() + audio_frame[:, :-1] = audio_tokens.transpose(0, 1) + audio_frame_mask[:, :-1] = True + + frame_tokens.append(audio_frame) + frame_masks.append(audio_frame_mask) + + return torch.cat(frame_tokens, dim=0), torch.cat(frame_masks, dim=0) + + def _tokenize_segment(self, segment: Segment) -> Tuple[torch.Tensor, torch.Tensor]: + """ + Returns: + (seq_len,17), (seq_len, 17) + """ + text_tokens, text_masks = self._tokenize_text_segment( + segment.text, segment.speaker + ) + audio_tokens, audio_masks = self._tokenize_audio(segment.audio) + + return torch.cat([text_tokens, audio_tokens], dim=0), torch.cat( + [text_masks, audio_masks], dim=0 + ) + + def generate_frame( + self, + tokens: torch.Tensor, + tokens_mask: torch.Tensor, + input_pos: torch.Tensor, + temperature: float, + topk: int, + ) -> torch.Tensor: + """ + Args: + tokens: (batch_size, seq_len, audio_num_codebooks+1) + tokens_mask: (batch_size, seq_len, audio_num_codebooks+1) + input_pos: (batch_size, seq_len) positions for each token + mask: (batch_size, seq_len, max_seq_len + + Returns: + (batch_size, audio_num_codebooks) sampled tokens + """ + + # assert self.backbone.caches_are_enabled(), "backbone caches are not enabled" + embeds = self._embed_tokens(tokens) + masked_embeds = embeds * tokens_mask.unsqueeze(-1) + h = masked_embeds.sum(dim=2) + backbone_attention_mask = torch.ones( + tokens.size(0), tokens.size(1), + dtype=torch.long, + device=tokens.device + ) # [batch, curr_seq_len] + backbone_position_ids = input_pos + if self.backbone_past_len != 0: + backbone_attention_mask = torch.cat([ + torch.ones(tokens.size(0), self.backbone_past_len, dtype=torch.long, device=tokens.device), + backbone_attention_mask + ], dim=1) + backbone_position_ids = backbone_position_ids[:, -tokens.shape[1] :] + + inputs = { + "inputs_embeds": h, + "attention_mask": backbone_attention_mask, + "position_ids": backbone_position_ids, + "beam_idx": np.arange(h.shape[0], dtype=int) + } + + + + self.backbone.start_async(inputs, share_inputs=True) + self.backbone.wait() + logits = self.backbone.get_tensor("logits").data + last_hidden_state = self.backbone.get_tensor("last_hidden_state").data + c0_logits = torch.from_numpy(logits) + last_h = torch.from_numpy(last_hidden_state) + self.backbone_past_len += inputs["inputs_embeds"].shape[1] + c0_sample = sample_topk(c0_logits, 1, temperature) + c0_embed = self._embed_audio(0, c0_sample) + curr_h = torch.cat([last_h.unsqueeze(1), c0_embed], dim=1) + curr_sample = c0_sample.clone() + curr_pos = ( + torch.arange(0, curr_h.size(1), device=curr_h.device) + .unsqueeze(0) + .repeat(curr_h.size(0), 1) + ) + + self.decoder.reset_state() + # Set initial value for the next beam_idx input that will be used at the current iteration + # and will be optionally updated by _reorder_cache at the next iterations if beam_search is used + decoder_past_length = 0 + for i in range(1, self.config.audio_num_codebooks): + decoder_attention_mask = torch.ones( + curr_h.size(0), curr_h.size(1), + dtype=torch.long, + device=curr_h.device + ) # [batch, curr_seq_len] + decoder_position_ids = curr_pos # [batch, curr_seq_len] + if decoder_past_length != 0: + decoder_attention_mask = torch.cat([ + torch.ones(curr_h.size(0), decoder_past_length, dtype=torch.long, device=curr_h.device), + decoder_attention_mask + ], dim=1) + decoder_position_ids = decoder_position_ids[:, -curr_h.shape[1] :] + + inputs = { + "inputs_embeds": curr_h, + "attention_mask": decoder_attention_mask, + "position_ids": decoder_position_ids, + "beam_idx": np.arange(curr_h.shape[0], dtype=int), + "step": torch.tensor(i).to(dtype=torch.int32) + } + + self.decoder.start_async(inputs, share_inputs=True) + self.decoder.wait() + logits = self.decoder.get_tensor("logits").data + ci_logits = torch.from_numpy(logits) + decoder_past_length += inputs["inputs_embeds"].shape[1] + ci_sample = sample_topk(ci_logits, 1, 0.75) # fix to 10 and 0.75 + ci_embed = self._embed_audio(i, ci_sample) + curr_h = ci_embed + curr_sample = torch.cat([curr_sample, ci_sample], dim=1) + curr_pos = curr_pos[:, -1:] + 1 + + return curr_sample + + def reset_caches(self): + self.backbone.past_key_values = None + self.decoder.past_key_values = None + + def _embed_audio(self, codebook: int, tokens: torch.Tensor) -> torch.Tensor: + return torch.from_numpy(self.audio_embeddings((tokens + codebook * self.config.audio_vocab_size)[0])[0]).unsqueeze(0) + + def _embed_tokens(self, tokens: torch.Tensor) -> torch.Tensor: + text_embeds = torch.from_numpy(self.text_embeddings(tokens[:, :, -1])[0]).unsqueeze(-2) + + audio_tokens = tokens[:, :, :-1] + ( + self.config.audio_vocab_size + * torch.arange(self.config.audio_num_codebooks, device=tokens.device) + ) + audio_embeds = torch.from_numpy(self.audio_embeddings(audio_tokens.view(-1))[0]).reshape( + tokens.size(0), tokens.size(1), self.config.audio_num_codebooks, -1 + ) + + return torch.cat([audio_embeds, text_embeds], dim=-2) + + def generate( + self, + text: str, + speaker: str, + context: List[Segment], + max_audio_length_ms: float = 90_000, + temperature: float = 0.9, + topk: int = 20, + ) -> torch.Tensor: + self.backbone.reset_state() + self.backbone_past_len = 0 + max_generation_len = int(max_audio_length_ms / 80) + tokens, tokens_mask = [], [] + for segment in context: + segment_tokens, segment_tokens_mask = self._tokenize_segment(segment) + tokens.append(segment_tokens) + tokens_mask.append(segment_tokens_mask) + + gen_segment_tokens, gen_segment_tokens_mask = self._tokenize_text_segment( + text, speaker + ) + tokens.append(gen_segment_tokens) + tokens_mask.append(gen_segment_tokens_mask) + + prompt_tokens = torch.cat(tokens, dim=0).long() + prompt_tokens_mask = torch.cat(tokens_mask, dim=0).bool() + + samples = [] + curr_tokens = prompt_tokens.unsqueeze(0) + curr_tokens_mask = prompt_tokens_mask.unsqueeze(0) + curr_pos = ( + torch.arange(0, prompt_tokens.size(0)).unsqueeze(0).long() + ) + + max_seq_len = 3100 + max_context_len = max_seq_len - max_generation_len + if curr_tokens.size(1) >= max_context_len: + raise ValueError( + f"Inputs too long, must be below max_seq_len - max_generation_len: {max_context_len}" + ) + + for _ in range(max_generation_len): + sample = self.generate_frame( + curr_tokens, curr_tokens_mask, curr_pos, temperature, topk + ) + # eos + if torch.all(sample == 0): + break + + samples.append(sample) + + curr_tokens = torch.cat( + [sample, torch.zeros(1, 1).long()], dim=1 + ).unsqueeze(1) + curr_tokens_mask = torch.cat( + [ + torch.ones_like(sample).bool(), + torch.zeros(1, 1).bool(), + ], + dim=1, + ).unsqueeze(1) + curr_pos = curr_pos[:, -1:] + 1 + vq_out = self.audio_upsampler(torch.stack(samples).permute(1, 2, 0)) + vq_out_feats, _ = torch.from_numpy(vq_out[0]), torch.from_numpy(vq_out[1]) + vq_out_length = torch.tensor([vq_out_feats.shape[1]], dtype=torch.long) + audio = torch.from_numpy(self.audio_decoder([vq_out_feats, vq_out_length])[0]) + audio = ( + audio + .squeeze(0) + .squeeze(0) + ) + + return audio + + @torch.inference_mode() + def generate_dialogue( + self, + text_list, + prompt_wav_list=None, + prompt_text_list=None, + temperature=0.9, + topk=20, + ): + all_generated_segments = [] + all_storage_segments = [] + prompt_segments = [] + text_list = process_text_list(text_list=text_list) + if prompt_wav_list is not None: + assert len(prompt_wav_list) == len(prompt_text_list) + # Prepare prompts + for i in range(len(prompt_wav_list)): + prompt_wav = prompt_wav_list[i] + prompt_text = prompt_text_list[i] + speaker = prompt_text[:4] + assert speaker in ["[S1]", "[S2]", "[S3]", "[S4]"] + prompt_segments.append( + self.prepare_prompt( + text=prompt_text, speaker=speaker, audio_path=prompt_wav + ) + ) + + for text in tqdm(text_list): + speaker = text[:4] + text = text[4:] + # print("---speaker:", speaker) + # print("---text:", text) + assert speaker in ["[S1]", "[S2]", "[S3]", "[S4]"] + + audio_tensor = self.generate( + text=text, + speaker=speaker, + context=prompt_segments + all_generated_segments, + max_audio_length_ms=30_000, + temperature=temperature, + topk=topk, + ) + + # 做上下文管理的时候需要将audio 转到16k + audio_16k = torchaudio.functional.resample( + audio_tensor.unsqueeze(0), 24000, 16000 + ) + all_generated_segments.append( + Segment(text=text, speaker=speaker, audio=audio_16k) + ) + + all_storage_segments.append( + Segment(text=text, speaker=speaker, audio=audio_tensor.unsqueeze(0)) + ) + + # Concatenate all generations + all_audio = torch.cat([seg.audio for seg in all_storage_segments], dim=1) + all_audio = all_audio.cpu() + return all_audio \ No newline at end of file From cf06ecd20fd7311047178c8155311f81f1e27e61 Mon Sep 17 00:00:00 2001 From: ethan Date: Thu, 13 Nov 2025 21:01:37 -0800 Subject: [PATCH 02/14] update --- notebooks/fireredtts2/README.md | 7 +- notebooks/fireredtts2/fireredtts2.ipynb | 255 +----------------- notebooks/fireredtts2/gradio_helper.py | 51 ++-- notebooks/fireredtts2/ov_fireredtts_helper.py | 232 ++++++---------- 4 files changed, 122 insertions(+), 423 deletions(-) diff --git a/notebooks/fireredtts2/README.md b/notebooks/fireredtts2/README.md index dc0b53cf620..80728c602da 100644 --- a/notebooks/fireredtts2/README.md +++ b/notebooks/fireredtts2/README.md @@ -22,13 +22,10 @@ The tutorial consists from following steps: In this demonstration, you'll create interactive assistant that can answer questions about provided image's content or generate images based on text instructions. -The images bellow illustrates example of input prompt and model answer for image understanding and generation -![example.png](https://github.com/user-attachments/assets/89a71be8-b472-4acd-a2e0-dbc97645fc1c) -![example2.png](https://github.com/user-attachments/assets/5aca2b37-52d9-403d-a773-311ccf82b375) +The images bellow illustrates example of voice cloning and dialogue generation. ## Installation instructions This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to [Installation Guide](../../README.md). - - + diff --git a/notebooks/fireredtts2/fireredtts2.ipynb b/notebooks/fireredtts2/fireredtts2.ipynb index ae3beb8ff54..6b5c3cf890a 100644 --- a/notebooks/fireredtts2/fireredtts2.ipynb +++ b/notebooks/fireredtts2/fireredtts2.ipynb @@ -36,7 +36,10 @@ "We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.\n", "For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).\n", "\n", - "\n" + "\n", + "\n", + "\n", + "\n" ] }, { @@ -84,7 +87,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -100,41 +103,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", - "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", - "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", - "\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", - "Note: switching to 'bfacbfb7bb88cade9c0b9ab2644ebd7f75c6989c'.\n", - "\n", - "You are in 'detached HEAD' state. You can look around, make experimental\n", - "changes and commit them, and you can discard any commits you make in this\n", - "state without impacting any branches by switching back to a branch.\n", - "\n", - "If you want to create a new branch to retain commits you create, you may\n", - "do so (now or later) by using -c with the switch command. Example:\n", - "\n", - " git switch -c \n", "\n", - "Or undo this operation with:\n", - "\n", - " git switch -\n", - "\n", - "Turn off this advice by setting config variable advice.detachedHead to false\n", - "\n", - "HEAD is now at bfacbfb Update llm.py\n", - "\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", - "\u001b[0m\u001b[33m WARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", - "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", - "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", - "\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", - "\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", - "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", - "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ptimum-intel (/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n", - "\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" ] @@ -191,179 +160,9 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Skipping import of cpp extensions due to incompatible torch version 2.7.1+cpu for torchao version 0.14.1 Please see https://github.com/pytorch/ao/issues/2919 for more info\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "⌛ pretrained_models conversion started. Be patient, it may takes some time.\n", - "⌛ Load Original model\n", - "🔍 Detected Configuration:\n", - " num_heads: 12\n", - " num_kv_heads: 2\n", - " dim: 1536\n", - " head_dim: 128\n", - " intermediate_size: 8960\n", - " num_layers: 28\n", - " max_seq_len: 4096\n", - " tie_word_embeddings: True\n", - "\n", - "🔧 Removing 'model.' prefix...\n", - "\n", - "🔑 Cleaned key examples:\n", - " layers.0.self_attn.q_proj.weight\n", - " layers.0.self_attn.q_proj.bias\n", - " layers.0.self_attn.k_proj.weight\n", - " layers.0.self_attn.k_proj.bias\n", - " layers.0.self_attn.v_proj.weight\n", - "\n", - "⚠️ Missing keys: ['embed_tokens.weight']\n", - "\n", - "✅ Conversion completed!\n", - "🔍 Detected Configuration:\n", - " num_heads: 12\n", - " num_kv_heads: 2\n", - " dim: 1536\n", - " head_dim: 128\n", - " intermediate_size: 8960\n", - " num_layers: 4\n", - " max_seq_len: 4096\n", - " tie_word_embeddings: True\n", - "\n", - "🔧 Removing 'model.' prefix...\n", - "\n", - "🔑 Cleaned key examples:\n", - " layers.0.self_attn.q_proj.weight\n", - " layers.0.self_attn.q_proj.bias\n", - " layers.0.self_attn.k_proj.weight\n", - " layers.0.self_attn.k_proj.bias\n", - " layers.0.self_attn.v_proj.weight\n", - "\n", - "⚠️ Missing keys: ['embed_tokens.weight']\n", - "\n", - "✅ Conversion completed!\n", - "[INFO] LLM Loaded...\n", - "[INFO] Text Tokenizer Loaded...\n", - "[INFO] Codec Loaded...\n", - "✅ Original model successfully loaded\n", - "⌛ Export tokenizer and config\n", - "⌛ Convert TEXT_EMBEDDINGS model\n", - "✅ TEXT_EMBEDDINGS model successfully converted\n", - "⌛ Convert AUDIO_EMBEDDINGS model\n", - "✅ AUDIO_EMBEDDINGS model successfully converted\n", - "⌛ Convert AUDIO_UPSAMPLER model\n", - "vq_out_feats shape: 1\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/ov_fireredtts_helper.py:760: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", - " vq_out_length = torch.tensor(\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "vq_out_feats shape: 1\n", - "vq_out_feats shape: 1\n", - "✅ AUDIO_UPSAMPLER model successfully converted\n", - "⌛ Convert AUDIO_DECODER model\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/utils.py:7: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", - " max_len = max_len if max_len > 0 else lengths.max().item()\n", - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/utils.py:26: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", - " num_blocks = torch.ceil(torch.tensor(attn_mask.shape[1] / chunk_size)).to(torch.int64)\n", - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/utils.py:26: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.detach().clone() or sourceTensor.detach().clone().requires_grad_(True), rather than torch.tensor(sourceTensor).\n", - " num_blocks = torch.ceil(torch.tensor(attn_mask.shape[1] / chunk_size)).to(torch.int64)\n", - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/decoder.py:402: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", - " assert (window_envelope > 1e-11).all()\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ AUDIO_DECODER model successfully converted\n", - "⌛ Convert AUDIO_ENCODER model\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/model.py:221: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", - " audio16k_length = torch.tensor(\n", - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/whisper.py:330: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", - " mel_filters = torch.from_numpy(self.mel_filters).type(torch.float32).to(device)\n", - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2/fireredtts2/codec/utils.py:7: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", - " max_len = max_len if max_len > 0 else lengths.max().item()\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ AUDIO_ENCODER model successfully converted\n", - "⌛ Convert DECODER_MODEL model\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.\n", - "/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages/transformers/cache_utils.py:568: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", - " or not self.key_cache[layer_idx].numel() # the layer has no cache\n", - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/ov_fireredtts_helper.py:371: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", - " if (padding_length := kv_length + kv_offset - attention_mask.shape[-1]) > 0:\n", - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/ov_fireredtts_helper.py:503: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", - " torch.tensor(0.0, device=mask.device, dtype=dtype),\n", - "/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/ov_fireredtts_helper.py:504: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n", - " torch.tensor(torch.finfo(torch.float16).min, device=mask.device, dtype=dtype),\n", - "/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages/transformers/cache_utils.py:551: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", - " elif (\n", - "/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages/transformers/integrations/sdpa_attention.py:59: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", - " is_causal = query.shape[2] > 1 and attention_mask is None and getattr(module, \"is_causal\", True)\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Decoder model successfully converted\n", - "⌛ Convert BACKBONE_MODEL model\n", - "✅ Backbone model successfully converted\n", - "✅ pretrained_models model conversion finished. You can find results in FireRedTTS2-ov\n" - ] - }, - { - "data": { - "text/plain": [ - "PosixPath('FireRedTTS2-ov')" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "from ov_fireredtts_helper import convert_fireredtts2\n", "\n", @@ -396,25 +195,9 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "2d379251f53b43eb805b5a5f8d501f55", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Dropdown(description='Device:', options=('CPU', 'GPU', 'AUTO'), value='CPU')" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "from notebook_utils import device_widget\n", "\n", @@ -437,23 +220,9 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "ename": "RuntimeError", - "evalue": "Exception from src/inference/src/cpp/core.cpp:134:\nException from src/inference/src/dev/plugin.cpp:58:\nException from src/core/src/pass/graph_rewrite.cpp:298:\n[FuseBinaryEltwise] END: node: opset1::Add Add_494266 (SnippetsOpset::BrgemmCPU MatMul_494263[0]:f32[?,20,?,?], opset1::Parameter Add_494266[0]:f32[1,1,1,300]) -> (f32[?,20,?,300]) CALLBACK HAS THROWN: Exception from src/core/src/dimension.cpp:227:\nCannot get length of dynamic dimension\n\n\n\n\n", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mRuntimeError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[6], line 3\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mov_fireredtts_helper\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m OVFireRedTTS2\n\u001b[0;32m----> 3\u001b[0m ov_model \u001b[38;5;241m=\u001b[39m \u001b[43mOVFireRedTTS2\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mgen_type\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mdialogue\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdevice\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdevice\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalue\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcodec_device\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mCPU\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n", - "File \u001b[0;32m/home2/ethan/intel/openvino_notebooks/notebooks/fireredtts2/ov_fireredtts_helper.py:1079\u001b[0m, in \u001b[0;36mOVFireRedTTS2.__init__\u001b[0;34m(self, pretrained_dir, gen_type, device, codec_device)\u001b[0m\n\u001b[1;32m 1077\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maudio_embeddings \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mcompile_model(model_dir \u001b[38;5;241m/\u001b[39m AUDIO_EMBEDDINGS_PATH, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdevice)\n\u001b[1;32m 1078\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maudio_decoder \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mcompile_model(model_dir \u001b[38;5;241m/\u001b[39m AUDIO_DECODER_PATH, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcodec_device)\n\u001b[0;32m-> 1079\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maudio_encoder \u001b[38;5;241m=\u001b[39m \u001b[43mcore\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcompile_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel_dir\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m/\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mAUDIO_ENCODER_PATH\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcodec_device\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1080\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtext_embeddings \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mcompile_model(model_dir \u001b[38;5;241m/\u001b[39m TEXT_EMBEDDINGS_PATH, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdevice)\n\u001b[1;32m 1081\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maudio_upsampler \u001b[38;5;241m=\u001b[39m core\u001b[38;5;241m.\u001b[39mcompile_model(model_dir \u001b[38;5;241m/\u001b[39m AUDIO_UPSAMPLER_PATH, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdevice)\n", - "File \u001b[0;32m/home2/ethan/intel/openvino_notebooks/openvino_venv/lib/python3.10/site-packages/openvino/_ov_api.py:610\u001b[0m, in \u001b[0;36mCore.compile_model\u001b[0;34m(self, model, device_name, config, weights)\u001b[0m\n\u001b[1;32m 605\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m device_name \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m 606\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m CompiledModel(\n\u001b[1;32m 607\u001b[0m \u001b[38;5;28msuper\u001b[39m()\u001b[38;5;241m.\u001b[39mcompile_model(model, {} \u001b[38;5;28;01mif\u001b[39;00m config \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m config),\n\u001b[1;32m 608\u001b[0m )\n\u001b[1;32m 609\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m CompiledModel(\n\u001b[0;32m--> 610\u001b[0m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcompile_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdevice_name\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m{\u001b[49m\u001b[43m}\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mif\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mconfig\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mis\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01melse\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mconfig\u001b[49m\u001b[43m)\u001b[49m,\n\u001b[1;32m 611\u001b[0m )\n\u001b[1;32m 612\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 613\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m device_name \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n", - "\u001b[0;31mRuntimeError\u001b[0m: Exception from src/inference/src/cpp/core.cpp:134:\nException from src/inference/src/dev/plugin.cpp:58:\nException from src/core/src/pass/graph_rewrite.cpp:298:\n[FuseBinaryEltwise] END: node: opset1::Add Add_494266 (SnippetsOpset::BrgemmCPU MatMul_494263[0]:f32[?,20,?,?], opset1::Parameter Add_494266[0]:f32[1,1,1,300]) -> (f32[?,20,?,300]) CALLBACK HAS THROWN: Exception from src/core/src/dimension.cpp:227:\nCannot get length of dynamic dimension\n\n\n\n\n" - ] - } - ], + "outputs": [], "source": [ "from ov_fireredtts_helper import OVFireRedTTS2\n", "\n", @@ -559,7 +328,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.7" + "version": "3.10.12" }, "openvino_notebooks": { "imageUrl": "https://github.com/user-attachments/assets/0d83b369-b8fc-423e-bc53-495022555e8c", diff --git a/notebooks/fireredtts2/gradio_helper.py b/notebooks/fireredtts2/gradio_helper.py index c0c68bb7c85..fa8da5d51de 100644 --- a/notebooks/fireredtts2/gradio_helper.py +++ b/notebooks/fireredtts2/gradio_helper.py @@ -3,31 +3,28 @@ from tqdm import tqdm from argparse import ArgumentParser from typing import Literal, List, Tuple -from ov_firetts_helper import OVFireRedTTS2 -# ================================================ -# FireRedTTS2 Model -# ================================================ -# Global model instance -model: OVFireRedTTS2 = None - examples = [ - ["English", - "examples\chat_prompt\en\S1.flac", - "[S1]I think we should just talk about what happened and move on because there's going to be other jousts and Sir Saif isn't done yet. It's not, he's not, it's not done yet.", - "examples\chat_prompt\en\S2.flac", - "[S2]You know, maybe sorry, maybe maybe I pushed, maybe I pushed too hard. I was really excited. I didn't mean to make you snap.", - "[S1]It's alright, we'll take a breath and plan the next pass together.[S2]Yeah, thanks. We'll get it right this time.[S1]Let's review our signals tonight so we're in sync on the field tomorrow." - ] - ["中文", - "examples\chat_prompt\zh\S1.flac", - "[S1]啊,可能说更适合美国市场应该是什么样子。那这这个可能说当然如果说有有机会能亲身的去考察去了解一下,那当然是有更好的帮助。", - "examples\chat_prompt\zh\S2.flac", - "[S2]比如具体一点的,他觉得最大的一个跟他预想的不一样的是在什么地方。", - "[S1]那可能说对对,没有去过美国来说去去看到美国线下。巴斯曼也好,沃尔玛也好,他们线下不管说,因为深圳出去的还是电子周边的会表达,会发现哇对这个价格真的是很高呀。都是卖三十五美金、四十美金,甚至一个手机壳,就是二十五美金开。[S2]对,没错,我每次都觉得不不可思议。我什么人会买三五十美金的手机壳?但是其实在在那个target啊,就塔吉特这种超级市场,大家都是这样的,定价也很多人买。"], + [ + "English", + "FireRedTTS2\examples\chat_prompt\en\S1.flac", + "[S1]I think we should just talk about what happened and move on because there's going to be other jousts and Sir Saif isn't done yet. It's not, he's not, it's not done yet.", + "examples\chat_prompt\en\S2.flac", + "[S2]You know, maybe sorry, maybe maybe I pushed, maybe I pushed too hard. I was really excited. I didn't mean to make you snap.", + "[S1]It's alright, we'll take a breath and plan the next pass together.[S2]Yeah, thanks. We'll get it right this time.[S1]Let's review our signals tonight so we're in sync on the field tomorrow.", + ], + [ + "中文", + "FireRedTTS2\examples\chat_prompt\zh\S1.flac", + "[S1]啊,可能说更适合美国市场应该是什么样子。那这这个可能说当然如果说有有机会能亲身的去考察去了解一下,那当然是有更好的帮助。", + "examples\chat_prompt\zh\S2.flac", + "[S2]比如具体一点的,他觉得最大的一个跟他预想的不一样的是在什么地方。", + "[S1]那可能说对对,没有去过美国来说去去看到美国线下。巴斯曼也好,沃尔玛也好,他们线下不管说,因为深圳出去的还是电子周边的会表达,会发现哇对这个价格真的是很高呀。都是卖三十五美金、四十美金,甚至一个手机壳,就是二十五美金开。[S2]对,没错,我每次都觉得不不可思议。我什么人会买三五十美金的手机壳?但是其实在在那个target啊,就塔吉特这种超级市场,大家都是这样的,定价也很多人买。", + ], ] + def initiate_model(ov_model): global model model = ov_model @@ -151,14 +148,12 @@ def check_dialogue_text(text_list: List[str]) -> bool: return False for text in text_list: if not ( - check_monologue_text(text, "[S1]") - or check_monologue_text(text, "[S2]") - or check_monologue_text(text, "[S3]") - or check_monologue_text(text, "[S4]") + check_monologue_text(text, "[S1]") or check_monologue_text(text, "[S2]") or check_monologue_text(text, "[S3]") or check_monologue_text(text, "[S4]") ): return False return True + def dialogue_synthesis_function( target_text: str, voice_mode: Literal[0, 1] = 0, # 0 means voice clone @@ -193,9 +188,7 @@ def dialogue_synthesis_function( # Go synthesis progress_bar = gr.Progress(track_tqdm=True) - prompt_wav_list = ( - None if voice_mode != 0 else [spk1_prompt_audio, spk2_prompt_audio] - ) + prompt_wav_list = None if voice_mode != 0 else [spk1_prompt_audio, spk2_prompt_audio] prompt_text_list = None if voice_mode != 0 else [spk1_prompt_text, spk2_prompt_text] target_audio = model.generate_dialogue( text_list=target_text_list, @@ -265,9 +258,7 @@ def render_interface() -> gr.Blocks: lines=18, ) # Generate button - generate_btn = gr.Button( - value=i18n("generate_btn_label"), variant="primary", size="lg" - ) + generate_btn = gr.Button(value=i18n("generate_btn_label"), variant="primary", size="lg") # Long output audio generate_audio = gr.Audio( label=i18n("generated_audio_label"), diff --git a/notebooks/fireredtts2/ov_fireredtts_helper.py b/notebooks/fireredtts2/ov_fireredtts_helper.py index 158adc64488..30e2f6f5f04 100644 --- a/notebooks/fireredtts2/ov_fireredtts_helper.py +++ b/notebooks/fireredtts2/ov_fireredtts_helper.py @@ -25,6 +25,7 @@ import math import torch.nn.functional as F + def patch_cos_sin_cached_fp32(model): if ( hasattr(model, "layers") @@ -43,6 +44,7 @@ def patch_cos_sin_cached_fp32(model): dtype=torch.float32, ) + SYMBOLS_MAPPING = { "\n": "", "\t": "", @@ -82,9 +84,7 @@ def patch_cos_sin_cached_fp32(model): "*": "", } -REPLACE_SYMBOL_REGEX = re.compile( - "|".join(re.escape(p) for p in SYMBOLS_MAPPING.keys()) -) +REPLACE_SYMBOL_REGEX = re.compile("|".join(re.escape(p) for p in SYMBOLS_MAPPING.keys())) EMOJI_REGEX = re.compile( @@ -330,16 +330,19 @@ def process_text_list(text_list): new_text_list.append(speaker + chunk) return new_text_list + def _pad_and_chunk(audio: torch.Tensor, chunk_size: int) -> List[torch.Tensor]: pad_len = math.ceil(audio.shape[1] / chunk_size) * chunk_size - audio.shape[1] audio = F.pad(audio, (0, pad_len), mode="constant", value=0) audio_chunks = audio.split(chunk_size, dim=1) return audio_chunks + def _multinomial_sample_one_no_sync(probs): q = torch.empty_like(probs).exponential_(1) return torch.argmax(probs / q, dim=-1, keepdim=True).to(dtype=torch.int) + def sample_topk(logits: torch.Tensor, topk: int, temperature: float): logits = logits / temperature @@ -352,15 +355,15 @@ def sample_topk(logits: torch.Tensor, topk: int, temperature: float): sample_token = _multinomial_sample_one_no_sync(probs) return sample_token + def causal_mask_function(batch_idx: int, head_idx: int, q_idx: int, kv_idx: int) -> bool: """ This creates a basic lower-diagonal causal mask. """ return kv_idx <= q_idx -def prepare_padding_mask( - attention_mask: Optional[torch.Tensor], kv_length: int, kv_offset: int, _slice: bool = True -) -> Optional[torch.Tensor]: + +def prepare_padding_mask(attention_mask: Optional[torch.Tensor], kv_length: int, kv_offset: int, _slice: bool = True) -> Optional[torch.Tensor]: """ From the 2D attention mask, prepare the correct padding mask to use by potentially padding it, and slicing according to the `kv_offset` if `_slice` is `True`. @@ -379,6 +382,7 @@ def prepare_padding_mask( local_padding_mask = local_padding_mask[:, mask_indices] return local_padding_mask + def and_masks(*mask_functions: list[Callable]) -> Callable: """Returns a mask function that is the intersection of provided mask functions""" if not all(callable(arg) for arg in mask_functions): @@ -392,6 +396,7 @@ def and_mask(batch_idx, head_idx, q_idx, kv_idx): return and_mask + def padding_mask_function(padding_mask: torch.Tensor) -> Callable: """ This return the mask_function function corresponding to a 2D padding mask. @@ -405,6 +410,7 @@ def inner_mask(batch_idx: int, head_idx: int, q_idx: int, kv_idx: int) -> bool: return inner_mask + def _ignore_causal_mask_sdpa( padding_mask: Optional[torch.Tensor], query_length: int, @@ -437,19 +443,13 @@ def _ignore_causal_mask_sdpa( # in this case we need to add special patterns to the mask so cannot be skipped otherwise and (local_attention_size is None or kv_length < local_attention_size) # In this case, we need to add padding to the mask, so cannot be skipped otherwise - and ( - padding_mask is None - or ( - padding_mask.all() - if not is_torch_xpu_available or query_length == 1 - else padding_mask[:, :query_length].all() - ) - ) + and (padding_mask is None or (padding_mask.all() if not is_torch_xpu_available or query_length == 1 else padding_mask[:, :query_length].all())) ): return True return False + def sdpa_mask_without_vmap( batch_size: int, cache_position: torch.Tensor, @@ -490,6 +490,7 @@ def sdpa_mask_without_vmap( return causal_mask + # Adapted from https://github.com/huggingface/transformers/blob/v4.53.0/src/transformers/masking_utils.py#L433 # Specifically for OpenVINO, we use torch.finfo(torch.float16).min instead of torch.finfo(dtype).min def eager_mask_without_vmap(*args, **kwargs) -> Optional[torch.Tensor]: @@ -636,7 +637,6 @@ def make_stateful( """ from openvino._offline_transformations import apply_make_stateful_transformation - input_output_map = {} if num_beams_and_batch is not None: @@ -695,6 +695,7 @@ def cleanup_torchscript_cache(): torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore() torch.jit._state._clear_class_state() + TEXT_EMBEDDINGS_PATH = "openvino_text_embeddings_model.xml" AUDIO_EMBEDDINGS_PATH = "openvino_audio_embeddings_model.xml" AUDIO_DECODER_PATH = "openvino_audio_decoder_model.xml" @@ -703,14 +704,25 @@ def cleanup_torchscript_cache(): DECODER_MODEL_PATH = "openvino_decoder_model.xml" BACKBONE_MODEL_PATH = "openvino_backbone_model.xml" + def convert_fireredtts2(model_id, model_path=None, quantization_config=None): - if model_path is None: model_path = Path(model_id.split("/")[-1]) else: model_path = Path(model_path) - if all((model_path / model_name).exists() for model_name in [TEXT_EMBEDDINGS_PATH, AUDIO_DECODER_PATH, AUDIO_ENCODER_PATH, AUDIO_EMBEDDINGS_PATH, DECODER_MODEL_PATH, BACKBONE_MODEL_PATH, AUDIO_UPSAMPLER_PATH]): + if all( + (model_path / model_name).exists() + for model_name in [ + TEXT_EMBEDDINGS_PATH, + AUDIO_DECODER_PATH, + AUDIO_ENCODER_PATH, + AUDIO_EMBEDDINGS_PATH, + DECODER_MODEL_PATH, + BACKBONE_MODEL_PATH, + AUDIO_UPSAMPLER_PATH, + ] + ): print(f"✅ {model_id} model already converted. You can find results in {model_path}") return model_path print(f"⌛ {model_id} conversion started. Be patient, it may takes some time.") @@ -727,8 +739,6 @@ def convert_fireredtts2(model_id, model_path=None, quantization_config=None): pt_model._text_tokenizer.save_pretrained(model_path) for json_file in Path(model_id).glob("*.json"): shutil.copy(json_file, model_path / json_file.name) - - if not (model_path / TEXT_EMBEDDINGS_PATH).exists(): print("⌛ Convert TEXT_EMBEDDINGS model") @@ -739,7 +749,7 @@ def convert_fireredtts2(model_id, model_path=None, quantization_config=None): cleanup_torchscript_cache() gc.collect() print("✅ TEXT_EMBEDDINGS model successfully converted") - + if not (model_path / AUDIO_EMBEDDINGS_PATH).exists(): print("⌛ Convert AUDIO_EMBEDDINGS model") @@ -752,14 +762,13 @@ def convert_fireredtts2(model_id, model_path=None, quantization_config=None): if not (model_path / AUDIO_UPSAMPLER_PATH).exists(): print("⌛ Convert AUDIO_UPSAMPLER model") + def forward_wrap_audio_upsampler(self, tokens: torch.Tensor): tokens = tokens.permute(1, 0, 2) # (B, nq, L) -> (nq, B, L) vq_out_feats = self.rvq.decode_codes(tokens) vq_out_feats = vq_out_feats.transpose(1, 2) print(f"vq_out_feats shape: {vq_out_feats.shape[1]}") - vq_out_length = torch.tensor( - [vq_out_feats.size(1)], dtype=torch.long, device=vq_out_feats.device - ) + vq_out_length = torch.tensor([vq_out_feats.size(1)], dtype=torch.long, device=vq_out_feats.device) vq_out_feats, vq_out_length = self.upsample(vq_out_feats, vq_out_length) return vq_out_feats, vq_out_length @@ -774,8 +783,7 @@ def forward_wrap_audio_upsampler(self, tokens: torch.Tensor): cleanup_torchscript_cache() gc.collect() print("✅ AUDIO_UPSAMPLER model successfully converted") - - + if not (model_path / AUDIO_DECODER_PATH).exists(): print("⌛ Convert AUDIO_DECODER model") example_input = { @@ -789,11 +797,13 @@ def forward_wrap_audio_upsampler(self, tokens: torch.Tensor): cleanup_torchscript_cache() gc.collect() print("✅ AUDIO_DECODER model successfully converted") - + if not (model_path / AUDIO_ENCODER_PATH).exists(): print("⌛ Convert AUDIO_ENCODER model") + def forward_wrap_audio_encoder(self, audio16k: torch.Tensor): return self._encode_one_batch(audio16k) + pt_model._audio_tokenizer._orig_forward = pt_model._audio_tokenizer.forward pt_model._audio_tokenizer.forward = types.MethodType(forward_wrap_audio_encoder, pt_model._audio_tokenizer) @@ -806,19 +816,18 @@ def forward_wrap_audio_encoder(self, audio16k: torch.Tensor): gc.collect() print("✅ AUDIO_ENCODER model successfully converted") - - if not (model_path / DECODER_MODEL_PATH).exists(): print("⌛ Convert DECODER_MODEL model") patch_cos_sin_cached_fp32(pt_model._model.decoder) if hasattr(pt_model._model.decoder, "model"): patch_cos_sin_cached_fp32(pt_model._model.decoder.model) + def forward_wrap_decoder( self, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_values: Optional[list[torch.FloatTensor]] = None, - inputs_embeds: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, step: Optional[torch.Tensor] = None, ): if past_key_values is not None: @@ -826,19 +835,15 @@ def forward_wrap_decoder( inputs_embeds_proj = self.projection(inputs_embeds) # print(f"decoder inputs: {inputs_embeds}") outputs = self.decoder( - attention_mask=attention_mask, - position_ids=position_ids, - past_key_values=past_key_values, - inputs_embeds=inputs_embeds_proj, - use_cache=True + attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds_proj, use_cache=True ) - + if past_key_values is not None: outputs["past_key_values"] = outputs["past_key_values"].to_legacy_cache() decoder_h = outputs.last_hidden_state ci_logits = torch.mm(decoder_h[:, -1, :], self.audio_head[step - 1]) return (ci_logits, outputs.past_key_values) - + num_pkv = pt_model._model.decoder.config.num_hidden_layers hidden_size = pt_model._model.decoder.config.hidden_size @@ -855,7 +860,6 @@ def forward_wrap_decoder( inputs_embeds = torch.randn((2, 2, hidden_size)) attention_mask = torch.ones([2, 4], dtype=torch.int64) position_ids = torch.arange(2).unsqueeze(0).expand(2, -1) - input_names = ["attention_mask", "position_ids"] output_names = ["logits"] @@ -875,8 +879,8 @@ def forward_wrap_decoder( } input_shapes = [ - ov.PartialShape([-1, -1]), # attention_mask - ov.PartialShape([-1, -1]), # position_ids (2D for code predictor) + ov.PartialShape([-1, -1]), # attention_mask + ov.PartialShape([-1, -1]), # position_ids (2D for code predictor) ] input_shapes += ( [ @@ -916,16 +920,16 @@ def forward_wrap_decoder( cleanup_torchscript_cache() gc.collect() - if not (model_path / BACKBONE_MODEL_PATH).exists(): print("⌛ Convert BACKBONE_MODEL model") - + patch_cos_sin_cached_fp32(pt_model._model.backbone) if hasattr(pt_model._model.backbone, "model"): patch_cos_sin_cached_fp32(pt_model._model.backbone.model) backbone_config = pt_model._model.backbone.config backbone_config.save_pretrained(model_path) + def forward_wrap_backbone( self, attention_mask: Optional[torch.Tensor] = None, @@ -936,11 +940,7 @@ def forward_wrap_backbone( if past_key_values is not None: past_key_values = DynamicCache.from_legacy_cache(past_key_values) outputs = self.backbone( - inputs_embeds=inputs_embeds, - attention_mask=attention_mask, - position_ids=position_ids, - past_key_values=past_key_values, - use_cache=True + inputs_embeds=inputs_embeds, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, use_cache=True ) if past_key_values is not None: outputs["past_key_values"] = outputs["past_key_values"].to_legacy_cache() @@ -960,7 +960,7 @@ def forward_wrap_backbone( 2, pt_model._model.backbone.config.hidden_size // pt_model._model.backbone.config.num_attention_heads, ) - + input_embeds = torch.randn((2, 2, hidden_size)) attention_mask = torch.ones([2, 4], dtype=torch.int64) position_ids = torch.arange(2).unsqueeze(0).expand(2, -1) @@ -974,7 +974,7 @@ def forward_wrap_backbone( input_names.extend([f"past_key_values.{i}.key", f"past_key_values.{i}.value"]) output_names.extend([f"present.{i}.key", f"present.{i}.value"]) input_names.extend(["inputs_embeds"]) - + example_input = { "attention_mask": attention_mask, "position_ids": position_ids, @@ -1032,7 +1032,8 @@ class Segment: speaker: str text: str audio: torch.Tensor - + + @dataclass class ModelArgs: backbone_flavor: str @@ -1043,6 +1044,7 @@ class ModelArgs: decoder_loss_weight: float use_text_loss: bool + class OVFireRedTTS2: def __init__(self, pretrained_dir, gen_type, device, codec_device="CPU"): self.device = device @@ -1070,7 +1072,7 @@ def __init__(self, pretrained_dir, gen_type, device, codec_device="CPU"): decoder_loss_weight=llm_config["llm_models"]["decoder_loss_weight"], use_text_loss=True, ) - + model_dir = Path(pretrained_dir) self.backbone = core.compile_model(model_dir / BACKBONE_MODEL_PATH, self.device).create_infer_request() self.decoder = core.compile_model(model_dir / DECODER_MODEL_PATH, self.device).create_infer_request() @@ -1084,7 +1086,7 @@ def __init__(self, pretrained_dir, gen_type, device, codec_device="CPU"): # ==== Load Qwen2.5 Text Tokenizer ==== self._text_tokenizer = AutoTokenizer.from_pretrained(pretrained_dir) print("[INFO] Text Tokenizer Loaded...") - + def encode( self, audio16k: torch.Tensor, @@ -1101,9 +1103,7 @@ def encode( """ if audio16k_length is None: assert audio16k.shape[0] == 1 - audio16k_length = torch.tensor( - [audio16k.shape[1]], dtype=torch.long, device=audio16k.device - ) + audio16k_length = torch.tensor([audio16k.shape[1]], dtype=torch.long, device=audio16k.device) CHUNK_SIZE = 6 * 16000 B, T = audio16k.shape @@ -1112,9 +1112,7 @@ def encode( batch_size_list = [] for i in range(B): # Remove extra paddings - one_audio_chunks = _pad_and_chunk( - audio16k[i : (i + 1), : audio16k_length[i]], CHUNK_SIZE - ) + one_audio_chunks = _pad_and_chunk(audio16k[i : (i + 1), : audio16k_length[i]], CHUNK_SIZE) audio16k_batch += one_audio_chunks batch_size_list.append(len(one_audio_chunks)) audio16k_batch = torch.cat(audio16k_batch, dim=0) @@ -1126,13 +1124,8 @@ def encode( token_batch.append(one_token_batch) token_batch = torch.cat(token_batch, dim=0) # Recover & concat - token_list = torch.split( - token_batch, batch_size_list, dim=0 - ) # [(B=1, nq, l), (B=3, nq, l), ...] - token_list = [ - torch.cat(token_ts.split(1, dim=0), dim=-1) # (B=1, nq, l) - for token_ts in token_list - ] + token_list = torch.split(token_batch, batch_size_list, dim=0) # [(B=1, nq, l), (B=3, nq, l), ...] + token_list = [torch.cat(token_ts.split(1, dim=0), dim=-1) for token_ts in token_list] # (B=1, nq, l) # Pad tokens token = pad_sequence( [ts.squeeze(0).transpose(1, 0) for ts in token_list], @@ -1142,11 +1135,9 @@ def encode( 1, 2 ) # (B, nq, L) token_length = (audio16k_length / 1280).ceil().long() - token = token[ - ..., : token_length.max() - ] # Remove extra paddings (we pad to multiples of 6s) + token = token[..., : token_length.max()] # Remove extra paddings (we pad to multiples of 6s) return token, token_length - + def load_prompt_audio(self, audio_path) -> torch.Tensor: audio, audio_sr = torchaudio.load(audio_path) # Audio must be single channel @@ -1159,9 +1150,7 @@ def prepare_prompt(self, text, speaker, audio_path) -> Segment: audio_tensor = self.load_prompt_audio(audio_path) return Segment(text=text, speaker=speaker, audio=audio_tensor) - def _tokenize_text_segment( - self, text: str, speaker: str - ) -> Tuple[torch.Tensor, torch.Tensor]: + def _tokenize_text_segment(self, text: str, speaker: str) -> Tuple[torch.Tensor, torch.Tensor]: frame_tokens = [] frame_masks = [] @@ -1209,15 +1198,11 @@ def _tokenize_segment(self, segment: Segment) -> Tuple[torch.Tensor, torch.Tenso Returns: (seq_len,17), (seq_len, 17) """ - text_tokens, text_masks = self._tokenize_text_segment( - segment.text, segment.speaker - ) + text_tokens, text_masks = self._tokenize_text_segment(segment.text, segment.speaker) audio_tokens, audio_masks = self._tokenize_audio(segment.audio) - return torch.cat([text_tokens, audio_tokens], dim=0), torch.cat( - [text_masks, audio_masks], dim=0 - ) - + return torch.cat([text_tokens, audio_tokens], dim=0), torch.cat([text_masks, audio_masks], dim=0) + def generate_frame( self, tokens: torch.Tensor, @@ -1241,28 +1226,21 @@ def generate_frame( embeds = self._embed_tokens(tokens) masked_embeds = embeds * tokens_mask.unsqueeze(-1) h = masked_embeds.sum(dim=2) - backbone_attention_mask = torch.ones( - tokens.size(0), tokens.size(1), - dtype=torch.long, - device=tokens.device - ) # [batch, curr_seq_len] + backbone_attention_mask = torch.ones(tokens.size(0), tokens.size(1), dtype=torch.long, device=tokens.device) # [batch, curr_seq_len] backbone_position_ids = input_pos if self.backbone_past_len != 0: - backbone_attention_mask = torch.cat([ - torch.ones(tokens.size(0), self.backbone_past_len, dtype=torch.long, device=tokens.device), - backbone_attention_mask - ], dim=1) + backbone_attention_mask = torch.cat( + [torch.ones(tokens.size(0), self.backbone_past_len, dtype=torch.long, device=tokens.device), backbone_attention_mask], dim=1 + ) backbone_position_ids = backbone_position_ids[:, -tokens.shape[1] :] - + inputs = { "inputs_embeds": h, "attention_mask": backbone_attention_mask, "position_ids": backbone_position_ids, - "beam_idx": np.arange(h.shape[0], dtype=int) + "beam_idx": np.arange(h.shape[0], dtype=int), } - - self.backbone.start_async(inputs, share_inputs=True) self.backbone.wait() logits = self.backbone.get_tensor("logits").data @@ -1274,28 +1252,19 @@ def generate_frame( c0_embed = self._embed_audio(0, c0_sample) curr_h = torch.cat([last_h.unsqueeze(1), c0_embed], dim=1) curr_sample = c0_sample.clone() - curr_pos = ( - torch.arange(0, curr_h.size(1), device=curr_h.device) - .unsqueeze(0) - .repeat(curr_h.size(0), 1) - ) + curr_pos = torch.arange(0, curr_h.size(1), device=curr_h.device).unsqueeze(0).repeat(curr_h.size(0), 1) self.decoder.reset_state() # Set initial value for the next beam_idx input that will be used at the current iteration # and will be optionally updated by _reorder_cache at the next iterations if beam_search is used decoder_past_length = 0 for i in range(1, self.config.audio_num_codebooks): - decoder_attention_mask = torch.ones( - curr_h.size(0), curr_h.size(1), - dtype=torch.long, - device=curr_h.device - ) # [batch, curr_seq_len] + decoder_attention_mask = torch.ones(curr_h.size(0), curr_h.size(1), dtype=torch.long, device=curr_h.device) # [batch, curr_seq_len] decoder_position_ids = curr_pos # [batch, curr_seq_len] if decoder_past_length != 0: - decoder_attention_mask = torch.cat([ - torch.ones(curr_h.size(0), decoder_past_length, dtype=torch.long, device=curr_h.device), - decoder_attention_mask - ], dim=1) + decoder_attention_mask = torch.cat( + [torch.ones(curr_h.size(0), decoder_past_length, dtype=torch.long, device=curr_h.device), decoder_attention_mask], dim=1 + ) decoder_position_ids = decoder_position_ids[:, -curr_h.shape[1] :] inputs = { @@ -1303,7 +1272,7 @@ def generate_frame( "attention_mask": decoder_attention_mask, "position_ids": decoder_position_ids, "beam_idx": np.arange(curr_h.shape[0], dtype=int), - "step": torch.tensor(i).to(dtype=torch.int32) + "step": torch.tensor(i).to(dtype=torch.int32), } self.decoder.start_async(inputs, share_inputs=True) @@ -1329,10 +1298,7 @@ def _embed_audio(self, codebook: int, tokens: torch.Tensor) -> torch.Tensor: def _embed_tokens(self, tokens: torch.Tensor) -> torch.Tensor: text_embeds = torch.from_numpy(self.text_embeddings(tokens[:, :, -1])[0]).unsqueeze(-2) - audio_tokens = tokens[:, :, :-1] + ( - self.config.audio_vocab_size - * torch.arange(self.config.audio_num_codebooks, device=tokens.device) - ) + audio_tokens = tokens[:, :, :-1] + (self.config.audio_vocab_size * torch.arange(self.config.audio_num_codebooks, device=tokens.device)) audio_embeds = torch.from_numpy(self.audio_embeddings(audio_tokens.view(-1))[0]).reshape( tokens.size(0), tokens.size(1), self.config.audio_num_codebooks, -1 ) @@ -1357,9 +1323,7 @@ def generate( tokens.append(segment_tokens) tokens_mask.append(segment_tokens_mask) - gen_segment_tokens, gen_segment_tokens_mask = self._tokenize_text_segment( - text, speaker - ) + gen_segment_tokens, gen_segment_tokens_mask = self._tokenize_text_segment(text, speaker) tokens.append(gen_segment_tokens) tokens_mask.append(gen_segment_tokens_mask) @@ -1369,30 +1333,22 @@ def generate( samples = [] curr_tokens = prompt_tokens.unsqueeze(0) curr_tokens_mask = prompt_tokens_mask.unsqueeze(0) - curr_pos = ( - torch.arange(0, prompt_tokens.size(0)).unsqueeze(0).long() - ) + curr_pos = torch.arange(0, prompt_tokens.size(0)).unsqueeze(0).long() max_seq_len = 3100 max_context_len = max_seq_len - max_generation_len if curr_tokens.size(1) >= max_context_len: - raise ValueError( - f"Inputs too long, must be below max_seq_len - max_generation_len: {max_context_len}" - ) + raise ValueError(f"Inputs too long, must be below max_seq_len - max_generation_len: {max_context_len}") for _ in range(max_generation_len): - sample = self.generate_frame( - curr_tokens, curr_tokens_mask, curr_pos, temperature, topk - ) + sample = self.generate_frame(curr_tokens, curr_tokens_mask, curr_pos, temperature, topk) # eos if torch.all(sample == 0): break samples.append(sample) - curr_tokens = torch.cat( - [sample, torch.zeros(1, 1).long()], dim=1 - ).unsqueeze(1) + curr_tokens = torch.cat([sample, torch.zeros(1, 1).long()], dim=1).unsqueeze(1) curr_tokens_mask = torch.cat( [ torch.ones_like(sample).bool(), @@ -1405,11 +1361,7 @@ def generate( vq_out_feats, _ = torch.from_numpy(vq_out[0]), torch.from_numpy(vq_out[1]) vq_out_length = torch.tensor([vq_out_feats.shape[1]], dtype=torch.long) audio = torch.from_numpy(self.audio_decoder([vq_out_feats, vq_out_length])[0]) - audio = ( - audio - .squeeze(0) - .squeeze(0) - ) + audio = audio.squeeze(0).squeeze(0) return audio @@ -1434,11 +1386,7 @@ def generate_dialogue( prompt_text = prompt_text_list[i] speaker = prompt_text[:4] assert speaker in ["[S1]", "[S2]", "[S3]", "[S4]"] - prompt_segments.append( - self.prepare_prompt( - text=prompt_text, speaker=speaker, audio_path=prompt_wav - ) - ) + prompt_segments.append(self.prepare_prompt(text=prompt_text, speaker=speaker, audio_path=prompt_wav)) for text in tqdm(text_list): speaker = text[:4] @@ -1457,18 +1405,12 @@ def generate_dialogue( ) # 做上下文管理的时候需要将audio 转到16k - audio_16k = torchaudio.functional.resample( - audio_tensor.unsqueeze(0), 24000, 16000 - ) - all_generated_segments.append( - Segment(text=text, speaker=speaker, audio=audio_16k) - ) + audio_16k = torchaudio.functional.resample(audio_tensor.unsqueeze(0), 24000, 16000) + all_generated_segments.append(Segment(text=text, speaker=speaker, audio=audio_16k)) - all_storage_segments.append( - Segment(text=text, speaker=speaker, audio=audio_tensor.unsqueeze(0)) - ) + all_storage_segments.append(Segment(text=text, speaker=speaker, audio=audio_tensor.unsqueeze(0))) # Concatenate all generations all_audio = torch.cat([seg.audio for seg in all_storage_segments], dim=1) all_audio = all_audio.cpu() - return all_audio \ No newline at end of file + return all_audio From ac40f154809428d2cbe3db70dc4690b6787a1045 Mon Sep 17 00:00:00 2001 From: ethan Date: Thu, 13 Nov 2025 21:06:51 -0800 Subject: [PATCH 03/14] add picture --- notebooks/fireredtts2/README.md | 2 ++ notebooks/fireredtts2/fireredtts2.ipynb | 8 +++----- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/notebooks/fireredtts2/README.md b/notebooks/fireredtts2/README.md index 80728c602da..e91d6c5f52f 100644 --- a/notebooks/fireredtts2/README.md +++ b/notebooks/fireredtts2/README.md @@ -24,6 +24,8 @@ In this demonstration, you'll create interactive assistant that can answer quest The images bellow illustrates example of voice cloning and dialogue generation. +image + ## Installation instructions This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. diff --git a/notebooks/fireredtts2/fireredtts2.ipynb b/notebooks/fireredtts2/fireredtts2.ipynb index 6b5c3cf890a..60987046a5f 100644 --- a/notebooks/fireredtts2/fireredtts2.ipynb +++ b/notebooks/fireredtts2/fireredtts2.ipynb @@ -331,7 +331,7 @@ "version": "3.10.12" }, "openvino_notebooks": { - "imageUrl": "https://github.com/user-attachments/assets/0d83b369-b8fc-423e-bc53-495022555e8c", + "imageUrl": "https://github.com/user-attachments/assets/a7512db5-78cd-4379-956b-893c13534862", "tags": { "categories": [ "Model Demos", @@ -340,10 +340,8 @@ "libraries": [], "other": [], "tasks": [ - "Visual Question Answering", - "Image-to-Text", - "Text Generation", - "Text-to-Image" + "Text-to-Audio", + "Text-to-Speech" ] } }, From d24b2843b0fb0520a1c7cda7e113d6cad743469c Mon Sep 17 00:00:00 2001 From: ethan Date: Thu, 13 Nov 2025 21:25:34 -0800 Subject: [PATCH 04/14] fix spelling --- notebooks/fireredtts2/fireredtts2.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/fireredtts2/fireredtts2.ipynb b/notebooks/fireredtts2/fireredtts2.ipynb index 60987046a5f..89667fb0476 100644 --- a/notebooks/fireredtts2/fireredtts2.ipynb +++ b/notebooks/fireredtts2/fireredtts2.ipynb @@ -11,7 +11,7 @@ "- **Long Conversational Speech Generation**: It currently supports 3 minutes dialogues with 4 speakers and can be easily scaled to longer conversations\n", "with more speakers by extending training corpus.\n", "- **Multilingual Support**: It supports multiple languages including English, Chinese, Japanese, Korean, French, German, and Russian. Support zero-shot voice cloning for cross-lingual and code-switching scenarios.\n", - "- **Ultra-Low Latency**: Building on the new **12.5Hz streaming** speech tokenizer, we employ a dual-transformer architecture that operates on a text–speech interleaved sequence, enabling flexible sentence-bysentence generation and reducing first-packet latency,Specifically, on an L20 GPU, our first-packet latency as low as 140ms while maintaining high-quality audio output.\n", + "- **Ultra-Low Latency**: Building on the new **12.5Hz streaming** speech tokenizer, we employ a dual-transformer architecture that operates on a text–speech interleaved sequence, enabling flexible sentence-by-sentence generation and reducing first-packet latency,Specifically, on an L20 GPU, our first-packet latency as low as 140ms while maintaining high-quality audio output.\n", "- **Strong Stability**:Our model achieves high similarity and low WER/CER in both monologue and dialogue tests.\n", "- **Random Timbre Generation**:Useful for creating ASR/speech interaction data.\n", "\n", From 47114c92f721947268bae8de45c68ac926d9ada3 Mon Sep 17 00:00:00 2001 From: ethan Date: Thu, 13 Nov 2025 21:26:03 -0800 Subject: [PATCH 05/14] fix spelling --- .ci/spellcheck/.pyspelling.wordlist.txt | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt index 5c666227d8b..7c53d65dae3 100644 --- a/.ci/spellcheck/.pyspelling.wordlist.txt +++ b/.ci/spellcheck/.pyspelling.wordlist.txt @@ -91,6 +91,7 @@ BLACKBOX boolean CatVTON CausVid +CER CentOS centric CFG @@ -300,6 +301,7 @@ feedforward FeedForward FFN FFmpeg +FireRedTTS FIL FEIL finetuned From cf6d83ab5f06b3f41b7b6a864062bd8275cd6b74 Mon Sep 17 00:00:00 2001 From: ethan Date: Thu, 13 Nov 2025 23:22:42 -0800 Subject: [PATCH 06/14] fix spelling --- notebooks/fireredtts2/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/fireredtts2/README.md b/notebooks/fireredtts2/README.md index e91d6c5f52f..2d0213d8bd3 100644 --- a/notebooks/fireredtts2/README.md +++ b/notebooks/fireredtts2/README.md @@ -4,7 +4,7 @@ FireRedTTS‑2 is a long-form streaming TTS system for multi-speaker dialogue ge - **Long Conversational Speech Generation**: It currently supports 3 minutes dialogues with 4 speakers and can be easily scaled to longer conversations with more speakers by extending training corpus. - **Multilingual Support**: It supports multiple languages including English, Chinese, Japanese, Korean, French, German, and Russian. Support zero-shot voice cloning for cross-lingual and code-switching scenarios. -- **Ultra-Low Latency**: Building on the new **12.5Hz streaming** speech tokenizer, we employ a dual-transformer architecture that operates on a text–speech interleaved sequence, enabling flexible sentence-bysentence generation and reducing first-packet latency,Specifically, on an L20 GPU, our first-packet latency as low as 140ms while maintaining high-quality audio output. +- **Ultra-Low Latency**: Building on the new **12.5Hz streaming** speech tokenizer, we employ a dual-transformer architecture that operates on a text–speech interleaved sequence, enabling flexible sentence-by-sentence generation and reducing first-packet latency,Specifically, on an L20 GPU, our first-packet latency as low as 140ms while maintaining high-quality audio output. - **Strong Stability**:Our model achieves high similarity and low WER/CER in both monologue and dialogue tests. - **Random Timbre Generation**:Useful for creating ASR/speech interaction data. From a1ada99bd14fa7308f30b9b9788aac93bb67c6bf Mon Sep 17 00:00:00 2001 From: ethan Date: Wed, 19 Nov 2025 20:10:56 -0800 Subject: [PATCH 07/14] add skip for macos --- .ci/skipped_notebooks.yml | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/.ci/skipped_notebooks.yml b/.ci/skipped_notebooks.yml index 245627d14f2..a74fe91f623 100644 --- a/.ci/skipped_notebooks.yml +++ b/.ci/skipped_notebooks.yml @@ -536,3 +536,7 @@ skips: - os: - macos-13 +- notebook: notebooks/fireredtts2/fireredtts2.ipynb + skips: + - os: + - macos-13 From 34666972deb45ff68bc2d09331e0e583bfea4ee3 Mon Sep 17 00:00:00 2001 From: ethan Date: Wed, 26 Nov 2025 03:54:29 -0800 Subject: [PATCH 08/14] skip all tests --- .ci/ignore_treon_docker.txt | 3 ++- .ci/skipped_notebooks.yml | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/.ci/ignore_treon_docker.txt b/.ci/ignore_treon_docker.txt index 6373cdcb838..91a020ba448 100644 --- a/.ci/ignore_treon_docker.txt +++ b/.ci/ignore_treon_docker.txt @@ -78,4 +78,5 @@ notebooks/qwen2.5-omni-chatbot/qwen2.5-omni-chatbot.ipynb notebooks/intern-video2-classiciation/intern-video2-classification.ipynb notebooks/flex.2-image-generation/flex.2-image-generation.ipynb notebooks/wan2.1-text-to-video/wan2.1-text-to-video.ipynb -notebooks/ace-step-music-generation/ace-step-music-generation.ipynb \ No newline at end of file +notebooks/ace-step-music-generation/ace-step-music-generation.ipynb +notebooks/fireredtts2/fireredtts2.ipynb \ No newline at end of file diff --git a/.ci/skipped_notebooks.yml b/.ci/skipped_notebooks.yml index 29d4dc7c491..a130b1a8470 100644 --- a/.ci/skipped_notebooks.yml +++ b/.ci/skipped_notebooks.yml @@ -542,6 +542,8 @@ skips: - os: - macos-13 + - ubuntu-22.04 + - windows-2022 - notebook: notebooks/qwen3-vl/qwen3-vl.ipynb skips: - os: From ed9163a0b65800663b0be080fdfbc6c07599250b Mon Sep 17 00:00:00 2001 From: ethan Date: Thu, 27 Nov 2025 16:47:14 -0800 Subject: [PATCH 09/14] update model download method --- notebooks/fireredtts2/fireredtts2.ipynb | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/notebooks/fireredtts2/fireredtts2.ipynb b/notebooks/fireredtts2/fireredtts2.ipynb index 89667fb0476..d03e136c1c0 100644 --- a/notebooks/fireredtts2/fireredtts2.ipynb +++ b/notebooks/fireredtts2/fireredtts2.ipynb @@ -126,6 +126,7 @@ " \"nncf\",\n", " \"openvino>=2025.3.0\",\n", " \"gradio\",\n", + " \"huggingface_hub\",\n", ")\n", "\n", "repo_dir = Path(\"FireRedTTS2\")\n", @@ -165,15 +166,14 @@ "outputs": [], "source": [ "from ov_fireredtts_helper import convert_fireredtts2\n", - "\n", "# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry\n", "from notebook_utils import collect_telemetry\n", + "from huggingface_hub import snapshot_download\n", "\n", "collect_telemetry(\"fireredtts2.ipynb\")\n", "\n", - "pt_model_path = Path(\"pretrained_models\")\n", - "if not pt_model_path.exists():\n", - " !git clone https://huggingface.co/FireRedTeam/FireRedTTS2 pretrained_models\n", + "pt_model_path = \"pretrained_models\"\n", + "snapshot_download(repo_id=\"FireRedTeam/FireRedTTS2\", local_dir=Path(pt_model_path))\n", "\n", "model_path = \"FireRedTTS2-ov\"\n", "convert_fireredtts2(pt_model_path, model_path)" @@ -201,7 +201,7 @@ "source": [ "from notebook_utils import device_widget\n", "\n", - "device = device_widget(\"CPU\", [\"NPU\"])\n", + "device = device_widget(\"CPU\", exclude=[\"NPU\"])\n", "\n", "device" ] From ffba9fa83089ee736fd4886476fdb06efb3325b5 Mon Sep 17 00:00:00 2001 From: ethan Date: Fri, 28 Nov 2025 06:38:49 -0800 Subject: [PATCH 10/14] add model descriptions --- notebooks/fireredtts2/fireredtts2.ipynb | 126 ++++++++++++++++-- notebooks/fireredtts2/ov_fireredtts_helper.py | 1 - 2 files changed, 113 insertions(+), 14 deletions(-) diff --git a/notebooks/fireredtts2/fireredtts2.ipynb b/notebooks/fireredtts2/fireredtts2.ipynb index d03e136c1c0..777301a30ba 100644 --- a/notebooks/fireredtts2/fireredtts2.ipynb +++ b/notebooks/fireredtts2/fireredtts2.ipynb @@ -103,6 +103,13 @@ "name": "stderr", "output_type": "stream", "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "\u001b[33m DEPRECATION: Legacy editable install of fireredtts2==0.1 from file:///home/ethan/intel/openvino_notebooks/notebooks/fireredtts2/FireRedTTS2 (setup.py develop) is deprecated. pip 25.3 will enforce this behaviour change. A possible replacement is to add a pyproject.toml or enable --use-pep517, and use setuptools >= 64. If the resulting installation is not behaving as expected, try using --config-settings editable_mode=compat. Please consult the setuptools documentation for more information. Discussion can be found at https://github.com/pypa/pip/issues/11457\u001b[0m\u001b[33m\n", + "\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", "\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.1.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" @@ -114,7 +121,7 @@ "from pip_helper import pip_install\n", "import platform\n", "\n", - "!pip uninstall -y FireRedTTS2\n", + "!pip uninstall -y fireredtts2\n", "\n", "pip_install(\n", " \"-q\",\n", @@ -154,16 +161,58 @@ "## Convert and Optimize model\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", - " Janus is PyTorch model. OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation (IR). [OpenVINO model conversion API](https://docs.openvino.ai/2024/openvino-workflow/model-preparation.html#convert-a-model-with-python-convert-model) should be used for these purposes. `ov.convert_model` function accepts original PyTorch model instance and example input for tracing and returns `ov.Model` representing this model in OpenVINO framework. Converted model can be used for saving on disk using `ov.save_model` function or directly loading on device using `core.complie_model`. \n", + "FireRedTTS2 is a PyTorch model. OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation (IR). [OpenVINO model conversion API](https://docs.openvino.ai/2024/openvino-workflow/model-preparation.html#convert-a-model-with-python-convert-model) should be used for these purposes. `ov.convert_model` function accepts original PyTorch model instance and example input for tracing and returns `ov.Model` representing this model in OpenVINO framework. Converted model can be used for saving on disk using `ov.save_model` function or directly loading on device using `core.compile_model`.\n", + "\n", + "The script `ov_fireredtts_helper.py` contains helper function for model conversion.\n", "\n", - "The script `ov_firetts_helper.py` contains helper function for model conversion." + "**Model Components:**\n", + "\n", + "- `openvino_text_embeddings_model` - Converts text tokens into embedding vectors for the language model processing\n", + "- `openvino_audio_embeddings_model` - Converts audio codebook tokens into embedding vectors for audio processing\n", + "- `openvino_audio_decoder_model` - Decodes acoustic features into audio waveforms from the encoded representations\n", + "- `openvino_audio_upsampler_model` - Upsamples audio tokens through RVQ decoding to generate higher quality audio features\n", + "- `openvino_audio_encoder_model` - Encodes raw audio waveforms into compressed token representations\n", + "- `openvino_decoder_model` - Transformer-based decoder that generates subsequent audio codebook levels (codebooks 1-15) autoregressively\n", + "- `openvino_backbone_model` - Main transformer backbone that processes text and audio embeddings to generate the first level audio codebook (codebook 0) and contextual representations\n" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "cc2a94a13e85433497f7229fb2abaa33", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Fetching 17 files: 0%| | 0/17 [00:00\n", + " \n", + " Your browser does not support the audio element.\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "import IPython\n", "\n", diff --git a/notebooks/fireredtts2/ov_fireredtts_helper.py b/notebooks/fireredtts2/ov_fireredtts_helper.py index 30e2f6f5f04..7b35291502f 100644 --- a/notebooks/fireredtts2/ov_fireredtts_helper.py +++ b/notebooks/fireredtts2/ov_fireredtts_helper.py @@ -767,7 +767,6 @@ def forward_wrap_audio_upsampler(self, tokens: torch.Tensor): tokens = tokens.permute(1, 0, 2) # (B, nq, L) -> (nq, B, L) vq_out_feats = self.rvq.decode_codes(tokens) vq_out_feats = vq_out_feats.transpose(1, 2) - print(f"vq_out_feats shape: {vq_out_feats.shape[1]}") vq_out_length = torch.tensor([vq_out_feats.size(1)], dtype=torch.long, device=vq_out_feats.device) vq_out_feats, vq_out_length = self.upsample(vq_out_feats, vq_out_length) return vq_out_feats, vq_out_length From 4e29392a5cd103a0c397f649952fdf80567e47c9 Mon Sep 17 00:00:00 2001 From: ethan Date: Mon, 1 Dec 2025 21:48:08 -0800 Subject: [PATCH 11/14] update spelling list --- .ci/spellcheck/.pyspelling.wordlist.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt index 5d8f1cc85d7..3ba62b449a1 100644 --- a/.ci/spellcheck/.pyspelling.wordlist.txt +++ b/.ci/spellcheck/.pyspelling.wordlist.txt @@ -914,6 +914,7 @@ Ruizhongtai Runtime runtime runtimes +RVQ Safetensors SageMaker sagittal From 382f0ead35c916c424e3699040e5f320bea507e5 Mon Sep 17 00:00:00 2001 From: ethan Date: Mon, 1 Dec 2025 21:58:50 -0800 Subject: [PATCH 12/14] update --- notebooks/fireredtts2/fireredtts2.ipynb | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/notebooks/fireredtts2/fireredtts2.ipynb b/notebooks/fireredtts2/fireredtts2.ipynb index 777301a30ba..aa66e9358db 100644 --- a/notebooks/fireredtts2/fireredtts2.ipynb +++ b/notebooks/fireredtts2/fireredtts2.ipynb @@ -82,7 +82,12 @@ "# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry\n", "from notebook_utils import collect_telemetry\n", "\n", - "collect_telemetry(\"firetts2.ipynb\")" + "collect_telemetry(\"firetts2.ipynb\")\n", + "\n", + "# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry\n", + "from notebook_utils import collect_telemetry\n", + "\n", + "collect_telemetry(\"fireredtts2.ipynb\")" ] }, { @@ -215,6 +220,7 @@ ], "source": [ "from ov_fireredtts_helper import convert_fireredtts2\n", + "\n", "# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry\n", "from notebook_utils import collect_telemetry\n", "from huggingface_hub import snapshot_download\n", From aaca5cfdcf5af2ff7e79ad48a28c7629865ee60f Mon Sep 17 00:00:00 2001 From: Ethan Yang Date: Tue, 2 Dec 2025 14:13:57 +0800 Subject: [PATCH 13/14] Update notebooks/fireredtts2/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- notebooks/fireredtts2/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/fireredtts2/README.md b/notebooks/fireredtts2/README.md index 2d0213d8bd3..921df8b8b1b 100644 --- a/notebooks/fireredtts2/README.md +++ b/notebooks/fireredtts2/README.md @@ -20,7 +20,7 @@ The tutorial consists from following steps: - Run OpenVINO model inference - Launch Interactive demo -In this demonstration, you'll create interactive assistant that can answer questions about provided image's content or generate images based on text instructions. +In this demonstration, you'll create an interactive assistant that can generate multi-speaker dialogues, perform voice cloning, and synthesize natural speech using FireRedTTS-2 and OpenVINO. The images bellow illustrates example of voice cloning and dialogue generation. From 29119f88cabdb59622559c1a6868b3c0ae526295 Mon Sep 17 00:00:00 2001 From: Ethan Yang Date: Tue, 2 Dec 2025 14:14:21 +0800 Subject: [PATCH 14/14] Update notebooks/fireredtts2/gradio_helper.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- notebooks/fireredtts2/gradio_helper.py | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/notebooks/fireredtts2/gradio_helper.py b/notebooks/fireredtts2/gradio_helper.py index fa8da5d51de..bc25f268524 100644 --- a/notebooks/fireredtts2/gradio_helper.py +++ b/notebooks/fireredtts2/gradio_helper.py @@ -8,17 +8,17 @@ examples = [ [ "English", - "FireRedTTS2\examples\chat_prompt\en\S1.flac", + "FireRedTTS2/examples/chat_prompt/en/S1.flac", "[S1]I think we should just talk about what happened and move on because there's going to be other jousts and Sir Saif isn't done yet. It's not, he's not, it's not done yet.", - "examples\chat_prompt\en\S2.flac", + "examples/chat_prompt/en/S2.flac", "[S2]You know, maybe sorry, maybe maybe I pushed, maybe I pushed too hard. I was really excited. I didn't mean to make you snap.", "[S1]It's alright, we'll take a breath and plan the next pass together.[S2]Yeah, thanks. We'll get it right this time.[S1]Let's review our signals tonight so we're in sync on the field tomorrow.", ], [ "中文", - "FireRedTTS2\examples\chat_prompt\zh\S1.flac", + "FireRedTTS2/examples/chat_prompt/zh/S1.flac", "[S1]啊,可能说更适合美国市场应该是什么样子。那这这个可能说当然如果说有有机会能亲身的去考察去了解一下,那当然是有更好的帮助。", - "examples\chat_prompt\zh\S2.flac", + "examples/chat_prompt/zh/S2.flac", "[S2]比如具体一点的,他觉得最大的一个跟他预想的不一样的是在什么地方。", "[S1]那可能说对对,没有去过美国来说去去看到美国线下。巴斯曼也好,沃尔玛也好,他们线下不管说,因为深圳出去的还是电子周边的会表达,会发现哇对这个价格真的是很高呀。都是卖三十五美金、四十美金,甚至一个手机壳,就是二十五美金开。[S2]对,没错,我每次都觉得不不可思议。我什么人会买三五十美金的手机壳?但是其实在在那个target啊,就塔吉特这种超级市场,大家都是这样的,定价也很多人买。", ],