diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/extrinsic_evaluation.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/extrinsic_evaluation.ipynb new file mode 100644 index 00000000..83da15e5 --- /dev/null +++ b/nemo/NeMo-Safe-Synthesizer/advanced/extrinsic_evaluation.ipynb @@ -0,0 +1,504 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "630e3e17", + "metadata": {}, + "source": [ + "# ๐ŸŽ›๏ธ NeMo Safe Synthesizer 101: Extrinsic Evaluation\n", + "\n", + "> โš ๏ธ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.\n", + "\n", + "
\n", + "\n", + "In this notebook, we build off the foundational concepts from the *NeMo Safe Synthesizer 101: Data Generation* notebook. While the first notebook focused on *how* to generate synthetic data, this one focuses on **how to measure its quality and utility** for real-world applications.\n", + "\n", + "We'll do this using a common method called **extrinsic evaluation**, which involves testing the synthetic data's performance on a downstream machine learning task.\n", + "\n", + "---\n", + "\n", + "## ๐ŸŽฏ What is Extrinsic Evaluation?\n", + "\n", + "Extrinsic evaluation measures the **utility** of synthetic data by using it to train a model for a specific task. This contrasts with *intrinsic* evaluation, which might only measure the statistical similarity between the synthetic and real data.\n", + "\n", + "In this notebook, we'll use a **simple classification task** as our benchmark. The core idea is to answer the question:\n", + "\n", + "> \"Can a model trained **only** on our *synthetic data* achieve comparable performance to a model trained on the *real data*?\"\n", + "\n", + "If the answer is yes, it's a strong signal that our synthetic data has successfully captured the important patterns, relationships, and statistical properties of the original dataset. This is the \"Train-on-Synthetic, Test-on-Real\" approach." + ] + }, + { + "cell_type": "markdown", + "id": "8be84f5d", + "metadata": {}, + "source": [ + "#### ๐Ÿ’พ Install dependencies\n", + "\n", + "**IMPORTANT** ๐Ÿ‘‰ Ensure you have a NeMo Microservices Platform deployment available. Follow the quickstart or Helm chart instructions in your environment's setup guide. You may need to restart your kernel after installing dependencies.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9f5d6f5a", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from nemo_microservices import NeMoMicroservices\n", + "from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder\n", + "\n", + "import logging\n", + "logging.basicConfig(level=logging.WARNING)\n", + "logging.getLogger(\"httpx\").setLevel(logging.WARNING)" + ] + }, + { + "cell_type": "markdown", + "id": "53bb2807", + "metadata": {}, + "source": [ + "### โš™๏ธ Initialize the NeMo Safe Synthesizer Client\n", + "\n", + "- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.\n", + "- `http://localhost:8080` is the default url for the client's `base_url` in the quickstart.\n", + "- If using a managed or remote deployment, ensure correct base URLs and tokens.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c15ab93", + "metadata": {}, + "outputs": [], + "source": [ + "client = NeMoMicroservices(\n", + " base_url=\"http://localhost:8080\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "74d72ef7", + "metadata": {}, + "source": [ + "NeMo DataStore is launched as one of the services, and we'll use it to manage our storage. so we'll set the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab037a3a", + "metadata": {}, + "outputs": [], + "source": [ + "datastore_config = {\n", + " \"endpoint\": \"http://localhost:3000/v1/hf\",\n", + " \"token\": \"placeholder\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "2d66c819", + "metadata": {}, + "source": [ + "## ๐Ÿ“ฅ Load input data\n", + "\n", + "Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.\n", + "\n", + "The sample dataset used here is a set of women's clothing reviews, including age, product category, rating, and review text. Some of the reviews contain Personally Identifiable Information (PII), such as height, weight, age, and location." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "daa955b6", + "metadata": {}, + "outputs": [], + "source": [ + "# %uv pip install kagglehub, scikit-learn, tabulate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7204f213", + "metadata": {}, + "outputs": [], + "source": [ + "import kagglehub\n", + "import pandas as pd\n", + "\n", + "# Download latest version\n", + "path = kagglehub.dataset_download(\"nicapotato/womens-ecommerce-clothing-reviews\")\n", + "raw_df = pd.read_csv(f\"{path}/Womens Clothing E-Commerce Reviews.csv\", index_col=0)\n", + "raw_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "c6c331b7", + "metadata": {}, + "source": [ + "We create a holdout dataset that will only be used for evaluating the end classifier" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "162876c3", + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "df, test_df = train_test_split(raw_df, test_size=0.2, random_state=42)\n", + "\n", + "print(f\"Original df length: {len(raw_df)}\")\n", + "print(f\"Training df length: {len(df)}\")\n", + "print(f\"Testing df length: {len(test_df)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "87d72c68", + "metadata": {}, + "source": [ + "## ๐Ÿ—๏ธ Create a Safe Synthesizer job\n", + "\n", + "The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.\n", + "\n", + "The following code creates and submits a job:\n", + "- `SafeSynthesizerBuilder(client)`: initialize with the NeMo Microservices client.\n", + "- `.from_data_source(df)`: set the input data source.\n", + "- `.with_datastore(datastore_config)`: configure model artifact storage.\n", + "- `.with_replace_pii()`: enable automatic replacement of PII.\n", + "- `.synthesize()`: train and generate synthetic data.\n", + "- `.create_job()`: submit the job to the platform.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85d9de56", + "metadata": {}, + "outputs": [], + "source": [ + "job = (\n", + " SafeSynthesizerBuilder(client)\n", + " .from_data_source(df)\n", + " .with_datastore(datastore_config)\n", + " .with_replace_pii()\n", + " .synthesize()\n", + " .with_generate(num_records=15000)\n", + " .create_job()\n", + ")\n", + "\n", + "print(f\"job_id = {job.job_id}\")\n", + "job.wait_for_completion()\n", + "\n", + "print(f\"Job finished with status {job.fetch_status()}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa2eacb2", + "metadata": {}, + "outputs": [], + "source": [ + "# If your notebook shuts down, it's okay, your job is still running on the microservices platform.\n", + "# You can get the same job object and interact with it again by uncommenting the following code\n", + "# snippet, and modifying it with the job id from the previous cell output.\n", + "\n", + "# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob\n", + "# job = SafeSynthesizerJob(job_id=\"\", client=client)" + ] + }, + { + "cell_type": "markdown", + "id": "285d4a9d", + "metadata": {}, + "source": [ + "## ๐Ÿ‘€ View synthetic data\n", + "\n", + "After the job completes, fetch the generated synthetic dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f25574a", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch the synthetic data created by the job\n", + "synthetic_df = job.fetch_data()\n", + "synthetic_df\n" + ] + }, + { + "cell_type": "markdown", + "id": "2b25f152", + "metadata": {}, + "source": [ + "## ๐Ÿ“Š View evaluation report\n", + "\n", + "An evaluation comparing the synthetic data to the input data is performed automatically. You can:\n", + "\n", + "- **Inspect key scores**: overall synthetic data quality and privacy.\n", + "- **Download the full HTML report**: includes charts and detailed metrics.\n", + "- **Display the report inline**: useful when viewing in notebook environments.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7b691127", + "metadata": {}, + "outputs": [], + "source": [ + "# Print selected information from the job summary\n", + "summary = job.fetch_summary()\n", + "print(\n", + " f\"Synthetic data quality score (0-10, higher is better): {summary.synthetic_data_quality_score}\"\n", + ")\n", + "print(f\"Data privacy score (0-10, higher is better): {summary.data_privacy_score}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39e62ea9", + "metadata": {}, + "outputs": [], + "source": [ + "# Download the full evaluation report to your local machine\n", + "job.save_report(\"evaluation_report.html\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45f7e22b", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch and display the full evaluation report inline\n", + "# job.display_report_in_notebook()" + ] + }, + { + "cell_type": "markdown", + "id": "dd1e4925-3620-4b31-bc17-16f74d10fbb5", + "metadata": {}, + "source": [ + "## ๐Ÿงช Extrinsic Evaluation \n", + "\n", + "This section details the **extrinsic evaluation** process, where the quality of the synthetic data is assessed based on how well a model trained on it performs on a real-world task. This comparison is critical for validating the synthetic data's utility.\n", + "\n", + "- **Train Benchmark Model**: A model is trained on a small, fixed subset of the **original data** to establish a performance baseline.\n", + "- **Train Synthetic Model**: A second model, using the same structure, is trained on the **entire synthetic dataset**.\n", + "- **Compare Performance**: Both models are evaluated against the same **fixed holdout test set** ($\\mathbf{X_{test}, y_{test}}$).\n", + "- **Inspect Key Metrics**: The comparison focuses on key metrics like **ROC AUC** and **F1-Score** to determine if the synthetic model performs comparably to the benchmark." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "37b6df30-6627-4a40-8604-e905ada571b7", + "metadata": {}, + "outputs": [], + "source": [ + "# This script defines a scikit-learn pipeline for a classification task.\n", + "import numpy as np\n", + "from sklearn.model_selection import train_test_split \n", + "from sklearn.feature_extraction.text import TfidfVectorizer\n", + "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.metrics import classification_report, accuracy_score, roc_auc_score\n", + "from sklearn.base import clone\n", + "\n", + "X_train = df.drop('Recommended IND', axis=1)\n", + "y_train = df['Recommended IND']\n", + "\n", + "X_train['Review Text'] = X_train['Review Text'].fillna('')\n", + "X_train['Title'] = X_train['Title'].fillna('')\n", + "\n", + "X_test = test_df.drop('Recommended IND', axis=1)\n", + "y_test = test_df['Recommended IND']\n", + "\n", + "X_test['Review Text'] = X_test['Review Text'].fillna('')\n", + "X_test['Title'] = X_test['Title'].fillna('')\n", + "\n", + "text_features = ['Review Text']\n", + "numerical_features = ['Age', 'Rating', 'Positive Feedback Count']\n", + "categorical_features = ['Division Name', 'Department Name', 'Class Name']\n", + "\n", + "text_transformer = TfidfVectorizer(stop_words='english', max_features=5000)\n", + "numerical_transformer = StandardScaler()\n", + "categorical_transformer = OneHotEncoder(handle_unknown='ignore') \n", + "\n", + "preprocessor = ColumnTransformer(\n", + " transformers=[\n", + " ('text', text_transformer, text_features[0]), \n", + " ('num', numerical_transformer, numerical_features),\n", + " ('cat', categorical_transformer, categorical_features)\n", + " ],\n", + " remainder='drop' \n", + ")\n", + "\n", + "model = LogisticRegression(solver='liblinear', random_state=42)\n", + "\n", + "full_pipeline = Pipeline(steps=[\n", + " ('preprocessor', preprocessor),\n", + " ('classifier', model)\n", + "])\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee747c80-d42f-4ec5-b27b-2b2462436b92", + "metadata": {}, + "outputs": [], + "source": [ + "# Train and evaluate a benchmark model pipeline, storing its performance metrics.\n", + "from sklearn.metrics import classification_report, accuracy_score, roc_auc_score\n", + "\n", + "original_pipeline = full_pipeline \n", + "print(\"\\n--- Training Benchmark Model on Original Data (1000 rows) ---\")\n", + "original_pipeline.fit(X_train, y_train)\n", + "\n", + "y_pred_original = original_pipeline.predict(X_test)\n", + "y_prob_original = original_pipeline.predict_proba(X_test)[:, 1]\n", + "\n", + "results = {}\n", + "results['Original'] = {\n", + " 'Accuracy': accuracy_score(y_test, y_pred_original),\n", + " 'ROC AUC': roc_auc_score(y_test, y_prob_original),\n", + " 'Classification Report': classification_report(y_test, y_pred_original, output_dict=True)\n", + "}\n", + "print(\"Benchmark training and evaluation complete.\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf3f1d59-8c46-4d84-b813-a4adf88a3422", + "metadata": {}, + "outputs": [], + "source": [ + "# Train a new model pipeline on synthetic data and evaluates it against the test set.\n", + "from sklearn.base import clone\n", + "from sklearn.metrics import classification_report, accuracy_score, roc_auc_score\n", + "\n", + "X_synthetic = synthetic_df.drop('Recommended IND', axis=1).fillna({'Review Text': '', 'Title': ''})\n", + "y_synthetic = synthetic_df['Recommended IND']\n", + "\n", + "synthetic_pipeline = clone(full_pipeline) \n", + "\n", + "print(\"\\n--- Training Model on Synthetic Data ---\")\n", + "synthetic_pipeline.fit(X_synthetic, y_synthetic)\n", + "\n", + "y_pred_synthetic = synthetic_pipeline.predict(X_test)\n", + "y_prob_synthetic = synthetic_pipeline.predict_proba(X_test)[:, 1]\n", + "\n", + "results['Synthetic'] = {\n", + " 'Accuracy': accuracy_score(y_test, y_pred_synthetic),\n", + " 'ROC AUC': roc_auc_score(y_test, y_prob_synthetic),\n", + " 'Classification Report': classification_report(y_test, y_pred_synthetic, output_dict=True)\n", + "}\n", + "print(\"Synthetic training and evaluation complete.\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d83e681e-aac2-44d0-83cb-1d93002a725d", + "metadata": {}, + "outputs": [], + "source": [ + "# Compare the performance of the original and synthetic models and prints a summary.\n", + "import pandas as pd\n", + "\n", + "print(\"\\n\" + \"=\"*60)\n", + "print(\" SIDE-BY-SIDE MODEL COMPARISON\")\n", + "print(f\" (Tested on {len(test_df)}-Row Holdout Set)\")\n", + "print(\"=\"*60)\n", + "\n", + "summary_data = {\n", + " 'Model': ['Original (Benchmark)', 'Synthetic'],\n", + " 'Train Size': [len(X_train), len(X_synthetic)],\n", + " 'Accuracy': [results['Original']['Accuracy'], results['Synthetic']['Accuracy']],\n", + " 'ROC AUC Score': [results['Original']['ROC AUC'], results['Synthetic']['ROC AUC']],\n", + " 'Precision (Class 1)': [results['Original']['Classification Report']['1']['precision'], results['Synthetic']['Classification Report']['1']['precision']],\n", + " 'Recall (Class 1)': [results['Original']['Classification Report']['1']['recall'], results['Synthetic']['Classification Report']['1']['recall']],\n", + "}\n", + "\n", + "summary_df = pd.DataFrame(summary_data).set_index('Model').T\n", + "summary_df.columns.name = 'Metric'\n", + "\n", + "print(summary_df.to_markdown(floatfmt=\".4f\"))\n", + "\n", + "print(\"\\n\" + \"=\"*60)\n", + "\n", + "print(\"Key Finding:\")\n", + "if results['Synthetic']['ROC AUC'] >= results['Original']['ROC AUC']:\n", + " print(\"The Synthetic Model performs AS WELL OR BETTER than the Original Benchmark.\")\n", + "else:\n", + " print(\"The Synthetic Model's performance is slightly lower than the Original Benchmark.\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "169f443d", + "metadata": {}, + "source": [ + "Your end result should look similar to this:\n", + "\n", + "| | Original (Benchmark) | Synthetic |\n", + "|:--------------------|-----------------------:|------------:|\n", + "| Train Size | 18,788 | 15,000 |\n", + "| Accuracy | 0.9404 | 0.9278 |\n", + "| ROC AUC Score | 0.9782 | 0.9762 |\n", + "| Precision (Class 1) | 0.9626 | 0.9423 |\n", + "| Recall (Class 1) | 0.9646 | 0.9714 |\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9b55961-ddac-4d91-aa4d-9646fb72c7be", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "My Virtual Env", + "language": "python", + "name": "myenv" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.14" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}