Skip to content

Commit f458e3a

Browse files
committed
modified healthcare notebooks for 25.10
1 parent 8b9be04 commit f458e3a

File tree

6 files changed

+2515
-10
lines changed

6 files changed

+2515
-10
lines changed

nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/forms/w2-dataset.ipynb

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,18 +11,20 @@
1111
"\n",
1212
"#### 📚 What you'll learn\n",
1313
"\n",
14-
"The notebook demonstrates how you can combine numerical samplers, the person sampler and LLMs to create a synthetic dataset of W-2 forms (US Wage & Tax Statements).\n",
14+
"The notebook demonstrates how you can combine numerical samplers, the person sampler and LLMs to create a synthetic\\\n",
15+
" dataset of W-2 forms (US Wage & Tax Statements).\n",
1516
"\n",
1617
"- We will use generate numerical fields using [statistics published by the IRS](https://www.irs.gov/pub/irs-pdf/p5385.pdf) for the year 2021:\n",
1718
"\n",
18-
"- We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics for generated persons reflect real-world census data.\n",
19+
"- We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics\\\n",
20+
" for generated persons reflect real-world census data.\n",
1921
"\n",
2022
"\n",
2123
"<br>\n",
2224
"\n",
2325
"> 👋 **IMPORTANT** – Environment Setup\n",
2426
">\n",
25-
"> - If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies.\n",
27+
"> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n",
2628
">\n",
2729
"> - You may need to restart your notebook's kernel after setting up the environment.\n",
2830
"> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n",
@@ -167,13 +169,14 @@
167169
"id": "bbcb3538",
168170
"metadata": {},
169171
"source": [
170-
"### 🎲 Setting Up Taxpayer and Employer Sampling\n",
172+
"## 🎲 Setting Up Taxpayer and Employer Sampling\n",
171173
"\n",
172174
"- Sampler columns offer non-LLM based generation of synthetic data.\n",
173175
"\n",
174176
"- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n",
175177
"\n",
176-
"- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census. If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker"
178+
"- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n",
179+
" If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker"
177180
]
178181
},
179182
{
@@ -212,7 +215,7 @@
212215
"id": "28397d74",
213216
"metadata": {},
214217
"source": [
215-
"### ⚡️ Defining the Fields\n",
218+
"## ⚡️ Defining the Fields\n",
216219
"\n",
217220
"We will focus on the following:\n",
218221
"- Box 1 (Wages, tips, and other compensation)\n",
@@ -231,7 +234,8 @@
231234
"\n",
232235
"### Numerical fields\n",
233236
"\n",
234-
"Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). We'll use the W-2 statistics from the IRS linked above to generate realistic samples."
237+
"Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). \\\n",
238+
"We'll use the W-2 statistics from the IRS linked above to generate realistic samples."
235239
]
236240
},
237241
{
@@ -411,7 +415,8 @@
411415
"source": [
412416
"### 🦜 Non-numerical Fields\n",
413417
"\n",
414-
"The remaining fields contain information about the employee (taxpayer) and the employer. We'll use the person sampler in combination with an LLM to generate values here."
418+
"The remaining fields contain information about the employee (taxpayer) and the employer. \\\n",
419+
"We'll use the person sampler in combination with an LLM to generate values here."
415420
]
416421
},
417422
{
@@ -574,7 +579,7 @@
574579
"metadata": {},
575580
"outputs": [],
576581
"source": [
577-
"job_results = data_designer_client.create(config_builder, num_records=2)\n",
582+
"job_results = data_designer_client.create(config_builder, num_records=20)\n",
578583
"\n",
579584
"# This will block until the job is complete.\n",
580585
"job_results.wait_until_done()"
@@ -618,7 +623,7 @@
618623
"# Download the job artifacts and save them to disk.\n",
619624
"job_results.download_artifacts(\n",
620625
" output_path=TUTORIAL_OUTPUT_PATH,\n",
621-
" artifacts_folder_name=\"artifacts-community-contributions-w2-dataset\",\n",
626+
" artifacts_folder_name=\"artifacts-community-contributions-forms-w2-dataset\",\n",
622627
");"
623628
]
624629
}

0 commit comments

Comments
 (0)