|
11 | 11 | "\n", |
12 | 12 | "#### 📚 What you'll learn\n", |
13 | 13 | "\n", |
14 | | - "The notebook demonstrates how you can combine numerical samplers, the person sampler and LLMs to create a synthetic dataset of W-2 forms (US Wage & Tax Statements).\n", |
| 14 | + "The notebook demonstrates how you can combine numerical samplers, the person sampler and LLMs to create a synthetic\\\n", |
| 15 | + " dataset of W-2 forms (US Wage & Tax Statements).\n", |
15 | 16 | "\n", |
16 | 17 | "- We will use generate numerical fields using [statistics published by the IRS](https://www.irs.gov/pub/irs-pdf/p5385.pdf) for the year 2021:\n", |
17 | 18 | "\n", |
18 | | - "- We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics for generated persons reflect real-world census data.\n", |
| 19 | + "- We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics\\\n", |
| 20 | + " for generated persons reflect real-world census data.\n", |
19 | 21 | "\n", |
20 | 22 | "\n", |
21 | 23 | "<br>\n", |
22 | 24 | "\n", |
23 | 25 | "> 👋 **IMPORTANT** – Environment Setup\n", |
24 | 26 | ">\n", |
25 | | - "> - If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies.\n", |
| 27 | + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", |
26 | 28 | ">\n", |
27 | 29 | "> - You may need to restart your notebook's kernel after setting up the environment.\n", |
28 | 30 | "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", |
|
167 | 169 | "id": "bbcb3538", |
168 | 170 | "metadata": {}, |
169 | 171 | "source": [ |
170 | | - "### 🎲 Setting Up Taxpayer and Employer Sampling\n", |
| 172 | + "## 🎲 Setting Up Taxpayer and Employer Sampling\n", |
171 | 173 | "\n", |
172 | 174 | "- Sampler columns offer non-LLM based generation of synthetic data.\n", |
173 | 175 | "\n", |
174 | 176 | "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n", |
175 | 177 | "\n", |
176 | | - "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census. If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker" |
| 178 | + "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n", |
| 179 | + " If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker" |
177 | 180 | ] |
178 | 181 | }, |
179 | 182 | { |
|
212 | 215 | "id": "28397d74", |
213 | 216 | "metadata": {}, |
214 | 217 | "source": [ |
215 | | - "### ⚡️ Defining the Fields\n", |
| 218 | + "## ⚡️ Defining the Fields\n", |
216 | 219 | "\n", |
217 | 220 | "We will focus on the following:\n", |
218 | 221 | "- Box 1 (Wages, tips, and other compensation)\n", |
|
231 | 234 | "\n", |
232 | 235 | "### Numerical fields\n", |
233 | 236 | "\n", |
234 | | - "Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). We'll use the W-2 statistics from the IRS linked above to generate realistic samples." |
| 237 | + "Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). \\\n", |
| 238 | + "We'll use the W-2 statistics from the IRS linked above to generate realistic samples." |
235 | 239 | ] |
236 | 240 | }, |
237 | 241 | { |
|
411 | 415 | "source": [ |
412 | 416 | "### 🦜 Non-numerical Fields\n", |
413 | 417 | "\n", |
414 | | - "The remaining fields contain information about the employee (taxpayer) and the employer. We'll use the person sampler in combination with an LLM to generate values here." |
| 418 | + "The remaining fields contain information about the employee (taxpayer) and the employer. \\\n", |
| 419 | + "We'll use the person sampler in combination with an LLM to generate values here." |
415 | 420 | ] |
416 | 421 | }, |
417 | 422 | { |
|
574 | 579 | "metadata": {}, |
575 | 580 | "outputs": [], |
576 | 581 | "source": [ |
577 | | - "job_results = data_designer_client.create(config_builder, num_records=2)\n", |
| 582 | + "job_results = data_designer_client.create(config_builder, num_records=20)\n", |
578 | 583 | "\n", |
579 | 584 | "# This will block until the job is complete.\n", |
580 | 585 | "job_results.wait_until_done()" |
|
618 | 623 | "# Download the job artifacts and save them to disk.\n", |
619 | 624 | "job_results.download_artifacts(\n", |
620 | 625 | " output_path=TUTORIAL_OUTPUT_PATH,\n", |
621 | | - " artifacts_folder_name=\"artifacts-community-contributions-w2-dataset\",\n", |
| 626 | + " artifacts_folder_name=\"artifacts-community-contributions-forms-w2-dataset\",\n", |
622 | 627 | ");" |
623 | 628 | ] |
624 | 629 | } |
|
0 commit comments