[no ci] add draft for data processing section

vpratz · vpratz · commit 728b1301d353 · 2025-05-07T07:59:27.000Z
diff --git a/docsrc/source/user_guide/data_processing.ipynb b/docsrc/source/user_guide/data_processing.ipynb
@@ -0,0 +1,238 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d030332b-2bb3-4b6c-b332-164206123b8f",
+   "metadata": {},
+   "source": [
+    "# Data Processing: Adapters\n",
+    "\n",
+    "To ensure that the training data generated by a simulator can be used for deep learning, we have to bring our data into the structure required by BayesFlow. The {py:class}`~bayesflow.adapters.Adapter` class provides multiple flexible functionalities, from standardization to renaming, and many more.\n",
+    "\n",
+    "## BayesFlow's Data Structure\n",
+    "\n",
+    "BayesFlow offers a standardized interface for training neural networks. Data and parameters are organized in dictionaries. The inputs to the networks are organized in specific dictionary entries.\n",
+    "\n",
+    "- `inference_variables` (required): The variables of the distribution we try to approximate. For a posterior distribution, this would be the parameters. For a likelihood function, this would be the data.\n",
+    "- `summary_variables` (optional): Variables that are passed through the summary network, and subsequently used as a condition for the inference network. In a posterior estimation setting, this would be the data (if a summary network is used).\n",
+    "- `inference_conditions` (optional): Conditions for the inference network that are passed directly, without going through a summary network. This is useful for context variables, as well as for the data when not summary network is used.\n",
+    "\n",
+    "In addition, we have to ensure that the correct data type is passed, usually `float32`. The {py:class}`~bayesflow.adapters.Adapter` class makes it easy to transform the data into the required structure."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9380af80-0638-4059-ab5e-ed8c181e9a93",
+   "metadata": {},
+   "source": [
+    "### Example: Posterior Estimation\n",
+    "\n",
+    "Let's start with a simple posterior estimation example, where we want to approximate the posterior distribution for parameters `theta_1` and `theta_2`, conditional on data `x`. First, we construct a simple dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "30505f99-db0f-4651-9de6-efcb282f578c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Shapes: {'theta_1': (2, 1), 'theta_2': (2, 1), 'x': (2, 3)}\n"
+     ]
+    }
+   ],
+   "source": [
+    "import bayesflow as bf\n",
+    "import numpy as np\n",
+    "\n",
+    "batch_size = 2\n",
+    "rng = np.random.default_rng(seed=2025)\n",
+    "data = {\n",
+    "    \"theta_1\": np.zeros((batch_size, 1)),\n",
+    "    \"theta_2\": np.ones((batch_size, 1)),\n",
+    "    \"x\": rng.uniform(size=(batch_size, 3)),\n",
+    "}\n",
+    "print(\"Shapes:\", {k: v.shape for k, v in data.items()})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "823b183c-a993-451f-b3a9-1908694a6448",
+   "metadata": {},
+   "source": [
+    "Next, we create an {py:class}`~bayesflow.adapters.Adapter` to convert it into the desired format (assuming we want to use a summary network later on)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "bd217c66-a748-455d-8cbd-03e74405bc86",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Adapter([0: ConvertDType -> 1: Concatenate(['theta_1', 'theta_2'] -> 'inference_variables') -> 2: Rename('x' -> 'summary_variables')])\n"
+     ]
+    }
+   ],
+   "source": [
+    "adapter = (\n",
+    "    bf.Adapter()\n",
+    "    .convert_dtype(\"float64\", \"float32\")\n",
+    "    .concatenate([\"theta_1\", \"theta_2\"], into=\"inference_variables\")\n",
+    "    .rename(\"x\", \"summary_variables\")\n",
+    ")\n",
+    "\n",
+    "print(adapter)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2616f412-db48-49e6-a876-89e95b435472",
+   "metadata": {},
+   "source": [
+    "When we now apply the adapter to our data, it executes the specified transformations:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "9da822ba-bc14-4945-8eb5-db5b26eb8a3b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'inference_variables': array([[0., 1.],\n",
+      "       [0., 1.]], dtype=float32), 'summary_variables': array([[0.9944578 , 0.38200974, 0.827148  ],\n",
+      "       [0.8372553 , 0.97580904, 0.07722503]], dtype=float32)}\n",
+      "Shapes: {'inference_variables': (2, 2), 'summary_variables': (2, 3)}\n"
+     ]
+    }
+   ],
+   "source": [
+    "transformed_data = adapter(data)\n",
+    "print(transformed_data)\n",
+    "print(\"Shapes:\", {k: v.shape for k, v in transformed_data.items()})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "58def124-c41c-4059-b1ee-21319043ad06",
+   "metadata": {},
+   "source": [
+    "Many of the transforms in the adapter are invertible, so that we can also call the adapter in the inverse direction:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "963df7dd-e641-44e7-86fa-3d0444556b95",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Shapes: {'x': (2, 3), 'theta_1': (2, 1), 'theta_2': (2, 1)}\n"
+     ]
+    }
+   ],
+   "source": [
+    "cycled_data = adapter(transformed_data, inverse=True)\n",
+    "print(\"Shapes:\", {k: v.shape for k, v in cycled_data.items()})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "352acf17-7880-4b24-8358-c7f66b405159",
+   "metadata": {},
+   "source": [
+    "### Example: Likelihood Estimation\n",
+    "\n",
+    "For likelihood estimation, the roles are switched. We want to estimate the distribution of the data `x` conditional on the parameters `theta_1` and `theta_2`. We supply the parameters to the inference network directly without a summary network."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "af55755f-200e-436d-9e06-ba26761ae859",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Adapter([0: ConvertDType -> 1: Concatenate(['theta_1', 'theta_2'] -> 'inference_conditions') -> 2: Rename('x' -> 'inference_variables')])\n",
+      "Shapes: {'inference_conditions': (2, 2), 'inference_variables': (2, 3)}\n"
+     ]
+    }
+   ],
+   "source": [
+    "adapter = (\n",
+    "    bf.Adapter()\n",
+    "    .convert_dtype(\"float64\", \"float32\")\n",
+    "    .concatenate([\"theta_1\", \"theta_2\"], into=\"inference_conditions\")\n",
+    "    .rename(\"x\", \"inference_variables\")\n",
+    ")\n",
+    "\n",
+    "print(adapter)\n",
+    "transformed_data = adapter(data)\n",
+    "print(\"Shapes:\", {k: v.shape for k, v in transformed_data.items()})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f88a14b1-7b18-4d44-a5f6-8ca7a54dda7f",
+   "metadata": {},
+   "source": [
+    "You can find many more configurations in the {doc}`../../examples` section."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "16ba9fa3-6ad6-476d-afbd-13e260ea56b0",
+   "metadata": {},
+   "source": [
+    "## Pre-processing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "06f0f749-32fb-42a7-b56e-12ca0e396abb",
+   "metadata": {},
+   "source": [
+    "Besides the structure and the data types, there are pre-processing steps that can make network training more efficient. Those include standardization, transforming constrained variables to an unconstrained space, or various non-linear transformations that simply the space the network has to operate in. In addition, operations on arrays like broadcasting and concatenating simplify the transformation into the required structure.\n",
+    "\n",
+    "The {py:class}`~bayesflow.adapters.Adapter` features a large set of methods, please refer to the API documentation for a complete list. For applied examples, refer to the {doc}`../../examples` section."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docsrc/source/user_guide/index.md b/docsrc/source/user_guide/index.md
@@ -11,4 +11,5 @@ If you want to contribute, feel free to open an issue or a pull request, we welc
 
 introduction
 generative_models.ipynb
+data_processing.ipynb
 ```