tests(manual): add tensorflow-test.ipynb (#1975)

jiridanek · web-flow · commit 1b70595a5b99 · 2025-09-16T17:56:57.000+02:00
This is a basic TensorFlow smoke test created by @daniellutz Coderabbit wished it to be enhanced in the following areas * #2491
diff --git a/tests/manual/tensorflow-test.ipynb b/tests/manual/tensorflow-test.ipynb
@@ -0,0 +1,347 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "06fb33f2-64d3-4555-97ae-1fbd4d35bcf3",
+   "metadata": {},
+   "source": [
+    "# Tensorflow test notebook\n",
+    "\n",
+    "This notebook aims to provide a very basic testing perspective on Jupyter notebooks with GPU support, in such way that:\n",
+    "\n",
+    "1. Verify the installed Python version\n",
+    "2. Find the GPU on the devices list\n",
+    "3. Check if nvidia-smi loads properly\n",
+    "4. CUDA/ROCm drivers are installed\n",
+    "5. TensorFlow quickstart for beginners\n",
+    "6. TensorFlow tests (basic operations on GPU)\n",
+    "7. TensorBoard spins up properly"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9c5d228b-1e87-4dde-a469-7e89ac27e0f3",
+   "metadata": {},
+   "source": [
+    "## 1. Verify the installed Python version\n",
+    "Multiple notebooks are available, and it can happen of system upgrades, notebooks built with different Python versions, across other possible changes.\n",
+    "\n",
+    "The following test will only print out the Python version installed on this notebook, so it can be verified that the expected Python is really running.\n",
+    "\n",
+    "> Note: this is yet a manual test, you need to know what version is supposed to run here and match with the output below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "5964db1a-5a7c-4988-a3ee-2b7794cdfc38",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "3.12.9 (main, Jun 20 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import sys\n",
+    "print(sys.version)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14291502-9d3b-40dc-b1a3-0721ab26e12d",
+   "metadata": {},
+   "source": [
+    "## 2. Find the GPU on the devices list\n",
+    "To understand if the GPUs are present in the current setup, we wil rely on TensorFlow Python client, which refers to the official Python API for interacting with the system's properties and devices.\n",
+    "\n",
+    "- If the following code returns a list with items inside, this means that there are GPUs running on this server;\n",
+    "- If the following code returns an empty list, this means that there are no GPUs available;"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "612058a9-cbfe-4e4b-a5cc-bc08e31df830",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tensorflow.python.client import device_lib\n",
+    "\n",
+    "def get_available_gpus():\n",
+    "    \"\"\"\n",
+    "    Get all devices connected to this server and return only\n",
+    "    the devices that contain the keyword \"GPU\"\n",
+    "    \"\"\"\n",
+    "    local_device_protos = device_lib.list_local_devices()\n",
+    "    return ([x.name for x in local_device_protos if x.device_type == 'GPU'])\n",
+    "\n",
+    "get_available_gpus()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "40f07595-a51e-4b17-b537-53d2c21ea83f",
+   "metadata": {},
+   "source": [
+    "## 3. Check if nvidia-smi loads properly\n",
+    "The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.\n",
+    "\n",
+    "The following command only spins up the `nvidia-smi` utility:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4307f74-b172-4a4e-8d35-c8f6b98421f6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "86ed7b78-1171-4347-a99a-cedc5e513415",
+   "metadata": {},
+   "source": [
+    "## 4. CUDA/ROCm drivers are installed;\n",
+    "This test aims to simply check if CUDA/ROCm drivers are properly installed. To test this, the `nvcc` command will be executed for NVIDIA GPUs and `hipcc` for ROCm GPUs.\n",
+    "\n",
+    "> Note: the code is as simple as possible, run the ones that makes sense for the tests you are doing (there are no extended programming to check automatically, etc, this is done this way on purpose to simplify the code as much as possible"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "98a7a87d-c814-458a-b47a-36681c384223",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run the following for NVIDIA GPUs\n",
+    "!nvcc --version"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "58878971-ddde-4310-bee3-992cb60e6444",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run the following for ROCm GPUs\n",
+    "!hipcc --version"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3d3fbdc6-fafb-48a7-bfdb-4bf8f00ea414",
+   "metadata": {},
+   "source": [
+    "## 5. TensorFlow quickstart for beginners\n",
+    "This test aims to use the basic Getting Started tutorial on TensorFlow website to check if this installation is working properly:\n",
+    "\n",
+    "- Load a prebuilt dataset;\n",
+    "- Build a neural network machine learning model that classifies images;\n",
+    "- Train this neural network;\n",
+    "- Evaluate the accuracy of the model;\n",
+    "\n",
+    "The expected output here is a matrix of probabilities, i.e.:\n",
+    "\n",
+    "```\n",
+    "<tf.Tensor: shape=(5, 10), dtype=float32, numpy=\n",
+    "array([[1.6603454e-07, 1.1202768e-09, 3.5647324e-06, 2.7514023e-05,\n",
+    "        1.0256378e-10, 1.7054673e-07...\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9a84a9ec-f867-44aa-b078-61156f71e132",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tensorflow as tf\n",
+    "\n",
+    "# 1. Load a dataset\n",
+    "# ----------------------------------------------------------------------------------------------\n",
+    "# Load and prepare the MNIST dataset. The pixel values of the images range from 0 through 255.\n",
+    "# Scale these values to a range of 0 to 1 by dividing the values by 255.0.\n",
+    "# This also converts the sample data from integers to floating-point numbers:\n",
+    "mnist = tf.keras.datasets.mnist\n",
+    "\n",
+    "(x_train, y_train), (x_test, y_test) = mnist.load_data()\n",
+    "x_train, x_test = x_train / 255.0, x_test / 255.0\n",
+    "\n",
+    "\n",
+    "# 2. Build a machine learning model\n",
+    "# ----------------------------------------------------------------------------------------------\n",
+    "# Build a tf.keras.Sequential model:\n",
+    "model = tf.keras.models.Sequential([\n",
+    "  tf.keras.layers.Flatten(input_shape=(28, 28)),\n",
+    "  tf.keras.layers.Dense(128, activation='relu'),\n",
+    "  tf.keras.layers.Dropout(0.2),\n",
+    "  tf.keras.layers.Dense(10)\n",
+    "])\n",
+    "\n",
+    "# Sequential is useful for stacking layers where each layer has one input tensor and one output tensor.\n",
+    "# Layers are functions with a known mathematical structure that can be reused and have trainable variables.\n",
+    "# Most TensorFlow models are composed of layers. This model uses the Flatten, Dense, and Dropout layers.\n",
+    "#\n",
+    "# For each example, the model returns a vector of logits or log-odds scores, one for each class.\n",
+    "predictions = model(x_train[:1]).numpy()\n",
+    "predictions\n",
+    "\n",
+    "# The tf.nn.softmax function converts these logits to probabilities for each class:\n",
+    "tf.nn.softmax(predictions).numpy()\n",
+    "\n",
+    "# Define a loss function for training using losses.SparseCategoricalCrossentropy:\n",
+    "loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)\n",
+    "\n",
+    "# The loss function takes a vector of ground truth values and a vector of logits and returns a scalar loss for each example.\n",
+    "# This loss is equal to the negative log probability of the true class: The loss is zero if the model is sure of the correct class.\n",
+    "#\n",
+    "# This untrained model gives probabilities close to random (1/10 for each class), so the initial loss should be close to -tf.math.log(1/10) ~= 2.3.\n",
+    "loss_fn(y_train[:1], predictions).numpy()\n",
+    "\n",
+    "# Before you start training, configure and compile the model using Keras Model.compile.Set the optimizer class to adam, set the loss to the loss_fn\n",
+    "# function you defined earlier, and specify a metric to be evaluated for the model by setting the metrics parameter to accuracy.\n",
+    "model.compile(optimizer='adam',\n",
+    "              loss=loss_fn,\n",
+    "              metrics=['accuracy'])\n",
+    "\n",
+    "\n",
+    "# 3. Train and evaluate your model\n",
+    "# ----------------------------------------------------------------------------------------------\n",
+    "# Use the Model.fit method to adjust your model parameters and minimize the loss:\n",
+    "model.fit(x_train, y_train, epochs=5)\n",
+    "\n",
+    "# The Model.evaluate method checks the model's performance, usually on a validation set or test set.\n",
+    "model.evaluate(x_test,  y_test, verbose=2)\n",
+    "\n",
+    "# The image classifier is now trained to ~98% accuracy on this dataset. To learn more, read the TensorFlow tutorials.\n",
+    "#\n",
+    "# If you want your model to return a probability, you can wrap the trained model, and attach the softmax to it:\n",
+    "probability_model = tf.keras.Sequential([\n",
+    "  model,\n",
+    "  tf.keras.layers.Softmax()\n",
+    "])\n",
+    "probability_model(x_test[:5])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1a0ed9e-7372-4b3b-b6c0-0d252ee88c01",
+   "metadata": {},
+   "source": [
+    "## 6. TensorFlow tests (basic operations on GPU)\n",
+    "This test aims to try out the GPU basic commands, like `add`, `multiply`, `matmul`, `reduce_sum` and `reduce_mean` just to be sure that TensorFlow is really running well.\n",
+    "\n",
+    "For more information of the methods that will be tested here:\n",
+    "- [tf.add](https://www.tensorflow.org/api_docs/python/tf/math/add)\n",
+    "- [tf.multiply](https://www.tensorflow.org/api_docs/python/tf/math/multiply)\n",
+    "- [tf.matmul](https://www.tensorflow.org/api_docs/python/tf/linalg/matmul)\n",
+    "- [tf.reduce_sum](https://www.tensorflow.org/api_docs/python/tf/math/reduce_sum)\n",
+    "- [tf.reduce_mean](https://www.tensorflow.org/api_docs/python/tf/math/reduce_mean)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e2e44f63-6c50-493b-9742-125e2f09073b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tensorflow as tf\n",
+    "import numpy as np\n",
+    "\n",
+    "# Basic arithmetic\n",
+    "a = tf.constant([1.0, 2.0, 3.0, 4.0])\n",
+    "b = tf.constant([4.0, 3.0, 2.0, 1.0])\n",
+    "\n",
+    "# Addition\n",
+    "add_result = tf.add(a, b)\n",
+    "print(f\"Addition: {a.numpy()} + {b.numpy()} = {add_result.numpy()}\")\n",
+    "\n",
+    "# Multiplication\n",
+    "mul_result = tf.multiply(a, b)\n",
+    "print(f\"Multiplication: {mul_result.numpy()}\")\n",
+    "\n",
+    "# Matrix operations\n",
+    "matrix_a = tf.constant([[1.0, 2.0], [3.0, 4.0]])\n",
+    "matrix_b = tf.constant([[2.0, 1.0], [1.0, 3.0]])\n",
+    "matmul_result = tf.matmul(matrix_a, matrix_b)\n",
+    "print(f\"Matrix multiplication result: \\n{matmul_result.numpy()}\")\n",
+    "\n",
+    "# Reduction operations\n",
+    "reduce_sum = tf.reduce_sum(a)\n",
+    "reduce_mean = tf.reduce_mean(a)\n",
+    "print(f\"Sum: {reduce_sum.numpy()}, Mean: {reduce_mean.numpy()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "988e8953-6957-44da-ab2e-3a7272d5815d",
+   "metadata": {},
+   "source": [
+    "## 7. TensorBoard spins up properly\n",
+    "[TensorBoard](https://www.tensorflow.org/tensorboard) provides the visualization and tooling needed for machine learning experimentation:\n",
+    "\n",
+    "- Tracking and visualizing metrics such as loss and accuracy\n",
+    "- Visualizing the model graph (ops and layers)\n",
+    "- Viewing histograms of weights, biases, or other tensors as they change over time\n",
+    "- Projecting embeddings to a lower dimensional space\n",
+    "- Displaying images, text, and audio data\n",
+    "- Profiling TensorFlow programs\n",
+    "- and more...\n",
+    "\n",
+    "This basic test aims to check if the TensorBoard UI spins up properly and that there are no major issues (see [RHOAIENG-20553](https://issues.redhat.com/browse/RHOAIENG-20553))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9361e574-2e50-46b3-852c-c5b9d06c6f7b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# Tensorboard needs to run through a proxy as the Jupyter Notebook is running with a proxy / internal connection\n",
+    "# More information: https://medium.com/@adrian.punga_29809/how-to-use-tensorboard-inline-in-jupyter-lab-or-colab-d28519619d28\n",
+    "os.environ[\"TENSORBOARD_PROXY_URL\"] = os.environ[\"NB_PREFIX\"]+\"/proxy/6006/\"\n",
+    "\n",
+    "# Load TensorBoard extension\n",
+    "%load_ext tensorboard\n",
+    "\n",
+    "# Show TensorBoard in the Jupyter Notebook\n",
+    "%tensorboard --logdir /opt/app-root/src/shared"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.12",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}