Add more SHAP explanation plots and embedding anomalies

JohT · JohT · commit 013b7ab4f611 · 2025-07-22T22:15:00.000+02:00
diff --git a/domains/anomaly-detection/explore/AnomalyDetectionIsolationForestExploration.ipynb b/domains/anomaly-detection/explore/AnomalyDetectionIsolationForestExploration.ipynb
@@ -931,7 +931,6 @@
    "outputs": [],
    "source": [
     "java_package_shap_values = explain_anomalies_with_shap(\n",
-    "    # random_forest_model=java_package_proxy_random_forest,\n",
     "    random_forest_model=java_package_anomaly_detection_results.random_forest_classifier,\n",
     "    prepared_features=java_package_anomaly_detection_features_prepared\n",
     ")\n",
@@ -944,6 +943,366 @@
     ")"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "50ce9cbb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO delete next section if not used anymore"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34a35377",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "### 🔍 1. **Summary Plot (Beeswarm)**\n",
+    "\n",
+    "```python\n",
+    "shap.summary_plot(shap_values, X, feature_names=feature_names)\n",
+    "```\n",
+    "\n",
+    "* **Best for:** Global understanding of which features drive anomalies.\n",
+    "* **Adds:** Direction of impact (color shows feature value).\n",
+    "* **Why:** Useful when you want to see how values push predictions toward normal or anomalous.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### 🧠 2. **Force Plot**\n",
+    "\n",
+    "```python\n",
+    "shap.initjs()\n",
+    "shap.force_plot(\n",
+    "    explainer.expected_value[1],  # For class \"anomaly\"\n",
+    "    shap_values[1][i],            # For specific instance i\n",
+    "    X.iloc[i],                    # Same instance input\n",
+    "    feature_names=feature_names\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "* **Best for:** Explaining *why a specific data point* is anomalous.\n",
+    "* **Adds:** Visual breakdown of how each feature contributes to the score.\n",
+    "* **Why:** Highly interpretable for debugging single nodes.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### 📈 3. **Dependence Plot**\n",
+    "\n",
+    "```python\n",
+    "shap.dependence_plot(\"PageRank\", shap_values[1], X, feature_names=feature_names)\n",
+    "```\n",
+    "\n",
+    "* **Best for:** Understanding how *one feature* affects anomaly scores.\n",
+    "* **Adds:** Color can show interaction with another feature.\n",
+    "* **Why:** Helps discover *nonlinear effects or interaction terms*.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### 🔗 4. **Interaction Value Plots**\n",
+    "\n",
+    "If your model was trained with `TreeExplainer(model, feature_perturbation=\"tree_path_dependent\")`, you can use:\n",
+    "\n",
+    "```python\n",
+    "shap_interaction_values = explainer.shap_interaction_values(X)\n",
+    "shap.summary_plot(shap_interaction_values[1], X)\n",
+    "```\n",
+    "\n",
+    "* **Best for:** Revealing how features *interact* in creating anomalies.\n",
+    "* **Adds:** Pairs of features contributing together.\n",
+    "* **Why:** Especially interesting with graph metrics + embedding components.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### 🧭 5. **Decision Plot**\n",
+    "\n",
+    "```python\n",
+    "shap.decision_plot(\n",
+    "    explainer.expected_value[1],\n",
+    "    shap_values[1][sample_indices],\n",
+    "    X.iloc[sample_indices],\n",
+    "    feature_names=feature_names\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "* **Best for:** Tracing how a model arrives at a decision.\n",
+    "* **Adds:** Shows cumulative impact of features.\n",
+    "* **Why:** Good for **comparing multiple instances** and identifying tipping-point features.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### 🧊 6. **Waterfall Plot**\n",
+    "\n",
+    "```python\n",
+    "shap.plots.waterfall(shap.Explanation(\n",
+    "    values=shap_values[1][i],\n",
+    "    base_values=explainer.expected_value[1],\n",
+    "    data=X.iloc[i],\n",
+    "    feature_names=feature_names\n",
+    "))\n",
+    "```\n",
+    "\n",
+    "* **Best for:** Clear breakdown of prediction into additive components.\n",
+    "* **Why:** Cleaner than force plot; great in reports or UI.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### ✅ Recommendations for Your Use Case (Code Graph Anomaly Detection):\n",
+    "\n",
+    "| Goal                             | Recommended Plot               |\n",
+    "| -------------------------------- | ------------------------------ |\n",
+    "| Global feature influence         | Summary Plot (bar or beeswarm) |\n",
+    "| Understand single anomaly        | Force Plot / Waterfall         |\n",
+    "| Explore how a feature influences | Dependence Plot                |\n",
+    "| Discover interactions            | Interaction Plot               |\n",
+    "| Debug how decision was made      | Decision Plot                  |\n",
+    "\n",
+    "Let me know what type of insight you're most interested in (e.g., per node, across the graph, per anomaly cluster), and I can recommend specific plot setups or generate templates for you.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7025dd65",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def plot_shap_explained_beeswarm(\n",
+    "    shap_values: numpy_typing.NDArray,\n",
+    "    prepared_features: numpy_typing.NDArray,\n",
+    "    feature_names: list[str],\n",
+    "    title_prefix: str = \"\",\n",
+    ") -> None:\n",
+    "    \"\"\"\n",
+    "    Explain anomalies using SHAP values and plot the global feature importance as a \"beeswarm\".\n",
+    "    This function uses the SHAP library to visualize the impact of features on the model's predictions\n",
+    "    for anomalies detected by the Isolation Forest model via the Random Forest proxy model.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    shap.summary_plot(\n",
+    "        shap_values[:, :, 1],  # Class 1 = anomaly\n",
+    "        prepared_features[:],\n",
+    "        feature_names=feature_names,\n",
+    "        plot_type=\"dot\",\n",
+    "        max_display=20,\n",
+    "        plot_size=(12, 6),  # (width, height) in inches\n",
+    "        show=False\n",
+    "    )\n",
+    "    plot.title(f\"How {title_prefix} features contribute to the anomaly score (beeswarm plot)\", fontsize=12)\n",
+    "    plot.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec6676c7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plot_shap_explained_beeswarm(\n",
+    "    shap_values=java_package_shap_values,\n",
+    "    prepared_features=java_package_anomaly_detection_features_prepared,\n",
+    "    feature_names=java_package_anomaly_detection_feature_names,\n",
+    "    title_prefix=\"Java Package\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9b5a523d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def plot_shap_explained_local_feature_importance(\n",
+    "    index_to_explain,\n",
+    "    random_forest_model: RandomForestClassifier,\n",
+    "    prepared_features: np.ndarray,\n",
+    "    feature_names: list[str],\n",
+    "    title_prefix: str = \"\",\n",
+    "    rounding_precision: int = 3,\n",
+    "):\n",
+    "    # TODO Take explainer as input parameter\n",
+    "    explainer = shap.TreeExplainer(random_forest_model)\n",
+    "    shap_values = explainer.shap_values(prepared_features)\n",
+    "\n",
+    "    # print(f\"Input data with prepared features: shape={prepared_features.shape}\")\n",
+    "    # print(f\"Explainable AI SHAP values: shape={np.shape(shap_values)}\")\n",
+    "    # print(f\"Explainable AI SHAP expected_value: shape={np.shape(explainer.expected_value)}\")\n",
+    "    # print(f\"Explainable AI SHAP expected_value: type={type(explainer.expected_value)}\")\n",
+    "    # print(f\"Explaining instance at index {index_to_explain} with anomaly label: {original_features.iloc[index_to_explain][anomaly_label_column]}\")\n",
+    "\n",
+    "    shap_values_rounded = np.round(shap_values[:,:, 1][index_to_explain], rounding_precision)\n",
+    "    prepared_features_rounded = prepared_features[:][index_to_explain].round(rounding_precision)\n",
+    "    base_value_rounded = np.round(typing.cast(np.ndarray,explainer.expected_value)[1], rounding_precision)\n",
+    "\n",
+    "    shap.force_plot(\n",
+    "        base_value_rounded,  # For class \"anomaly\"\n",
+    "        # typing.cast(np.ndarray,explainer.expected_value)[1],  # For class \"anomaly\"\n",
+    "        # shap_values[:,:, 1][index_to_explain],\n",
+    "        shap_values_rounded,\n",
+    "        prepared_features_rounded,\n",
+    "        # prepared_features[:][index_to_explain],\n",
+    "        feature_names=feature_names,\n",
+    "        matplotlib=True,\n",
+    "        show=False,\n",
+    "        contribution_threshold=0.07\n",
+    "    )\n",
+    "    plot.title(f\"{title_prefix} anomaly feature {feature_names[index_to_explain]} explained\", fontsize=14, loc='left')\n",
+    "    plot.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "77b0852c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plot_shap_explained_local_feature_importance(\n",
+    "    index_to_explain=4,\n",
+    "    random_forest_model=java_package_anomaly_detection_results.random_forest_classifier,\n",
+    "    prepared_features=java_package_anomaly_detection_features_prepared,\n",
+    "    feature_names=java_package_anomaly_detection_feature_names,\n",
+    "    title_prefix=\"Java Package\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "id": "2df453b4",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "# TODO delete if not needed anymore\n",
+    "def plot_shap_explained_feature_dependency(\n",
+    "    index_to_explain: int,\n",
+    "    random_forest_model: RandomForestClassifier,\n",
+    "    prepared_features: np.ndarray,\n",
+    "    feature_names: list[str],\n",
+    "    title_prefix: str = \"\",\n",
+    "):\n",
+    "    explainer = shap.TreeExplainer(random_forest_model)\n",
+    "    shap_values = explainer.shap_values(prepared_features)\n",
+    "\n",
+    "    shap.dependence_plot(\n",
+    "        ind=index_to_explain,  # Feature name or index\n",
+    "        shap_values=shap_values[:, :, 1],\n",
+    "        features=prepared_features[:],\n",
+    "        feature_names=feature_names,\n",
+    "        interaction_index=None,  # Set to a feature name/index to see interactions\n",
+    "        show=False,\n",
+    "    )\n",
+    "    plot.title(f\"{title_prefix} Feature contribution to anomaly score\")\n",
+    "    plot.show()\n",
+    "\n",
+    "plot_shap_explained_feature_dependency(\n",
+    "    index_to_explain=2,\n",
+    "    random_forest_model=java_package_anomaly_detection_results.random_forest_classifier,\n",
+    "    prepared_features=java_package_anomaly_detection_features_prepared,\n",
+    "    feature_names=java_package_anomaly_detection_feature_names,\n",
+    "    title_prefix=\"Java Package\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fb7e14f9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def plot_shap_explained_top_10_feature_dependence(\n",
+    "    random_forest_model: RandomForestClassifier,\n",
+    "    prepared_features: np.ndarray,\n",
+    "    feature_names: list[str],\n",
+    "    title_prefix: str = \"\",\n",
+    "):\n",
+    "    explainer = shap.TreeExplainer(random_forest_model)\n",
+    "    shap_values = explainer.shap_values(prepared_features)\n",
+    "\n",
+    "    mean_abs_shap = np.abs(shap_values[:, :, 1]).mean(axis=0)\n",
+    "    top_features = np.argsort(mean_abs_shap)[-10:][::-1]  # top 10 indices\n",
+    "    top_feature_names = [feature_names[i] for i in top_features]  # Get names of top features\n",
+    "    \n",
+    "    figure, axes = plot.subplots(5, 2, figsize=(15, 20))  # 5 rows x 2 columns\n",
+    "    figure.suptitle(f\"{title_prefix} Anomalies: Top 10 feature dependence plots\", fontsize=16)\n",
+    "    axes = axes.flatten()  # Flatten for easy iteration\n",
+    "\n",
+    "    for index, feature in enumerate(top_feature_names):\n",
+    "        shap.dependence_plot(\n",
+    "            ind=feature,  # Feature name or index\n",
+    "            shap_values=shap_values[:, :, 1],\n",
+    "            features=prepared_features[:],\n",
+    "            feature_names=feature_names,\n",
+    "            interaction_index=None,  # Set to a feature name/index to see interactions\n",
+    "            show=False,\n",
+    "            ax=axes[index]\n",
+    "        )\n",
+    "\n",
+    "    plot.tight_layout(rect=(0.0, 0.02, 1.0, 0.98))\n",
+    "    plot.show()\n",
+    "\n",
+    "plot_shap_explained_top_10_feature_dependence(\n",
+    "    random_forest_model=java_package_anomaly_detection_results.random_forest_classifier,\n",
+    "    prepared_features=java_package_anomaly_detection_features_prepared,\n",
+    "    feature_names=java_package_anomaly_detection_feature_names,\n",
+    "    title_prefix=\"Java Package\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "id": "1ced99f1",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "# TODO delete if not needed anymore\n",
+    "def plot_shap_explained_heatmap(\n",
+    "    random_forest_model: RandomForestClassifier,\n",
+    "    prepared_features: np.ndarray,\n",
+    "    original_features: pd.DataFrame, \n",
+    "    feature_names: list[str],\n",
+    "    title_prefix: str = \"\",\n",
+    "    anomaly_label_column: str = \"anomalyLabel\"\n",
+    "):\n",
+    "    explainer = shap.TreeExplainer(random_forest_model)\n",
+    "    shap_values = explainer.shap_values(prepared_features)\n",
+    "\n",
+    "    # Create SHAP Explanation object\n",
+    "    shap_explanation = shap.Explanation(\n",
+    "        values=shap_values[:, :, 1],\n",
+    "        base_values=typing.cast(np.ndarray, explainer.expected_value)[1],  # For class \"anomaly\"\n",
+    "        data=prepared_features[:],\n",
+    "        feature_names=feature_names\n",
+    "    )\n",
+    "\n",
+    "    shap.heatmap_plot(\n",
+    "        shap_explanation, \n",
+    "        instance_order=\"leaves\",  # Optional: use clustering to sort rows\n",
+    "        show=False,\n",
+    "    )\n",
+    "    plot.title(f\"{title_prefix} Anomaly feature heatmap\")\n",
+    "    plot.show()\n",
+    "\n",
+    "plot_shap_explained_heatmap(\n",
+    "    random_forest_model=java_package_anomaly_detection_results.random_forest_classifier,\n",
+    "    prepared_features=java_package_anomaly_detection_features_prepared,\n",
+    "    original_features=java_package_anomaly_detection_features,\n",
+    "    feature_names=java_package_anomaly_detection_feature_names,\n",
+    "    title_prefix=\"Java Package\"\n",
+    ")"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "27b33560",