|
11 | 11 | .. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
12 | 12 | .. See the License for the specific language governing permissions and |
13 | 13 | .. limitations under the License. |
14 | | -
|
| 14 | +.. include:: substitutions.rst |
15 | 15 | .. _array_api: |
16 | 16 |
|
17 | 17 | ================= |
18 | 18 | Array API support |
19 | 19 | ================= |
20 | | -The `Array API <https://data-apis.org/array-api/latest/>`_ specification defines |
21 | | -a standard API for all array manipulation libraries with a NumPy-like API. |
22 | | -Extension for Scikit-learn doesn't require |
23 | | -`array-api-compat <https://github.com/data-apis/array-api-compat>`__ to be installed for |
24 | | -functional support of the array API standard. |
25 | | -In the current implementation, the functional support of array api follows the functional |
26 | | -support of different array or DataFrame inputs and does not modify the precision of the |
27 | | -input and output data formats unless necessary. Any array API input will be converted to host |
28 | | -numpy.ndarrays and all internal manipulations with data will be done with these representations of |
29 | | -the input data. DPNP's 'ndarray' and Data Parallel Control's 'usm_ndarray' have special handling |
30 | | -requirements that are described in the relevant section of this document. Output values will in |
31 | | -all relevant cases match the input data format. |
32 | | - |
33 | | -.. note:: |
34 | | - Currently, only `array-api-strict <https://github.com/data-apis/array-api-strict>`__, |
35 | | - `dpctl <https://intelpython.github.io/dpctl/latest/index.html>`__, `dpnp <https://github.com/IntelPython/dpnp>`__ |
36 | | - and `numpy <https://numpy.org/>`__ are known to work with sklearnex estimators. |
37 | | -.. note:: |
38 | | - Stock Scikit-learn’s array API support requires `array-api-compat <https://github.com/data-apis/array-api-compat>`__ to be installed. |
39 | 20 |
|
40 | | - |
41 | | -Support for DPNP and DPCTL |
42 | | -========================== |
43 | | -The functional support of input data for sklearnex estimators also extended for SYCL USM array types. |
44 | | -These include SYCL USM arrays `dpnp's <https://github.com/IntelPython/dpnp>`__ ndarray and |
45 | | -`Data Parallel Control usm_ndarray <https://intelpython.github.io/dpctl/latest/index.html>`__. |
46 | | -DPNP ndarray and Data Parallel Control usm_ndarray contain SYCL contexts which can be used for |
47 | | -`sklearnex` device offloading. |
| 21 | +Overview |
| 22 | +======== |
| 23 | + |
| 24 | +Many estimators from the |sklearnex| support passing data classes that conform to the |
| 25 | +`Array API <https://data-apis.org/array-api/>`_ specification as inputs to methods like ``.fit()`` |
| 26 | +and ``.predict()``, such as :external+dpnp:doc:`dpnp.ndarray <reference/ndarray>` or |
| 27 | +`torch.tensor <https://docs.pytorch.org/docs/stable/tensors.html>`__. This is particularly |
| 28 | +useful for GPU computations, as it allows performing operations on inputs that are already |
| 29 | +on GPU without moving the data from host to device. |
| 30 | + |
| 31 | +.. important:: |
| 32 | + Array API is disabled by default in |sklearn|. In order to get array API support in the |sklearnex|, it must |
| 33 | + be :external+sklearn:doc:`enabled in scikit-learn <modules/array_api>`, which requires either changing |
| 34 | + global settings or using a ``config_context``, plus installing additional dependencies such as ``array-api-compat``. |
| 35 | + |
| 36 | +When passing array API inputs whose data is on a SyCL-enabled device (e.g. an Intel GPU), as |
| 37 | +supported for example by `PyTorch <https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html>`__ |
| 38 | +and |dpnp|, if array API support is enabled and the requested operation (e.g. call to ``.fit()`` / ``.predict()`` |
| 39 | +on the estimator class being used) is :ref:`supported on device/GPU <sklearn_algorithms_gpu>`, computations |
| 40 | +will be performed on the device where the data lives, without involving any data transfers. Note that all of |
| 41 | +the inputs (e.g. ``X`` and ``y`` passed to ``.fit()`` methods) must be allocated on the same device for this to |
| 42 | +work. If the requested operation is not supported on the device where the data lives, then it will either fall |
| 43 | +back to |sklearn|, or to an accelerated CPU version from the |sklearnex| when supported - these are controllable |
| 44 | +through options ``allow_sklearn_after_onedal`` (default is ``True``) and ``allow_fallback_to_host`` (default is |
| 45 | +``False``), respectively, which are accepted by ``config_context`` and ``set_config`` after |
| 46 | +:ref:`patching scikit-learn <patching>` or when importing those directly from ``sklearnex``. |
48 | 47 |
|
49 | 48 | .. note:: |
50 | | - Current support for DPNP and DPCTL usm_ndarray data can be copied and moved to and from device in sklearnex and have |
51 | | - impacts on memory utilization. |
52 | | - |
53 | | -DPCTL or DPNP inputs are not required to use `config_context(target_offload=device)`. |
54 | | -`sklearnex` will use input usm_ndarray sycl context for device offloading. |
| 49 | + Under default settings for ``set_config`` / ``config_context``, operations that are not supported on GPU will |
| 50 | + fall back to |sklearn| instead of falling back to CPU versions from the |sklearnex|. |
| 51 | + |
| 52 | +If array API is enabled for |sklearn| and the estimator being used has array API support on |sklearn| (which can be |
| 53 | +verified by attribute ``array_api_support`` from :obj:`sklearn.utils.get_tags`), then array API inputs whose data |
| 54 | +is allocated neither on CPU nor on a SyCL device will be forwarded directly to the unpatched methods from |sklearn|, |
| 55 | +without using the accelerated versions from this library, regardless of option ``allow_sklearn_after_onedal``. |
| 56 | + |
| 57 | +While other array API inputs (e.g. torch arrays with data allocated on a non-SyCL device) might be supported |
| 58 | +by the |sklearnex| in cases where the same class from |sklearn| doesn't support array API, note that the data will |
| 59 | +be transferred to host if it isn't already, and the computations will happen on CPU. |
| 60 | + |
| 61 | +.. hint:: |
| 62 | + Enable :ref:`verbose` to see information about whether data transfers happen during an operation or not, |
| 63 | + whether an accelerated version from the extension is used, and where (CPU/device) the operation is executed. |
| 64 | + |
| 65 | +When passing array API inputs to methods such as ``.predict()`` of estimators with array API support, the output |
| 66 | +will always be of the same class as the inputs, but be aware that array attributes of fitted models (e.g. ``coef_`` |
| 67 | +in a linear model) will not necessarily be of the same class as array API inputs passed to ``.fit()``, even though |
| 68 | +in many cases they are. |
| 69 | + |
| 70 | +.. warning:: |
| 71 | + If array API inputs are passed to an estimator's ``.fit()``, subsequent data passed to methods such as |
| 72 | + ``.predict()`` or ``.score()`` of the fitted model might be of a different class than the ``X``/``y`` passed to |
| 73 | + ``.fit()``, but **it must reside on the same device** - meaning: a model that was fitted with GPU arrays cannot |
| 74 | + make predictions on CPU arrays, and a model fitted with CPU array API inputs cannot make predictions on GPU |
| 75 | + arrays, even if they are of the same class. Attempting to pass data on the wrong device might lead to |
| 76 | + process-wide crashes. |
55 | 77 |
|
56 | 78 | .. note:: |
57 | | - As DPCTL or DPNP inputs contain SYCL contexts, they do not require `config_context(target_offload=device)`. |
58 | | - However, the use of `config_context`` will override the contained SYCL context and will force movement |
59 | | - of data to the targeted device. |
| 79 | + The ``target_offload`` option in config contexts and settings is not intended to work with array API |
| 80 | + classes that have :external+dpctl:doc:`USM data <api_reference/dpctl/memory>`. In order to ensure that computations |
| 81 | + happen on the intended device under array API, make sure that the data is already on the desired device. |
60 | 82 |
|
61 | 83 |
|
62 | | -Support for Array API-compatible inputs |
63 | | -======================================= |
64 | | -All patched estimators, metrics, tools and non-scikit-learn estimators functionally support Array API standard. |
65 | | -Extension for Scikit-learn preserves input data format for all outputs. For all array inputs except |
66 | | -SYCL USM arrays `dpnp's <https://github.com/IntelPython/dpnp>`__ ndarray and |
67 | | -`Data Parallel Control usm_ndarray <https://intelpython.github.io/dpctl/latest/index.html>`__ all computation |
68 | | -will be only accomplished on CPU unless specified by a `config_context`` with an available GPU device. |
| 84 | +Supported classes |
| 85 | +================= |
69 | 86 |
|
70 | | -Stock scikit-learn uses `config_context(array_api_dispatch=True)` for enabling Array API |
71 | | -`support <https://scikit-learn.org/1.5/modules/array_api.html>`__. |
72 | | -If `array_api_dispatch` is enabled and the installed Scikit-Learn version supports array API, then the original |
73 | | -inputs are used when falling back to Scikit-Learn functionality. |
| 87 | +The following patched classes have support for array API inputs: |
| 88 | + |
| 89 | +- :obj:`sklearnex.basic_statistics.BasicStatistics` |
| 90 | +- :obj:`sklearnex.basic_statistics.IncrementalBasicStatistics` |
| 91 | +- :obj:`sklearn.cluster.DBSCAN` |
| 92 | +- :obj:`sklearn.covariance.EmpiricalCovariance` |
| 93 | +- :obj:`sklearnex.covariance.IncrementalEmpiricalCovariance` |
| 94 | +- :obj:`sklearn.decomposition.PCA` |
| 95 | +- :obj:`sklearn.linear_model.LinearRegression` |
| 96 | +- :obj:`sklearn.linear_model.Ridge` |
| 97 | +- :obj:`sklearnex.linear_model.IncrementalLinearRegression` |
| 98 | +- :obj:`sklearnex.linear_model.IncrementalRidge` |
74 | 99 |
|
75 | 100 | .. note:: |
76 | | - Data Parallel Control usm_ndarray or DPNP ndarray inputs will use host numpy data copies when |
77 | | - falling back to Scikit-Learn since they are not array API compliant. |
78 | | -.. note:: |
79 | | - Functional support doesn't guarantee that after the model is trained, fitted attributes that are arrays |
80 | | - will also be from the same namespace as the training data. |
| 101 | + While full array API support is currently not implemented for all classes, :external+dpnp:doc:`dpnp.ndarray <reference/ndarray>` |
| 102 | + and :external+dpctl:doc:`dpctl.tensor <api_reference/dpctl/tensor>` inputs are supported by all the classes |
| 103 | + that have :ref:`GPU support <oneapi_gpu>`. Note however that if array API support is not enabled in |sklearn|, |
| 104 | + when passing these classes as inputs, data will be transferred to host and then back to device instead of being |
| 105 | + used directly. |
81 | 106 |
|
82 | 107 |
|
83 | 108 | Example usage |
84 | 109 | ============= |
85 | 110 |
|
86 | | -DPNP ndarrays |
87 | | -------------- |
| 111 | +GPU operations on GPU arrays |
| 112 | +---------------------------- |
88 | 113 |
|
89 | | -Here is an example code to demonstrate how to use `dpnp <https://github.com/IntelPython/dpnp>`__ arrays to |
90 | | -run `RandomForestRegressor` on a GPU without `config_context(array_api_dispatch=True)`: |
| 114 | +.. code-block:: python |
91 | 115 |
|
92 | | -.. literalinclude:: ../../examples/sklearnex/random_forest_regressor_dpnp.py |
93 | | - :language: python |
| 116 | + # Array API support from sklearn requires enabling it on SciPy too |
| 117 | + import os |
| 118 | + os.environ["SCIPY_ARRAY_API"] = "1" |
94 | 119 |
|
| 120 | + import numpy as np |
| 121 | + import dpnp |
| 122 | + from sklearnex import config_context |
| 123 | + from sklearnex.linear_model import LinearRegression |
95 | 124 |
|
96 | | -.. note:: |
97 | | - Functional support doesn't guarantee that after the model is trained, fitted attributes that are arrays |
98 | | - will also be from the same namespace as the training data. |
| 125 | + # Random data for a regression problem |
| 126 | + rng = np.random.default_rng(seed=123) |
| 127 | + X_np = rng.standard_normal(size=(100, 10), dtype=np.float32) |
| 128 | + y_np = rng.standard_normal(size=100, dtype=np.float32) |
| 129 | +
|
| 130 | + # DPNP offers an array-API-compliant class where data can be on GPU |
| 131 | + X = dpnp.array(X_np, device="gpu") |
| 132 | + y = dpnp.array(y_np, device="gpu") |
| 133 | +
|
| 134 | + # Important to note again that array API must be enabled on scikit-learn |
| 135 | + model = LinearRegression() |
| 136 | + with config_context(array_api_dispatch=True): |
| 137 | + model.fit(X, y) |
| 138 | +
|
| 139 | + # Fitted attributes are now of the same class as inputs |
| 140 | + assert isinstance(model.coef_, X.__class__) |
99 | 141 |
|
100 | | -For example, if `dpnp's <https://github.com/IntelPython/dpnp>`__ namespace was used for training, |
101 | | -then fitted attributes will be on the CPU and `numpy.ndarray` data format. |
| 142 | + # Predictions are also of the same class |
| 143 | + with config_context(array_api_dispatch=True): |
| 144 | + pred = model.predict(X[:5]) |
| 145 | + assert isinstance(pred, X.__class__) |
102 | 146 |
|
103 | | -DPCTL usm_ndarrays |
104 | | ------------------- |
105 | | -Here is an example code to demonstrate how to use `dpctl <https://intelpython.github.io/dpctl/latest/index.html>`__ |
106 | | -arrays to run `RandomForestClassifier` on a GPU without `config_context(array_api_dispatch=True)`: |
| 147 | + # Fitted models can be passed array API inputs of a different class |
| 148 | + # than the training data, as long as their data resides in the same |
| 149 | + # device. This now fits a model using a non-NumPy class whose data is on CPU. |
| 150 | + X_cpu = dpnp.array(X_np, device="cpu") |
| 151 | + y_cpu = dpnp.array(y_np, device="cpu") |
| 152 | + model_cpu = LinearRegression() |
| 153 | + with config_context(array_api_dispatch=True): |
| 154 | + model_cpu.fit(X_cpu, y_cpu) |
| 155 | + pred_dpnp = model_cpu.predict(X_cpu[:5]) |
| 156 | + pred_np = model_cpu.predict(X_cpu[:5].asnumpy()) |
| 157 | + assert isinstance(pred_dpnp, X_cpu.__class__) |
| 158 | + assert isinstance(pred_np, np.ndarray) |
| 159 | + assert pred_dpnp.__class__ != pred_np.__class__ |
107 | 160 |
|
108 | | -.. literalinclude:: ../../examples/sklearnex/random_forest_classifier_dpctl.py |
109 | | - :language: python |
110 | 161 |
|
111 | | -As on previous example, if `dpctl <https://intelpython.github.io/dpctl/latest/index.html>`__ Array API namespace was |
112 | | -used for training, then fitted attributes will be on the CPU and `numpy.ndarray` data format. |
| 162 | +``array-api-strict`` |
| 163 | +-------------------- |
113 | 164 |
|
114 | | -Use of `array-api-strict` |
115 | | -------------------------- |
| 165 | +Example code showcasing how to use `array-api-strict <https://github.com/data-apis/array-api-strict>`__ |
| 166 | +arrays to run patched :obj:`sklearn.cluster.DBSCAN`. |
116 | 167 |
|
117 | | -Here is an example code to demonstrate how to use `array-api-strict <https://github.com/data-apis/array-api-strict>`__ |
118 | | -arrays to run `DBSCAN`. |
| 168 | +.. toggle:: |
119 | 169 |
|
120 | | -.. literalinclude:: ../../examples/sklearnex/dbscan_array_api.py |
121 | | - :language: python |
| 170 | + .. literalinclude:: ../../examples/sklearnex/dbscan_array_api.py |
| 171 | + :language: python |
0 commit comments