Skip to content

Commit 3d86aeb

Browse files
DOC: Describe behaviors from array API (#2707)
* update array api docs to reflect current situation * reword * correct behavior around target offload with array api * fix link * remove incorrect note about array-api-compat
1 parent 48078c8 commit 3d86aeb

File tree

5 files changed

+131
-77
lines changed

5 files changed

+131
-77
lines changed

doc/sources/algorithms.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,7 @@ Other Tasks
264264

265265
on GPU
266266
------
267+
.. _sklearn_algorithms_gpu:
267268

268269
.. seealso:: :ref:`oneapi_gpu`
269270

doc/sources/array_api.rst

Lines changed: 127 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -11,111 +11,161 @@
1111
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
.. See the License for the specific language governing permissions and
1313
.. limitations under the License.
14-
14+
.. include:: substitutions.rst
1515
.. _array_api:
1616

1717
=================
1818
Array API support
1919
=================
20-
The `Array API <https://data-apis.org/array-api/latest/>`_ specification defines
21-
a standard API for all array manipulation libraries with a NumPy-like API.
22-
Extension for Scikit-learn doesn't require
23-
`array-api-compat <https://github.com/data-apis/array-api-compat>`__ to be installed for
24-
functional support of the array API standard.
25-
In the current implementation, the functional support of array api follows the functional
26-
support of different array or DataFrame inputs and does not modify the precision of the
27-
input and output data formats unless necessary. Any array API input will be converted to host
28-
numpy.ndarrays and all internal manipulations with data will be done with these representations of
29-
the input data. DPNP's 'ndarray' and Data Parallel Control's 'usm_ndarray' have special handling
30-
requirements that are described in the relevant section of this document. Output values will in
31-
all relevant cases match the input data format.
32-
33-
.. note::
34-
Currently, only `array-api-strict <https://github.com/data-apis/array-api-strict>`__,
35-
`dpctl <https://intelpython.github.io/dpctl/latest/index.html>`__, `dpnp <https://github.com/IntelPython/dpnp>`__
36-
and `numpy <https://numpy.org/>`__ are known to work with sklearnex estimators.
37-
.. note::
38-
Stock Scikit-learn’s array API support requires `array-api-compat <https://github.com/data-apis/array-api-compat>`__ to be installed.
3920

40-
41-
Support for DPNP and DPCTL
42-
==========================
43-
The functional support of input data for sklearnex estimators also extended for SYCL USM array types.
44-
These include SYCL USM arrays `dpnp's <https://github.com/IntelPython/dpnp>`__ ndarray and
45-
`Data Parallel Control usm_ndarray <https://intelpython.github.io/dpctl/latest/index.html>`__.
46-
DPNP ndarray and Data Parallel Control usm_ndarray contain SYCL contexts which can be used for
47-
`sklearnex` device offloading.
21+
Overview
22+
========
23+
24+
Many estimators from the |sklearnex| support passing data classes that conform to the
25+
`Array API <https://data-apis.org/array-api/>`_ specification as inputs to methods like ``.fit()``
26+
and ``.predict()``, such as :external+dpnp:doc:`dpnp.ndarray <reference/ndarray>` or
27+
`torch.tensor <https://docs.pytorch.org/docs/stable/tensors.html>`__. This is particularly
28+
useful for GPU computations, as it allows performing operations on inputs that are already
29+
on GPU without moving the data from host to device.
30+
31+
.. important::
32+
Array API is disabled by default in |sklearn|. In order to get array API support in the |sklearnex|, it must
33+
be :external+sklearn:doc:`enabled in scikit-learn <modules/array_api>`, which requires either changing
34+
global settings or using a ``config_context``, plus installing additional dependencies such as ``array-api-compat``.
35+
36+
When passing array API inputs whose data is on a SyCL-enabled device (e.g. an Intel GPU), as
37+
supported for example by `PyTorch <https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html>`__
38+
and |dpnp|, if array API support is enabled and the requested operation (e.g. call to ``.fit()`` / ``.predict()``
39+
on the estimator class being used) is :ref:`supported on device/GPU <sklearn_algorithms_gpu>`, computations
40+
will be performed on the device where the data lives, without involving any data transfers. Note that all of
41+
the inputs (e.g. ``X`` and ``y`` passed to ``.fit()`` methods) must be allocated on the same device for this to
42+
work. If the requested operation is not supported on the device where the data lives, then it will either fall
43+
back to |sklearn|, or to an accelerated CPU version from the |sklearnex| when supported - these are controllable
44+
through options ``allow_sklearn_after_onedal`` (default is ``True``) and ``allow_fallback_to_host`` (default is
45+
``False``), respectively, which are accepted by ``config_context`` and ``set_config`` after
46+
:ref:`patching scikit-learn <patching>` or when importing those directly from ``sklearnex``.
4847

4948
.. note::
50-
Current support for DPNP and DPCTL usm_ndarray data can be copied and moved to and from device in sklearnex and have
51-
impacts on memory utilization.
52-
53-
DPCTL or DPNP inputs are not required to use `config_context(target_offload=device)`.
54-
`sklearnex` will use input usm_ndarray sycl context for device offloading.
49+
Under default settings for ``set_config`` / ``config_context``, operations that are not supported on GPU will
50+
fall back to |sklearn| instead of falling back to CPU versions from the |sklearnex|.
51+
52+
If array API is enabled for |sklearn| and the estimator being used has array API support on |sklearn| (which can be
53+
verified by attribute ``array_api_support`` from :obj:`sklearn.utils.get_tags`), then array API inputs whose data
54+
is allocated neither on CPU nor on a SyCL device will be forwarded directly to the unpatched methods from |sklearn|,
55+
without using the accelerated versions from this library, regardless of option ``allow_sklearn_after_onedal``.
56+
57+
While other array API inputs (e.g. torch arrays with data allocated on a non-SyCL device) might be supported
58+
by the |sklearnex| in cases where the same class from |sklearn| doesn't support array API, note that the data will
59+
be transferred to host if it isn't already, and the computations will happen on CPU.
60+
61+
.. hint::
62+
Enable :ref:`verbose` to see information about whether data transfers happen during an operation or not,
63+
whether an accelerated version from the extension is used, and where (CPU/device) the operation is executed.
64+
65+
When passing array API inputs to methods such as ``.predict()`` of estimators with array API support, the output
66+
will always be of the same class as the inputs, but be aware that array attributes of fitted models (e.g. ``coef_``
67+
in a linear model) will not necessarily be of the same class as array API inputs passed to ``.fit()``, even though
68+
in many cases they are.
69+
70+
.. warning::
71+
If array API inputs are passed to an estimator's ``.fit()``, subsequent data passed to methods such as
72+
``.predict()`` or ``.score()`` of the fitted model might be of a different class than the ``X``/``y`` passed to
73+
``.fit()``, but **it must reside on the same device** - meaning: a model that was fitted with GPU arrays cannot
74+
make predictions on CPU arrays, and a model fitted with CPU array API inputs cannot make predictions on GPU
75+
arrays, even if they are of the same class. Attempting to pass data on the wrong device might lead to
76+
process-wide crashes.
5577

5678
.. note::
57-
As DPCTL or DPNP inputs contain SYCL contexts, they do not require `config_context(target_offload=device)`.
58-
However, the use of `config_context`` will override the contained SYCL context and will force movement
59-
of data to the targeted device.
79+
The ``target_offload`` option in config contexts and settings is not intended to work with array API
80+
classes that have :external+dpctl:doc:`USM data <api_reference/dpctl/memory>`. In order to ensure that computations
81+
happen on the intended device under array API, make sure that the data is already on the desired device.
6082

6183

62-
Support for Array API-compatible inputs
63-
=======================================
64-
All patched estimators, metrics, tools and non-scikit-learn estimators functionally support Array API standard.
65-
Extension for Scikit-learn preserves input data format for all outputs. For all array inputs except
66-
SYCL USM arrays `dpnp's <https://github.com/IntelPython/dpnp>`__ ndarray and
67-
`Data Parallel Control usm_ndarray <https://intelpython.github.io/dpctl/latest/index.html>`__ all computation
68-
will be only accomplished on CPU unless specified by a `config_context`` with an available GPU device.
84+
Supported classes
85+
=================
6986

70-
Stock scikit-learn uses `config_context(array_api_dispatch=True)` for enabling Array API
71-
`support <https://scikit-learn.org/1.5/modules/array_api.html>`__.
72-
If `array_api_dispatch` is enabled and the installed Scikit-Learn version supports array API, then the original
73-
inputs are used when falling back to Scikit-Learn functionality.
87+
The following patched classes have support for array API inputs:
88+
89+
- :obj:`sklearnex.basic_statistics.BasicStatistics`
90+
- :obj:`sklearnex.basic_statistics.IncrementalBasicStatistics`
91+
- :obj:`sklearn.cluster.DBSCAN`
92+
- :obj:`sklearn.covariance.EmpiricalCovariance`
93+
- :obj:`sklearnex.covariance.IncrementalEmpiricalCovariance`
94+
- :obj:`sklearn.decomposition.PCA`
95+
- :obj:`sklearn.linear_model.LinearRegression`
96+
- :obj:`sklearn.linear_model.Ridge`
97+
- :obj:`sklearnex.linear_model.IncrementalLinearRegression`
98+
- :obj:`sklearnex.linear_model.IncrementalRidge`
7499

75100
.. note::
76-
Data Parallel Control usm_ndarray or DPNP ndarray inputs will use host numpy data copies when
77-
falling back to Scikit-Learn since they are not array API compliant.
78-
.. note::
79-
Functional support doesn't guarantee that after the model is trained, fitted attributes that are arrays
80-
will also be from the same namespace as the training data.
101+
While full array API support is currently not implemented for all classes, :external+dpnp:doc:`dpnp.ndarray <reference/ndarray>`
102+
and :external+dpctl:doc:`dpctl.tensor <api_reference/dpctl/tensor>` inputs are supported by all the classes
103+
that have :ref:`GPU support <oneapi_gpu>`. Note however that if array API support is not enabled in |sklearn|,
104+
when passing these classes as inputs, data will be transferred to host and then back to device instead of being
105+
used directly.
81106

82107

83108
Example usage
84109
=============
85110

86-
DPNP ndarrays
87-
-------------
111+
GPU operations on GPU arrays
112+
----------------------------
88113

89-
Here is an example code to demonstrate how to use `dpnp <https://github.com/IntelPython/dpnp>`__ arrays to
90-
run `RandomForestRegressor` on a GPU without `config_context(array_api_dispatch=True)`:
114+
.. code-block:: python
91115
92-
.. literalinclude:: ../../examples/sklearnex/random_forest_regressor_dpnp.py
93-
:language: python
116+
# Array API support from sklearn requires enabling it on SciPy too
117+
import os
118+
os.environ["SCIPY_ARRAY_API"] = "1"
94119
120+
import numpy as np
121+
import dpnp
122+
from sklearnex import config_context
123+
from sklearnex.linear_model import LinearRegression
95124
96-
.. note::
97-
Functional support doesn't guarantee that after the model is trained, fitted attributes that are arrays
98-
will also be from the same namespace as the training data.
125+
# Random data for a regression problem
126+
rng = np.random.default_rng(seed=123)
127+
X_np = rng.standard_normal(size=(100, 10), dtype=np.float32)
128+
y_np = rng.standard_normal(size=100, dtype=np.float32)
129+
130+
# DPNP offers an array-API-compliant class where data can be on GPU
131+
X = dpnp.array(X_np, device="gpu")
132+
y = dpnp.array(y_np, device="gpu")
133+
134+
# Important to note again that array API must be enabled on scikit-learn
135+
model = LinearRegression()
136+
with config_context(array_api_dispatch=True):
137+
model.fit(X, y)
138+
139+
# Fitted attributes are now of the same class as inputs
140+
assert isinstance(model.coef_, X.__class__)
99141
100-
For example, if `dpnp's <https://github.com/IntelPython/dpnp>`__ namespace was used for training,
101-
then fitted attributes will be on the CPU and `numpy.ndarray` data format.
142+
# Predictions are also of the same class
143+
with config_context(array_api_dispatch=True):
144+
pred = model.predict(X[:5])
145+
assert isinstance(pred, X.__class__)
102146
103-
DPCTL usm_ndarrays
104-
------------------
105-
Here is an example code to demonstrate how to use `dpctl <https://intelpython.github.io/dpctl/latest/index.html>`__
106-
arrays to run `RandomForestClassifier` on a GPU without `config_context(array_api_dispatch=True)`:
147+
# Fitted models can be passed array API inputs of a different class
148+
# than the training data, as long as their data resides in the same
149+
# device. This now fits a model using a non-NumPy class whose data is on CPU.
150+
X_cpu = dpnp.array(X_np, device="cpu")
151+
y_cpu = dpnp.array(y_np, device="cpu")
152+
model_cpu = LinearRegression()
153+
with config_context(array_api_dispatch=True):
154+
model_cpu.fit(X_cpu, y_cpu)
155+
pred_dpnp = model_cpu.predict(X_cpu[:5])
156+
pred_np = model_cpu.predict(X_cpu[:5].asnumpy())
157+
assert isinstance(pred_dpnp, X_cpu.__class__)
158+
assert isinstance(pred_np, np.ndarray)
159+
assert pred_dpnp.__class__ != pred_np.__class__
107160
108-
.. literalinclude:: ../../examples/sklearnex/random_forest_classifier_dpctl.py
109-
:language: python
110161
111-
As on previous example, if `dpctl <https://intelpython.github.io/dpctl/latest/index.html>`__ Array API namespace was
112-
used for training, then fitted attributes will be on the CPU and `numpy.ndarray` data format.
162+
``array-api-strict``
163+
--------------------
113164

114-
Use of `array-api-strict`
115-
-------------------------
165+
Example code showcasing how to use `array-api-strict <https://github.com/data-apis/array-api-strict>`__
166+
arrays to run patched :obj:`sklearn.cluster.DBSCAN`.
116167

117-
Here is an example code to demonstrate how to use `array-api-strict <https://github.com/data-apis/array-api-strict>`__
118-
arrays to run `DBSCAN`.
168+
.. toggle::
119169

120-
.. literalinclude:: ../../examples/sklearnex/dbscan_array_api.py
121-
:language: python
170+
.. literalinclude:: ../../examples/sklearnex/dbscan_array_api.py
171+
:language: python

doc/sources/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@
6565
"sphinx.ext.autodoc",
6666
"nbsphinx",
6767
"sphinx_tabs.tabs",
68+
"sphinx_togglebutton",
6869
"notfound.extension",
6970
"sphinx_design",
7071
"sphinx_copybutton",

doc/sources/substitutions.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
.. limitations under the License.
1414
1515
.. |dpctl| replace:: :external+dpctl:doc:`dpctl <index>`
16+
.. |dpnp| replace:: :external+dpnp:doc:`dpnp <index>`
1617
.. |sklearn| replace:: :external+sklearn:doc:`scikit-learn <index>`
1718
.. |intelex_repo| replace:: |sklearnex| repository
1819
.. _intelex_repo: https://github.com/uxlfoundation/scikit-learn-intelex

requirements-doc.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ sphinx-design==0.6.1
5858
sphinx_copybutton==0.5.2
5959
sphinx-notfound-page==1.1.0
6060
sphinx-tabs==3.4.7
61+
sphinx-togglebutton==0.3.2
6162
sphinx_rtd_theme==3.0.2
6263
sphinxcontrib-applehelp==2.0.0
6364
sphinxcontrib-devhelp==2.0.0

0 commit comments

Comments
 (0)