Merge pull request #541 from xtensor-stack/fix/improve-doc

JohanMabille · web-flow · commit 311d3051a4f8 · 2021-08-25T14:22:13.000+02:00
Improve generated doc
diff --git a/docs/Doxyfile b/docs/Doxyfile
@@ -1,9 +1,11 @@
 PROJECT_NAME      = "xsimd"
 XML_OUTPUT        = xml
-INPUT             = ../include-refactoring/xsimd/types/xsimd_api.hpp \
-                    ../include-refactoring/xsimd/types/xsimd_batch.hpp \
-                    ../include-refactoring/xsimd/config/xsimd_config.hpp \
-                    ../include-refactoring/xsimd/memory/xsimd_aligned_allocator.hpp
+INPUT             = ../include/xsimd/types/xsimd_api.hpp \
+                    ../include/xsimd/types/xsimd_batch.hpp \
+                    ../include/xsimd/config/xsimd_arch.hpp \
+                    ../include/xsimd/config/xsimd_config.hpp \
+                    ../include/xsimd/memory/xsimd_alignment.hpp \
+                    ../include/xsimd/memory/xsimd_aligned_allocator.hpp
 GENERATE_LATEX    = NO
 GENERATE_MAN      = NO
 GENERATE_RTF      = NO
@@ -14,3 +16,7 @@ RECURSIVE         = YES
 QUIET             = YES
 JAVADOC_AUTOBRIEF = YES
 WARN_IF_UNDOCUMENTED = NO
+ENABLE_PREPROCESSING = YES
+MACRO_EXPANSION      = YES
+EXPAND_ONLY_PREDEF   = YES
+PREDEFINED           = XSIMD_NO_DISCARD=
diff --git a/docs/source/api/batch_index.rst b/docs/source/api/batch_index.rst
@@ -4,8 +4,8 @@
 
    The full license is in the file LICENSE, distributed with this software.
 
-Wrapper types
-=============
+Batch types
+===========
 
 .. toctree::
 
diff --git a/docs/source/api/data_transfer.rst b/docs/source/api/data_transfer.rst
@@ -11,3 +11,10 @@ Data transfer
    :project: xsimd
    :content-only:
 
+The following empty types are used for tag dispatching:
+
+.. doxygenstruct:: xsimd::aligned_mode
+   :project: xsimd
+
+.. doxygenstruct:: xsimd::unaligned_mode
+   :project: xsimd
diff --git a/docs/source/basic_usage.rst b/docs/source/basic_usage.rst
@@ -28,6 +28,8 @@ Here is an example that computes the mean of two sets of 4 double floating point
         return 0;
     }
 
+Note that in that case, the instruction set is explicilty specified in the batch type.
+
 This example outputs:
 
 .. code::
@@ -37,7 +39,7 @@ This example outputs:
 Auto detection of the instruction set extension to be used
 ----------------------------------------------------------
 
-The same computation operating on vectors and using the most performant instruction set available:
+The same computation operating on vectors and using the most performant instruction set available, using a code that's generic on the batch size:
 
 .. code::
 
@@ -67,3 +69,4 @@ The same computation operating on vectors and using the most performant instruct
         }
     }
 
+In that case, the architecture is chosen based on the compilation flags, prioritizing the largest width and the most recent instruction set.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -19,6 +19,9 @@ vendors and compilers.
 `xsimd` provides a unified means for using these features for library authors. Namely, it enables manipulation of batches of numbers with the same arithmetic
 operators as for single values. It also provides accelerated implementation of common mathematical functions operating on batches.
 
+`xsimd` makes it easy to write a single algorithm, generate one version of the algorithm per micro-architecture and pick the best one at runtime, based on the
+running processor capability.
+
 You can find out more about this implementation of C++ wrappers for SIMD intrinsics at the `The C++ Scientist`_. The mathematical functions are a
 lightweight implementation of the algorithms also used in `boost.SIMD`_.
 
@@ -80,6 +83,7 @@ This software is licensed under the BSD-3-Clause license. See the LICENSE file f
    api/batch_manip
    api/math_index
    api/aligned_allocator
+   api/dispatching
 
 .. _The C++ Scientist: http://johanmabille.github.io/blog/archives/
 .. _boost.SIMD: https://github.com/NumScale/boost.simd
diff --git a/docs/source/installation.rst b/docs/source/installation.rst
@@ -21,27 +21,27 @@
 Installation
 ============
 
-Although ``xsimd`` is a header-only library, we provide standardized means to install it, with package managers or with cmake.
+Although `xsimd` is a header-only library, we provide standardized means to install it, with package managers or with cmake.
 
-Besides the xsimd headers, all these methods place the ``cmake`` project configuration file in the right location so that third-party projects can use cmake's ``find_package`` to locate xsimd headers.
+Besides the `xsimd` headers, all these methods place the ``cmake`` project configuration file in the right location so that third-party projects can use cmake's ``find_package`` to locate `xsimd` headers.
 
 .. image:: conda.svg
 
 Using the conda-forge package
 -----------------------------
 
-A package for xsimd is available for the mamba (or conda) package manager.
+A package for `xsimd` is available for the `mamba <https://mamba.readthedocs.io>`_ (or `conda <https://conda.io>`_) package manager.
 
 .. code::
 
-    mamba install -c conda-forge xsimd 
+    mamba install -c conda-forge xsimd
 
 .. image:: spack.svg
 
 Using the Spack package
 -----------------------
 
-A package for xsimd is available on the Spack package manager.
+A package for `xsimd` is available on the `Spack <https://spack.io>`_ package manager.
 
 .. code::
 
@@ -53,7 +53,7 @@ A package for xsimd is available on the Spack package manager.
 From source with cmake
 ----------------------
 
-You can also install ``xsimd`` from source with cmake. On Unix platforms, from the source directory:
+You can also install `xsimd` from source with `cmake <https://cmake.org/>`_. On Unix platforms, from the source directory:
 
 .. code::
 
diff --git a/docs/source/vectorized_code.rst b/docs/source/vectorized_code.rst
@@ -28,8 +28,8 @@ How can we used `xsimd` to take advantage of vectorization ?
 Explicit use of an instruction set
 ----------------------------------
 
-`xsimd` provides the template class ``batch<T, A>`` where ``A`` is the target architecture and ``T`` the type of the values involved in SIMD
-instructions. If you know which intruction set is available on your machine, you can directly use the corresponding specialization
+`xsimd` provides the template class :cpp:class:`xsimd::batch` parametrized by ``T`` and ``A`` types where ``T`` is the type of the values involved in SIMD
+instructions and ``A`` is the target architecture. If you know which instruction set is available on your machine, you can directly use the corresponding specialization
 of ``batch``. For instance, assuming the AVX instruction set is available, the previous code can be vectorized the following way:
 
 .. code::
@@ -60,19 +60,19 @@ of ``batch``. For instance, assuming the AVX instruction set is available, the p
     }
 
 However, if you want to write code that is portable, you cannot rely on the use of ``batch<double, xsimd::avx>``.
-Indeed this won't compile on a CPU where only SSE2 instruction set is available for instance. Fortuantely, if you don't set the second template parameter, ``xsimd`` picks the best architecture among the one available, based on the compiler flag you use.
+Indeed this won't compile on a CPU where only SSE2 instruction set is available for instance. Fortunately, if you don't set the second template parameter, `xsimd` picks the best architecture among the one available, based on the compiler flag you use.
 
 
 Aligned vs unaligned memory
 ---------------------------
 
-In the previous example, you may have noticed the ``load_unaligned/store_unaligned`` functions. These
+In the previous example, you may have noticed the :cpp:func:`xsimd::batch::load_unaligned` and :cpp:func:`xsimd::batch::store_unaligned` functions. These
 are meant for loading values from contiguous dynamically allocated memory into SIMD registers and
 reciprocally. When dealing with memory transfer operations, some instructions sets required the memory
 to be aligned by a given amount, others can handle both aligned and unaligned modes. In that latter case,
-operating on aligned memory is always faster than operating on unaligned memory.
+operating on aligned memory is generally faster than operating on unaligned memory.
 
-`xsimd` provides an aligned memory allocator which follows the standard requirements, so it can be used
+`xsimd` provides an aligned memory allocator, namely :cpp:class:`xsimd::aligned_allocator` which follows the standard requirements, so it can be used
 with STL containers. Let's change the previous code so it can take advantage of this allocator:
 
 .. code::
@@ -118,7 +118,7 @@ mechanism that allows you to easily write such a generic code:
     #include "xsimd/xsimd.hpp"
 
     template <class C, class Tag>
-    void mean(const C& a, const C& b, C& res)
+    void mean(const C& a, const C& b, C& res, Tag)
     {
         using b_type = xsimd::batch<double>;
         std::size_t inc = b_type::size;
@@ -139,10 +139,50 @@ mechanism that allows you to easily write such a generic code:
         }
     }
 
-Here, the ``Tag`` template parameter can be ``xsimd::aligned_mode`` or ``xsimd::unaligned_mode``. Assuming the existence
-of a ``get_alignment_tag`` metafunction in the code, the previous code can be invoked this way:
+Here, the ``Tag`` template parameter can be :cpp:struct:`xsimd::aligned_mode` or :cpp:struct:`xsimd::unaligned_mode`. Assuming the existence
+of a ``get_alignment_tag`` meta-function in the code, the previous code can be invoked this way:
 
 .. code::
 
-    mean<get_alignment_tag<decltype(a)>>(a, b, res);
+    mean(a, b, res, get_alignment_tag<decltype(a)>());
 
+Writing arch-independent code
+-----------------------------
+
+If your code may target either SSE2, AVX2 or AVX512 instruction set, `xsimd`
+make it possible to make your code even more generic by using the architecture
+as a template parameter:
+
+
+.. code::
+
+    #include <cstddef>
+    #include <vector>
+    #include "xsimd/xsimd.hpp"
+
+    struct mean {
+        template <class C, class Tag, class Arch>
+        void operator()(Arch, const C& a, const C& b, C& res, Tag)
+        {
+            using b_type = xsimd::batch<double, Arch>;
+            std::size_t inc = b_type::size;
+            std::size_t size = res.size();
+            // size for which the vectorization is possible
+            std::size_t vec_size = size - size % inc;
+            for(std::size_t i = 0; i < vec_size; i += inc)
+            {
+                b_type avec = b_type::load(&a[i], Tag());
+                b_type bvec = b_type::load(&b[i], Tag());
+                b_type rvec = (avec + bvec) / 2;
+                xsimd::store(&res[i], rvec, Tag());
+            }
+            // Remaining part that cannot be vectorize
+            for(std::size_t i = vec_size; i < size; ++i)
+            {
+                res[i] = (a[i] + b[i]) / 2;
+            }
+        }
+    };
+
+This can be useful to implement runtime dispatching, based on the instruction set detected at runtime. `xsimd` provides a generic machinery :cpp:func:`xsimd::dispatch()` to implement
+this pattern. Based on the above example, instead of calling ``mean{}(arch, a, b, res, tag)``, one can use ``xsimd::dispatch(mean{})(a, b, res, tag)``.
diff --git a/include/xsimd/types/xsimd_api.hpp b/include/xsimd/types/xsimd_api.hpp
@@ -1382,10 +1382,10 @@ auto ssub(T const& x, Tp const& y) -> decltype(x - y) {
 /**
  * @ingroup batch_data_transfer
  *
- * copy content of batch \c val to the buffer \c mem. the
+ * Copy content of batch \c val to the buffer \c mem. The
  * memory does not need to be aligned.
- * @param mem the memory buffer to read
- * @param val the batch to copy
+ * @param mem the memory buffer to write to
+ * @param val the batch to copy from
  */
 template<class To, class A, class From>
 void store(From* mem, batch<To, A> const& val, aligned_mode={}) {
@@ -1395,10 +1395,10 @@ void store(From* mem, batch<To, A> const& val, aligned_mode={}) {
 /**
  * @ingroup batch_data_transfer
  *
- * copy content of batch \c val to the buffer \c mem. the
+ * Copy content of batch \c val to the buffer \c mem. The
  * memory does not need to be aligned.
- * @param mem the memory buffer to read
- * @param val the batch to copy
+ * @param mem the memory buffer to write to
+ * @param val the batch to copy from
  */
 template<class To, class A, class From>
 void store(To* mem, batch<From, A> const& val, unaligned_mode) {
@@ -1408,10 +1408,10 @@ void store(To* mem, batch<From, A> const& val, unaligned_mode) {
 /**
  * @ingroup batch_data_transfer
  *
- * copy content of batch \c val to the buffer \c mem. the
- * memory does not need to be aligned.
- * @param mem the memory buffer to read
- * @param val the batch to copy
+ * Copy content of batch \c val to the buffer \c mem. The
+ * memory needs to be aligned.
+ * @param mem the memory buffer to write to
+ * @param val the batch to copy from
  */
 template<class To, class A, class From>
 void store_aligned(To* mem, batch<From, A> const& val) {
@@ -1421,9 +1421,9 @@ void store_aligned(To* mem, batch<From, A> const& val) {
 /**
  * @ingroup batch_data_transfer
  *
- * copy content of batch \c val to the buffer \c mem. the
+ * Copy content of batch \c val to the buffer \c mem. The
  * memory does not need to be aligned.
- * @param mem the memory buffer to read
+ * @param mem the memory buffer to write to
  * @param val the batch to copy
  */
 template<class To, class A, class From>
diff --git a/include/xsimd/types/xsimd_batch.hpp b/include/xsimd/types/xsimd_batch.hpp
@@ -355,24 +355,48 @@ template<class T, class A>
 template<size_t... Is>
 batch<T, A>::batch(T const*data, detail::index_sequence<Is...>) : batch(kernel::set<A>(batch{}, A{}, data[Is]...)) {}
 
+/**
+ * Copy content of this batch to the buffer \c mem. The
+ * memory needs to be aligned.
+ * @param mem the memory buffer to read
+ */
 template<class T, class A>
 template<class U>
 void batch<T, A>::store_aligned(U* mem) const {
   kernel::store_aligned<A>(mem, *this, A{});
 }
 
+/**
+ * Copy content of this batch to the buffer \c mem. The
+ * memory does not need to be aligned.
+ * @param mem the memory buffer to write to
+ */
 template<class T, class A>
 template<class U>
 void batch<T, A>::store_unaligned(U* mem) const {
   kernel::store_unaligned<A>(mem, *this, A{});
 }
 
+/**
+ * Loading from aligned memory. May involve a conversion if \c U is different
+ * from \c T.
+ *
+ * @param mem the memory buffer to read from.
+ * @return a new batch instance.
+ */
 template<class T, class A>
 template<class U>
 batch<T, A> batch<T, A>::load_aligned(U const* mem) {
   return kernel::load_aligned<A>(mem, kernel::convert<T>{}, A{});
 }
 
+/**
+ * Loading from unaligned memory. May involve a conversion if \c U is different
+ * from \c T.
+ *
+ * @param mem the memory buffer to read from.
+ * @return a new batch instance.
+ */
 template<class T, class A>
 template<class U>
 batch<T, A> batch<T, A>::load_unaligned(U const* mem) {

Original file line number	Diff line number	Diff line change
`@@ -28,6 +28,8 @@ Here is an example that computes the mean of two sets of 4 double floating point`
`28`	`28`	`return 0;`
`29`	`29`	`}`
`30`	`30`
	`31`	`+Note that in that case, the instruction set is explicilty specified in the batch type.`
	`32`	`+`
`31`	`33`	`This example outputs:`
`32`	`34`
`33`	`35`	`.. code::`
`@@ -37,7 +39,7 @@ This example outputs:`
`37`	`39`	`Auto detection of the instruction set extension to be used`
`38`	`40`	`----------------------------------------------------------`
`39`	`41`
`40`		`-The same computation operating on vectors and using the most performant instruction set available:`
	`42`	`+The same computation operating on vectors and using the most performant instruction set available, using a code that's generic on the batch size:`
`41`	`43`
`42`	`44`	`.. code::`
`43`	`45`
`@@ -67,3 +69,4 @@ The same computation operating on vectors and using the most performant instruct`
`67`	`69`	`}`
`68`	`70`	`}`
`69`	`71`
	`72`	`+In that case, the architecture is chosen based on the compilation flags, prioritizing the largest width and the most recent instruction set.`