|
| 1 | +--- |
| 2 | +title: "Wrapping up GSoC 2025: Implement and improve an efficient, layered tape with prefetching capabilities" |
| 3 | +layout: post |
| 4 | +excerpt: "A summary of my GSoC 2025 project focusing on optimizing Clad's tape data structure by introducing slab-based memory, small buffer optimization, thread safety and multilayer storage." |
| 5 | +sitemap: true |
| 6 | +author: Aditi Milind Joshi |
| 7 | +permalink: blogs/gsoc25_aditi_final_blog/ |
| 8 | +banner_image: /images/blog/gsoc-clad-banner.png |
| 9 | +date: 2025-11-11 |
| 10 | +tags: gsoc clad clang c++ |
| 11 | +--- |
| 12 | + |
| 13 | +**Mentors:** Aaron Jomy, David Lange, Vassil Vassilev |
| 14 | + |
| 15 | +## A Brief Introduction |
| 16 | + |
| 17 | +### What is Automatic Differentiation? |
| 18 | + |
| 19 | +Automatic Differentiation (AD) is a computational technique that enables efficient and precise evaluation of derivatives for functions expressed in code. |
| 20 | + |
| 21 | +### What is Clad? |
| 22 | + |
| 23 | +Clad is a Clang-based automatic differentiation tool that transforms C++ source code to compute derivatives efficiently. |
| 24 | + |
| 25 | +### Tape in Clad |
| 26 | + |
| 27 | +The tape is a stack-like data structure that stores intermediate values in reverse mode AD during the forward pass for use during the backward (gradient) pass. |
| 28 | + |
| 29 | +This project focuses on improving the implementation and efficiency of the tape by removing unnecessary allocations, adding support for different features like thread-safety and offloading to disk and enhancing the related benchmarks. |
| 30 | + |
| 31 | +### Understanding the Previous Implementation of the Tape and its Limitations |
| 32 | + |
| 33 | +In Clad’s previous implementation, the tape was a monolithic memory buffer. It was a contiguous dynamic array and each time a new entry was pushed onto the tape and the underlying capacity was exceeded, the array grew by allocating a new larger block of memory of double the capacity, copying all existing entries to the new block and deallocating the old block. |
| 34 | + |
| 35 | +```cpp |
| 36 | +constexpr static std::size_t _init_capacity = 32; |
| 37 | +CUDA_HOST_DEVICE void grow() { |
| 38 | + // If empty, use initial capacity. |
| 39 | + if (!_capacity) |
| 40 | + _capacity = _init_capacity; |
| 41 | + else |
| 42 | + // Double the capacity on each reallocation. |
| 43 | + _capacity *= 2; |
| 44 | + T* new_data = AllocateRawStorage(_capacity); |
| 45 | + |
| 46 | + // Move values from old storage to the new storage. Should call move |
| 47 | + MoveData(begin(), end(), new_data); |
| 48 | + // Destroy all values in the old storage. |
| 49 | + destroy(begin(), end()); |
| 50 | + // delete the old data here to make sure we do not leak anything. |
| 51 | + ::operator delete(const_cast<void*>( |
| 52 | + static_cast<const volatile void*>(_data))); |
| 53 | + _data = new_data; |
| 54 | +} |
| 55 | +``` |
| 56 | + |
| 57 | +It dynamically resized its storage using a growth factor of 2x when capacity was exceeded. This led to expensive memory reallocation and copying overhead. While this approach was lightweight for small problems, it became inefficient and non-scalable for larger applications or parallel workloads. Frequent memory reallocations, lack of thread safety, and the absence of support for offloading made it a limiting factor in Clad’s usability in complex scenarios. |
| 58 | + |
| 59 | +## Project Implementation |
| 60 | + |
| 61 | +### 1. Slab-based Tape |
| 62 | + |
| 63 | +Instead of reallocating memory, a slab-based memory allocation strategy is used. This involves allocating connected memory chunks (slabs) and linking them dynamically as the tape grows, reducing unnecessary reallocations. Each time an element is pushed onto the tape and the capacity is exceeded a new slab is allocated and linked to last slab, forming a linked list structure. |
| 64 | + |
| 65 | +```cpp |
| 66 | +struct Slab { |
| 67 | + alignas(T) char raw_data[SLAB_SIZE * sizeof(T)]; |
| 68 | + Slab* prev; |
| 69 | + Slab* next; |
| 70 | + CUDA_HOST_DEVICE Slab() : prev(nullptr), next(nullptr) {} |
| 71 | + CUDA_HOST_DEVICE T* elements() { |
| 72 | +#if __cplusplus >= 201703L |
| 73 | + return std::launder(reinterpret_cast<T*>(raw_data)); |
| 74 | +#else |
| 75 | + return reinterpret_cast<T*>(raw_data); |
| 76 | +#endif |
| 77 | +} |
| 78 | +}; |
| 79 | +``` |
| 80 | +
|
| 81 | +### 2. Small Buffer Optimization |
| 82 | +
|
| 83 | +Additionally, to further optimize performance for small-scale or short-lived tapes, a small buffer optimization (SBO) was introduced as part of the design. With SBO, elements are initially pushed onto a small statically allocated buffer. Only when this buffer overflows does the system transition to heap-allocated slabs. |
| 84 | +
|
| 85 | +```cpp |
| 86 | +alignas(T) char m_static_buffer[SBO_SIZE * sizeof(T)]; |
| 87 | +``` |
| 88 | + |
| 89 | +### 3. Further Tape Improvements |
| 90 | + |
| 91 | +There were further improvements made to the slab-based implementation of the tape during the course of the project which included adding a tail pointer which pointed to the last slab to reduce push pop operation runtime to O(n), using a capacity variable to reuse slabs and making the tape a doubly linked list to keep track of the tail pointer without traversing the entire tape after pop operations. |
| 92 | + |
| 93 | +*Push Function:* |
| 94 | +```cpp |
| 95 | +template <typename... ArgsT> |
| 96 | +CUDA_HOST_DEVICE void emplace_back(ArgsT&&... args) { |
| 97 | +if (m_size < SBO_SIZE) { |
| 98 | + // Store in SBO buffer |
| 99 | + ::new (const_cast<void*>(static_cast<const volatile void*>( |
| 100 | + sbo_elements() + m_size))) T(std::forward<ArgsT>(args)...); |
| 101 | +} else { |
| 102 | + const auto offset = (m_size - SBO_SIZE) % SLAB_SIZE; |
| 103 | + // Allocate new slab if required |
| 104 | + if (!offset) { |
| 105 | + if (m_size == m_capacity) { |
| 106 | + Slab* new_slab = new Slab(); |
| 107 | + if (!m_head) |
| 108 | + m_head = new_slab; |
| 109 | + else { |
| 110 | + m_tail->next = new_slab; |
| 111 | + new_slab->prev = m_tail; |
| 112 | + } |
| 113 | + m_capacity += SLAB_SIZE; |
| 114 | + } |
| 115 | + if (m_size == SBO_SIZE) |
| 116 | + m_tail = m_head; |
| 117 | + else |
| 118 | + m_tail = m_tail->next; |
| 119 | + } |
| 120 | + |
| 121 | + // Construct element in-place |
| 122 | + ::new (const_cast<void*>(static_cast<const volatile void*>( |
| 123 | + m_tail->elements() + offset))) T(std::forward<ArgsT>(args)...); |
| 124 | +} |
| 125 | +m_size++; |
| 126 | +} |
| 127 | +``` |
| 128 | +
|
| 129 | +*Pop Function:* |
| 130 | +```cpp |
| 131 | +CUDA_HOST_DEVICE void pop_back() { |
| 132 | + assert(m_size); |
| 133 | + m_size--; |
| 134 | + if (m_size < SBO_SIZE) |
| 135 | + destroy_element(sbo_elements() + m_size); |
| 136 | + else { |
| 137 | + std::size_t offset = (m_size - SBO_SIZE) % SLAB_SIZE; |
| 138 | + destroy_element(m_tail->elements() + offset); |
| 139 | + if (offset == 0) { |
| 140 | + if (m_tail != m_head) |
| 141 | + m_tail = m_tail->prev; |
| 142 | + } |
| 143 | + } |
| 144 | +} |
| 145 | +``` |
| 146 | + |
| 147 | +### 3. Enhancements in Benchmarks |
| 148 | + |
| 149 | +- Benchmark Script: |
| 150 | +Added a benchmark script which takes two revisions (baseline and current) and computes and compares the benchmarks of both. |
| 151 | + |
| 152 | +- Configurable Benchmarks: |
| 153 | +Added configurable tape memory benchmarks which take different slab and SBO sizes to test and find the optimal size. |
| 154 | + |
| 155 | +```cpp |
| 156 | +template <std::size_t SBO_SIZE, std::size_t SLAB_SIZE> |
| 157 | +static void BM_TapeMemory_Templated(benchmark::State& state) { |
| 158 | + int block = state.range(0); |
| 159 | + AddBMCounterRAII MemCounters(*mm.get(), state); |
| 160 | + for (auto _ : state) { |
| 161 | + clad::tape<double, SBO_SIZE, SLAB_SIZE> t; |
| 162 | + func<double, SBO_SIZE, SLAB_SIZE>(t, 1, block * 2 + 1); |
| 163 | + } |
| 164 | +} |
| 165 | + |
| 166 | +#define REGISTER_TAPE_BENCHMARK(sbo, slab) \ |
| 167 | + BENCHMARK_TEMPLATE(BM_TapeMemory_Templated, sbo, slab) \ |
| 168 | + ->RangeMultiplier(2) \ |
| 169 | + ->Range(0, 4096) \ |
| 170 | + ->Name("BM_TapeMemory/SBO_" #sbo "_SLAB_" #slab) |
| 171 | + |
| 172 | +REGISTER_TAPE_BENCHMARK(64, 1024); |
| 173 | +REGISTER_TAPE_BENCHMARK(32, 512); |
| 174 | +``` |
| 175 | +- Fixes in Benchmarks: |
| 176 | + - Removed ```Iterations(1)``` to get better estimate of the benchmarks. |
| 177 | + - Fixed memory manager counters. |
| 178 | + - Added ```DoNotOptimize()``` to prevent compiler from optimizing out the pop function. |
| 179 | +
|
| 180 | +### 4. Tape Thread-Safety |
| 181 | +
|
| 182 | +Added thread-safe tape access functions with mutex locking mechanism to allow for concurrent access. Since the locking mechanism has significant overhead, the tape access functions were overloaded and separate thread-safe functions have been introduced which can be used as the default tape access functions by setting the ```is_multithread``` template parameter to ```true``` during tape initialization. |
| 183 | +
|
| 184 | +```cpp |
| 185 | +/// Thread safe tape access functions with mutex locking mechanism |
| 186 | +#ifndef __CUDACC__ |
| 187 | + /// Add value to the end of the tape, return the same value. |
| 188 | + template <typename T, std::size_t SBO_SIZE = 64, std::size_t SLAB_SIZE = 1024, |
| 189 | + typename... ArgsT> |
| 190 | + T push(tape<T, SBO_SIZE, SLAB_SIZE, /*is_multithreaded=*/true>& to, |
| 191 | + ArgsT&&... val) { |
| 192 | + std::lock_guard<std::mutex> lock(to.mutex()); |
| 193 | + to.emplace_back(std::forward<ArgsT>(val)...); |
| 194 | + return to.back(); |
| 195 | + } |
| 196 | +
|
| 197 | + /// A specialization for C arrays |
| 198 | + template <typename T, typename U, size_t N, std::size_t SBO_SIZE = 64, |
| 199 | + std::size_t SLAB_SIZE = 1024> |
| 200 | + void push(tape<T[N], SBO_SIZE, SLAB_SIZE, /*is_multithreaded=*/true>& to, |
| 201 | + const U& val) { |
| 202 | + std::lock_guard<std::mutex> lock(to.mutex()); |
| 203 | + to.emplace_back(); |
| 204 | + std::copy(std::begin(val), std::end(val), std::begin(to.back())); |
| 205 | + } |
| 206 | +
|
| 207 | + /// Remove the last value from the tape, return it. |
| 208 | + template <typename T, std::size_t SBO_SIZE = 64, std::size_t SLAB_SIZE = 1024> |
| 209 | + T pop(tape<T, SBO_SIZE, SLAB_SIZE, /*is_multithreaded=*/true>& to) { |
| 210 | + std::lock_guard<std::mutex> lock(to.mutex()); |
| 211 | + T val = std::move(to.back()); |
| 212 | + to.pop_back(); |
| 213 | + return val; |
| 214 | + } |
| 215 | +
|
| 216 | + /// A specialization for C arrays |
| 217 | + template <typename T, std::size_t N, std::size_t SBO_SIZE = 64, |
| 218 | + std::size_t SLAB_SIZE = 1024> |
| 219 | + void pop(tape<T[N], SBO_SIZE, SLAB_SIZE, /*is_multithreaded=*/true>& to) { |
| 220 | + std::lock_guard<std::mutex> lock(to.mutex()); |
| 221 | + to.pop_back(); |
| 222 | + } |
| 223 | +
|
| 224 | + /// Access return the last value in the tape. |
| 225 | + template <typename T, std::size_t SBO_SIZE = 64, std::size_t SLAB_SIZE = 1024> |
| 226 | + T& back(tape<T, SBO_SIZE, SLAB_SIZE, /*is_multithreaded=*/true>& of) { |
| 227 | + std::lock_guard<std::mutex> lock(of.mutex()); |
| 228 | + return of.back(); |
| 229 | + } |
| 230 | +#endif |
| 231 | +``` |
| 232 | + |
| 233 | +### 5. Multilayer Storage (Ongoing) |
| 234 | + |
| 235 | +To scale AD beyond memory limits, an offloading mechanism to offload slabs to disk and load slabs from disk to memory is being introduced. Instead of keeping all the slabs in memory, only the last N slabs are kept in memory at a time and the rest are offloaded to the disk. One slab space is kept for random access where slabs are loaded if element to be loaded is not in memory. |
| 236 | + |
| 237 | +## Results and Benchmarks |
| 238 | + |
| 239 | +The current tape implementation was tested against the old tape, ```std::vector```, ```std::stack``` and tapenade. |
| 240 | +The following were the results obtained: |
| 241 | + |
| 242 | + |
| 243 | + |
| 244 | +## Future Work |
| 245 | + |
| 246 | +- Supporting CPU-GPU memory transfers for future heterogeneous computing use cases. |
| 247 | +- Introducing checkpointing for optimal memory-computation trade-offs. |
| 248 | + |
| 249 | +--- |
| 250 | + |
| 251 | +## Related Links |
| 252 | + |
| 253 | +- [Clad Repository](https://github.com/vgvassilev/clad) |
| 254 | +- [Project Description](https://hepsoftwarefoundation.org/gsoc/2025/proposal_Clad-ImproveTape.html) |
| 255 | +- [GSoC Project Proposal](/assets/docs/Aditi_Milind_Joshi_Proposal_2025.pdf) |
| 256 | +- [My GitHub Profile](https://github.com/aditimjoshi) |
0 commit comments