Skip to content

Commit b678900

Browse files
authored
Add GSoC final presentation and wrap-up blog (#342)
* add GSoC final presentation and wrap-up blog * updated spellings
1 parent 791cb06 commit b678900

File tree

6 files changed

+265
-2
lines changed

6 files changed

+265
-2
lines changed

.github/actions/spelling/allow/terms.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ CMSSW
77
Caa
88
Codegen
99
Cppyy
10+
CUDACC
1011
Debian
1112
EPC
1213
forw
@@ -48,6 +49,7 @@ blogs
4849
cms
4950
codegen
5051
consteval
52+
cplusplus
5153
cppyy
5254
cytokine
5355
cytokines
@@ -70,9 +72,11 @@ pythonized
7072
ramview
7173
reoptimize
7274
samtools
75+
sbo
7376
sitemap
7477
softsusy
7578
superbuilds
79+
tapenade
7680
vimeo
7781
www
7882
xcolors

_data/crconlist2025.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -171,9 +171,8 @@
171171
tape operations. Ongoing work includes developing a multilayer tape system with
172172
offloading capabilities, which will allow only the most recent slabs to remain in
173173
memory.
174-
175174
176-
# slides: /assets/presentations/...
175+
slides: /assets/presentations/Aditi_Joshi_gsoc25_final_presentation.pdf
177176

178177
- title: "Support usage of Thrust API in Clad"
179178
speaker:

_data/standing_meetings.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
time_cest: "17:00"
44
connect: "[Link to zoom](https://princeton.zoom.us/j/97915651167?pwd=MXJ1T2lhc3Z5QWlYbUFnMTZYQlNRdz09)"
55
agenda:
6+
- title: "Wrap-Up: Implement and improve an efficient, layered tape with prefetching capabilities"
7+
date: 2025-10-30 15:40:00 +0200
8+
speaker: "Aditi Milind Joshi"
9+
link: "[Slides](/assets/presentations/Aditi_Joshi_gsoc25_final_presentation.pdf)"
610
- title: "Wrap-Up: Support usage of Thrust API in Clad"
711
date: 2025-10-30 15:00:00 +0200
812
speaker: "Abdelrhman Elrawy"
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
---
2+
title: "Wrapping up GSoC 2025: Implement and improve an efficient, layered tape with prefetching capabilities"
3+
layout: post
4+
excerpt: "A summary of my GSoC 2025 project focusing on optimizing Clad's tape data structure by introducing slab-based memory, small buffer optimization, thread safety and multilayer storage."
5+
sitemap: true
6+
author: Aditi Milind Joshi
7+
permalink: blogs/gsoc25_aditi_final_blog/
8+
banner_image: /images/blog/gsoc-clad-banner.png
9+
date: 2025-11-11
10+
tags: gsoc clad clang c++
11+
---
12+
13+
**Mentors:** Aaron Jomy, David Lange, Vassil Vassilev
14+
15+
## A Brief Introduction
16+
17+
### What is Automatic Differentiation?
18+
19+
Automatic Differentiation (AD) is a computational technique that enables efficient and precise evaluation of derivatives for functions expressed in code.
20+
21+
### What is Clad?
22+
23+
Clad is a Clang-based automatic differentiation tool that transforms C++ source code to compute derivatives efficiently.
24+
25+
### Tape in Clad
26+
27+
The tape is a stack-like data structure that stores intermediate values in reverse mode AD during the forward pass for use during the backward (gradient) pass.
28+
29+
This project focuses on improving the implementation and efficiency of the tape by removing unnecessary allocations, adding support for different features like thread-safety and offloading to disk and enhancing the related benchmarks.
30+
31+
### Understanding the Previous Implementation of the Tape and its Limitations
32+
33+
In Clad’s previous implementation, the tape was a monolithic memory buffer. It was a contiguous dynamic array and each time a new entry was pushed onto the tape and the underlying capacity was exceeded, the array grew by allocating a new larger block of memory of double the capacity, copying all existing entries to the new block and deallocating the old block.
34+
35+
```cpp
36+
constexpr static std::size_t _init_capacity = 32;
37+
CUDA_HOST_DEVICE void grow() {
38+
// If empty, use initial capacity.
39+
if (!_capacity)
40+
_capacity = _init_capacity;
41+
else
42+
// Double the capacity on each reallocation.
43+
_capacity *= 2;
44+
T* new_data = AllocateRawStorage(_capacity);
45+
46+
// Move values from old storage to the new storage. Should call move
47+
MoveData(begin(), end(), new_data);
48+
// Destroy all values in the old storage.
49+
destroy(begin(), end());
50+
// delete the old data here to make sure we do not leak anything.
51+
::operator delete(const_cast<void*>(
52+
static_cast<const volatile void*>(_data)));
53+
_data = new_data;
54+
}
55+
```
56+
57+
It dynamically resized its storage using a growth factor of 2x when capacity was exceeded. This led to expensive memory reallocation and copying overhead. While this approach was lightweight for small problems, it became inefficient and non-scalable for larger applications or parallel workloads. Frequent memory reallocations, lack of thread safety, and the absence of support for offloading made it a limiting factor in Clad’s usability in complex scenarios.
58+
59+
## Project Implementation
60+
61+
### 1. Slab-based Tape
62+
63+
Instead of reallocating memory, a slab-based memory allocation strategy is used. This involves allocating connected memory chunks (slabs) and linking them dynamically as the tape grows, reducing unnecessary reallocations. Each time an element is pushed onto the tape and the capacity is exceeded a new slab is allocated and linked to last slab, forming a linked list structure.
64+
65+
```cpp
66+
struct Slab {
67+
alignas(T) char raw_data[SLAB_SIZE * sizeof(T)];
68+
Slab* prev;
69+
Slab* next;
70+
CUDA_HOST_DEVICE Slab() : prev(nullptr), next(nullptr) {}
71+
CUDA_HOST_DEVICE T* elements() {
72+
#if __cplusplus >= 201703L
73+
return std::launder(reinterpret_cast<T*>(raw_data));
74+
#else
75+
return reinterpret_cast<T*>(raw_data);
76+
#endif
77+
}
78+
};
79+
```
80+
81+
### 2. Small Buffer Optimization
82+
83+
Additionally, to further optimize performance for small-scale or short-lived tapes, a small buffer optimization (SBO) was introduced as part of the design. With SBO, elements are initially pushed onto a small statically allocated buffer. Only when this buffer overflows does the system transition to heap-allocated slabs.
84+
85+
```cpp
86+
alignas(T) char m_static_buffer[SBO_SIZE * sizeof(T)];
87+
```
88+
89+
### 3. Further Tape Improvements
90+
91+
There were further improvements made to the slab-based implementation of the tape during the course of the project which included adding a tail pointer which pointed to the last slab to reduce push pop operation runtime to O(n), using a capacity variable to reuse slabs and making the tape a doubly linked list to keep track of the tail pointer without traversing the entire tape after pop operations.
92+
93+
*Push Function:*
94+
```cpp
95+
template <typename... ArgsT>
96+
CUDA_HOST_DEVICE void emplace_back(ArgsT&&... args) {
97+
if (m_size < SBO_SIZE) {
98+
// Store in SBO buffer
99+
::new (const_cast<void*>(static_cast<const volatile void*>(
100+
sbo_elements() + m_size))) T(std::forward<ArgsT>(args)...);
101+
} else {
102+
const auto offset = (m_size - SBO_SIZE) % SLAB_SIZE;
103+
// Allocate new slab if required
104+
if (!offset) {
105+
if (m_size == m_capacity) {
106+
Slab* new_slab = new Slab();
107+
if (!m_head)
108+
m_head = new_slab;
109+
else {
110+
m_tail->next = new_slab;
111+
new_slab->prev = m_tail;
112+
}
113+
m_capacity += SLAB_SIZE;
114+
}
115+
if (m_size == SBO_SIZE)
116+
m_tail = m_head;
117+
else
118+
m_tail = m_tail->next;
119+
}
120+
121+
// Construct element in-place
122+
::new (const_cast<void*>(static_cast<const volatile void*>(
123+
m_tail->elements() + offset))) T(std::forward<ArgsT>(args)...);
124+
}
125+
m_size++;
126+
}
127+
```
128+
129+
*Pop Function:*
130+
```cpp
131+
CUDA_HOST_DEVICE void pop_back() {
132+
assert(m_size);
133+
m_size--;
134+
if (m_size < SBO_SIZE)
135+
destroy_element(sbo_elements() + m_size);
136+
else {
137+
std::size_t offset = (m_size - SBO_SIZE) % SLAB_SIZE;
138+
destroy_element(m_tail->elements() + offset);
139+
if (offset == 0) {
140+
if (m_tail != m_head)
141+
m_tail = m_tail->prev;
142+
}
143+
}
144+
}
145+
```
146+
147+
### 3. Enhancements in Benchmarks
148+
149+
- Benchmark Script:
150+
Added a benchmark script which takes two revisions (baseline and current) and computes and compares the benchmarks of both.
151+
152+
- Configurable Benchmarks:
153+
Added configurable tape memory benchmarks which take different slab and SBO sizes to test and find the optimal size.
154+
155+
```cpp
156+
template <std::size_t SBO_SIZE, std::size_t SLAB_SIZE>
157+
static void BM_TapeMemory_Templated(benchmark::State& state) {
158+
int block = state.range(0);
159+
AddBMCounterRAII MemCounters(*mm.get(), state);
160+
for (auto _ : state) {
161+
clad::tape<double, SBO_SIZE, SLAB_SIZE> t;
162+
func<double, SBO_SIZE, SLAB_SIZE>(t, 1, block * 2 + 1);
163+
}
164+
}
165+
166+
#define REGISTER_TAPE_BENCHMARK(sbo, slab) \
167+
BENCHMARK_TEMPLATE(BM_TapeMemory_Templated, sbo, slab) \
168+
->RangeMultiplier(2) \
169+
->Range(0, 4096) \
170+
->Name("BM_TapeMemory/SBO_" #sbo "_SLAB_" #slab)
171+
172+
REGISTER_TAPE_BENCHMARK(64, 1024);
173+
REGISTER_TAPE_BENCHMARK(32, 512);
174+
```
175+
- Fixes in Benchmarks:
176+
- Removed ```Iterations(1)``` to get better estimate of the benchmarks.
177+
- Fixed memory manager counters.
178+
- Added ```DoNotOptimize()``` to prevent compiler from optimizing out the pop function.
179+
180+
### 4. Tape Thread-Safety
181+
182+
Added thread-safe tape access functions with mutex locking mechanism to allow for concurrent access. Since the locking mechanism has significant overhead, the tape access functions were overloaded and separate thread-safe functions have been introduced which can be used as the default tape access functions by setting the ```is_multithread``` template parameter to ```true``` during tape initialization.
183+
184+
```cpp
185+
/// Thread safe tape access functions with mutex locking mechanism
186+
#ifndef __CUDACC__
187+
/// Add value to the end of the tape, return the same value.
188+
template <typename T, std::size_t SBO_SIZE = 64, std::size_t SLAB_SIZE = 1024,
189+
typename... ArgsT>
190+
T push(tape<T, SBO_SIZE, SLAB_SIZE, /*is_multithreaded=*/true>& to,
191+
ArgsT&&... val) {
192+
std::lock_guard<std::mutex> lock(to.mutex());
193+
to.emplace_back(std::forward<ArgsT>(val)...);
194+
return to.back();
195+
}
196+
197+
/// A specialization for C arrays
198+
template <typename T, typename U, size_t N, std::size_t SBO_SIZE = 64,
199+
std::size_t SLAB_SIZE = 1024>
200+
void push(tape<T[N], SBO_SIZE, SLAB_SIZE, /*is_multithreaded=*/true>& to,
201+
const U& val) {
202+
std::lock_guard<std::mutex> lock(to.mutex());
203+
to.emplace_back();
204+
std::copy(std::begin(val), std::end(val), std::begin(to.back()));
205+
}
206+
207+
/// Remove the last value from the tape, return it.
208+
template <typename T, std::size_t SBO_SIZE = 64, std::size_t SLAB_SIZE = 1024>
209+
T pop(tape<T, SBO_SIZE, SLAB_SIZE, /*is_multithreaded=*/true>& to) {
210+
std::lock_guard<std::mutex> lock(to.mutex());
211+
T val = std::move(to.back());
212+
to.pop_back();
213+
return val;
214+
}
215+
216+
/// A specialization for C arrays
217+
template <typename T, std::size_t N, std::size_t SBO_SIZE = 64,
218+
std::size_t SLAB_SIZE = 1024>
219+
void pop(tape<T[N], SBO_SIZE, SLAB_SIZE, /*is_multithreaded=*/true>& to) {
220+
std::lock_guard<std::mutex> lock(to.mutex());
221+
to.pop_back();
222+
}
223+
224+
/// Access return the last value in the tape.
225+
template <typename T, std::size_t SBO_SIZE = 64, std::size_t SLAB_SIZE = 1024>
226+
T& back(tape<T, SBO_SIZE, SLAB_SIZE, /*is_multithreaded=*/true>& of) {
227+
std::lock_guard<std::mutex> lock(of.mutex());
228+
return of.back();
229+
}
230+
#endif
231+
```
232+
233+
### 5. Multilayer Storage (Ongoing)
234+
235+
To scale AD beyond memory limits, an offloading mechanism to offload slabs to disk and load slabs from disk to memory is being introduced. Instead of keeping all the slabs in memory, only the last N slabs are kept in memory at a time and the rest are offloaded to the disk. One slab space is kept for random access where slabs are loaded if element to be loaded is not in memory.
236+
237+
## Results and Benchmarks
238+
239+
The current tape implementation was tested against the old tape, ```std::vector```, ```std::stack``` and tapenade.
240+
The following were the results obtained:
241+
242+
![Benchmarks](/images/blog/gsoc25_tape_benchmarks.png)
243+
244+
## Future Work
245+
246+
- Supporting CPU-GPU memory transfers for future heterogeneous computing use cases.
247+
- Introducing checkpointing for optimal memory-computation trade-offs.
248+
249+
---
250+
251+
## Related Links
252+
253+
- [Clad Repository](https://github.com/vgvassilev/clad)
254+
- [Project Description](https://hepsoftwarefoundation.org/gsoc/2025/proposal_Clad-ImproveTape.html)
255+
- [GSoC Project Proposal](/assets/docs/Aditi_Milind_Joshi_Proposal_2025.pdf)
256+
- [My GitHub Profile](https://github.com/aditimjoshi)
275 KB
Binary file not shown.
78.2 KB
Loading

0 commit comments

Comments
 (0)