Skip to content

Commit 791cb06

Browse files
authored
Add final presentation slides and GSoC wrap-up blog (#341)
* Add final presentation slides and GSoC wrap-up blog * Fix spellcheck check-spelling run (pull_request) for final Signed-off-by: check-spelling-bot <check-spelling-bot@users.noreply.github.com> on-behalf-of: @check-spelling <check-spelling-bot@check-spelling.dev> --------- Signed-off-by: check-spelling-bot <check-spelling-bot@users.noreply.github.com> Co-authored-by: Rohan Timmaraju <r-timmaraju@users.noreply.github.com>
1 parent 9b83926 commit 791cb06

File tree

9 files changed

+143
-5
lines changed

9 files changed

+143
-5
lines changed
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,23 @@
1+
Akida
2+
backpropation
3+
cblas
14
copyable
25
endunless
36
Galin
47
genomics
8+
gpt
9+
inp
10+
Karpathy
11+
layernorm
12+
Loihi
13+
Neuromorphic
14+
neuromorphic
515
optimizing
616
previndex
717
pubtit
818
Reoptimization
919
reoptimization
1020
Resugaring
21+
sgemm
1122
sustainability
1223
transitioning

_data/contributors.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -370,7 +370,7 @@
370370
info: "Google Summer of Code 2025 Contributor"
371371
email: rohan.timmaraju@gmail.com
372372
education: "B.S. Computer Science, Columbia University"
373-
github: "https://github.com/Rohan-T144"
373+
github: "https://github.com/r-timmaraju"
374374
active: 1
375375
linkedin: "https://www.linkedin.com/in/rohan-timmaraju-650ba3221/"
376376
projects:

_data/crconlist2025.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@
151151
provides a strong foundation for future hardware acceleration, such as porting
152152
the implementation to CUDA.
153153
154-
# slides: /assets/presentations/...
154+
slides: /assets/presentations/Rohan_Timmaraju_GSoC25_final.pdf
155155

156156
- title: "Implement and improve an efficient, layered tape with prefetching capabilities"
157157
speaker:

_data/standing_meetings.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -433,3 +433,8 @@
433433
date: 2025-10-30 15:00:00 +0200
434434
speaker: "Salvador de la Torre Gonzalez"
435435
link: "[Slides](/assets/presentations/Salva_GSoC25_final_presentation_CART.pdf)"
436+
- title: "Final Presentation: Efficient LLM Training in C++ via Compiler-Level Autodiff with Clad"
437+
date: 2025-10-30 15:20:00 +0200
438+
speaker: "Rohan Timmaraju"
439+
link: "[Slides](/assets/presentations/Rohan_Timmaraju_GSoC25_final.pdf)"
440+

_posts/2025-05-21-enhancing-llm-training.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: Rohan Timmaraju
77
permalink: blogs/gsoc25_rohan_introduction_blog/
88
banner_image: /images/blog/LLM_project_banner.jpg
99
date: 2025-05-21
10-
tags: gsoc c++ clang clad llm
10+
tags: gsoc c++ clang clad llm rohan-timmaraju
1111
---
1212

1313
### Introduction
@@ -49,4 +49,4 @@ This project has the potential to make a valuable contribution to both the compi
4949

5050
- [Project Description](https://hepsoftwarefoundation.org/gsoc/2025/proposal_Clad-LLM.html)
5151
- [Clad Repository](https://github.com/vgvassilev/clad)
52-
- [My GitHub Profile](https://github.com/Rohan-T144)
52+
- [My GitHub Profile](https://github.com/r-timmaraju)

_posts/2025-08-10-rohan-timmaraju-neo-flood-nasa.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ date: 2025-10-08
1414
tags: [
1515
nasa,
1616
neo-flood,
17-
rohan-timmaraju,
1817
compiler-research,
1918
neuromorphic-computing,
2019
satellite-ai,
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
title: "GSoC 2025 Final: Compiler-Driven LLM Training with Clad and C++"
3+
layout: post
4+
excerpt: "My GSoC 2025 project: implementing LLM training in C++ using Clad for compiler-level automatic differentiation. We explore two implementation approaches, culminating in performance improvements over PyTorch on CPU."
5+
sitemap: true
6+
author: Rohan Timmaraju
7+
permalink: blogs/gsoc25_rohan_final_blog/
8+
banner_image: /images/blog/LLM_project_banner.jpg
9+
date: 2025-11-10
10+
tags: gsoc c++ clang clad llm
11+
---
12+
13+
## Project Summary
14+
15+
When I began Google Summer of Code 2025, the goal was ambitious: to demonstrate that a C++-centric approach with compile-time Automatic Differentiation could be used for training Large Language Model (LLM) efficiently. The hypothesis was that by using **Clad**, a Clang plugin for source-to-source AD, we could eliminate the overhead other frameworks and enable deeper compiler optimizations.
16+
17+
This post details the journey from initial concept to a fully functional end-to-end training pipeline. We explore the main technical decisions and results that validate compiler-driven ML as a promising approach for high-performance computing.
18+
19+
---
20+
21+
### Phase 1: The `cladtorch` Library
22+
23+
Following the initial plan, I built `cladtorch`, a PyTorch-style tensor library with a clean, object-oriented API. It featured a `Tensor` class, encapsulation of state, and automatic memory management. We successfully implemented a full GPT-2 forward pass and, critically, achieved our core technical objective: **applying `clad::gradient` to the entire model's loss function**.
24+
25+
```cpp
26+
// Our target: differentiate the whole model at once
27+
auto grad_fn = clad::gradient(gpt2_loss, "model");
28+
```
29+
30+
This allows us to write neural networks like this:
31+
```cpp
32+
// Inside gpt2::LayerNorm
33+
Tensor forward(const Tensor& input) const {
34+
auto norm = input.norm();
35+
auto tmp = norm * weight;
36+
return tmp + bias;
37+
}
38+
```
39+
Which clad will then automatically write the backpropation pass for via `clad::gradient`:
40+
```cpp
41+
void forward_pullback(
42+
const Tensor& input, Tensor _d_y,
43+
gpt2::LayerNorm* _d_this, Tensor* _d_input
44+
) const {
45+
op_plus_pullback(&tmp, this->bias, _d_result, &_d_tmp, &_d_this->bias);
46+
op_star_pullback(&norm, this->weight, _d_tmp, &_d_norm, &_d_this->weight);
47+
norm_pullback(&input, _d_norm, _d_input);
48+
}
49+
```
50+
51+
Clad successfully processed loops, custom classes, and nested calls to generate the complete backward pass. This alone was a significant validation of Clad's capabilities. Initial benchmarks were encouraging — we matched the order-of-magnitude performance of Karpathy's `llm.c` — but profiling revealed we were still 3-4x slower than PyTorch. The culprit wasn't matrix multiplication (our BLAS kernels were already optimized), but rather the abstraction overhead of our design: temporary object creation, dynamic memory allocation mid-training, and memory access patterns that weren't cache-friendly.
52+
53+
### Phase 2: Optimized Training Loop
54+
55+
Faced with this overhead, we made a decisive pivot. I rewrote the engine from scratch, trading API familiarity for raw speed. Inspired by `llm.c`'s simplified approach, the new design was built on two main ideas:
56+
57+
1. **Single Memory Arena**: One massive, pre-allocated `float*` buffer holds *all* model parameters, gradients, and activations. The `GPT2` struct simply contains pointers into this arena. This eliminates all dynamic allocation during training and dramatically improves data locality.
58+
59+
2. **Stateless Kernels**: Every operation (`matmul`, `layernorm`, `softmax`) became a pure C-style function operating on raw pointers. No classes, no hidden state, no RAII overhead in tight loops: just simple, predictable code that is much easier for the compiler to optimize.
60+
61+
62+
This stateless kernel design turned out to be a great match for Clad. Instead of asking Clad to differentiate complex class methods, we could provide custom derivatives for each simple kernel using `clad::custom_derivatives`. Clad then orchestrates these hand-optimized pullbacks into the full backward pass.
63+
64+
{Maybe we can rewrite this example to show what clad actually does with the pullback in the vein of the previous example - maybe even the same example just in this style of implementation.}
65+
66+
```cpp
67+
// The forward kernel: simple and stateless
68+
void layernorm_forward(float* out, float* inp, float* weight, float* bias, int N, int C);
69+
70+
// Custom Derivative registered with Clad
71+
namespace clad::custom_derivatives {
72+
void forward_pullback(...) { /* calls layernorm_forward_pullback */ }
73+
}
74+
```
75+
76+
This approach gave us the best of both worlds: Clad's automated, compiler-level orchestration of the backpropagation graph and our manual optimization of the most performance-critical kernels.
77+
78+
---
79+
80+
### Benchmark Results
81+
82+
The results on an Apple M3 Max CPU speak for themselves. We measured the time for a full training iteration (forward + backward pass) for GPT-2 (124M parameters) across different batch sizes (B) and sequence lengths (T).
83+
84+
85+
<img src="/images/blog/llm-training-benchmarks.png" alt="LLM training benchmarks" style="max-width: 70%; height: auto; display: block; margin: 0 auto;">
86+
87+
**Key Findings**: Our implementation was consistently faster than PyTorch on CPU (benchmarked on M4 apple silicon), with speedups approaching 2x. This proves that a compiled, C++-first approach can surpass even the highly tuned PyTorch engine in a CPU-only environment.
88+
89+
The speedup stems directly from our design choices:
90+
91+
1. **Zero Python Overhead**: The entire training loop is a single, compiled binary. There is no Python-to-C++ context switches and no dynamic dispatch overhead. The compiler sees and can statically optimize the whole program (including the entire backpropagation graph).
92+
93+
2. **Cache-Friendly Memory Layout**: The single pre-allocated buffer ensures that model parameters and activations are laid out contiguously in memory. This improves cache line utilization and minimizes expensive fetches. Critically, it also eliminates the overhead from freeing and reallocating temporaries that more RAII-based C++ designs incur.
94+
95+
3. **Direct Hardware Access**: We call optimized BLAS libraries (such as Apple Accelerate) directly for `cblas_sgemm` without any framework abstraction layers. The stateless kernels also open the door for manual kernel fusion, further reducing memory bandwidth pressure.
96+
97+
---
98+
99+
### Key Achievements & Impact
100+
101+
- **Delivered Two Functional Implementations**: A flexible `cladtorch` prototype and a high-performance C-style engine, providing a comprehensive study in design trade-offs.
102+
- **Validated Clad for Complex ML**: Successfully demonstrated end-to-end differentiation of a production-level GPT-2 model, showcasing Clad's readiness for real-world compiler research applications.
103+
- **Surpassed PyTorch on CPU**: Achieved a significant performance milestone, proving the viability of compile-time AD for high-performance ML.
104+
- **Created a Research Foundation**: The optimized, kernel-based architecture provides an ideal base for exploring GPU acceleration and novel compiler optimizations.
105+
106+
This project also illustrates the trade-off between developer ergonomics and raw speed. While `cladtorch`'s PyTorch-like API was pleasant to use, its abstraction overhead was fundamentally at odds with peak performance, and would require significant work to bring to the same level of performance as PyTorch itself. The C-style engine, though less "modern C++," is what allowed us to beat PyTorch.
107+
108+
### Future Work & Next Steps
109+
110+
The project's architecture opens several interesting avenues to explore further:
111+
112+
1. **GPU Acceleration**: The stateless, pointer-based kernel design is a good candidate for porting to CUDA. This will let us test our hypothesis on the hardware where training is done at scale.
113+
2. **Clad-Driven Kernel Fusion**: We can leverage Clad's static analysis to automatically fuse sequences of operations (e.g., `softmax` + `cross_entropy`) into single, more efficient kernels, reducing memory bandwidth and kernel launch overhead.
114+
115+
### Conclusion
116+
117+
This GSoC project aimed to explore whether compiler-level AD could make LLM training more efficient in C++ environments. Our implementation demonstrates clear performance improvements over PyTorch on CPU, while also highlighting important trade-offs in C++ ML system design. By deeply integrating with the compiler via Clad, we've demonstrated that compiler-driven ML can even surpass mature Python frameworks. This work provides a tangible, high-performance alternative for C++-centric HPC environments and offers the Compiler Research Group a powerful real-world benchmark for future Clad enhancements.
118+
119+
### Links & Resources
120+
121+
- [Clad Repository](https://github.com/vgvassilev/clad)
122+
- [My GitHub Profile](https://github.com/r-timmaraju)
123+
214 KB
Binary file not shown.
133 KB
Loading

0 commit comments

Comments
 (0)