Skip to content

Commit f0892c3

Browse files
authored
Add post about GPU CI in scikit-learn (#227)
* Add post about GPU CI in scikit-learn * Fix lint * Use earlier publishing date * Fix date? * Add links to page sections * Add canonical URL * Fix author tag
1 parent 6841c0f commit f0892c3

File tree

3 files changed

+181
-0
lines changed

3 files changed

+181
-0
lines changed

content/authors/timhead.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
title: Tim Head
3+
---
4+
5+
👋 Hi!
6+
7+
I am a scikit-learn core developer. Find out more about [Tim Head](https://betatim.github.io/) on
8+
his website.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
---
2+
title: Scikit-learn
3+
---
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
---
2+
title: "Automated tests with GPUs for your project"
3+
date: 2024-07-19T15:24:17+01:00
4+
draft: false
5+
description: "Setting up CI with a GPU to test your code"
6+
tags: ["scikit-learn", "ci", "gpu", "cuda"]
7+
displayInList: true
8+
author: ["Tim Head <timhead>"]
9+
canonicalURL: https://betatim.github.io/posts/github-action-with-gpu/
10+
---
11+
12+
TL;DR: If you have GPU code in your project, setup a GitHub hosted GPU runner today.
13+
It is fairly quick to do and will free you from having to run tests manually.
14+
15+
Writing automated tests for your code base and certainly for the more complex parts
16+
of it has become as normal as brushing your teeth in the morning. Having a system
17+
that automatically runs a project's tests for every Pull Request
18+
is completely normal. However, until recently it was very complex and expensive
19+
to setup a system that can run tests on a system with a GPU. This means that,
20+
when dealing with GPU related code, we were thrown back into the dark ages where
21+
you had to rely on manual testing.
22+
23+
In this blog post I will describe how we set up a GitHub Action based GPU runner
24+
for the scikit-learn project and the things we learnt along the way. The goal is
25+
to give you some additional information and details about the setup we now use.
26+
27+
- [Setting up larger runners for your project](#larger-runners-with-gpus)
28+
- [VM image contents and setup](#vm-image-contents)
29+
- [Workflow configuration](#workflow-configuration)
30+
- [Bonus material](#bonus-material)
31+
32+
## Larger runners with GPUs
33+
34+
All workflows for your GitHub project are executed on a
35+
runner. Normally all your workflows run on the default runner, but you can have additional runners too. If you wanted
36+
to you could host a runner yourself on your own infrastructure. Until now this
37+
was the only way to get access to a runner with a GPU. However, hosting your
38+
own runner is complicated and comes with pitfalls regarding security.
39+
40+
Since about April 2024 GitHub has made [larger runners with a
41+
GPU](https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners/about-larger-runners) generally available.
42+
43+
To use these you will have to [setup a credit card for your organisation](https://docs.github.com/en/billing/managing-your-github-billing-settings/adding-or-editing-a-payment-method#updating-your-organizations-payment-method). Configure a spending limit so that you do not end up getting surprised
44+
with a very large bill. For scikit-learn we currently use a limit of $50.
45+
46+
When [adding a new GitHub hosted runner](https://github.com/organizations/YOUR_OWN_ORG_NAME/settings/actions/runners) make sure to select the "Partner" tab when
47+
choosing the VM's image. You need to select the "NVIDIA GPU-Optimized Image for AI and HPC"
48+
image in order to be able to choose the GPU runner later on.
49+
50+
The group the runner is assigned to can be configured to only allow particular repositories
51+
and workflows to use the runner group. It makes sense to only enable the runner
52+
group for the repository in which you plan to use it. Limiting which workflows your
53+
runner will pick up requires an additional level of indirection in your workflow
54+
setup, so I will not cover it in this blog post.
55+
56+
Name your runner group `cuda-gpu-runner-group` to match the name used in the examples
57+
below.
58+
59+
## VM Image contents
60+
61+
The GPU runner uses a disk image provided by NVIDIA. This means that there are
62+
some differences to the image that the default runner uses.
63+
64+
The `gh` command-line utility is not installed by default. Keep this in mind
65+
if you want to do things like removing a label from the Pull Request or
66+
other such tasks.
67+
68+
The biggest difference to the standard image is that the GPU image contains
69+
a conda installation, but the file permissions do not allow the workflow user
70+
to modify the existing environment or create new environments. As a result
71+
for scikit-learn we install conda a second time via miniforge. The conda environment is
72+
created from a lockfile, so we do not need to run the dependency solver.
73+
74+
## Workflow configuration
75+
76+
A key difference between the GPU runner and the default runner is that a project
77+
has to pay for the time of the GPU runner. This means that you might want to
78+
execute your GPU workflow only for some Pull Requests instead of all of them.
79+
80+
The GPU available in the runner is not very powerful, this means it is not
81+
that attractive of a target for people who are looking to abuse free GPU resources.
82+
Nevertheless, once in a while someone might try. Another reason to not run
83+
the GPU workflow by default.
84+
85+
A nice way to deal with running the workflow only after some form of human review
86+
is to use a label. To mark a Pull Request (PR) for execution on the GPU runner a
87+
reviewer applies a particular label. Applying a label does not cause a notification
88+
to be sent to all PR participants, unlike using a special comment to trigger the
89+
workflow.
90+
In the following example the `CUDA CI` label is used to mark a PR for execution and
91+
the `runs-on` directive is used to select the GPU runner. This is a snippet from
92+
[the full GPU workflow](https://github.com/scikit-learn/scikit-learn/blob/9d39f57399d6f1f7d8e8d4351dbc3e9244b98d28/.github/workflows/cuda-ci.yml) used in the scikit-learn repository.
93+
94+
```
95+
name: CUDA GPU
96+
on:
97+
pull_request:
98+
types:
99+
- labeled
100+
101+
jobs:
102+
tests:
103+
if: contains(github.event.pull_request.labels.*.name, 'CUDA CI')
104+
runs-on:
105+
group: cuda-gpu-runner-group
106+
steps:
107+
- uses: actions/setup-python@v5
108+
with:
109+
python-version: '3.12.3'
110+
- name: Checkout main repository
111+
uses: actions/checkout@v4
112+
...
113+
```
114+
115+
In order to remove the label again we need a workflow with elevated
116+
permissions. It needs to be able to edit a Pull Request. This privilege is not
117+
available for workflows triggered from Pull Requests from forks. Instead
118+
the workflow has to run in the context of the main repository and should only
119+
do the minimum amount of work.
120+
121+
```
122+
on:
123+
# Using `pull_request_target` gives us the possibility to get a API token
124+
# with write permissions
125+
pull_request_target:
126+
types:
127+
- labeled
128+
129+
# In order to remove the "CUDA CI" label we need to have write permissions for PRs
130+
permissions:
131+
pull-requests: write
132+
133+
jobs:
134+
label-remover:
135+
if: contains(github.event.pull_request.labels.*.name, 'CUDA CI')
136+
runs-on: ubuntu-20.04
137+
steps:
138+
- uses: actions-ecosystem/action-remove-labels@v1
139+
with:
140+
labels: CUDA CI
141+
```
142+
143+
This snippet is from the [label remover workflow](https://github.com/scikit-learn/scikit-learn/blob/9d39f57399d6f1f7d8e8d4351dbc3e9244b98d28/.github/workflows/cuda-label-remover.yml)
144+
we use in scikit-learn.
145+
146+
## Bonus Material
147+
148+
For scikit-learn we have been using the GPU runner for about six weeks. So far we have stayed
149+
below the $50 monthly spending limit we set. This includes some runs to debug the setup at the
150+
start.
151+
152+
One of the scikit-learn contributors created a [Colab notebook that people can use to setup and run the scikit-learn test suite on Colab](https://gist.github.com/EdAbati/ff3bdc06bafeb92452b3740686cc8d7c). This is useful
153+
for contributors who do not have easy access to a GPU. They can test their changes or debug
154+
failures without having to wait for a maintainer to label the Pull Request. We plan to add
155+
a workflow that comments on PRs with information on how to use this notebook to increase its
156+
discoverability.
157+
158+
## Conclusion
159+
160+
Overall it was not too difficult to setup the GPU runner. It took a little bit of fiddling to
161+
deal with the differences in VM image content as well as a few iterations for how to setup
162+
the workflow, given we wanted to manually trigger them.
163+
164+
The GPU runner has been reliably working and picking up work when requested. It saves us (the
165+
maintainers) a lot of time, as we do not have to checkout a PR locally and run the tests
166+
by hand.
167+
168+
The costs so far have been manageable and it has been worth spending the money as it removes
169+
a repetitive and tedious manual task from the reviewing workflow. However, it does require
170+
having the funds and a credit card.

0 commit comments

Comments
 (0)