|
| 1 | +--- |
| 2 | +title: "Automated tests with GPUs for your project" |
| 3 | +date: 2024-07-19T15:24:17+01:00 |
| 4 | +draft: false |
| 5 | +description: "Setting up CI with a GPU to test your code" |
| 6 | +tags: ["scikit-learn", "ci", "gpu", "cuda"] |
| 7 | +displayInList: true |
| 8 | +author: ["Tim Head <timhead>"] |
| 9 | +canonicalURL: https://betatim.github.io/posts/github-action-with-gpu/ |
| 10 | +--- |
| 11 | + |
| 12 | +TL;DR: If you have GPU code in your project, setup a GitHub hosted GPU runner today. |
| 13 | +It is fairly quick to do and will free you from having to run tests manually. |
| 14 | + |
| 15 | +Writing automated tests for your code base and certainly for the more complex parts |
| 16 | +of it has become as normal as brushing your teeth in the morning. Having a system |
| 17 | +that automatically runs a project's tests for every Pull Request |
| 18 | +is completely normal. However, until recently it was very complex and expensive |
| 19 | +to setup a system that can run tests on a system with a GPU. This means that, |
| 20 | +when dealing with GPU related code, we were thrown back into the dark ages where |
| 21 | +you had to rely on manual testing. |
| 22 | + |
| 23 | +In this blog post I will describe how we set up a GitHub Action based GPU runner |
| 24 | +for the scikit-learn project and the things we learnt along the way. The goal is |
| 25 | +to give you some additional information and details about the setup we now use. |
| 26 | + |
| 27 | +- [Setting up larger runners for your project](#larger-runners-with-gpus) |
| 28 | +- [VM image contents and setup](#vm-image-contents) |
| 29 | +- [Workflow configuration](#workflow-configuration) |
| 30 | +- [Bonus material](#bonus-material) |
| 31 | + |
| 32 | +## Larger runners with GPUs |
| 33 | + |
| 34 | +All workflows for your GitHub project are executed on a |
| 35 | +runner. Normally all your workflows run on the default runner, but you can have additional runners too. If you wanted |
| 36 | +to you could host a runner yourself on your own infrastructure. Until now this |
| 37 | +was the only way to get access to a runner with a GPU. However, hosting your |
| 38 | +own runner is complicated and comes with pitfalls regarding security. |
| 39 | + |
| 40 | +Since about April 2024 GitHub has made [larger runners with a |
| 41 | +GPU](https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners/about-larger-runners) generally available. |
| 42 | + |
| 43 | +To use these you will have to [setup a credit card for your organisation](https://docs.github.com/en/billing/managing-your-github-billing-settings/adding-or-editing-a-payment-method#updating-your-organizations-payment-method). Configure a spending limit so that you do not end up getting surprised |
| 44 | +with a very large bill. For scikit-learn we currently use a limit of $50. |
| 45 | + |
| 46 | +When [adding a new GitHub hosted runner](https://github.com/organizations/YOUR_OWN_ORG_NAME/settings/actions/runners) make sure to select the "Partner" tab when |
| 47 | +choosing the VM's image. You need to select the "NVIDIA GPU-Optimized Image for AI and HPC" |
| 48 | +image in order to be able to choose the GPU runner later on. |
| 49 | + |
| 50 | +The group the runner is assigned to can be configured to only allow particular repositories |
| 51 | +and workflows to use the runner group. It makes sense to only enable the runner |
| 52 | +group for the repository in which you plan to use it. Limiting which workflows your |
| 53 | +runner will pick up requires an additional level of indirection in your workflow |
| 54 | +setup, so I will not cover it in this blog post. |
| 55 | + |
| 56 | +Name your runner group `cuda-gpu-runner-group` to match the name used in the examples |
| 57 | +below. |
| 58 | + |
| 59 | +## VM Image contents |
| 60 | + |
| 61 | +The GPU runner uses a disk image provided by NVIDIA. This means that there are |
| 62 | +some differences to the image that the default runner uses. |
| 63 | + |
| 64 | +The `gh` command-line utility is not installed by default. Keep this in mind |
| 65 | +if you want to do things like removing a label from the Pull Request or |
| 66 | +other such tasks. |
| 67 | + |
| 68 | +The biggest difference to the standard image is that the GPU image contains |
| 69 | +a conda installation, but the file permissions do not allow the workflow user |
| 70 | +to modify the existing environment or create new environments. As a result |
| 71 | +for scikit-learn we install conda a second time via miniforge. The conda environment is |
| 72 | +created from a lockfile, so we do not need to run the dependency solver. |
| 73 | + |
| 74 | +## Workflow configuration |
| 75 | + |
| 76 | +A key difference between the GPU runner and the default runner is that a project |
| 77 | +has to pay for the time of the GPU runner. This means that you might want to |
| 78 | +execute your GPU workflow only for some Pull Requests instead of all of them. |
| 79 | + |
| 80 | +The GPU available in the runner is not very powerful, this means it is not |
| 81 | +that attractive of a target for people who are looking to abuse free GPU resources. |
| 82 | +Nevertheless, once in a while someone might try. Another reason to not run |
| 83 | +the GPU workflow by default. |
| 84 | + |
| 85 | +A nice way to deal with running the workflow only after some form of human review |
| 86 | +is to use a label. To mark a Pull Request (PR) for execution on the GPU runner a |
| 87 | +reviewer applies a particular label. Applying a label does not cause a notification |
| 88 | +to be sent to all PR participants, unlike using a special comment to trigger the |
| 89 | +workflow. |
| 90 | +In the following example the `CUDA CI` label is used to mark a PR for execution and |
| 91 | +the `runs-on` directive is used to select the GPU runner. This is a snippet from |
| 92 | +[the full GPU workflow](https://github.com/scikit-learn/scikit-learn/blob/9d39f57399d6f1f7d8e8d4351dbc3e9244b98d28/.github/workflows/cuda-ci.yml) used in the scikit-learn repository. |
| 93 | + |
| 94 | +``` |
| 95 | +name: CUDA GPU |
| 96 | +on: |
| 97 | + pull_request: |
| 98 | + types: |
| 99 | + - labeled |
| 100 | +
|
| 101 | +jobs: |
| 102 | + tests: |
| 103 | + if: contains(github.event.pull_request.labels.*.name, 'CUDA CI') |
| 104 | + runs-on: |
| 105 | + group: cuda-gpu-runner-group |
| 106 | + steps: |
| 107 | + - uses: actions/setup-python@v5 |
| 108 | + with: |
| 109 | + python-version: '3.12.3' |
| 110 | + - name: Checkout main repository |
| 111 | + uses: actions/checkout@v4 |
| 112 | + ... |
| 113 | +``` |
| 114 | + |
| 115 | +In order to remove the label again we need a workflow with elevated |
| 116 | +permissions. It needs to be able to edit a Pull Request. This privilege is not |
| 117 | +available for workflows triggered from Pull Requests from forks. Instead |
| 118 | +the workflow has to run in the context of the main repository and should only |
| 119 | +do the minimum amount of work. |
| 120 | + |
| 121 | +``` |
| 122 | +on: |
| 123 | + # Using `pull_request_target` gives us the possibility to get a API token |
| 124 | + # with write permissions |
| 125 | + pull_request_target: |
| 126 | + types: |
| 127 | + - labeled |
| 128 | +
|
| 129 | +# In order to remove the "CUDA CI" label we need to have write permissions for PRs |
| 130 | +permissions: |
| 131 | + pull-requests: write |
| 132 | +
|
| 133 | +jobs: |
| 134 | + label-remover: |
| 135 | + if: contains(github.event.pull_request.labels.*.name, 'CUDA CI') |
| 136 | + runs-on: ubuntu-20.04 |
| 137 | + steps: |
| 138 | + - uses: actions-ecosystem/action-remove-labels@v1 |
| 139 | + with: |
| 140 | + labels: CUDA CI |
| 141 | +``` |
| 142 | + |
| 143 | +This snippet is from the [label remover workflow](https://github.com/scikit-learn/scikit-learn/blob/9d39f57399d6f1f7d8e8d4351dbc3e9244b98d28/.github/workflows/cuda-label-remover.yml) |
| 144 | +we use in scikit-learn. |
| 145 | + |
| 146 | +## Bonus Material |
| 147 | + |
| 148 | +For scikit-learn we have been using the GPU runner for about six weeks. So far we have stayed |
| 149 | +below the $50 monthly spending limit we set. This includes some runs to debug the setup at the |
| 150 | +start. |
| 151 | + |
| 152 | +One of the scikit-learn contributors created a [Colab notebook that people can use to setup and run the scikit-learn test suite on Colab](https://gist.github.com/EdAbati/ff3bdc06bafeb92452b3740686cc8d7c). This is useful |
| 153 | +for contributors who do not have easy access to a GPU. They can test their changes or debug |
| 154 | +failures without having to wait for a maintainer to label the Pull Request. We plan to add |
| 155 | +a workflow that comments on PRs with information on how to use this notebook to increase its |
| 156 | +discoverability. |
| 157 | + |
| 158 | +## Conclusion |
| 159 | + |
| 160 | +Overall it was not too difficult to setup the GPU runner. It took a little bit of fiddling to |
| 161 | +deal with the differences in VM image content as well as a few iterations for how to setup |
| 162 | +the workflow, given we wanted to manually trigger them. |
| 163 | + |
| 164 | +The GPU runner has been reliably working and picking up work when requested. It saves us (the |
| 165 | +maintainers) a lot of time, as we do not have to checkout a PR locally and run the tests |
| 166 | +by hand. |
| 167 | + |
| 168 | +The costs so far have been manageable and it has been worth spending the money as it removes |
| 169 | +a repetitive and tedious manual task from the reviewing workflow. However, it does require |
| 170 | +having the funds and a credit card. |
0 commit comments