|
| 1 | +# Post Submit Testing |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +While this infrastructure is focused on premerge testing, it is also important |
| 6 | +to make sure that the specific configuration we are testing is tested post |
| 7 | +commit as well. This document outlines the motivation for the need to test this |
| 8 | +configuration post commit, how we plan on implementing this to ensure we get |
| 9 | +fast feedback scalably, and why we are utilizing this design over others. |
| 10 | + |
| 11 | +## Background/Motivation |
| 12 | + |
| 13 | +LLVM has two types of testing upstream: premerge and postcommit. The premerge |
| 14 | +testing is performed using Github Actions every time a pull request (PR) is |
| 15 | +updated before it is merged. Premerge testing is performed using this |
| 16 | +infrastructure (specifically the `./premerge` folder in llvm-zorg). Landing a PR |
| 17 | +consists of squashing the changes into a single commit and adding that commit to |
| 18 | +the `main` branch in the LLVM monorepo. We care specifically about the state of |
| 19 | +the `main` branch because it is what the community considers to be the canonical |
| 20 | +tree. Currently, commits can also be added to the `main` branch by directly |
| 21 | +pushing to the main branch. Commits pushed directly to `main` are not tested |
| 22 | +through the premerge pipeline as they skip the PR merge process. After a new |
| 23 | +commit lands in the `main` branch, postcommit testing is performed. Most |
| 24 | +postcommit testing is performed through the Buildbot infrastructure. The main |
| 25 | +Buildbot instance for LLVM has a web instance hosted at |
| 26 | +[lab.llvm.org](https://lab.llvm.org/buildbot/#/builders). When a new commit |
| 27 | +lands in `main` the Buildbot instance (sometimes referred to as the Buildbot |
| 28 | +master) will trigger many different builds, base on the configurations |
| 29 | +defined in the llvm-zorg repository under the `buildbot/` folder. These |
| 30 | +configurations are run on Buildbot workers that are hosted by the community. |
| 31 | +Some builders build too slowly to keep up with the pace of commits to `main`, |
| 32 | +so test batches of commits. This often results in a large number of |
| 33 | +erroneous notifications due to the list of possible culprits for a breakage |
| 34 | +being more than a single commit. |
| 35 | + |
| 36 | +For premerge testing, we do not want to notify LLVM developers about failures |
| 37 | +already happening in `main` irrelevant to their changes. This requires knowing |
| 38 | +the state of `main` at the time the premerge testing for a PR was started. We |
| 39 | +also want information on the current state of `main` to empower the community |
| 40 | +with information that they need to revert or forward-fix problematic commits. |
| 41 | +Problematic commits can occur without being caught by the premerge system due to |
| 42 | +someone directly pushing a commit to `main`, or if multiple PRs become |
| 43 | +problematic only when combined. This means we need to test the premerge |
| 44 | +configuration postcommit as well so that we can determine the state of `main` |
| 45 | +(in terms of whether the build passed/failed and what tests failed, if any) at |
| 46 | +any given point in time. We can use this data to implement a "premerge advisor" |
| 47 | +that would prevent sending notifications about build/test failures not caused by |
| 48 | +the changes in a user's PR. |
| 49 | + |
| 50 | +## Design |
| 51 | + |
| 52 | +The LLVM Premerge system has two clusters, namely the central cluster in the |
| 53 | +Google Cloud Platform (GCP) zone `us-central1-a` and the west cluster in the GCP |
| 54 | +zone `us-west1`. We run two clusters in different zones for redundancy so that |
| 55 | +if one fails, we can still run jobs on the other cluster. For postcommit |
| 56 | +testing, we plan on setting up builders attached to the Buildbot master |
| 57 | +described above. We will run one builder on the central cluster and one in the |
| 58 | +west cluster. This ensures the configuration is highly available (able to |
| 59 | +tolerate an entire cluster going down), similar to the premerge testing. The |
| 60 | +builders will be configured to use a script that will launch testing on each |
| 61 | +commit to `main` as if it was being run through the premerge testing pipeline, with some small but significant differences. The post submit |
| 62 | +testing is intended to be close to the premerge configuration. but will be |
| 63 | +different in some key ways. The differences and motivation for them is described |
| 64 | +more thoroughly in the [testing configuration](#testing-configuration) section. |
| 65 | +These builds will be run inside containers that are distributed onto the cluster |
| 66 | +inside kubernetes pods (the fundamental schedulable unit inside kubernetes). |
| 67 | +This allows for kubernetes to handle details like what machine a build should |
| 68 | +run on. Allowing kubernetes to handle these details also enables Google |
| 69 | +Kubernetes Engine (GKE) to autoscale the node pools so we are not paying for |
| 70 | +unneeded capacity. Launching builds inside pods also allows for each builder to |
| 71 | +handle multiple builds at the same time. |
| 72 | + |
| 73 | +In terms of the full flow, any commit (which can be from direct pushes or |
| 74 | +merging pull requests) pushed to the LLVM monorepo will get detected by the |
| 75 | +buildbot master. The Buildbot master will invoke Buildbot workers running on our |
| 76 | +clusters. These Buildbot workers will use custom builders to launch a build |
| 77 | +wrapped in a kubernetes pod and report the results back to the buildbot master. |
| 78 | +When the job is finished, the pod will complete and capacity will be available |
| 79 | +for another build, or if there is nothing left to test GKE will see that there |
| 80 | +is nothing running on one of the nodes and downscale the node pool. |
| 81 | + |
| 82 | +### Annotated Builder |
| 83 | + |
| 84 | +llvm-zorg has multiple types of builders. We plan on using an AnnotatedBuilder. |
| 85 | +AnnotatedBuilders allow for the build to be driven using a custom python script |
| 86 | +rather than directly dictating the shell commands that should be run to perform |
| 87 | +the build. We need the flexibility of the AnnotatedBuilder to deploy jobs on the |
| 88 | +cluster. AnnotatedBuilder based builders also enable deploying changes without |
| 89 | +needing to restart the buildbot master. Without this, we have to wait for an |
| 90 | +administrator of the LLVM buildbot master to restart it before our changes get |
| 91 | +deployed. This could significantly delay updates or responses to incidents, |
| 92 | +especially before the system is fully stable. |
| 93 | + |
| 94 | +### Build Distribution |
| 95 | + |
| 96 | +We want to be able to take advantage of the autoscaling functionality of the new |
| 97 | +cluster to efficiently utilize resources. To do this, we plan on having the |
| 98 | +AnnotatedBuilder script launch builds as kubernetes pods. This allows for |
| 99 | +kubernetes to assign the builds to nodes and also allows autoscaling through the |
| 100 | +same mechanism that Github Actions Runner Controller (ARC) uses to autoscale. |
| 101 | +This enables us to quickly process builds at peak times and not pay for extra |
| 102 | +capacity when commit traffic is quiet, ensuring our resource use is efficient |
| 103 | +while still providing fast feedback. |
| 104 | + |
| 105 | +Using the kubernetes API inside of a python script (our AnnotatedBuilder |
| 106 | +implementation) to launch builds does add some complexity. However, we belive |
| 107 | +the additional complexity is justified as it allows us to achieve our goals |
| 108 | +while maintaining efficient resource usage. |
| 109 | + |
| 110 | +### Testing Configuration |
| 111 | + |
| 112 | +By testing configuration, we mean both the environment that the tests run in, |
| 113 | +and the set of tests that run. The testing configuration will be as close to the |
| 114 | +premerge configuration as possible. We will be running all tests inside the same |
| 115 | +container with the same scripts (the `monolithic-linux.sh` and |
| 116 | +`monolithic-windows.sh` scripts) used by the premerge testing. However, there |
| 117 | +will be one main difference between the premerge and postcommit testing |
| 118 | +configurations. In the postcommit configuration we propose testing all projects |
| 119 | +on every commit rather than only testing the projects that themselves changed or |
| 120 | +had dependencies that changed. We propose this for two main reasons. Firstly, |
| 121 | +Buildbot does not have good support for heterogenous build configurations. This |
| 122 | +means that testing a different set of projects within a single builder depending |
| 123 | +upon the contents of the commit could easily cause problems. More notifications |
| 124 | +could be produced if certain projects (that were only triggered by some files) |
| 125 | +were failing and some were passing which would significantly increase false |
| 126 | +positive notifications. For example, supposed that we have three commits that |
| 127 | +land in `main` and run through postcommit testing: commit A that touches MLIR, |
| 128 | +commit B that touches clang-tidy, and commit C that modifies MLIR. Commit A |
| 129 | +lands, then commit B, then commit C. If commit A introduces MLIR test failures |
| 130 | +into an otherwise clean slate, we would see the following events: |
| 131 | + |
| 132 | +1. Commit A lands. Because it touches MLIR, the buildbot worker runs the MLIR |
| 133 | + tests. Some of the tests fail. The buildbot "turns red" and a notification is |
| 134 | + sent out to the PR author. |
| 135 | +2. Commit B lands. Since it touches clang-tidy, the buildbot worker runs the |
| 136 | + clang-tidy tests. All of the tests pass. The buildbot "turns green". No |
| 137 | + notifications are sent out since everything is passing. |
| 138 | +3. Commit C lands. Since it touches MLIR, the buildbot workers runs the MLIR |
| 139 | + tests. The problem introduced in commit A still exists, so some tests fail. |
| 140 | + No new tests fail. Since the buildbot was previously green due to the |
| 141 | + interspersed clang-tidy commit, a notification is still sent out to the |
| 142 | + author of commit C. |
| 143 | + |
| 144 | +By running the tests for all projects in every postsubmit test run, we avoid |
| 145 | +the problematic situation described above. |
| 146 | + |
| 147 | +Another reason for running all the tests in every postsubmit run: When running |
| 148 | +premerge tests on a PR, we also explicitly do not test certain projects even |
| 149 | +though their dependencies change. While we do this because we suspect |
| 150 | +interactions resulting in test failures would be quite rare, it is possible, and |
| 151 | +having a postcommit configuration catch these rare failures would be useful. |
| 152 | + |
| 153 | +### Data Storage |
| 154 | + |
| 155 | +The hosted Buildbot master instance at [lab.llvm.org](https://lab.llvm.org) |
| 156 | +contains results for all recent postcommit runs. We plan on querying the results |
| 157 | +from the buildbot master because they are already available and that is where |
| 158 | +they will natively be reported after the infrastructure is set up. Buildbot |
| 159 | +supports a [REST API](https://docs.buildbot.net/latest/developer/rest.html) that |
| 160 | +would allow for easily querying the state of a commit in `main`. |
| 161 | + |
| 162 | +In the future, we may implement a "premerge advisor" that tells the user what |
| 163 | +tests/build failures they can safely ignore, we need to know what is currently |
| 164 | +failing on `main`. Each pull request is tested as if it was merged into main, |
| 165 | +which means the commit underneath the PR is very recent. If a premerge run |
| 166 | +fails, the premerge advisor will find the commit from `main` the PR is being |
| 167 | +tested on. It will then query the Buildbot master using the REST API for the |
| 168 | +status of that commit, or the preceeding commits if testing for the requested |
| 169 | +commit has not yet completed. It can then report the appropriate status to the |
| 170 | +user. Having the status will let the premerge advisor avoid pestering LLVM |
| 171 | +developers with failures unrelated to their changes. |
| 172 | + |
| 173 | +## Alternatives Considered |
| 174 | + |
| 175 | +Originally, we were looking at running postcommit testing through Github |
| 176 | +Actions, like the premerge tests. This is primarily due to it being easy to |
| 177 | +implement (a single line change in the Github Actions workflow config) and also |
| 178 | +easy to integrate with the Github API for implementation of the premerge testing |
| 179 | +advisor. More detailed motivation for the doing postcommit testing directly |
| 180 | +through Github is available in the |
| 181 | +[discourse RFC thread](https://discourse.llvm.org/t/rfc-running-premerge-postcommit-through-github-actions/86124) |
| 182 | +where we proposed doing this. We eventually decided against implementation in |
| 183 | +this way for a couple of reasons: |
| 184 | + |
| 185 | +1. Nonstandard - The standard postcommit testing infrastructure for LLVM is |
| 186 | + through Buildbot. Doing postcommit testing for the premerge configuration |
| 187 | + through Github would represent a significant departure from this. This means |
| 188 | + we are leaving behind some common infrastructure and are also forcing a new |
| 189 | + unfamiliar postcommit interface on LLVM contributors. |
| 190 | +2. Notifications - This is the biggest issue. Github currently gives very little |
| 191 | + control over the notifications that are sent out when the build fails or gets |
| 192 | + cancelled. This is specifically a problem with Github sending out |
| 193 | + notifications for build failures even if the previous build has failed. This |
| 194 | + can easily create a lot of warning fatigue which is something we are putting |
| 195 | + a lot of effort in to avoid. We want the premerge system to be perceived as |
| 196 | + reliable, have people trust its results, and most importantly, have people |
| 197 | + pay attention to failures when they do occur. They are significantly more |
| 198 | + likely to pay attention when they are the author of the patch getting the |
| 199 | + notification and the feedback is actionable. |
| 200 | +3. Customization - Buildbot can be customized around issues like notifications |
| 201 | + whereas Github cannot. Github is not particularly responsive on feature |
| 202 | + requests and their notification story has been poor for a while, so their |
| 203 | + lack of customization is a strategic risk. |
0 commit comments