Skip to content

Commit 651c1f3

Browse files
[CI] Add design document for post submit testing (#512)
This patch adds the design document outlining how we plan on implementing post submit testing for the premerge configuration along with the motivation and alternatives considered.
1 parent 5b4634b commit 651c1f3

File tree

1 file changed

+203
-0
lines changed

1 file changed

+203
-0
lines changed

premerge/post-submit-testing.md

Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
# Post Submit Testing
2+
3+
## Introduction
4+
5+
While this infrastructure is focused on premerge testing, it is also important
6+
to make sure that the specific configuration we are testing is tested post
7+
commit as well. This document outlines the motivation for the need to test this
8+
configuration post commit, how we plan on implementing this to ensure we get
9+
fast feedback scalably, and why we are utilizing this design over others.
10+
11+
## Background/Motivation
12+
13+
LLVM has two types of testing upstream: premerge and postcommit. The premerge
14+
testing is performed using Github Actions every time a pull request (PR) is
15+
updated before it is merged. Premerge testing is performed using this
16+
infrastructure (specifically the `./premerge` folder in llvm-zorg). Landing a PR
17+
consists of squashing the changes into a single commit and adding that commit to
18+
the `main` branch in the LLVM monorepo. We care specifically about the state of
19+
the `main` branch because it is what the community considers to be the canonical
20+
tree. Currently, commits can also be added to the `main` branch by directly
21+
pushing to the main branch. Commits pushed directly to `main` are not tested
22+
through the premerge pipeline as they skip the PR merge process. After a new
23+
commit lands in the `main` branch, postcommit testing is performed. Most
24+
postcommit testing is performed through the Buildbot infrastructure. The main
25+
Buildbot instance for LLVM has a web instance hosted at
26+
[lab.llvm.org](https://lab.llvm.org/buildbot/#/builders). When a new commit
27+
lands in `main` the Buildbot instance (sometimes referred to as the Buildbot
28+
master) will trigger many different builds, base on the configurations
29+
defined in the llvm-zorg repository under the `buildbot/` folder. These
30+
configurations are run on Buildbot workers that are hosted by the community.
31+
Some builders build too slowly to keep up with the pace of commits to `main`,
32+
so test batches of commits. This often results in a large number of
33+
erroneous notifications due to the list of possible culprits for a breakage
34+
being more than a single commit.
35+
36+
For premerge testing, we do not want to notify LLVM developers about failures
37+
already happening in `main` irrelevant to their changes. This requires knowing
38+
the state of `main` at the time the premerge testing for a PR was started. We
39+
also want information on the current state of `main` to empower the community
40+
with information that they need to revert or forward-fix problematic commits.
41+
Problematic commits can occur without being caught by the premerge system due to
42+
someone directly pushing a commit to `main`, or if multiple PRs become
43+
problematic only when combined. This means we need to test the premerge
44+
configuration postcommit as well so that we can determine the state of `main`
45+
(in terms of whether the build passed/failed and what tests failed, if any) at
46+
any given point in time. We can use this data to implement a "premerge advisor"
47+
that would prevent sending notifications about build/test failures not caused by
48+
the changes in a user's PR.
49+
50+
## Design
51+
52+
The LLVM Premerge system has two clusters, namely the central cluster in the
53+
Google Cloud Platform (GCP) zone `us-central1-a` and the west cluster in the GCP
54+
zone `us-west1`. We run two clusters in different zones for redundancy so that
55+
if one fails, we can still run jobs on the other cluster. For postcommit
56+
testing, we plan on setting up builders attached to the Buildbot master
57+
described above. We will run one builder on the central cluster and one in the
58+
west cluster. This ensures the configuration is highly available (able to
59+
tolerate an entire cluster going down), similar to the premerge testing. The
60+
builders will be configured to use a script that will launch testing on each
61+
commit to `main` as if it was being run through the premerge testing pipeline, with some small but significant differences. The post submit
62+
testing is intended to be close to the premerge configuration. but will be
63+
different in some key ways. The differences and motivation for them is described
64+
more thoroughly in the [testing configuration](#testing-configuration) section.
65+
These builds will be run inside containers that are distributed onto the cluster
66+
inside kubernetes pods (the fundamental schedulable unit inside kubernetes).
67+
This allows for kubernetes to handle details like what machine a build should
68+
run on. Allowing kubernetes to handle these details also enables Google
69+
Kubernetes Engine (GKE) to autoscale the node pools so we are not paying for
70+
unneeded capacity. Launching builds inside pods also allows for each builder to
71+
handle multiple builds at the same time.
72+
73+
In terms of the full flow, any commit (which can be from direct pushes or
74+
merging pull requests) pushed to the LLVM monorepo will get detected by the
75+
buildbot master. The Buildbot master will invoke Buildbot workers running on our
76+
clusters. These Buildbot workers will use custom builders to launch a build
77+
wrapped in a kubernetes pod and report the results back to the buildbot master.
78+
When the job is finished, the pod will complete and capacity will be available
79+
for another build, or if there is nothing left to test GKE will see that there
80+
is nothing running on one of the nodes and downscale the node pool.
81+
82+
### Annotated Builder
83+
84+
llvm-zorg has multiple types of builders. We plan on using an AnnotatedBuilder.
85+
AnnotatedBuilders allow for the build to be driven using a custom python script
86+
rather than directly dictating the shell commands that should be run to perform
87+
the build. We need the flexibility of the AnnotatedBuilder to deploy jobs on the
88+
cluster. AnnotatedBuilder based builders also enable deploying changes without
89+
needing to restart the buildbot master. Without this, we have to wait for an
90+
administrator of the LLVM buildbot master to restart it before our changes get
91+
deployed. This could significantly delay updates or responses to incidents,
92+
especially before the system is fully stable.
93+
94+
### Build Distribution
95+
96+
We want to be able to take advantage of the autoscaling functionality of the new
97+
cluster to efficiently utilize resources. To do this, we plan on having the
98+
AnnotatedBuilder script launch builds as kubernetes pods. This allows for
99+
kubernetes to assign the builds to nodes and also allows autoscaling through the
100+
same mechanism that Github Actions Runner Controller (ARC) uses to autoscale.
101+
This enables us to quickly process builds at peak times and not pay for extra
102+
capacity when commit traffic is quiet, ensuring our resource use is efficient
103+
while still providing fast feedback.
104+
105+
Using the kubernetes API inside of a python script (our AnnotatedBuilder
106+
implementation) to launch builds does add some complexity. However, we belive
107+
the additional complexity is justified as it allows us to achieve our goals
108+
while maintaining efficient resource usage.
109+
110+
### Testing Configuration
111+
112+
By testing configuration, we mean both the environment that the tests run in,
113+
and the set of tests that run. The testing configuration will be as close to the
114+
premerge configuration as possible. We will be running all tests inside the same
115+
container with the same scripts (the `monolithic-linux.sh` and
116+
`monolithic-windows.sh` scripts) used by the premerge testing. However, there
117+
will be one main difference between the premerge and postcommit testing
118+
configurations. In the postcommit configuration we propose testing all projects
119+
on every commit rather than only testing the projects that themselves changed or
120+
had dependencies that changed. We propose this for two main reasons. Firstly,
121+
Buildbot does not have good support for heterogenous build configurations. This
122+
means that testing a different set of projects within a single builder depending
123+
upon the contents of the commit could easily cause problems. More notifications
124+
could be produced if certain projects (that were only triggered by some files)
125+
were failing and some were passing which would significantly increase false
126+
positive notifications. For example, supposed that we have three commits that
127+
land in `main` and run through postcommit testing: commit A that touches MLIR,
128+
commit B that touches clang-tidy, and commit C that modifies MLIR. Commit A
129+
lands, then commit B, then commit C. If commit A introduces MLIR test failures
130+
into an otherwise clean slate, we would see the following events:
131+
132+
1. Commit A lands. Because it touches MLIR, the buildbot worker runs the MLIR
133+
tests. Some of the tests fail. The buildbot "turns red" and a notification is
134+
sent out to the PR author.
135+
2. Commit B lands. Since it touches clang-tidy, the buildbot worker runs the
136+
clang-tidy tests. All of the tests pass. The buildbot "turns green". No
137+
notifications are sent out since everything is passing.
138+
3. Commit C lands. Since it touches MLIR, the buildbot workers runs the MLIR
139+
tests. The problem introduced in commit A still exists, so some tests fail.
140+
No new tests fail. Since the buildbot was previously green due to the
141+
interspersed clang-tidy commit, a notification is still sent out to the
142+
author of commit C.
143+
144+
By running the tests for all projects in every postsubmit test run, we avoid
145+
the problematic situation described above.
146+
147+
Another reason for running all the tests in every postsubmit run: When running
148+
premerge tests on a PR, we also explicitly do not test certain projects even
149+
though their dependencies change. While we do this because we suspect
150+
interactions resulting in test failures would be quite rare, it is possible, and
151+
having a postcommit configuration catch these rare failures would be useful.
152+
153+
### Data Storage
154+
155+
The hosted Buildbot master instance at [lab.llvm.org](https://lab.llvm.org)
156+
contains results for all recent postcommit runs. We plan on querying the results
157+
from the buildbot master because they are already available and that is where
158+
they will natively be reported after the infrastructure is set up. Buildbot
159+
supports a [REST API](https://docs.buildbot.net/latest/developer/rest.html) that
160+
would allow for easily querying the state of a commit in `main`.
161+
162+
In the future, we may implement a "premerge advisor" that tells the user what
163+
tests/build failures they can safely ignore, we need to know what is currently
164+
failing on `main`. Each pull request is tested as if it was merged into main,
165+
which means the commit underneath the PR is very recent. If a premerge run
166+
fails, the premerge advisor will find the commit from `main` the PR is being
167+
tested on. It will then query the Buildbot master using the REST API for the
168+
status of that commit, or the preceeding commits if testing for the requested
169+
commit has not yet completed. It can then report the appropriate status to the
170+
user. Having the status will let the premerge advisor avoid pestering LLVM
171+
developers with failures unrelated to their changes.
172+
173+
## Alternatives Considered
174+
175+
Originally, we were looking at running postcommit testing through Github
176+
Actions, like the premerge tests. This is primarily due to it being easy to
177+
implement (a single line change in the Github Actions workflow config) and also
178+
easy to integrate with the Github API for implementation of the premerge testing
179+
advisor. More detailed motivation for the doing postcommit testing directly
180+
through Github is available in the
181+
[discourse RFC thread](https://discourse.llvm.org/t/rfc-running-premerge-postcommit-through-github-actions/86124)
182+
where we proposed doing this. We eventually decided against implementation in
183+
this way for a couple of reasons:
184+
185+
1. Nonstandard - The standard postcommit testing infrastructure for LLVM is
186+
through Buildbot. Doing postcommit testing for the premerge configuration
187+
through Github would represent a significant departure from this. This means
188+
we are leaving behind some common infrastructure and are also forcing a new
189+
unfamiliar postcommit interface on LLVM contributors.
190+
2. Notifications - This is the biggest issue. Github currently gives very little
191+
control over the notifications that are sent out when the build fails or gets
192+
cancelled. This is specifically a problem with Github sending out
193+
notifications for build failures even if the previous build has failed. This
194+
can easily create a lot of warning fatigue which is something we are putting
195+
a lot of effort in to avoid. We want the premerge system to be perceived as
196+
reliable, have people trust its results, and most importantly, have people
197+
pay attention to failures when they do occur. They are significantly more
198+
likely to pay attention when they are the author of the patch getting the
199+
notification and the feedback is actionable.
200+
3. Customization - Buildbot can be customized around issues like notifications
201+
whereas Github cannot. Github is not particularly responsive on feature
202+
requests and their notification story has been poor for a while, so their
203+
lack of customization is a strategic risk.

0 commit comments

Comments
 (0)