-
Notifications
You must be signed in to change notification settings - Fork 116
Description
Summary
Pipelines-as-Code calls GetFiles() serially for each PipelineRun's CEL expression evaluation, causing severe performance degradation when repositories have many PipelineRuns using files.* patterns. This results in minutes to hours of delay between webhook receipt and pipeline creation, with the delay scaling linearly with the number of PipelineRuns.
Problem Statement
When a webhook event is received, PaC evaluates each PipelineRun's CEL expression to determine if it should be triggered. If a CEL expression references files.* (e.g., files.all.exists()), PaC fetches the list of changed files from the Git provider's API. However, this fetch happens separately for every single PipelineRun, even though all PipelineRuns are evaluating the same webhook event with the same changed files.
This causes:
- Linear scaling: N PipelineRuns with
files.*= N API calls (~3 seconds each) - API saturation: HTTP/2 stream errors under load
- Retry cascades: Exponential backoff extends delays from minutes to hours
- Production impact: Real-world example showed 3 hours 36 minutes delay for 132 PipelineRuns
Real-World Example
A production GitLab repository with 132 PipelineRuns using file-based triggers experienced:
- Webhook received: 17:52:31
- First PipelineRun created: 21:28:46 (3 hours 36 minutes later)
- Root cause: 132 serial API calls per attempt (6+ attempts due to HTTP/2 stream errors)
- Expected time with caching: ~6 seconds
Details: https://gitlab.com/redhat/hummingbird/containers/-/merge_requests/1477
Technical Root Cause
The issue is in how PaC's CEL evaluation interacts with file fetching:
Current Implementation
func celEvaluate(ctx context.Context, expr string, event *info.Event, vcx provider.Interface) (ref.Val, error) {
r := regexp.MustCompile(reChangedFilesTags) // Matches "files\." in CEL expression
changedFiles := changedfiles.ChangedFiles{}
if r.MatchString(expr) {
changedFiles, err = vcx.GetFiles(ctx, event) // ⚠️ API call inside loop
if err != nil {
return nil, err
}
}
// ... evaluate CEL with changedFiles
}Called from pkg/matcher/annotation_matcher.go:
func MatchPipelinerunByAnnotation(...) {
for _, prun := range pipelineruns {
if celExpr, ok := prun.GetObjectMeta().GetAnnotations()[keys.OnCelExpression]; ok {
out, err := celEvaluate(ctx, celExpr, event, vcx) // ⚠️ Called in loop
// ...
}
}
}The Problem
- Per-expression optimization: The
reChangedFilesTagsregex skips the API call if an expression doesn't usefiles.* - Wrong level: This optimization is at the per-expression level, not per-webhook-event
- Repeated fetches: For a webhook event, the same file list is fetched N times for N PipelineRuns
- Common pattern: Many repositories use
files.all.exists(x, x.matches("path/..."))for path-based triggering
Result: In repositories with many PipelineRuns using file-based triggers, vcx.GetFiles() is called repeatedly with identical parameters, fetching the same data N times per webhook event.
Proposed Solution
Cache file fetching at the webhook processing level in pkg/matcher/annotation_matcher.go:
// Pseudo-code
func MatchPipelinerunByAnnotation(ctx context.Context, ...) ([]*tektonv1.PipelineRun, error) {
// Fetch files ONCE for the entire webhook event
var cachedFiles changedfiles.ChangedFiles
var filesErr error
var filesFetched bool
for _, prun := range pipelineruns {
if celExpr, ok := prun.GetObjectMeta().GetAnnotations()[keys.OnCelExpression]; ok {
// Fetch files only once
if !filesFetched && regexp.MustCompile(`files\.`).MatchString(celExpr) {
cachedFiles, filesErr = vcx.GetFiles(ctx, event)
filesFetched = true
if filesErr != nil {
return nil, filesErr
}
}
// Evaluate with cached files
out, err := celEvaluateWithFiles(ctx, celExpr, event, cachedFiles)
// ...
}
}
}Alternatively, cache at a higher level (per webhook event ID) with TTL-based invalidation.
Expected Improvement
For 132 PipelineRuns using file-based triggers:
- Current (single pass): ~7 minutes (132 × ~3s per API call)
- Current (with HTTP/2 failures): Hours (exponential backoff on retries)
- With caching: ~6 seconds (1 API call + fast CEL evaluation)
- Improvement: ~65x faster per attempt, eliminates retry failures
Key insight: CEL evaluation is fast (~1ms per expression). The bottleneck is repeated API calls for identical data.
Impact by repository scale:
- 10-20 PipelineRuns: Minor delay (~30-60s), likely unnoticed
- 50+ PipelineRuns: Noticeable delay (2-3 minutes)
- 100+ PipelineRuns: Significant delay (5+ minutes), risk of HTTP/2 errors
- 150+ PipelineRuns: High likelihood of failures requiring retries
Reproduction
This repository contains a standalone Go reproducer that demonstrates the issue using real production data. It can test any GitLab merge request with PipelineRuns.
Test Results
Without API Calls (Optimal - Files Cached):
97 PipelineRuns:
Initial API call: 2.67s
Avg evaluation: 83.7ms (5 iterations)
Total time: 2.75s
Matched: 97/97
With API Calls (Current PaC Behavior):
3 PipelineRuns tested (of 97 total):
Total time: 9.41s
API call time: 9.40s (99.9%)
CEL eval time: 6.78ms (0.1%)
Avg per expr: 3.14s
Estimated for all 97: ~5m4s
Extrapolated for 132 PipelineRuns:
- Optimal (cached): ~6.4 seconds
- Current (per-eval): ~414 seconds (~7 minutes per successful attempt)
- Observed in production: 3 hours 36 minutes (due to HTTP/2 stream errors and retries)
Running the Reproducer
The reproducer is containerized and runs both scenarios automatically.
Quick start (tests example MR with 97 PipelineRuns):
make runTest a different GitLab MR:
make run MR_URL=https://gitlab.com/<group>/<project>/-/merge_requests/<number>Other commands:
make build # Build container image
make clean # Remove container imageManual usage:
# Build
podman build -t pac-performance-test .
# Run with any GitLab MR URL
podman run --rm pac-performance-test \
-mr https://gitlab.com/<group>/<project>/-/merge_requests/<number>The reproducer automatically:
- Fetches
.tekton/*.yamlPipelineRuns from the repository - Retrieves changed files from the merge request
- Runs both scenarios (cached vs. per-eval API calls)
- Compares performance and extrapolates to full scale
What it does:
- Scenario 1 (Files Cached): Fetches files once, runs 5 iterations of CEL evaluations
- Scenario 2 (API Per Evaluation): Tests 3 expressions with API calls per evaluation (simulates current PaC behavior)
- Comparison: Shows timing difference and extrapolates to production scale (132 PipelineRuns)
Output example:
========================================
Scenario 1: Files Cached
========================================
Initial API call: 2.67s
Avg evaluation time: 83.7ms (5 iterations)
Total time: 2.75s
Matched PipelineRuns: 97/97
========================================
Scenario 2: API Call Per Evaluation
========================================
Testing 3 expressions (of 97 total)
Total time (3 tested): 9.41s
API call time: 9.40s (99.9%)
CEL evaluation time: 6.78ms (0.1%)
Avg per expression: 3.14s
Matched PipelineRuns: 3/3 (tested)
Estimated for all 97: ~5m4s
========================================
Comparison
========================================
Scenario 1 (cached): 2.75s
Scenario 2 (per-eval, estimated): 5m4s
For 132 PipelineRuns in production:
Scenario 1: ~6.4s
Scenario 2: ~414.1s
========================================
Note: The reproducer demonstrates timing for a single successful evaluation pass. In production, HTTP/2 stream errors can cause multiple failed attempts with exponential backoff, extending delays from minutes to hours.
Reproducer Files
main.go- Reproducer implementationDockerfile- Container build configurationMakefile- Build and run automationgo.mod,go.sum- Go dependencies
The reproducer:
- Accepts any GitLab MR URL as input
- Fetches
.tekton/*.yamlPipelineRuns from the repository - Retrieves actual changed files from the merge request
- Evaluates CEL expressions using
github.com/google/cel-go(same as PaC) - Compares cached vs. per-evaluation API call scenarios
- Extrapolates timing to production scale (132 PipelineRuns)
Environment
- PaC Version: Observed on latest (main branch as of November 2024)
- Git Provider: GitLab (issue likely affects GitHub/Gitea/etc. similarly)
- Affected Code:
pkg/matcher/cel.go- API call inside evaluationpkg/matcher/annotation_matcher.go- Evaluation loop
When This Issue Occurs
This affects repositories that:
- Have many PipelineRuns (50+) with CEL-based triggers
- Use
files.*patterns for path-based triggering (e.g.,files.all.exists(x, x.matches("path/..."))) - Are common in monorepo setups with per-component or per-image pipelines
Typical use case: A container image repository with multiple images, each having separate PipelineRuns for different events and branches, all using file path matching to trigger only on relevant changes.
Scaling behavior:
- 10-20 PipelineRuns: ~30-60s delay
- 50 PipelineRuns: ~2.5 minutes
- 100 PipelineRuns: ~5 minutes (HTTP/2 errors possible)
- 150+ PipelineRuns: High likelihood of timeout/retry failures
Attempted Workarounds
None of these address the root cause:
- Add
on-event/on-target-branchannotations: Reduces PipelineRuns evaluated but doesn't eliminate serial API calls for remaining ones - Reduce PipelineRuns: Not feasible for legitimate multi-component repositories
- Remove
files.*patterns: Triggers all pipelines on every change, wasting resources - Increase timeouts: Delays the problem but doesn't prevent HTTP/2 saturation
Additional Context
Example repository that hit this issue: https://gitlab.com/redhat/hummingbird/containers
- 132 PipelineRuns using file-based triggers
- Real delay observed: 3 hours 36 minutes for a single merge request
- Expected delay with caching: ~6 seconds
Log evidence from production: Multiple INTERNAL_ERROR HTTP/2 stream failures during CEL evaluation, requiring retries with exponential backoff.
AI Disclaimer
The analysis and writing was assisted by Claude.
cc @scoheb