Skip to content

Performance Issue: Serial GetFiles() Calls Causing Extended Delays #2324

@mh21

Description

@mh21

Summary

Pipelines-as-Code calls GetFiles() serially for each PipelineRun's CEL expression evaluation, causing severe performance degradation when repositories have many PipelineRuns using files.* patterns. This results in minutes to hours of delay between webhook receipt and pipeline creation, with the delay scaling linearly with the number of PipelineRuns.

Problem Statement

When a webhook event is received, PaC evaluates each PipelineRun's CEL expression to determine if it should be triggered. If a CEL expression references files.* (e.g., files.all.exists()), PaC fetches the list of changed files from the Git provider's API. However, this fetch happens separately for every single PipelineRun, even though all PipelineRuns are evaluating the same webhook event with the same changed files.

This causes:

  • Linear scaling: N PipelineRuns with files.* = N API calls (~3 seconds each)
  • API saturation: HTTP/2 stream errors under load
  • Retry cascades: Exponential backoff extends delays from minutes to hours
  • Production impact: Real-world example showed 3 hours 36 minutes delay for 132 PipelineRuns

Real-World Example

A production GitLab repository with 132 PipelineRuns using file-based triggers experienced:

  • Webhook received: 17:52:31
  • First PipelineRun created: 21:28:46 (3 hours 36 minutes later)
  • Root cause: 132 serial API calls per attempt (6+ attempts due to HTTP/2 stream errors)
  • Expected time with caching: ~6 seconds

Details: https://gitlab.com/redhat/hummingbird/containers/-/merge_requests/1477

Technical Root Cause

The issue is in how PaC's CEL evaluation interacts with file fetching:

Current Implementation

In pkg/matcher/cel.go:

func celEvaluate(ctx context.Context, expr string, event *info.Event, vcx provider.Interface) (ref.Val, error) {
    r := regexp.MustCompile(reChangedFilesTags)  // Matches "files\." in CEL expression
    changedFiles := changedfiles.ChangedFiles{}

    if r.MatchString(expr) {
        changedFiles, err = vcx.GetFiles(ctx, event)  // ⚠️ API call inside loop
        if err != nil {
            return nil, err
        }
    }
    // ... evaluate CEL with changedFiles
}

Called from pkg/matcher/annotation_matcher.go:

func MatchPipelinerunByAnnotation(...) {
    for _, prun := range pipelineruns {
        if celExpr, ok := prun.GetObjectMeta().GetAnnotations()[keys.OnCelExpression]; ok {
            out, err := celEvaluate(ctx, celExpr, event, vcx)  // ⚠️ Called in loop
            // ...
        }
    }
}

The Problem

  1. Per-expression optimization: The reChangedFilesTags regex skips the API call if an expression doesn't use files.*
  2. Wrong level: This optimization is at the per-expression level, not per-webhook-event
  3. Repeated fetches: For a webhook event, the same file list is fetched N times for N PipelineRuns
  4. Common pattern: Many repositories use files.all.exists(x, x.matches("path/...")) for path-based triggering

Result: In repositories with many PipelineRuns using file-based triggers, vcx.GetFiles() is called repeatedly with identical parameters, fetching the same data N times per webhook event.

Proposed Solution

Cache file fetching at the webhook processing level in pkg/matcher/annotation_matcher.go:

// Pseudo-code
func MatchPipelinerunByAnnotation(ctx context.Context, ...) ([]*tektonv1.PipelineRun, error) {
    // Fetch files ONCE for the entire webhook event
    var cachedFiles changedfiles.ChangedFiles
    var filesErr error
    var filesFetched bool

    for _, prun := range pipelineruns {
        if celExpr, ok := prun.GetObjectMeta().GetAnnotations()[keys.OnCelExpression]; ok {
            // Fetch files only once
            if !filesFetched && regexp.MustCompile(`files\.`).MatchString(celExpr) {
                cachedFiles, filesErr = vcx.GetFiles(ctx, event)
                filesFetched = true
                if filesErr != nil {
                    return nil, filesErr
                }
            }

            // Evaluate with cached files
            out, err := celEvaluateWithFiles(ctx, celExpr, event, cachedFiles)
            // ...
        }
    }
}

Alternatively, cache at a higher level (per webhook event ID) with TTL-based invalidation.

Expected Improvement

For 132 PipelineRuns using file-based triggers:

  • Current (single pass): ~7 minutes (132 × ~3s per API call)
  • Current (with HTTP/2 failures): Hours (exponential backoff on retries)
  • With caching: ~6 seconds (1 API call + fast CEL evaluation)
  • Improvement: ~65x faster per attempt, eliminates retry failures

Key insight: CEL evaluation is fast (~1ms per expression). The bottleneck is repeated API calls for identical data.

Impact by repository scale:

  • 10-20 PipelineRuns: Minor delay (~30-60s), likely unnoticed
  • 50+ PipelineRuns: Noticeable delay (2-3 minutes)
  • 100+ PipelineRuns: Significant delay (5+ minutes), risk of HTTP/2 errors
  • 150+ PipelineRuns: High likelihood of failures requiring retries

Reproduction

This repository contains a standalone Go reproducer that demonstrates the issue using real production data. It can test any GitLab merge request with PipelineRuns.

Test Results

Without API Calls (Optimal - Files Cached):

97 PipelineRuns:
  Initial API call:    2.67s
  Avg evaluation:      83.7ms (5 iterations)
  Total time:          2.75s
  Matched:             97/97

With API Calls (Current PaC Behavior):

3 PipelineRuns tested (of 97 total):
  Total time:        9.41s
  API call time:     9.40s (99.9%)
  CEL eval time:     6.78ms (0.1%)
  Avg per expr:      3.14s

Estimated for all 97: ~5m4s

Extrapolated for 132 PipelineRuns:

  • Optimal (cached): ~6.4 seconds
  • Current (per-eval): ~414 seconds (~7 minutes per successful attempt)
  • Observed in production: 3 hours 36 minutes (due to HTTP/2 stream errors and retries)

Running the Reproducer

The reproducer is containerized and runs both scenarios automatically.

Quick start (tests example MR with 97 PipelineRuns):

make run

Test a different GitLab MR:

make run MR_URL=https://gitlab.com/<group>/<project>/-/merge_requests/<number>

Other commands:

make build     # Build container image
make clean     # Remove container image

Manual usage:

# Build
podman build -t pac-performance-test .

# Run with any GitLab MR URL
podman run --rm pac-performance-test \
  -mr https://gitlab.com/<group>/<project>/-/merge_requests/<number>

The reproducer automatically:

  • Fetches .tekton/*.yaml PipelineRuns from the repository
  • Retrieves changed files from the merge request
  • Runs both scenarios (cached vs. per-eval API calls)
  • Compares performance and extrapolates to full scale

What it does:

  1. Scenario 1 (Files Cached): Fetches files once, runs 5 iterations of CEL evaluations
  2. Scenario 2 (API Per Evaluation): Tests 3 expressions with API calls per evaluation (simulates current PaC behavior)
  3. Comparison: Shows timing difference and extrapolates to production scale (132 PipelineRuns)

Output example:

========================================
Scenario 1: Files Cached
========================================
Initial API call:        2.67s
Avg evaluation time:     83.7ms (5 iterations)
Total time:              2.75s
Matched PipelineRuns:    97/97

========================================
Scenario 2: API Call Per Evaluation
========================================
Testing 3 expressions (of 97 total)

Total time (3 tested):  9.41s
API call time:           9.40s (99.9%)
CEL evaluation time:     6.78ms (0.1%)
Avg per expression:      3.14s
Matched PipelineRuns:    3/3 (tested)

Estimated for all 97:    ~5m4s

========================================
Comparison
========================================
Scenario 1 (cached):              2.75s
Scenario 2 (per-eval, estimated): 5m4s

For 132 PipelineRuns in production:
  Scenario 1:                   ~6.4s
  Scenario 2:                   ~414.1s
========================================

Note: The reproducer demonstrates timing for a single successful evaluation pass. In production, HTTP/2 stream errors can cause multiple failed attempts with exponential backoff, extending delays from minutes to hours.

Reproducer Files

  • main.go - Reproducer implementation
  • Dockerfile - Container build configuration
  • Makefile - Build and run automation
  • go.mod, go.sum - Go dependencies

The reproducer:

  • Accepts any GitLab MR URL as input
  • Fetches .tekton/*.yaml PipelineRuns from the repository
  • Retrieves actual changed files from the merge request
  • Evaluates CEL expressions using github.com/google/cel-go (same as PaC)
  • Compares cached vs. per-evaluation API call scenarios
  • Extrapolates timing to production scale (132 PipelineRuns)

Environment

When This Issue Occurs

This affects repositories that:

  1. Have many PipelineRuns (50+) with CEL-based triggers
  2. Use files.* patterns for path-based triggering (e.g., files.all.exists(x, x.matches("path/...")))
  3. Are common in monorepo setups with per-component or per-image pipelines

Typical use case: A container image repository with multiple images, each having separate PipelineRuns for different events and branches, all using file path matching to trigger only on relevant changes.

Scaling behavior:

  • 10-20 PipelineRuns: ~30-60s delay
  • 50 PipelineRuns: ~2.5 minutes
  • 100 PipelineRuns: ~5 minutes (HTTP/2 errors possible)
  • 150+ PipelineRuns: High likelihood of timeout/retry failures

Attempted Workarounds

None of these address the root cause:

  1. Add on-event/on-target-branch annotations: Reduces PipelineRuns evaluated but doesn't eliminate serial API calls for remaining ones
  2. Reduce PipelineRuns: Not feasible for legitimate multi-component repositories
  3. Remove files.* patterns: Triggers all pipelines on every change, wasting resources
  4. Increase timeouts: Delays the problem but doesn't prevent HTTP/2 saturation

Additional Context

Example repository that hit this issue: https://gitlab.com/redhat/hummingbird/containers

  • 132 PipelineRuns using file-based triggers
  • Real delay observed: 3 hours 36 minutes for a single merge request
  • Expected delay with caching: ~6 seconds

Log evidence from production: Multiple INTERNAL_ERROR HTTP/2 stream failures during CEL evaluation, requiring retries with exponential backoff.

AI Disclaimer

The analysis and writing was assisted by Claude.

cc @scoheb

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions