Performance Issue: Serial GetFiles() Calls Causing Extended Delays

## Summary

Pipelines-as-Code calls `GetFiles()` serially for each PipelineRun's CEL expression evaluation, causing severe performance degradation when repositories have many PipelineRuns using `files.*` patterns. This results in minutes to hours of delay between webhook receipt and pipeline creation, with the delay scaling linearly with the number of PipelineRuns.

## Problem Statement

When a webhook event is received, PaC evaluates each PipelineRun's CEL expression to determine if it should be triggered. If a CEL expression references `files.*` (e.g., `files.all.exists()`), PaC fetches the list of changed files from the Git provider's API. **However, this fetch happens separately for every single PipelineRun**, even though all PipelineRuns are evaluating the same webhook event with the same changed files.

This causes:
- **Linear scaling**: N PipelineRuns with `files.*` = N API calls (~3 seconds each)
- **API saturation**: HTTP/2 stream errors under load
- **Retry cascades**: Exponential backoff extends delays from minutes to hours
- **Production impact**: Real-world example showed 3 hours 36 minutes delay for 132 PipelineRuns

## Real-World Example

A production GitLab repository with 132 PipelineRuns using file-based triggers experienced:
- **Webhook received**: 17:52:31
- **First PipelineRun created**: 21:28:46 (3 hours 36 minutes later)
- **Root cause**: 132 serial API calls per attempt (6+ attempts due to HTTP/2 stream errors)
- **Expected time with caching**: ~6 seconds

Details: https://gitlab.com/redhat/hummingbird/containers/-/merge_requests/1477

## Technical Root Cause

The issue is in how PaC's CEL evaluation interacts with file fetching:

### Current Implementation

In [`pkg/matcher/cel.go`](https://github.com/openshift-pipelines/pipelines-as-code/blob/main/pkg/matcher/cel.go#L63-L71):

```go
func celEvaluate(ctx context.Context, expr string, event *info.Event, vcx provider.Interface) (ref.Val, error) {
    r := regexp.MustCompile(reChangedFilesTags)  // Matches "files\." in CEL expression
    changedFiles := changedfiles.ChangedFiles{}

    if r.MatchString(expr) {
        changedFiles, err = vcx.GetFiles(ctx, event)  // ⚠️ API call inside loop
        if err != nil {
            return nil, err
        }
    }
    // ... evaluate CEL with changedFiles
}
```

Called from [`pkg/matcher/annotation_matcher.go`](https://github.com/openshift-pipelines/pipelines-as-code/blob/main/pkg/matcher/annotation_matcher.go#L100-L115):

```go
func MatchPipelinerunByAnnotation(...) {
    for _, prun := range pipelineruns {
        if celExpr, ok := prun.GetObjectMeta().GetAnnotations()[keys.OnCelExpression]; ok {
            out, err := celEvaluate(ctx, celExpr, event, vcx)  // ⚠️ Called in loop
            // ...
        }
    }
}
```

### The Problem

1. **Per-expression optimization**: The `reChangedFilesTags` regex skips the API call if an expression doesn't use `files.*`
2. **Wrong level**: This optimization is at the per-expression level, not per-webhook-event
3. **Repeated fetches**: For a webhook event, the same file list is fetched N times for N PipelineRuns
4. **Common pattern**: Many repositories use `files.all.exists(x, x.matches("path/..."))` for path-based triggering

**Result**: In repositories with many PipelineRuns using file-based triggers, `vcx.GetFiles()` is called repeatedly with identical parameters, fetching the same data N times per webhook event.

## Proposed Solution

Cache file fetching at the webhook processing level in [`pkg/matcher/annotation_matcher.go`](https://github.com/openshift-pipelines/pipelines-as-code/blob/main/pkg/matcher/annotation_matcher.go):

```go
// Pseudo-code
func MatchPipelinerunByAnnotation(ctx context.Context, ...) ([]*tektonv1.PipelineRun, error) {
    // Fetch files ONCE for the entire webhook event
    var cachedFiles changedfiles.ChangedFiles
    var filesErr error
    var filesFetched bool

    for _, prun := range pipelineruns {
        if celExpr, ok := prun.GetObjectMeta().GetAnnotations()[keys.OnCelExpression]; ok {
            // Fetch files only once
            if !filesFetched && regexp.MustCompile(`files\.`).MatchString(celExpr) {
                cachedFiles, filesErr = vcx.GetFiles(ctx, event)
                filesFetched = true
                if filesErr != nil {
                    return nil, filesErr
                }
            }

            // Evaluate with cached files
            out, err := celEvaluateWithFiles(ctx, celExpr, event, cachedFiles)
            // ...
        }
    }
}
```

Alternatively, cache at a higher level (per webhook event ID) with TTL-based invalidation.

## Expected Improvement

**For 132 PipelineRuns using file-based triggers:**
- **Current (single pass)**: ~7 minutes (132 × ~3s per API call)
- **Current (with HTTP/2 failures)**: Hours (exponential backoff on retries)
- **With caching**: ~6 seconds (1 API call + fast CEL evaluation)
- **Improvement**: ~65x faster per attempt, eliminates retry failures

**Key insight**: CEL evaluation is fast (~1ms per expression). The bottleneck is repeated API calls for identical data.

**Impact by repository scale:**
- **10-20 PipelineRuns**: Minor delay (~30-60s), likely unnoticed
- **50+ PipelineRuns**: Noticeable delay (2-3 minutes)
- **100+ PipelineRuns**: Significant delay (5+ minutes), risk of HTTP/2 errors
- **150+ PipelineRuns**: High likelihood of failures requiring retries

## Reproduction

This repository contains a standalone Go reproducer that demonstrates the issue using real production data. It can test any GitLab merge request with PipelineRuns.

### Test Results

**Without API Calls (Optimal - Files Cached):**
```
97 PipelineRuns:
  Initial API call:    2.67s
  Avg evaluation:      83.7ms (5 iterations)
  Total time:          2.75s
  Matched:             97/97
```

**With API Calls (Current PaC Behavior):**
```
3 PipelineRuns tested (of 97 total):
  Total time:        9.41s
  API call time:     9.40s (99.9%)
  CEL eval time:     6.78ms (0.1%)
  Avg per expr:      3.14s

Estimated for all 97: ~5m4s
```

**Extrapolated for 132 PipelineRuns:**
- Optimal (cached): ~6.4 seconds
- Current (per-eval): ~414 seconds (~7 minutes per successful attempt)
- Observed in production: 3 hours 36 minutes (due to HTTP/2 stream errors and retries)

### Running the Reproducer

The reproducer is containerized and runs both scenarios automatically.

**Quick start (tests example MR with 97 PipelineRuns):**
```bash
make run
```

**Test a different GitLab MR:**
```bash
make run MR_URL=https://gitlab.com/<group>/<project>/-/merge_requests/<number>
```

**Other commands:**
```bash
make build     # Build container image
make clean     # Remove container image
```

**Manual usage:**
```bash
# Build
podman build -t pac-performance-test .

# Run with any GitLab MR URL
podman run --rm pac-performance-test \
  -mr https://gitlab.com/<group>/<project>/-/merge_requests/<number>
```

The reproducer automatically:
- Fetches `.tekton/*.yaml` PipelineRuns from the repository
- Retrieves changed files from the merge request
- Runs both scenarios (cached vs. per-eval API calls)
- Compares performance and extrapolates to full scale

**What it does:**
1. **Scenario 1 (Files Cached)**: Fetches files once, runs 5 iterations of CEL evaluations
2. **Scenario 2 (API Per Evaluation)**: Tests 3 expressions with API calls per evaluation (simulates current PaC behavior)
3. **Comparison**: Shows timing difference and extrapolates to production scale (132 PipelineRuns)

**Output example:**
```
========================================
Scenario 1: Files Cached
========================================
Initial API call:        2.67s
Avg evaluation time:     83.7ms (5 iterations)
Total time:              2.75s
Matched PipelineRuns:    97/97

========================================
Scenario 2: API Call Per Evaluation
========================================
Testing 3 expressions (of 97 total)

Total time (3 tested):  9.41s
API call time:           9.40s (99.9%)
CEL evaluation time:     6.78ms (0.1%)
Avg per expression:      3.14s
Matched PipelineRuns:    3/3 (tested)

Estimated for all 97:    ~5m4s

========================================
Comparison
========================================
Scenario 1 (cached):              2.75s
Scenario 2 (per-eval, estimated): 5m4s

For 132 PipelineRuns in production:
  Scenario 1:                   ~6.4s
  Scenario 2:                   ~414.1s
========================================
```

**Note:** The reproducer demonstrates timing for a single successful evaluation pass. In production, HTTP/2 stream errors can cause multiple failed attempts with exponential backoff, extending delays from minutes to hours.

### Reproducer Files

- **`main.go`** - Reproducer implementation
- **`Dockerfile`** - Container build configuration
- **`Makefile`** - Build and run automation
- **`go.mod`, `go.sum`** - Go dependencies

The reproducer:
- Accepts any GitLab MR URL as input
- Fetches `.tekton/*.yaml` PipelineRuns from the repository
- Retrieves actual changed files from the merge request
- Evaluates CEL expressions using `github.com/google/cel-go` (same as PaC)
- Compares cached vs. per-evaluation API call scenarios
- Extrapolates timing to production scale (132 PipelineRuns)

## Environment

- **PaC Version**: Observed on latest (main branch as of November 2024)
- **Git Provider**: GitLab (issue likely affects GitHub/Gitea/etc. similarly)
- **Affected Code**:
  - [`pkg/matcher/cel.go`](https://github.com/openshift-pipelines/pipelines-as-code/blob/main/pkg/matcher/cel.go) - API call inside evaluation
  - [`pkg/matcher/annotation_matcher.go`](https://github.com/openshift-pipelines/pipelines-as-code/blob/main/pkg/matcher/annotation_matcher.go) - Evaluation loop

## When This Issue Occurs

This affects repositories that:
1. Have many PipelineRuns (50+) with CEL-based triggers
2. Use `files.*` patterns for path-based triggering (e.g., `files.all.exists(x, x.matches("path/..."))`)
3. Are common in monorepo setups with per-component or per-image pipelines

**Typical use case**: A container image repository with multiple images, each having separate PipelineRuns for different events and branches, all using file path matching to trigger only on relevant changes.

**Scaling behavior**:
- **10-20 PipelineRuns**: ~30-60s delay
- **50 PipelineRuns**: ~2.5 minutes
- **100 PipelineRuns**: ~5 minutes (HTTP/2 errors possible)
- **150+ PipelineRuns**: High likelihood of timeout/retry failures

## Attempted Workarounds

None of these address the root cause:

1. **Add `on-event`/`on-target-branch` annotations**: Reduces PipelineRuns evaluated but doesn't eliminate serial API calls for remaining ones
2. **Reduce PipelineRuns**: Not feasible for legitimate multi-component repositories
3. **Remove `files.*` patterns**: Triggers all pipelines on every change, wasting resources
4. **Increase timeouts**: Delays the problem but doesn't prevent HTTP/2 saturation

## Additional Context

**Example repository** that hit this issue: https://gitlab.com/redhat/hummingbird/containers
- 132 PipelineRuns using file-based triggers
- Real delay observed: 3 hours 36 minutes for a single merge request
- Expected delay with caching: ~6 seconds

**Log evidence** from production: Multiple `INTERNAL_ERROR` HTTP/2 stream failures during CEL evaluation, requiring retries with exponential backoff.

## AI Disclaimer

The analysis and writing was assisted by Claude.

cc @scoheb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Issue: Serial GetFiles() Calls Causing Extended Delays #2324

Summary

Problem Statement

Real-World Example

Technical Root Cause

Current Implementation

The Problem

Proposed Solution

Expected Improvement

Reproduction

Test Results

Running the Reproducer

Reproducer Files

Environment

When This Issue Occurs

Attempted Workarounds

Additional Context

AI Disclaimer

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Issue: Serial GetFiles() Calls Causing Extended Delays #2324

Description

Summary

Problem Statement

Real-World Example

Technical Root Cause

Current Implementation

The Problem

Proposed Solution

Expected Improvement

Reproduction

Test Results

Running the Reproducer

Reproducer Files

Environment

When This Issue Occurs

Attempted Workarounds

Additional Context

AI Disclaimer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions