C#: Overlay extraction support. #20425

michaelnebel · 2025-09-12T15:08:56Z

In this PR we

Add support for overlay extraction for C# in BMN.
Add discard predicates.
Enable overlay compilation.
Add some unit and QL tests.

The strategy for doing overlay extraction in C# is as follows.
The CLI supplies a JSON file with "changed" files. This information is used in the extraction context in C#.

If a file is considered changed, we extract it as usual.
If a file is considered un-changed, we extract a skeleton implementation. That is, we extract all signature like information, but we do not extract expressions or statements (body implementations). Furthermore, we don't extract locations as we re-use those from the base variant.

There are three DCA experiments

A normal DCA execution for nightly/nightly without overlay (internally available here). This experiment shows no change in performance (but an increase in DIL sizes) and a small alert discrepancy which is likely due to wobliness. This experiment is to validate that enabling overlay compilation in the C# QL library don't have any negative side effects.
An accuracy experiment to check (internally available here), whether we get the same results when running the code-scanning security queries on full extraction compared to overlay extraction. According to DCA we get 97.4% accuracy (which is above the 90% min target). It looks like the missing results all are located in generated files. At the moment we only extract "changed" files fully in overlay mode, which could explain the missing results as the files are generated on the fly by the extractor (by running source generators). That is, we are only doing a skeleton extraction of the generated files as they are considered "unchanged".
A performance experiment to compare a full extraction with overlay extraction (internally available here). According to DCA we get

Analysis time +4%
Database build time -19%
TRAP import time -34%
End-to-end time -11%
Database size +56%

There is still plenty that could be done to improve the quality of the overlay database and query execution. Known possible improvements

Address the accuracy problem mentioned above.
It looks like there is a discrepancy between the number of "locatable" entities (from global) code between an overlay generated database, which indicates that we are not removing everything that could be removed (discarding could be improved).
All the QL code needs to be decorated with the relevant overlay annotations.

csharp/extractor/Semmle.Extraction.CSharp/Extractor/OverlayInfo.cs

csharp/extractor/Semmle.Extraction.CSharp.Standalone/Extractor.cs

csharp/extractor/Semmle.Extraction.CSharp/Extractor/OverlayInfo.cs

Copilot

Pull Request Overview

This PR introduces overlay extraction support for C#, enabling incremental database creation. The strategy extracts only signature information (types, methods, parameters) for unchanged files while fully extracting changed files, reusing locations from the base database. Key changes include adding discard predicates, enabling overlay compilation, and modifying the extraction context to skip expression/statement extraction for unchanged files.

Reviewed Changes

Copilot reviewed 59 out of 59 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`csharp/ql/lib/semmlecode.csharp.dbscheme`	Adds `databaseMetadata`, `overlayChangedFiles` relations and `@locatable` type for overlay support
`csharp/ql/lib/semmle/code/csharp/Overlay.qll`	Implements entity discard predicates for overlay analysis
`csharp/ql/lib/semmle/code/csharp/Conversion.qll`	Refactors type constraint logic by extracting `typeConstraintToBaseType` predicate
`csharp/ql/lib/qlpack.yml`	Enables overlay compilation with `compileForOverlayEval: true`
`csharp/ql/lib/csharp.qll`	Imports Overlay module
`csharp/extractor/Semmle.Util/EnvironmentVariables.cs`	Adds methods to detect overlay mode and read overlay changes file path
`csharp/extractor/Semmle.Extraction.CSharp/Extractor/OverlayInfo.cs`	Implements overlay information tracking and changed file detection
`csharp/extractor/Semmle.Extraction.CSharp/Extractor/Context.cs`	Adds `OnlyScaffold` flag to control skeleton extraction for unchanged files
Multiple entity extractors	Conditionally skip location extraction when `OnlyScaffold` is true
`csharp/codeql-extractor.yml`	Declares overlay support version
Test files	Adds overlay test infrastructure and expected results
Upgrade/downgrade files	Provides database schema migration support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ion context.

…ns and some named TRAP entities.

…s when not scaffolding and only extract attributes when they are in source code in overlay mode.

jbj · 2025-11-05T10:21:03Z

Based on your DCA results (I haven't reviewed the code), it sounds to me like this is good enough to staff-ship 🎉 but that more follow-up work is needed.

It looks like the missing results all are located in generated files.

That should be fine. With or without incrementality, the PR view doesn't show alerts in generated files because they're not in the changed-files list in the UI.

If a file is considered un-changed, we extract a skeleton implementation.

That's fine for now, but do you see a path forward to only extracting skeletons of the (transitive) dependencies of changed files? That's what we do for Java. The ideal is that the extraction process should take time proportional to the number of changed files (plus a minimal set of dependencies).

Analysis time +4%
Database size +56%

This is comparable to what we've seen for other languages. Analysis time should come down when overlay[local] annotations are added in later PRs. Maybe the database size increase is on the high side; possible causes:

The string pool (default/cache/cached-strings/pools) might be large. We have a known bug about that, but QL-level changes can also reduce the string pool (like avoiding getQualifiedName, as you know).
The tuple pool (default/cache/cached-strings/tuple-pool) might be large. It's a known issue that unnecessary IPA types persist in the tuple pool of base databases, but that should matter less when more code gets marked as local, so I wouldn't worry about it.
The id pool (default/idPool) might be large. You've already minimised the use of TRAP keys for locations, so there are probably no easy wins here.

In summary, these numbers look good to me.

Database build time -19%

That's a small reduction compared to what we've seen in other languages. I'm guessing it's because the C# extractor does a lot of work setting up the build and the dependencies before it gets to the actual code. I hope github/codeql-action#3117 will help here, or at least it will become clearer what work remains. If there's any computation you want to cache between the full analysis and the PR, it can be stored in $CODEQL_EXTRACTOR_CSHARP_OVERLAY_BASE_METADATA_OUT.

Extracting only the necessary dependencies as skeleton files should also help.

michaelnebel · 2025-11-05T12:58:45Z

@jbj : Thank you very much for the feedback! I really appreciate it!

That should be fine. With or without incrementality, the PR view doesn't show alerts in generated files because they're not in the changed-files list in the UI.

Yes, exactly. We could consider always extracting generated files, if we can't get a mapping between original <-> generated files. As you mention, this is less urgent.

do you see a path forward to only extracting skeletons of the (transitive) dependencies of changed files? That's what we do for Java. The ideal is that the extraction process should take time proportional to the number of changed files (plus a minimal set of dependencies).

Really good question!
At the moment, we use the context (changed files) to decide whether we make a full extraction/scaffold of a given symbol just before the entity is written to TRAP. However, if we are only interested in scaffolding the dependencies (transitively), then maybe we can do the filtering on changed files earlier (On syntax tree level). At least for a call that calls a method outside the changed files, I think we would extract the corresponding symbol Method entity - which would be populated (as this is a cached entity in the extractor). However, for dependencies we would need to supply information that only scaffolding is needed. Not sure whether this will work. At least there is someting to think about.

That's a small reduction compared to what we've seen in other languages. I'm guessing it's because the C# extractor does a lot of work setting up the build and the dependencies before it gets to the actual code.

Yes, I think downloading dependencies is a large portion of the overall time. Do you know, whether it will be possible to measure the benefit of cached dependencies using DCA?

jbj · 2025-11-05T13:01:41Z

Do you know, whether it will be possible to measure the benefit of cached dependencies using DCA?

I know that @ginsbach has experimented with this for Java. I'll pull you into the relevant thread on Slack.

hvitved

Overall LGTM, great work 💪

hvitved · 2025-11-07T09:31:36Z

csharp/codeql-extractor.yml

 display_name: "C#"
 version: 1.22.1
 column_kind: "utf16"
+overlay_support_version: 20250626


OOI, where does this date come from?

hvitved · 2025-11-07T09:33:30Z

csharp/extractor/Semmle.Extraction.CSharp/Extractor/OverlayInfo.cs

+        ///  ]
+        /// }
+        /// </summary>
+        public record ChangedFiles


Can it actually be made private?

Yes, it can 😄

hvitved · 2025-11-07T09:39:35Z

csharp/extractor/Semmle.Extraction.CSharp/Entities/CommentBlock.cs

        {
            trapFile.commentblock(this);
-            WriteLocationToTrap(trapFile.commentblock_location, this, Context.CreateLocation(Symbol.Location));
+            if (!Context.OnlyScaffold)


Why is it that we not simply return when Context.OnlyScaffold holds?

For the "named" TRAP entities, I have just tried to avoid re-extracting the source locations. Will try and make the change. In this case the child information is extracted below. I will restructure a bit to streamline implementation.

hvitved · 2025-11-07T09:39:42Z

csharp/extractor/Semmle.Extraction.CSharp/Entities/CommentLine.cs

            location = Context.CreateLocation(Location);
            trapFile.commentline(this, Type == CommentLineType.MultilineContinuation ? CommentLineType.Multiline : Type, Text, RawText);
-            WriteLocationToTrap(trapFile.commentline_location, this, location);
+            if (!Context.OnlyScaffold)


hvitved · 2025-11-07T09:42:47Z

csharp/extractor/Semmle.Extraction.CSharp/Entities/UsingDirective.cs

                if (info.Symbol is INamespaceSymbol namespaceSymbol)
                {
                    var ns = Namespace.Create(Context, namespaceSymbol);
                    trapFile.using_namespace_directives(this, ns);


Why are we extracting this in scaffold mode?

Uh, yes, they are indeed * ID elements, so we should probably avoid extracting them altogether when scaffolding. Will try and make that change.

hvitved · 2025-11-07T10:30:43Z