Skip to content

Conversation

@michaelnebel
Copy link
Contributor

@michaelnebel michaelnebel commented Sep 12, 2025

In this PR we

  • Add support for overlay extraction for C# in BMN.
  • Add discard predicates.
  • Enable overlay compilation.
  • Add some unit and QL tests.

The strategy for doing overlay extraction in C# is as follows.
The CLI supplies a JSON file with "changed" files. This information is used in the extraction context in C#.

  • If a file is considered changed, we extract it as usual.
  • If a file is considered un-changed, we extract a skeleton implementation. That is, we extract all signature like information, but we do not extract expressions or statements (body implementations). Furthermore, we don't extract locations as we re-use those from the base variant.

There are three DCA experiments

  1. A normal DCA execution for nightly/nightly without overlay (internally available here). This experiment shows no change in performance (but an increase in DIL sizes) and a small alert discrepancy which is likely due to wobliness. This experiment is to validate that enabling overlay compilation in the C# QL library don't have any negative side effects.
  2. An accuracy experiment to check (internally available here), whether we get the same results when running the code-scanning security queries on full extraction compared to overlay extraction. According to DCA we get 97.4% accuracy (which is above the 90% min target). It looks like the missing results all are located in generated files. At the moment we only extract "changed" files fully in overlay mode, which could explain the missing results as the files are generated on the fly by the extractor (by running source generators). That is, we are only doing a skeleton extraction of the generated files as they are considered "unchanged".
  3. A performance experiment to compare a full extraction with overlay extraction (internally available here). According to DCA we get
  • Analysis time +4%
  • Database build time -19%
  • TRAP import time -34%
  • End-to-end time -11%
  • Database size +56%

There is still plenty that could be done to improve the quality of the overlay database and query execution. Known possible improvements

  1. Address the accuracy problem mentioned above.
  2. It looks like there is a discrepancy between the number of "locatable" entities (from global) code between an overlay generated database, which indicates that we are not removing everything that could be removed (discarding could be improved).
  3. All the QL code needs to be decorated with the relevant overlay annotations.

@github-actions github-actions bot added the C# label Sep 12, 2025
@michaelnebel michaelnebel force-pushed the csharp/basicextractoroverlay branch 2 times, most recently from 19c75fe to 41e57d9 Compare September 17, 2025 12:41
@michaelnebel michaelnebel force-pushed the csharp/basicextractoroverlay branch from 41e57d9 to beb6cd3 Compare October 21, 2025 11:23
@michaelnebel michaelnebel force-pushed the csharp/basicextractoroverlay branch from beb6cd3 to cd1ba5f Compare October 21, 2025 12:24
@michaelnebel michaelnebel force-pushed the csharp/basicextractoroverlay branch 2 times, most recently from 1155ee2 to d158b1f Compare October 22, 2025 11:43
@michaelnebel michaelnebel force-pushed the csharp/basicextractoroverlay branch from d158b1f to 871c6c9 Compare October 23, 2025 14:41
@michaelnebel michaelnebel force-pushed the csharp/basicextractoroverlay branch from cd0b31c to b749636 Compare October 27, 2025 09:50
@michaelnebel michaelnebel changed the title [WIP] C#: Basic overlay extraction support. C#: Basic overlay extraction support. Oct 27, 2025
@michaelnebel michaelnebel force-pushed the csharp/basicextractoroverlay branch 2 times, most recently from 6db88cf to 915770b Compare October 27, 2025 13:42
@michaelnebel michaelnebel changed the title C#: Basic overlay extraction support. C#: Overlay extraction support. Oct 27, 2025
@michaelnebel michaelnebel force-pushed the csharp/basicextractoroverlay branch 3 times, most recently from 5fe0d0e to 6ea8338 Compare October 30, 2025 12:46
@michaelnebel michaelnebel requested a review from hvitved October 31, 2025 07:27
@michaelnebel michaelnebel marked this pull request as ready for review October 31, 2025 07:27
@michaelnebel michaelnebel requested a review from a team as a code owner October 31, 2025 07:27
Copilot AI review requested due to automatic review settings October 31, 2025 07:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces overlay extraction support for C#, enabling incremental database creation. The strategy extracts only signature information (types, methods, parameters) for unchanged files while fully extracting changed files, reusing locations from the base database. Key changes include adding discard predicates, enabling overlay compilation, and modifying the extraction context to skip expression/statement extraction for unchanged files.

Reviewed Changes

Copilot reviewed 59 out of 59 changed files in this pull request and generated no comments.

Show a summary per file
File Description
csharp/ql/lib/semmlecode.csharp.dbscheme Adds databaseMetadata, overlayChangedFiles relations and @locatable type for overlay support
csharp/ql/lib/semmle/code/csharp/Overlay.qll Implements entity discard predicates for overlay analysis
csharp/ql/lib/semmle/code/csharp/Conversion.qll Refactors type constraint logic by extracting typeConstraintToBaseType predicate
csharp/ql/lib/qlpack.yml Enables overlay compilation with compileForOverlayEval: true
csharp/ql/lib/csharp.qll Imports Overlay module
csharp/extractor/Semmle.Util/EnvironmentVariables.cs Adds methods to detect overlay mode and read overlay changes file path
csharp/extractor/Semmle.Extraction.CSharp/Extractor/OverlayInfo.cs Implements overlay information tracking and changed file detection
csharp/extractor/Semmle.Extraction.CSharp/Extractor/Context.cs Adds OnlyScaffold flag to control skeleton extraction for unchanged files
Multiple entity extractors Conditionally skip location extraction when OnlyScaffold is true
csharp/codeql-extractor.yml Declares overlay support version
Test files Adds overlay test infrastructure and expected results
Upgrade/downgrade files Provides database schema migration support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jbj
Copy link
Contributor

jbj commented Nov 5, 2025

Based on your DCA results (I haven't reviewed the code), it sounds to me like this is good enough to staff-ship 🎉 but that more follow-up work is needed.

It looks like the missing results all are located in generated files.

That should be fine. With or without incrementality, the PR view doesn't show alerts in generated files because they're not in the changed-files list in the UI.

If a file is considered un-changed, we extract a skeleton implementation.

That's fine for now, but do you see a path forward to only extracting skeletons of the (transitive) dependencies of changed files? That's what we do for Java. The ideal is that the extraction process should take time proportional to the number of changed files (plus a minimal set of dependencies).

Analysis time +4%
Database size +56%

This is comparable to what we've seen for other languages. Analysis time should come down when overlay[local] annotations are added in later PRs. Maybe the database size increase is on the high side; possible causes:

  • The string pool (default/cache/cached-strings/pools) might be large. We have a known bug about that, but QL-level changes can also reduce the string pool (like avoiding getQualifiedName, as you know).
  • The tuple pool (default/cache/cached-strings/tuple-pool) might be large. It's a known issue that unnecessary IPA types persist in the tuple pool of base databases, but that should matter less when more code gets marked as local, so I wouldn't worry about it.
  • The id pool (default/idPool) might be large. You've already minimised the use of TRAP keys for locations, so there are probably no easy wins here.

In summary, these numbers look good to me.

Database build time -19%

That's a small reduction compared to what we've seen in other languages. I'm guessing it's because the C# extractor does a lot of work setting up the build and the dependencies before it gets to the actual code. I hope github/codeql-action#3117 will help here, or at least it will become clearer what work remains. If there's any computation you want to cache between the full analysis and the PR, it can be stored in $CODEQL_EXTRACTOR_CSHARP_OVERLAY_BASE_METADATA_OUT.

Extracting only the necessary dependencies as skeleton files should also help.

@michaelnebel
Copy link
Contributor Author

michaelnebel commented Nov 5, 2025

@jbj : Thank you very much for the feedback! I really appreciate it!

That should be fine. With or without incrementality, the PR view doesn't show alerts in generated files because they're not in the changed-files list in the UI.

Yes, exactly. We could consider always extracting generated files, if we can't get a mapping between original <-> generated files. As you mention, this is less urgent.

do you see a path forward to only extracting skeletons of the (transitive) dependencies of changed files? That's what we do for Java. The ideal is that the extraction process should take time proportional to the number of changed files (plus a minimal set of dependencies).

Really good question!
At the moment, we use the context (changed files) to decide whether we make a full extraction/scaffold of a given symbol just before the entity is written to TRAP. However, if we are only interested in scaffolding the dependencies (transitively), then maybe we can do the filtering on changed files earlier (On syntax tree level). At least for a call that calls a method outside the changed files, I think we would extract the corresponding symbol Method entity - which would be populated (as this is a cached entity in the extractor). However, for dependencies we would need to supply information that only scaffolding is needed. Not sure whether this will work. At least there is someting to think about.

That's a small reduction compared to what we've seen in other languages. I'm guessing it's because the C# extractor does a lot of work setting up the build and the dependencies before it gets to the actual code.

Yes, I think downloading dependencies is a large portion of the overall time. Do you know, whether it will be possible to measure the benefit of cached dependencies using DCA?

@jbj
Copy link
Contributor

jbj commented Nov 5, 2025

Do you know, whether it will be possible to measure the benefit of cached dependencies using DCA?

I know that @ginsbach has experimented with this for Java. I'll pull you into the relevant thread on Slack.

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, great work 💪

display_name: "C#"
version: 1.22.1
column_kind: "utf16"
overlay_support_version: 20250626
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOI, where does this date come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here.

/// ]
/// }
/// </summary>
public record ChangedFiles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it actually be made private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can 😄

{
trapFile.commentblock(this);
WriteLocationToTrap(trapFile.commentblock_location, this, Context.CreateLocation(Symbol.Location));
if (!Context.OnlyScaffold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it that we not simply return when Context.OnlyScaffold holds?

Copy link
Contributor Author

@michaelnebel michaelnebel Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "named" TRAP entities, I have just tried to avoid re-extracting the source locations. Will try and make the change. In this case the child information is extracted below. I will restructure a bit to streamline implementation.

location = Context.CreateLocation(Location);
trapFile.commentline(this, Type == CommentLineType.MultilineContinuation ? CommentLineType.Multiline : Type, Text, RawText);
WriteLocationToTrap(trapFile.commentline_location, this, location);
if (!Context.OnlyScaffold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

if (info.Symbol is INamespaceSymbol namespaceSymbol)
{
var ns = Namespace.Create(Context, namespaceSymbol);
trapFile.using_namespace_directives(this, ns);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we extracting this in scaffold mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, yes, they are indeed * ID elements, so we should probably avoid extracting them altogether when scaffolding. Will try and make that change.

*/
overlay[local]
private class StarEntity =
@expr or @stmt or @diagnostic or @extractor_message or @using_directive or @type_mention;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@local_variable uses * as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, of course, @location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only source locations are * entities. Furthermore, a location is not considered a @locatable. There is a separate discard predicate for handling locations.

private predicate discardStarEntity(@locatable e) {
e instanceof StarEntity and
// Entities with *-ids can exist either in base or overlay, but not both.
exists(DiscardableEntity de | de = e |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e = any(DiscardableEntity de | is slightly less verbose.

private predicate discardNamedEntity(@locatable e) {
not e instanceof StarEntity and
// Entities with named IDs can exist both in base, overlay, or both.
exists(DiscardableEntity de | de = e |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

}

overlay[local]
private class PossibleDiscardableEntity extends DiscardableEntity instanceof @locatable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instanceof @locatable is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put all the logic directly in the DiscardableEntity class instead.

overlay[local]
abstract private class DiscardableEntity extends @locatable {
/** Gets the path to the file in which this element occurs. */
abstract string getPath();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getFilePath would be better, I think.

@michaelnebel michaelnebel force-pushed the csharp/basicextractoroverlay branch from 1ca50aa to 9d300e3 Compare November 7, 2025 14:53
@michaelnebel michaelnebel marked this pull request as draft November 7, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants