Skip to content

Conversation

@jamescrosswell
Copy link
Collaborator

@jamescrosswell jamescrosswell commented Sep 5, 2025

Resolves #2033, Resolves #1067

Todo

We're seeing an error in some entirely unrelated code on Windows ARM64. This is happening for other PRs as well, so unrelated to this PR.

Interprocess Locking

This turns out to be surprisingly difficult.

Something like a named Mutex could be used for this. However Mutexes are thread affine so they need to be released on the same thread that acquired them. Other than setting aside a dedicated thread for the lifetime of the Hub, there's no easy way to make that work.

A named Semaphore would be another option, but these aren't supported on macOS.

In the end I went with the "lock file" solution that @lucas-zimerman suggested. Keeping a lock file specific to each cache directory and opening this with FileShare.None gives us a kind of file based named mutex that we can use to avoid having multiple instances of the Hub trying to process files from the same cache directory.

@jamescrosswell jamescrosswell marked this pull request as ready for review September 11, 2025 22:25
cursor[bot]

This comment was marked as outdated.

@codecov
Copy link

codecov bot commented Sep 11, 2025

Codecov Report

❌ Patch coverage is 77.16049% with 37 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (version6@522f80e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/Sentry/Internal/Http/CachingTransport.cs 70.73% 19 Missing and 5 partials ⚠️
src/Sentry/Internal/CacheDirectoryCoordinator.cs 81.96% 9 Missing and 2 partials ⚠️
src/Sentry/Internal/ReadOnlyFilesystem.cs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             version6    #4498   +/-   ##
===========================================
  Coverage            ?   73.36%           
===========================================
  Files               ?      482           
  Lines               ?    17567           
  Branches            ?     3473           
===========================================
  Hits                ?    12888           
  Misses              ?     3826           
  Partials            ?      853           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bruno-garcia
Copy link
Member

@sentry review

Comment on lines +86 to +88
{
throw new InvalidOperationException("Cache directory already locked.");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache coordinator acquisition failure throws an InvalidOperationException, but this could make it impossible to create multiple SDK instances in some scenarios. Consider making this non-fatal and falling back to a non-cached transport or using a different isolation strategy.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Could this be a severe issue?

But was should happen if this fails?
I believe the code path comes from creating a SentryClient via the Hub.

But we don't have too much control in this CachingTransport-.ctor here.

Hmmm ...
I wonder if we should avoid trying to Acquire the Lock-File in the .ctor,
but instead in the Initialize ... or the Worker-Task ... but then the error may happen quite late.

🤔 something is troubling me but I can't quite put my finger on it just yet 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the SDK currently works... so I don't believe it's an issue. If it is, I think we could address in a separate PR.

_isolatedCacheDirectoryPath =
options.TryGetIsolatedCacheDirectoryPath() ??
throw new InvalidOperationException("Cache directory or DSN is not set.");

Comment on lines +31 to +46
/// <remarks>
/// This method can throw all of the same exceptions that the <see cref="FileStream"/> constructor can throw. The
/// caller is responsible for handling those exceptions.
/// </remarks>
public override bool TryCreateLockFile(string path, out Stream fileStream)
{
// Note that FileShare.None is implemented via advisory locks only on macOS/Linux... so it will stop
// other .NET processes from accessing the file but not other non-.NET processes. This should be fine
// in our case - we just want to avoid multiple instances of the SDK concurrently accessing the cache
fileStream = new FileStream(
path,
FileMode.OpenOrCreate,
FileAccess.ReadWrite,
FileShare.None);
return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TryCreateLockFile method can throw all FileStream constructor exceptions but always returns true. This is inconsistent with the interface design pattern where 'Try' methods typically return false on failure rather than throwing. Consider catching exceptions and returning false, or rename the method to indicate it can throw.

Suggested change
/// <remarks>
/// This method can throw all of the same exceptions that the <see cref="FileStream"/> constructor can throw. The
/// caller is responsible for handling those exceptions.
/// </remarks>
public override bool TryCreateLockFile(string path, out Stream fileStream)
{
// Note that FileShare.None is implemented via advisory locks only on macOS/Linux... so it will stop
// other .NET processes from accessing the file but not other non-.NET processes. This should be fine
// in our case - we just want to avoid multiple instances of the SDK concurrently accessing the cache
fileStream = new FileStream(
path,
FileMode.OpenOrCreate,
FileAccess.ReadWrite,
FileShare.None);
return true;
}
public override bool TryCreateLockFile(string path, out Stream fileStream)
{
try
{
fileStream = new FileStream(
path,
FileMode.OpenOrCreate,
FileAccess.ReadWrite,
FileShare.None);
return true;
}
catch
{
fileStream = Stream.Null;
return false;
}
}

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the behaviour for CreateFileStream and makes it easier to surface errors to callers... so I'd rather stick with the API we have.

Comment on lines 2204 to 2206
if (options.Transport is CachingTransport cachingTransport)
{
cachingTransport.Dispose(); // Release cache lock so that the cacheDirectory can be removed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Don't we need to do this for every test?

The method CapturesEventWithContextKey_Implementation does it too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need to do it when options.CacheDirectoryPath is set... it's just to ensure the cache dir gets cleaned up. Most tests aren't using a cache dir.

var cachingTransport = Assert.IsType<CachingTransport>(_fixture.SentryOptions.Transport);
_ = Assert.IsType<FakeTransport>(cachingTransport.InnerTransport);

cachingTransport.Dispose(); // Release cache lock so that the cacheDirectory can be removed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: don't we need to do this on a Test-Class level instead, so that it applies to every Test-Method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only necessary because we set CacheDirectoryPath on line 1628.

It's a bit messy but only really a problem in tests, I think... SDK users won't need to worry about this.

Comment on lines +86 to +88
{
throw new InvalidOperationException("Cache directory already locked.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Could this be a severe issue?

But was should happen if this fails?
I believe the code path comes from creating a SentryClient via the Hub.

But we don't have too much control in this CachingTransport-.ctor here.

Hmmm ...
I wonder if we should avoid trying to Acquire the Lock-File in the .ctor,
but instead in the Initialize ... or the Worker-Task ... but then the error may happen quite late.

🤔 something is troubling me but I can't quite put my finger on it just yet 🤔


_options.LogDebug("found abandoned cache: {0}", dir);
using var coordinator = new CacheDirectoryCoordinator(dir, _options.DiagnosticLogger, _options.FileSystem);
if (!coordinator.TryAcquire())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Could we run into a concurrency issue?

E.g.

  • this recycler has started and tries to acquire a lock-file
  • in the .ctor we have also acquired a lock-file already, in a directory below
  • now we try to move files that contain a lock-file that we still have a reference via Stream on

Essentially now retrying 3 times without any effect?
But I may misunderstand something there 🤔

Copy link
Collaborator Author

@jamescrosswell jamescrosswell Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock files sit in the parent/root cache directory. The isolated cache folders are all children of that root folder, so none of the operations on the isolated cache folders should affect the locks. I've added a code comment to make this clearer:

// Note this creates a lock file in the cache directory's parent directory... not in the cache directory itself
_lockFilePath = $"{cacheDir}.lock";

@jamescrosswell jamescrosswell changed the base branch from main to version6 November 5, 2025 01:04
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

Fails
🚫 Please consider adding a changelog entry for the next release.
Messages
📖 Do not forget to update Sentry-docs with your feature once the pull request gets approved.

Instructions and example for changelog

Please add an entry to CHANGELOG.md to the "Unreleased" section. Make sure the entry includes this PR's number.

Example:

## Unreleased

### Features

- Implement process and instance isolation ([#4498](https://github.com/getsentry/sentry-dotnet/pull/4498))

If none of the above apply, you can opt out of this check by adding #skip-changelog to the PR description or adding a skip-changelog label.

Generated by 🚫 dangerJS against 2be400d

@jamescrosswell
Copy link
Collaborator Author

When I changed the merge target, it didn't automatically trigger a build, so I triggered this manually:
https://github.com/getsentry/sentry-dotnet/actions/runs/19088039748

All green 🟢

@jamescrosswell
Copy link
Collaborator Author

@Flash0ver I think the only remaining piece of feedback to address (not sure what happened to it) is how we handle upgrading from previous SDK versions, if people have files in their cache when they do so (since we're changing the cache directory).

Options:

  1. Deal with it via upgrade docs, in the Breaking Changes section of the changelog
  2. Modify the code to automatically import envelopes from the cache root directory (we could maybe remove that code in a future v7 release).

I think since this could potentially result in lost client data, it's worth opting for door number 2 here... but only if we're broadly happy with the PR aside from that.

I addressed all the review feedback and changed the merge target to version 6.

{
return ResultMigrationLockNotAcquired;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Race-Condition in Migration Locking Mechanism

The MigrateVersion5Cache method checks if the marker file exists before acquiring the migration lock, creating a race condition. Multiple processes could check the marker file simultaneously, all see it doesn't exist, then compete for the lock. The marker file check needs to happen after acquiring the lock to ensure atomicity of the check-and-migrate operation. Without this, concurrent processes may incorrectly attempt migration or return misleading status codes.

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While technically possible, worst case, two threads run the migration one after the other (and the second thread does nothing).

The current design will be more optimal 99.999% of the time (i.e. when not upgrading), since it avoids unnecessarily creating and deleting lock files for the root cache directory on every run.

_options.FileSystem.WriteAllTextToFile(procFile, "dummy content");

// Act
transport.MigrateVersion5Cache(CancellationToken.None);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Assert that the result is ResultMigrationOk

}

[Fact]
public async Task MigrateVersion5Cache_MovesEnvelopesFromBaseAndProcessing()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should we add a test for ResultMigrationNoBaseCacheDir?

/// In version 5, there were no isolated subdirectories.
/// </summary>
/// <param name="cancellationToken"></param>
/// <returns>An integer return code to facillitate testing</returns>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// <returns>An integer return code to facillitate testing</returns>
/// <returns>An integer return code to facilitate testing.</returns>

return Task.CompletedTask;
}

// Stop worker and wait until it finishes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: This comment is confusing to me ... unless I misunderstand something 😟:

We don't wait until the stopped worker is finished here ... but the caller is awaiting it's completion/cancellation.

public void GetIsolatedFolderName_MissingDsn_UniqueForInitcount()
{
var o = new SentryOptions { CacheDirectoryPath = "c:\\cache", Dsn = null };
var folder =o.GetIsolatedFolderName();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick

Suggested change
var folder =o.GetIsolatedFolderName();
var folder = o.GetIsolatedFolderName();

CHANGELOG.md Outdated
Comment on lines 5 to 7
### Features

- Implement process and instance isolation so that multiple instances of the Sentry SDK can be instantiated inside the same process when using the Caching Transport ([#4498](https://github.com/getsentry/sentry-dotnet/pull/4498))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: merge into already existing Unreleased Features section

@jamescrosswell jamescrosswell merged commit 570eae6 into version6 Nov 12, 2025
14 of 15 checks passed
@jamescrosswell jamescrosswell deleted the isolated-cache branch November 12, 2025 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Offline caching should support concurrent process instances Errors when initializing Sentry twice with caching enabled

5 participants