-
-
Notifications
You must be signed in to change notification settings - Fork 227
Implement process and instance isolation #4498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## version6 #4498 +/- ##
===========================================
Coverage ? 73.36%
===========================================
Files ? 482
Lines ? 17567
Branches ? 3473
===========================================
Hits ? 12888
Misses ? 3826
Partials ? 853 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@sentry review |
| { | ||
| throw new InvalidOperationException("Cache directory already locked."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache coordinator acquisition failure throws an InvalidOperationException, but this could make it impossible to create multiple SDK instances in some scenarios. Consider making this non-fatal and falling back to a non-cached transport or using a different isolation strategy.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Could this be a severe issue?
But was should happen if this fails?
I believe the code path comes from creating a SentryClient via the Hub.
But we don't have too much control in this CachingTransport-.ctor here.
Hmmm ...
I wonder if we should avoid trying to Acquire the Lock-File in the .ctor,
but instead in the Initialize ... or the Worker-Task ... but then the error may happen quite late.
🤔 something is troubling me but I can't quite put my finger on it just yet 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the SDK currently works... so I don't believe it's an issue. If it is, I think we could address in a separate PR.
sentry-dotnet/src/Sentry/Internal/Http/CachingTransport.cs
Lines 81 to 83 in ee30f23
| _isolatedCacheDirectoryPath = | |
| options.TryGetIsolatedCacheDirectoryPath() ?? | |
| throw new InvalidOperationException("Cache directory or DSN is not set."); |
| /// <remarks> | ||
| /// This method can throw all of the same exceptions that the <see cref="FileStream"/> constructor can throw. The | ||
| /// caller is responsible for handling those exceptions. | ||
| /// </remarks> | ||
| public override bool TryCreateLockFile(string path, out Stream fileStream) | ||
| { | ||
| // Note that FileShare.None is implemented via advisory locks only on macOS/Linux... so it will stop | ||
| // other .NET processes from accessing the file but not other non-.NET processes. This should be fine | ||
| // in our case - we just want to avoid multiple instances of the SDK concurrently accessing the cache | ||
| fileStream = new FileStream( | ||
| path, | ||
| FileMode.OpenOrCreate, | ||
| FileAccess.ReadWrite, | ||
| FileShare.None); | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TryCreateLockFile method can throw all FileStream constructor exceptions but always returns true. This is inconsistent with the interface design pattern where 'Try' methods typically return false on failure rather than throwing. Consider catching exceptions and returning false, or rename the method to indicate it can throw.
| /// <remarks> | |
| /// This method can throw all of the same exceptions that the <see cref="FileStream"/> constructor can throw. The | |
| /// caller is responsible for handling those exceptions. | |
| /// </remarks> | |
| public override bool TryCreateLockFile(string path, out Stream fileStream) | |
| { | |
| // Note that FileShare.None is implemented via advisory locks only on macOS/Linux... so it will stop | |
| // other .NET processes from accessing the file but not other non-.NET processes. This should be fine | |
| // in our case - we just want to avoid multiple instances of the SDK concurrently accessing the cache | |
| fileStream = new FileStream( | |
| path, | |
| FileMode.OpenOrCreate, | |
| FileAccess.ReadWrite, | |
| FileShare.None); | |
| return true; | |
| } | |
| public override bool TryCreateLockFile(string path, out Stream fileStream) | |
| { | |
| try | |
| { | |
| fileStream = new FileStream( | |
| path, | |
| FileMode.OpenOrCreate, | |
| FileAccess.ReadWrite, | |
| FileShare.None); | |
| return true; | |
| } | |
| catch | |
| { | |
| fileStream = Stream.Null; | |
| return false; | |
| } | |
| } |
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with the behaviour for CreateFileStream and makes it easier to surface errors to callers... so I'd rather stick with the API we have.
test/Sentry.Tests/HubTests.cs
Outdated
| if (options.Transport is CachingTransport cachingTransport) | ||
| { | ||
| cachingTransport.Dispose(); // Release cache lock so that the cacheDirectory can be removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Don't we need to do this for every test?
The method CapturesEventWithContextKey_Implementation does it too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need to do it when options.CacheDirectoryPath is set... it's just to ensure the cache dir gets cleaned up. Most tests aren't using a cache dir.
| var cachingTransport = Assert.IsType<CachingTransport>(_fixture.SentryOptions.Transport); | ||
| _ = Assert.IsType<FakeTransport>(cachingTransport.InnerTransport); | ||
|
|
||
| cachingTransport.Dispose(); // Release cache lock so that the cacheDirectory can be removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: don't we need to do this on a Test-Class level instead, so that it applies to every Test-Method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only necessary because we set CacheDirectoryPath on line 1628.
It's a bit messy but only really a problem in tests, I think... SDK users won't need to worry about this.
| { | ||
| throw new InvalidOperationException("Cache directory already locked."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Could this be a severe issue?
But was should happen if this fails?
I believe the code path comes from creating a SentryClient via the Hub.
But we don't have too much control in this CachingTransport-.ctor here.
Hmmm ...
I wonder if we should avoid trying to Acquire the Lock-File in the .ctor,
but instead in the Initialize ... or the Worker-Task ... but then the error may happen quite late.
🤔 something is troubling me but I can't quite put my finger on it just yet 🤔
|
|
||
| _options.LogDebug("found abandoned cache: {0}", dir); | ||
| using var coordinator = new CacheDirectoryCoordinator(dir, _options.DiagnosticLogger, _options.FileSystem); | ||
| if (!coordinator.TryAcquire()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Could we run into a concurrency issue?
E.g.
- this recycler has started and tries to acquire a lock-file
- in the
.ctorwe have also acquired a lock-file already, in a directory below - now we try to move files that contain a lock-file that we still have a reference via
Streamon
Essentially now retrying 3 times without any effect?
But I may misunderstand something there 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock files sit in the parent/root cache directory. The isolated cache folders are all children of that root folder, so none of the operations on the isolated cache folders should affect the locks. I've added a code comment to make this clearer:
sentry-dotnet/src/Sentry/Internal/CacheDirectoryCoordinator.cs
Lines 22 to 23 in ada094b
| // Note this creates a lock file in the cache directory's parent directory... not in the cache directory itself | |
| _lockFilePath = $"{cacheDir}.lock"; |
Instructions and example for changelogPlease add an entry to Example: ## Unreleased
### Features
- Implement process and instance isolation ([#4498](https://github.com/getsentry/sentry-dotnet/pull/4498))If none of the above apply, you can opt out of this check by adding |
|
When I changed the merge target, it didn't automatically trigger a build, so I triggered this manually: All green 🟢 |
|
@Flash0ver I think the only remaining piece of feedback to address (not sure what happened to it) is how we handle upgrading from previous SDK versions, if people have files in their cache when they do so (since we're changing the cache directory). Options:
I think since this could potentially result in lost client data, it's worth opting for door number 2 here... but only if we're broadly happy with the PR aside from that. I addressed all the review feedback and changed the merge target to version 6. |
| { | ||
| return ResultMigrationLockNotAcquired; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Race-Condition in Migration Locking Mechanism
The MigrateVersion5Cache method checks if the marker file exists before acquiring the migration lock, creating a race condition. Multiple processes could check the marker file simultaneously, all see it doesn't exist, then compete for the lock. The marker file check needs to happen after acquiring the lock to ensure atomicity of the check-and-migrate operation. Without this, concurrent processes may incorrectly attempt migration or return misleading status codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While technically possible, worst case, two threads run the migration one after the other (and the second thread does nothing).
The current design will be more optimal 99.999% of the time (i.e. when not upgrading), since it avoids unnecessarily creating and deleting lock files for the root cache directory on every run.
| _options.FileSystem.WriteAllTextToFile(procFile, "dummy content"); | ||
|
|
||
| // Act | ||
| transport.MigrateVersion5Cache(CancellationToken.None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Assert that the result is ResultMigrationOk
| } | ||
|
|
||
| [Fact] | ||
| public async Task MigrateVersion5Cache_MovesEnvelopesFromBaseAndProcessing() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Should we add a test for ResultMigrationNoBaseCacheDir?
| /// In version 5, there were no isolated subdirectories. | ||
| /// </summary> | ||
| /// <param name="cancellationToken"></param> | ||
| /// <returns>An integer return code to facillitate testing</returns> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// <returns>An integer return code to facillitate testing</returns> | |
| /// <returns>An integer return code to facilitate testing.</returns> |
| return Task.CompletedTask; | ||
| } | ||
|
|
||
| // Stop worker and wait until it finishes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: This comment is confusing to me ... unless I misunderstand something 😟:
We don't wait until the stopped worker is finished here ... but the caller is awaiting it's completion/cancellation.
| public void GetIsolatedFolderName_MissingDsn_UniqueForInitcount() | ||
| { | ||
| var o = new SentryOptions { CacheDirectoryPath = "c:\\cache", Dsn = null }; | ||
| var folder =o.GetIsolatedFolderName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick
| var folder =o.GetIsolatedFolderName(); | |
| var folder = o.GetIsolatedFolderName(); |
CHANGELOG.md
Outdated
| ### Features | ||
|
|
||
| - Implement process and instance isolation so that multiple instances of the Sentry SDK can be instantiated inside the same process when using the Caching Transport ([#4498](https://github.com/getsentry/sentry-dotnet/pull/4498)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: merge into already existing Unreleased Features section
Resolves #2033, Resolves #1067
Todo
We're seeing an error in some entirely unrelated code on Windows ARM64. This is happening for other PRs as well, so unrelated to this PR.
Interprocess Locking
This turns out to be surprisingly difficult.
Something like a named Mutex could be used for this. However Mutexes are thread affine so they need to be released on the same thread that acquired them. Other than setting aside a dedicated thread for the lifetime of the Hub, there's no easy way to make that work.
A named Semaphore would be another option, but these aren't supported on macOS.
In the end I went with the "lock file" solution that @lucas-zimerman suggested. Keeping a lock file specific to each cache directory and opening this with FileShare.None gives us a kind of file based named mutex that we can use to avoid having multiple instances of the Hub trying to process files from the same cache directory.