Skip to content

Conversation

@ScottDugas
Copy link
Collaborator

@ScottDugas ScottDugas commented Sep 9, 2025

This introduces a new KeySpacePath.importData that will import DataInKeySpacePath as gathered by KeySpacePath.exportAllData.

The new method works when importing data exported from other clusters.

Resolves: #3573

@ScottDugas ScottDugas added the enhancement New feature or request label Nov 9, 2025
@ScottDugas ScottDugas changed the title Keyspace import Introduce KeySpacePath.importData to import previously exported data Nov 9, 2025
@ScottDugas ScottDugas marked this pull request as ready for review November 10, 2025 16:08

@Nonnull
@Override
public CompletableFuture<Void> importData(@Nonnull FDBRecordContext context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the assumption here that resource control be handled externally? Meaning, if the given stream of data is too large, someone else is going to trim it and retry? In that sense, would it make sense to provide a single import method (for a single item) and return the future, as there seems no efficiency by working on the whole collection?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea would be that the calling process would limit dataToImport, and decrease that limit in response to failures.
I hadn't really considered the possibility of limiting in response to the number that were imported, our existing code does have some dynamic response (e.g. ThrottlingRetryingIterator and IndexingThrottle), so it is probably a good idea to support that here. It could do something like add the number imported to the exception, but that doesn't seem great.
There is some performance advantage to doing it as an iterable, namely toTupleAsync may have a cost if it is DirectoryLayerDirectory. That should be minimal, but it could go away completely by putting it on ResolvedKeySpacePath.
It's unfortunate that would put exportAllData and importData on different classes.

I can try moving this if you think it's worthwhile at this point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot/misread one other advantage: This imports all the data concurrently.
The set should be incredibly efficient, so the primary thing you are gaining in performance is resolving the path for each data item.
If a client wants to do imports one-at-a-time they can always provide dataToImport with just one element, and call it multiple times.

I feel like that is a good enough reason to stick with the current implementation, at least for now.
That being said, it may make sense to use a MapPipelinedCursor so that we are only resolving X at a time.

}

// Store the data
byte[] keyBytes = keyTuple.pack();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some fdb timer metrics for future use (imported_count)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think a timer around importFuture makes sense.

* @return a random subset of the databases available. This may be less than {@code count} if there aren't that many
* databases available.
*/
public List<FDBDatabase> getDatabases(int count) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public List<FDBDatabase> getDatabases(int count) {
public List<FDBDatabase> getRandomDatabasesSubset(int count) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible it will make less sense to exist when I remove the databases field, but if I keep this method around, I'll rename it.


@BeforeEach
void setUp() {
databases = dbExtension.getDatabases(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe sourceDb and targetDb, and some logic (like assume) around whether target DB is null?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that running with multiple clusters was only recently added, I felt it would be better to have these tests run re-using the same cluster, rather than have them not run if a developer doesn't have two FDB clusters.
(Also, I wrote these tests before that functionality existed).

Changing it to be sourceDB and destDB fields seems reasonable enough. I think I would have them both be Nonnull and have special logic to clear if they are the same.

It would be good to have a couple of these tests parameterized to copy within a cluster vs across clusters. I'll make sure to do that. At the very least copying back to where it was, clearing in-between, copying to a different path, and copying with DirectoryLayer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that importData works with a DataInKeySpacePath, I actually think copying to a different path doesn't make sense, because the DataInKeySpacePath needs to be scoped to this path.

final List<DataInKeySpacePath> exportedData = getExportedData(sourcePath);

if (databases.size() > 1) {
database = databases.get(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this intended to change the class' field value?
(I guess, it is, as after the copy all operations would be running on the "target" DB, but it is a little obscure to the reader)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was intended, but I can see how it would be confusing.
When I do the refactoring to have sourceDB and targetDB, I'll see how that shakes out. I think the only code interacting with the targetDB would be verifySingleKey so I'll make sure to check that.
At least one thing I want to ensure is clear is that we aren't accidentally reading back from the source.

List<CompletableFuture<Void>> importFutures = new ArrayList<>();

for (DataInKeySpacePath dataItem : dataToImport) {
CompletableFuture<Void> importFuture = dataItem.getPath().toTupleAsync(context).thenCompose(itemPathTuple -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that the directory layer implementation of Directory is creating a new entry in case the requested value is not found, right? This is done in the same transaction and is rolled back in case of failure, right? Can there be cases where multiple entries are created for the same resolved value? Do we care?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the directory layer will create any entries that don't exist, but I should ensure there are tests of that.
The DirectoryLayer creates entries in separate transactions, borrowing the read version. I believe it does this to:

  1. Isolate transaction conflict risk
  2. Once it is created it is cached

The general assumption is that anything that you're interacting with in the DirectoryLayer should almost always already exist, and most likely already be in the cache.

And the DirectoryLayer ensures that both the logical values (String) and resolved values (Long) are unique.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test where the directory-layer entry won't exist: importDataWithDirectoryLayer is using a uuid for the string value, so when copying between clusters it won't exist on the destination.

context.commit();
}

copyData(root.path("company"), root.path("company"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be cases where we have to import into a different root? Say where we export from "/local/company/..." and import into "/remote/company/..."?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it is the responsibility of the caller to convert the path.
In my draft for serialization I cover this: https://github.com/FoundationDB/fdb-record-layer/pull/3747/files#diff-9422726206a402dab506820d748c97cd0b4a1b76c5f9ed8ff0f60c50a8bad8b0R342
but it's not really a test that makes sense at this point, because the DataInKeySpacePath should be already referring to /remote/company by the time it gets to import.

private void clearPath(final FDBDatabase database, final KeySpacePath path) {
try (FDBRecordContext context = database.openContext()) {
Transaction tr = context.ensureActive();
tr.clear(path.toSubspace(context).range());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would not clear the directory layer, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

final KeySpacePath dataPath = root.path("tenant").add("user_id", 999L);
setSingleKey(dataPath, Tuple.from("data"), Tuple.from("directory_test"));

copyData(root.path("tenant"), root.path("tenant"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add assertions for the resolved exported value and the way it was resolved by directory layer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is implicitly done by the verifySingleKey, which takes the dataPath. If copyData didn't resolve the path correctly, it wouldn't be found in verifySingleKey


verifySingleKey(dataPath, Tuple.from("item"), Tuple.from("final_value"));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional potential tests:

  • Large data (or any out of band error) during import
  • Import into partial path (no leaves in import data) + some remainders
  • import where data is of the wrong type

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, a test of more data than can be inserted into a single transaction would make sense, but not if I move it to ResolvedKeySpacePath and just have it take a single DataInKeySpacePath.
  2. I'm not sure what you mean by a partial path.
  3. If by data you mean the value, there is no validation, and it is not KeySpacePaths responsibility to know what is in the data. If you mean the object in the path, that should be validated above this call, and should be trust-worthy by the time you get a DataInKeySpacePath. Ideally this would be validated when you create the KeySpacePath, but it is covered in the serialization work, and I explain a bit more on the situation there: https://github.com/FoundationDB/fdb-record-layer/pull/3747/files#diff-15120b2e222e6bb7c2647b670f676b719cce8602e410487604bc87e9ea30a3b0R179

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a KeySpacePath.import method to import the results of an export

2 participants