LMDB Store: working on new ID based join iterator #5549

hmottestad · 2025-11-03T10:02:07Z

GitHub issue resolved: #

Briefly describe the changes proposed in this PR:

PR Author Checklist (see the contributor guidelines for more details):

my pull request is self-contained
I've added tests for the changes I made
I've applied code formatting (you can use mvn process-resources to format from the command line)
I've squashed my commits where necessary
every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

Copilot

Pull Request Overview

This PR adds timeout protections to test cases and implements a safeguard against infinite query result iterations in the LMDB store. The changes focus on preventing tests from hanging indefinitely by adding explicit timeouts and result count limits.

Key Changes

Added 5-second timeouts to 28 test methods across snapshot and serializable isolation level tests
Added 10-second timeout to one delete-insert test
Implemented a safety check in the eval() method to prevent collecting more than 1 million results
Added timeout handling with explicit time units to CountDownLatch.await() calls in concurrency tests

Reviewed Changes

Copilot reviewed 95 out of 96 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
`SnapshotTest.java`	Added 5-second timeout annotations to 28 test methods
`SerializableTest.java`	Added 5-second timeout annotations to 19 test methods
`DeleteInsertTest.java`	Added 10-second timeout annotation to test method
`IsolationLevelTest.java`	Replaced indefinite `await()` calls with 10-second timeouts on CountDownLatches
`DeadLockTest.java`	Added TimeUnit import and 10-second timeouts to CountDownLatch waits
`SparqlOrderByTest.java`	Added class-level 10-second timeout annotation
`pom.xml`	Updated jacoco-maven-plugin version from 0.8.13 to 0.8.14

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-03T10:11:31Z

...repository/src/main/java/org/eclipse/rdf4j/testsuite/repository/optimistic/SnapshotTest.java

+				if (list.size() > 1000000) {
+					throw new RuntimeException("Too many results: " + list.size());
+				}


The error message includes list.size() which will always be 1000001 when the exception is thrown due to the condition checking > 1000000. Consider using a constant for the limit value to keep the message and condition in sync, or simplify the message to just state 'Too many results (exceeded limit of 1000000)'.

Copilot · 2025-11-03T10:11:31Z

...tory/src/main/java/org/eclipse/rdf4j/testsuite/repository/optimistic/IsolationLevelTest.java

 			try (RepositoryConnection write = store.getConnection()) {
 				start.countDown();
-				start.await();
+				start.await(10, TimeUnit.SECONDS);


The timeout value of 10 seconds is hardcoded across multiple await() calls in this file. Consider extracting this as a named constant (e.g., private static final long LATCH_TIMEOUT_SECONDS = 10;) to improve maintainability and make it easier to adjust if needed.

Copilot · 2025-11-03T10:11:32Z

core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/TripleStore.java

+			Varint.writeUnsigned(bb, pred);
+		}
+
+		void toDupValue(ByteBuffer bb, long subj, long pred, long obj, long context) {


The parameter 'subj' is never used.

Copilot · 2025-11-03T10:11:32Z

core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/TripleStore.java

+			Varint.writeUnsigned(bb, pred);
+		}
+
+		void toDupValue(ByteBuffer bb, long subj, long pred, long obj, long context) {


The parameter 'pred' is never used.

Copilot · 2025-11-03T10:11:32Z

.../sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/join/LmdbIdJoinQueryEvaluationStep.java

+	private final QueryEvaluationStep fallbackStep;
+	private final boolean fallbackImmediately;
+
+	public LmdbIdJoinQueryEvaluationStep(EvaluationStrategy strategy, Join join, QueryEvaluationContext context,


The parameter 'strategy' is never used.

Copilot · 2025-11-03T10:11:33Z

core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/IdBindingInfo.java

+			}
+		}
+		for (String v : right.getVariableNames()) {
+			Integer mask = right.getPositionsMask(v);


The variable 'mask' is only assigned values of primitive type and is never 'null', but it is declared with the boxed type 'Integer'.

Suggested change

Integer mask = right.getPositionsMask(v);

int mask = right.getPositionsMask(v);

Copilot · 2025-11-03T10:11:34Z

core/sail/lmdb/src/test/java/org/eclipse/rdf4j/sail/lmdb/LmdbIdJoinEvaluationTest.java

+
+		ValueFactory vf = SimpleValueFactory.getInstance();
+		IRI alice = vf.createIRI(NS, "alice");
+		IRI bob = vf.createIRI(NS, "bob");


Variable 'IRI bob' is never read.

Copilot · 2025-11-03T10:11:34Z

core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/TripleStore.java

 		 * The higher the score, the better the index is suited for matching the pattern. Lowest score is 0, which means
 		 * that the index will perform a sequential scan.
 		 */
 		public int getPatternScore(long subj, long pred, long obj, long context) {


This method overrides DupIndex.getPatternScore; it is advisable to add an Override annotation.

Copilot · 2025-11-03T10:11:34Z

core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/TripleStore.java

+			return dupsortEnabled;
+		}
+
+		public int getDupDB(boolean explicit) {


This method overrides DupIndex.getDupDB; it is advisable to add an Override annotation.

Copilot · 2025-11-03T10:11:34Z

core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/TripleStore.java

+			this(fieldSeq, false);
 		}

 		public char[] getFieldSeq() {


This method overrides DupIndex.getFieldSeq; it is advisable to add an Override annotation.

hmottestad · 2025-11-03T10:15:19Z

@kenwenzel I’m working on an ID-based join for the LMDB store where we don’t need to create LmdbValue objects or binding sets. There is something not quite right though, causing a bunch of tests that test SNAPSHOT and SERIALIZABLE isolation to fail. I think that ID-based joins won’t work for SERIALIZABLE isolation, but I don’t quite understand what makes SNAPSHOT fail.

kenwenzel · 2025-11-03T10:46:16Z

@hmottestad This is a great idea. Can you please explain why you would expect ID-based joins to fail for SERIALIZABLE? What is different in this case from joining on the value objects?

kenwenzel · 2025-11-03T11:06:12Z

@hmottestad I think in both cases the triples are only in memory - managed by SnapshotSailStore.
Maybe we can assign a temporary ID to the unsaved values if we introduce something like the extensible ID scheme for the LMDB store (which is still WIP).

hmottestad · 2025-11-03T12:20:50Z

I thought that SERIALIZABLE relied on reporting which statement patterns that have been read so that you can monitor for concurrent writes that would potentially invalidate your reads. I don't really want to have to reimplement that using IDs, so I would much rather prefer to rely on the existing solution that uses Values.

kenwenzel · 2025-11-03T12:35:11Z

I thought that SERIALIZABLE relied on reporting which statement patterns that have been read so that you can monitor for concurrent writes that would potentially invalidate your reads. I don't really want to have to reimplement that using IDs, so I would much rather prefer to rely on the existing solution that uses Values.

Yes, I understand. Nevertheless, something like an ID service for query evaluation (a query-local map with long IDs and weak keys) could help to get rid of value objects at all.

hmottestad · 2025-11-06T19:00:08Z

Before

Benchmark                                                     Mode  Cnt    Score    Error  Units
QueryBenchmark.complexQuery                                   avgt    5    4.130 ±  0.019  ms/op
QueryBenchmark.different_datasets_with_similar_distributions  avgt    5    2.301 ±  0.012  ms/op
QueryBenchmark.groupByQuery                                   avgt    5    0.865 ±  0.017  ms/op
QueryBenchmark.long_chain                                     avgt    5  722.689 ±  8.042  ms/op
QueryBenchmark.lots_of_optional                               avgt    5  235.701 ±  5.209  ms/op
QueryBenchmark.minus                                          avgt    5   10.684 ±  0.422  ms/op
QueryBenchmark.multiple_sub_select                            avgt    5   58.338 ±  1.059  ms/op
QueryBenchmark.nested_optionals                               avgt    5  172.021 ±  2.225  ms/op
QueryBenchmark.optional_lhs_filter                            avgt    5   38.475 ±  1.542  ms/op
QueryBenchmark.optional_rhs_filter                            avgt    5   55.941 ±  0.600  ms/op
QueryBenchmark.ordered_union_limit                            avgt    5   75.317 ±  2.092  ms/op
QueryBenchmark.pathExpressionQuery1                           avgt    5   21.376 ±  0.178  ms/op
QueryBenchmark.pathExpressionQuery2                           avgt    5    4.061 ±  0.024  ms/op
QueryBenchmark.query_distinct_predicates                      avgt    5   49.389 ±  1.005  ms/op
QueryBenchmark.simple_filter_not                              avgt    5    6.079 ±  0.062  ms/op
QueryBenchmark.sub_select                                     avgt    5   70.426 ±  0.617  ms/op
QueryBenchmarkFoaf.groupByCount                               avgt    5  742.845 ± 14.336  ms/op
QueryBenchmarkFoaf.groupByCountSorted                         avgt    5  654.954 ± 14.461  ms/op
QueryBenchmarkFoaf.personsAndFriends                          avgt    5  209.077 ±  3.944  ms/op

After

Benchmark                                                     Mode  Cnt    Score    Error  Units
QueryBenchmark.complexQuery                                   avgt    5    2.504 ±  0.091  ms/op
QueryBenchmark.different_datasets_with_similar_distributions  avgt    5    1.675 ±  0.049  ms/op
QueryBenchmark.groupByQuery                                   avgt    5    0.860 ±  0.002  ms/op
QueryBenchmark.long_chain                                     avgt    5  574.390 ±  6.417  ms/op
QueryBenchmark.lots_of_optional                               avgt    5  144.543 ±  1.339  ms/op
QueryBenchmark.minus                                          avgt    5    9.242 ±  0.228  ms/op
QueryBenchmark.multiple_sub_select                            avgt    5   40.386 ±  0.674  ms/op
QueryBenchmark.nested_optionals                               avgt    5  145.671 ±  1.434  ms/op
QueryBenchmark.optional_lhs_filter                            avgt    5   20.091 ±  0.259  ms/op
QueryBenchmark.optional_rhs_filter                            avgt    5   32.161 ±  0.594  ms/op
QueryBenchmark.ordered_union_limit                            avgt    5   59.742 ±  0.751  ms/op
QueryBenchmark.pathExpressionQuery1                           avgt    5    9.315 ±  0.102  ms/op
QueryBenchmark.pathExpressionQuery2                           avgt    5    0.896 ±  0.010  ms/op
QueryBenchmark.query_distinct_predicates                      avgt    5   59.969 ±  0.638  ms/op
QueryBenchmark.simple_filter_not                              avgt    5    4.654 ±  0.183  ms/op
QueryBenchmark.sub_select                                     avgt    5   63.857 ±  1.435  ms/op
QueryBenchmarkFoaf.groupByCount                               avgt    5  637.206 ± 29.751  ms/op
QueryBenchmarkFoaf.groupByCountSorted                         avgt    5  557.630 ± 16.917  ms/op
QueryBenchmarkFoaf.personsAndFriends                          avgt    5   49.959 ±  0.689  ms/op

QueryBenchmark.query_distinct_predicates is actually a bit slower. Otherwise everything else is a lot faster.

I still haven't fixed SNAPSHOT isolation.

kenwenzel · 2025-11-06T21:38:34Z

The numbers are looking really good :-)

kenwenzel · 2025-11-10T15:39:59Z

@hmottestad Maybe you also like to check out #5558

I've played a bit with DUPSORT and have the following results:

database is a bit smaller if only DUPSORT with variable values is used
database is way larger if DUPFIXED is used (at least around 80% more)
benchmarks are a bit slower due to matching keys and values

I've seen that DUPFIXED allows us to fetch multiple values but due to the "explosion" of database size I would not recommend to go that route.
Maybe it is also possible to split the keys into pairs like SPOC to SP, OC and then use SP as key for a block in the index. This block is then self-managed by us and contains a sorted list of OC pairs. It even could be compressed via Snappy or something comparable.

QLever seems to be pretty fast when querying indexes. Maybe we can draw some inspiration from it:
https://ad-publications.cs.uni-freiburg.de/CIKM_qlever_BB_2017.pdf

kenwenzel · 2025-11-20T07:32:19Z

@hmottestad For a DUPSORT-based implementation see also #5558
I'm sure we get this faster.

Is it possible that you remove anything related to DUPSORT from this PR and we implement it somewhere else?
I'm interested in reducing the DB size as much as we can.

kenwenzel · 2025-11-20T16:41:19Z

@hmottestad This commit here also implements pooling for cursors and internal state of LmdbRecordIterator:
c815e5d

It has no problems with different isolation levels as cursors are closed if they get invalid.

hmottestad · 2025-11-20T19:43:50Z

I got a decent performance improvement from pooling cursors. Did you benchmark your code?

kenwenzel · 2025-11-20T19:50:06Z

I got a decent performance improvement from pooling cursors. Did you benchmark your code?

Not fully, but for the complexQuery benchmark I have seen something like 10-20%.
Pooling the internal state of the iterator seems to not help that much.

hmottestad requested a review from Copilot November 3, 2025 10:02

hmottestad changed the title ~~working on new ID based join iterator~~ LMDB Store: working on new ID based join iterator Nov 3, 2025

hmottestad marked this pull request as draft November 3, 2025 10:03

Copilot AI reviewed Nov 3, 2025

View reviewed changes

hmottestad added 20 commits November 7, 2025 04:07

cache lmdb cursor

eed68e4

wip

e5d2dec

wip

617deff

wip

4dc188b

wip

450793c

wip

fcebff5

wip

9bcc7f4

wip

05cff45

wip

02ea306

wip

a86554c

wip

9651025

wip

a04b1f2

wip

fc9a68e

wip

db32dd3

wip

c4da941

wip

af82390

wip

2b295e6

wip

3eac7e9

wip

419e62c

all tests pass

4f4518f

hmottestad added 22 commits November 7, 2025 04:07

working on new ID based join iterator

19b9ec5

working on new ID based join iterator

e30c859

working on new ID based join iterator

b23fc6f

working on new ID based join iterator

d09591a

working on new ID based join iterator

353965c

working on new ID based join iterator

7c72ba9

working on new ID based join iterator

b5973b6

working on new ID based join iterator

8a00d62

working on new ID based join iterator

81a7e40

working on new ID based join iterator

6b54eed

working on new ID based join iterator

7812d82

working on new ID based join iterator

2b5354d

working on new ID based join iterator

980cddd

working on new ID based join iterator

ab00f92

working on new ID based join iterator

9e7e2c4

working on new ID based join iterator

c7b2f15

working on new ID based join iterator

18648e9

working on new ID based join iterator

46974b1

working on new ID based join iterator

17395e8

working on new ID based join iterator

10f1631

working on new ID based join iterator

e1f28f5

working on new ID based join iterator

9852af0

hmottestad force-pushed the optimise-lmdb-record-iterator branch from e73f154 to 9852af0 Compare November 6, 2025 19:08

hmottestad mentioned this pull request Nov 17, 2025

Query result inconsistency btwn 5.2.0 and 5.1.6 on sp2b #5569

Closed

	Integer mask = right.getPositionsMask(v);
	int mask = right.getPositionsMask(v);

LMDB Store: working on new ID based join iterator #5549

Are you sure you want to change the base?

LMDB Store: working on new ID based join iterator #5549

Uh oh!

Conversation

hmottestad commented Nov 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

hmottestad commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kenwenzel commented Nov 3, 2025

Uh oh!

kenwenzel commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hmottestad commented Nov 3, 2025

Uh oh!

kenwenzel commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hmottestad commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

kenwenzel commented Nov 6, 2025

Uh oh!

kenwenzel commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kenwenzel commented Nov 20, 2025

Uh oh!

kenwenzel commented Nov 20, 2025

Uh oh!

hmottestad commented Nov 20, 2025

Uh oh!

kenwenzel commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

hmottestad commented Nov 3, 2025 •

edited

Loading

kenwenzel commented Nov 3, 2025 •

edited

Loading

kenwenzel commented Nov 3, 2025 •

edited

Loading

hmottestad commented Nov 6, 2025 •

edited

Loading

kenwenzel commented Nov 10, 2025 •

edited

Loading