Skip to content

Conversation

@hmottestad
Copy link
Contributor

GitHub issue resolved: #

Briefly describe the changes proposed in this PR:


PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

@hmottestad hmottestad requested a review from Copilot November 3, 2025 10:02
@hmottestad hmottestad changed the title working on new ID based join iterator LMDB Store: working on new ID based join iterator Nov 3, 2025
@hmottestad hmottestad marked this pull request as draft November 3, 2025 10:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds timeout protections to test cases and implements a safeguard against infinite query result iterations in the LMDB store. The changes focus on preventing tests from hanging indefinitely by adding explicit timeouts and result count limits.

Key Changes

  • Added 5-second timeouts to 28 test methods across snapshot and serializable isolation level tests
  • Added 10-second timeout to one delete-insert test
  • Implemented a safety check in the eval() method to prevent collecting more than 1 million results
  • Added timeout handling with explicit time units to CountDownLatch.await() calls in concurrency tests

Reviewed Changes

Copilot reviewed 95 out of 96 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
SnapshotTest.java Added 5-second timeout annotations to 28 test methods
SerializableTest.java Added 5-second timeout annotations to 19 test methods
DeleteInsertTest.java Added 10-second timeout annotation to test method
IsolationLevelTest.java Replaced indefinite await() calls with 10-second timeouts on CountDownLatches
DeadLockTest.java Added TimeUnit import and 10-second timeouts to CountDownLatch waits
SparqlOrderByTest.java Added class-level 10-second timeout annotation
pom.xml Updated jacoco-maven-plugin version from 0.8.13 to 0.8.14

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +831 to +833
if (list.size() > 1000000) {
throw new RuntimeException("Too many results: " + list.size());
}
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message includes list.size() which will always be 1000001 when the exception is thrown due to the condition checking > 1000000. Consider using a constant for the limit value to keep the message and condition in sync, or simplify the message to just state 'Too many results (exceeded limit of 1000000)'.

Copilot uses AI. Check for mistakes.
try (RepositoryConnection write = store.getConnection()) {
start.countDown();
start.await();
start.await(10, TimeUnit.SECONDS);
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout value of 10 seconds is hardcoded across multiple await() calls in this file. Consider extracting this as a named constant (e.g., private static final long LATCH_TIMEOUT_SECONDS = 10;) to improve maintainability and make it easier to adjust if needed.

Copilot uses AI. Check for mistakes.
Varint.writeUnsigned(bb, pred);
}

void toDupValue(ByteBuffer bb, long subj, long pred, long obj, long context) {
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter 'subj' is never used.

Copilot uses AI. Check for mistakes.
Varint.writeUnsigned(bb, pred);
}

void toDupValue(ByteBuffer bb, long subj, long pred, long obj, long context) {
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter 'pred' is never used.

Copilot uses AI. Check for mistakes.
private final QueryEvaluationStep fallbackStep;
private final boolean fallbackImmediately;

public LmdbIdJoinQueryEvaluationStep(EvaluationStrategy strategy, Join join, QueryEvaluationContext context,
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter 'strategy' is never used.

Copilot uses AI. Check for mistakes.
}
}
for (String v : right.getVariableNames()) {
Integer mask = right.getPositionsMask(v);
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'mask' is only assigned values of primitive type and is never 'null', but it is declared with the boxed type 'Integer'.

Suggested change
Integer mask = right.getPositionsMask(v);
int mask = right.getPositionsMask(v);

Copilot uses AI. Check for mistakes.

ValueFactory vf = SimpleValueFactory.getInstance();
IRI alice = vf.createIRI(NS, "alice");
IRI bob = vf.createIRI(NS, "bob");
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable 'IRI bob' is never read.

Copilot uses AI. Check for mistakes.
* The higher the score, the better the index is suited for matching the pattern. Lowest score is 0, which means
* that the index will perform a sequential scan.
*/
public int getPatternScore(long subj, long pred, long obj, long context) {
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides DupIndex.getPatternScore; it is advisable to add an Override annotation.

Copilot uses AI. Check for mistakes.
return dupsortEnabled;
}

public int getDupDB(boolean explicit) {
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides DupIndex.getDupDB; it is advisable to add an Override annotation.

Copilot uses AI. Check for mistakes.
this(fieldSeq, false);
}

public char[] getFieldSeq() {
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides DupIndex.getFieldSeq; it is advisable to add an Override annotation.

Copilot uses AI. Check for mistakes.
@hmottestad
Copy link
Contributor Author

hmottestad commented Nov 3, 2025

@kenwenzel I’m working on an ID-based join for the LMDB store where we don’t need to create LmdbValue objects or binding sets. There is something not quite right though, causing a bunch of tests that test SNAPSHOT and SERIALIZABLE isolation to fail. I think that ID-based joins won’t work for SERIALIZABLE isolation, but I don’t quite understand what makes SNAPSHOT fail.

@kenwenzel
Copy link
Contributor

@hmottestad This is a great idea. Can you please explain why you would expect ID-based joins to fail for SERIALIZABLE? What is different in this case from joining on the value objects?

@kenwenzel
Copy link
Contributor

kenwenzel commented Nov 3, 2025

@hmottestad I think in both cases the triples are only in memory - managed by SnapshotSailStore.
Maybe we can assign a temporary ID to the unsaved values if we introduce something like the extensible ID scheme for the LMDB store (which is still WIP).

@hmottestad
Copy link
Contributor Author

I thought that SERIALIZABLE relied on reporting which statement patterns that have been read so that you can monitor for concurrent writes that would potentially invalidate your reads. I don't really want to have to reimplement that using IDs, so I would much rather prefer to rely on the existing solution that uses Values.

@kenwenzel
Copy link
Contributor

kenwenzel commented Nov 3, 2025

I thought that SERIALIZABLE relied on reporting which statement patterns that have been read so that you can monitor for concurrent writes that would potentially invalidate your reads. I don't really want to have to reimplement that using IDs, so I would much rather prefer to rely on the existing solution that uses Values.

Yes, I understand. Nevertheless, something like an ID service for query evaluation (a query-local map with long IDs and weak keys) could help to get rid of value objects at all.

@hmottestad
Copy link
Contributor Author

hmottestad commented Nov 6, 2025

Before

Benchmark                                                     Mode  Cnt    Score    Error  Units
QueryBenchmark.complexQuery                                   avgt    5    4.130 ±  0.019  ms/op
QueryBenchmark.different_datasets_with_similar_distributions  avgt    5    2.301 ±  0.012  ms/op
QueryBenchmark.groupByQuery                                   avgt    5    0.865 ±  0.017  ms/op
QueryBenchmark.long_chain                                     avgt    5  722.689 ±  8.042  ms/op
QueryBenchmark.lots_of_optional                               avgt    5  235.701 ±  5.209  ms/op
QueryBenchmark.minus                                          avgt    5   10.684 ±  0.422  ms/op
QueryBenchmark.multiple_sub_select                            avgt    5   58.338 ±  1.059  ms/op
QueryBenchmark.nested_optionals                               avgt    5  172.021 ±  2.225  ms/op
QueryBenchmark.optional_lhs_filter                            avgt    5   38.475 ±  1.542  ms/op
QueryBenchmark.optional_rhs_filter                            avgt    5   55.941 ±  0.600  ms/op
QueryBenchmark.ordered_union_limit                            avgt    5   75.317 ±  2.092  ms/op
QueryBenchmark.pathExpressionQuery1                           avgt    5   21.376 ±  0.178  ms/op
QueryBenchmark.pathExpressionQuery2                           avgt    5    4.061 ±  0.024  ms/op
QueryBenchmark.query_distinct_predicates                      avgt    5   49.389 ±  1.005  ms/op
QueryBenchmark.simple_filter_not                              avgt    5    6.079 ±  0.062  ms/op
QueryBenchmark.sub_select                                     avgt    5   70.426 ±  0.617  ms/op
QueryBenchmarkFoaf.groupByCount                               avgt    5  742.845 ± 14.336  ms/op
QueryBenchmarkFoaf.groupByCountSorted                         avgt    5  654.954 ± 14.461  ms/op
QueryBenchmarkFoaf.personsAndFriends                          avgt    5  209.077 ±  3.944  ms/op

After

Benchmark                                                     Mode  Cnt    Score    Error  Units
QueryBenchmark.complexQuery                                   avgt    5    2.504 ±  0.091  ms/op
QueryBenchmark.different_datasets_with_similar_distributions  avgt    5    1.675 ±  0.049  ms/op
QueryBenchmark.groupByQuery                                   avgt    5    0.860 ±  0.002  ms/op
QueryBenchmark.long_chain                                     avgt    5  574.390 ±  6.417  ms/op
QueryBenchmark.lots_of_optional                               avgt    5  144.543 ±  1.339  ms/op
QueryBenchmark.minus                                          avgt    5    9.242 ±  0.228  ms/op
QueryBenchmark.multiple_sub_select                            avgt    5   40.386 ±  0.674  ms/op
QueryBenchmark.nested_optionals                               avgt    5  145.671 ±  1.434  ms/op
QueryBenchmark.optional_lhs_filter                            avgt    5   20.091 ±  0.259  ms/op
QueryBenchmark.optional_rhs_filter                            avgt    5   32.161 ±  0.594  ms/op
QueryBenchmark.ordered_union_limit                            avgt    5   59.742 ±  0.751  ms/op
QueryBenchmark.pathExpressionQuery1                           avgt    5    9.315 ±  0.102  ms/op
QueryBenchmark.pathExpressionQuery2                           avgt    5    0.896 ±  0.010  ms/op
QueryBenchmark.query_distinct_predicates                      avgt    5   59.969 ±  0.638  ms/op
QueryBenchmark.simple_filter_not                              avgt    5    4.654 ±  0.183  ms/op
QueryBenchmark.sub_select                                     avgt    5   63.857 ±  1.435  ms/op
QueryBenchmarkFoaf.groupByCount                               avgt    5  637.206 ± 29.751  ms/op
QueryBenchmarkFoaf.groupByCountSorted                         avgt    5  557.630 ± 16.917  ms/op
QueryBenchmarkFoaf.personsAndFriends                          avgt    5   49.959 ±  0.689  ms/op

QueryBenchmark.query_distinct_predicates is actually a bit slower. Otherwise everything else is a lot faster.

I still haven't fixed SNAPSHOT isolation.

@hmottestad hmottestad force-pushed the optimise-lmdb-record-iterator branch from e73f154 to 9852af0 Compare November 6, 2025 19:08
@kenwenzel
Copy link
Contributor

The numbers are looking really good :-)

@kenwenzel
Copy link
Contributor

kenwenzel commented Nov 10, 2025

@hmottestad Maybe you also like to check out #5558

I've played a bit with DUPSORT and have the following results:

  • database is a bit smaller if only DUPSORT with variable values is used
  • database is way larger if DUPFIXED is used (at least around 80% more)
  • benchmarks are a bit slower due to matching keys and values

I've seen that DUPFIXED allows us to fetch multiple values but due to the "explosion" of database size I would not recommend to go that route.
Maybe it is also possible to split the keys into pairs like SPOC to SP, OC and then use SP as key for a block in the index. This block is then self-managed by us and contains a sorted list of OC pairs. It even could be compressed via Snappy or something comparable.

QLever seems to be pretty fast when querying indexes. Maybe we can draw some inspiration from it:
https://ad-publications.cs.uni-freiburg.de/CIKM_qlever_BB_2017.pdf

@kenwenzel
Copy link
Contributor

@hmottestad For a DUPSORT-based implementation see also #5558
I'm sure we get this faster.

Is it possible that you remove anything related to DUPSORT from this PR and we implement it somewhere else?
I'm interested in reducing the DB size as much as we can.

@kenwenzel
Copy link
Contributor

@hmottestad This commit here also implements pooling for cursors and internal state of LmdbRecordIterator:
c815e5d

It has no problems with different isolation levels as cursors are closed if they get invalid.

@hmottestad
Copy link
Contributor Author

I got a decent performance improvement from pooling cursors. Did you benchmark your code?

@kenwenzel
Copy link
Contributor

I got a decent performance improvement from pooling cursors. Did you benchmark your code?

Not fully, but for the complexQuery benchmark I have seen something like 10-20%.
Pooling the internal state of the iterator seems to not help that much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants