Skip to content

Conversation

@anatoly-scherbakov
Copy link
Collaborator

@anatoly-scherbakov anatoly-scherbakov commented Oct 18, 2023

Alleviate the step of manually downloading specifications for tests to use them. Instead, put specifications into a dedicated directory using Git submodules mechanism.

@anatoly-scherbakov
Copy link
Collaborator Author

closes #181

@BigBlueHat
Copy link
Contributor

@anatoly-scherbakov let's give this a try. Can you expand the README in this PR also to explain what the pre-testing steps would include now (i.e. git submodule init etc.)?

It would be great to simplify testing.

Thanks for exploring this idea!
🎩

@anatoly-scherbakov anatoly-scherbakov marked this pull request as ready for review November 12, 2023 16:37
@anatoly-scherbakov
Copy link
Collaborator Author

@BigBlueHat @davidlehn My apologies for the delay; I think I have something to show here now.

@BigBlueHat
Copy link
Contributor

Hey @anatoly-scherbakov, Sorry for the delay here. I continue to feel a bit squeamish about using git submodules. It's not a pattern we typically use in our other implementations, and it's a rarely used (and oft derided...) git feature. I'm also concerned about the overhead of keeping the sha's pinned accurately.

However, I did find that the jsonld.js implementation has a nice approach to this same problem using npm run commands: https://github.com/digitalbazaar/jsonld.js/blob/main/package.json#L96-L101

My thought is that we could do something similar here by providing a simple Python script that would run the git clone --depth 1 commands, and ideally also provide environment variables to use (as the JS implementation does) in case someone already has the test suites on hand but in different locations (as we often do because we work on those also).

Would you be up for taking a crack at that?

@anatoly-scherbakov
Copy link
Collaborator Author

@BigBlueHat thank you for your review. I understand the concern but might disagree in certain regards, let me try to elaborate on them.

Let's analyze the approach taken at jsonld.js.

    "fetch-json-ld-org-test-suite": "if [ ! -e test-suites/json-ld.org ]; then git clone --depth 1 https://github.com/json-ld/json-ld.org.git test-suites/json-ld.org; fi",

What happens if JSON-LD test suite is updated?

  1. A developer already has a downloaded version of JSON-LD test suite locally.
    • The developer is not notified about the change
    • If they realize that the change has occurred, to refresh the spec version, they have to manually remove the spec directory
    • Different developers will inevitably debug the implementation against different versions of the test suite, which leads to issues potentially complicated to debug
    • It is hard to track which exact version of the test suite one has, and Git doesn't help with that
  2. A developer submits their changes to jsonld-js to CI, and CI fails
    • That's because the spec has been updated and CI downloaded the latest version
    • First of all, the developer needs to realize that had happened
    • Then, they have to download the latest spec and debug what has changed,
    • Which wasn't in the original scope of their pull request.

I feel these cases can be very annoying and might hurt developer productivity.

Update of the spec version is not tied to an explicit human decision, which goes against normal practice for dependencies in many environments.

For instance, with Python poetry, you have to issue poetry update to update your dependencies; otherwise, their versions are explicitly pinned in poetry.lock. The same applies to cargo (Rust dependency manager), to npm (with its package.lock), and others.

Git submodules essentially provide the same level of control over the version of dependencies. It is easy to provide a make command which will upgrade dependencies, say

make upgrade-submodules

I'll add it in a moment in another commit. This will upgrade each submodule to its latest version; afterwards, the developer can run tests, make sure everything is 🟢, and submit to CI.

As to the popularity of git submodules, — quick search on Github returns a few popular projects relying upon them: pytorch, ghc, draw.io, mono, ceph, obs-studio, micropython, qemu and more.

Let me know what you think!

@anatoly-scherbakov
Copy link
Collaborator Author

@BigBlueHat

P. S. To address the concern about testing pyld, this can be done by:

python tests/runtests.py ~/projects/json-ld-api

…as documented in README.rst. I don't think git submodules would be an obstacle for local development in this fashion.

@BigBlueHat
Copy link
Contributor

@anatoly-scherbakov coming back to this again--sorry for the wait.

The experience of using the submodules was fine--not much different than the experience of using the approach we took in jsonld.js, so I'm fine with it.

@davidlehn unless you're diametrically opposed to this, I'd like to get it merged sooner than later.

Copy link
Contributor

@BigBlueHat BigBlueHat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small tweak which isn't a blocker--and could be done later--so approving.

@anatoly-scherbakov
Copy link
Collaborator Author

Thanks for the review @BigBlueHat! I am writing an issue about the Makefile shortly but I do not believe I have the rights to merge this PR.

This might be also related to me being unable to create a PR in this repository which has another PR as its base.

@@ -0,0 +1,11 @@
[submodule "specifications/json-ld-api"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the specs, but since their sole purpose here is for testing, could this be named test-suites/? That's what the scripts in jsonld.js use.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing!

I can rename those. A few things concern me a bit though:

  • Each repository we're checking out contains not just tests but the whole specification, which might be confusing and violate the principle of least surprise;
  • We will likely add more tests to the tests directory which will not be using the specifications, and the existence of both tests and test-suites will be confusing;
  • At some point, we might want to use the specifications for other purposes than tests, for instance, to aid documentation generation, or something.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the specifications/ folder name, as I do hope/expect we'll get more unit tests added to tests/ at some point, and test-suites/ would certainly be confusing.

Copy link
Collaborator

@mielvds mielvds Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree keeping specifications/, but should we not place it under tests/?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mielvds this will mean there will be a path like tests/specifications/json-ld-api/tests/, lots of nesting. Also... what if we want to reuse specifications for different purposes other than tests?

For instance, they could be extremely useful for LLM-assisted work on the library because specs are authoritatitve.

@@ -0,0 +1,2 @@
upgrade-submodules:
git submodule update --remote --init --recursive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you remove all the checked out dirs without leaving git thinking there are unstaged changes? Put another way, how do you undo this command?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to undo the version upgrade of the submodules, you can do:

git reset --hard

If you've modified the submodules locally, after having them upgraded (for debugging purposes perhaps) then you do:

git submodule foreach git reset --hard

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add those to the makefile for convenience?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • During a normal workflow, I guess modification of submodules won't happen often. reset --hard is a destructive operation, not sure if we should create a shortcut for it 🤔

@davidlehn
Copy link
Member

If this PR is going to go all-in, then consider updating docs to handle what #157 was doing to address #156.

@davidlehn
Copy link
Member

  • I don't necessarily disagree with the reasoning to use submodules. As a counter point, pyld and jsonld.js are over a decade old and have gotten by without using that feature. It's a matter of tradeoffs between using different commands and dealing with different problems. I'm not sure what the best approach is. Certainly better docs and tooling would help in whatever approach is used.
  • Adding this feature forces the maintainers to learn to use it. Help us (me) out with a few more snippets of docs perhaps in the README and/or CONTRIBUTING. Maybe even more script/makefile targets to run. Specifically, what do we (I) need to type to keep the test suites synced up all the time.
  • I have concerns about forgetting to update the test suite repo refs and not noticing newly added tests are failing. I can imagine complex processes to address this, but what is the theory here? I guess the idea is to tradeoff possibly running with updated tests unintentionally to having outdated tests unintentionally.
  • I put in some comments, partly directed towards making the former workflow possible with minimal hassle. I found that worked well for me. Especially when I was working on js and python at the same time and explicitly wanted to use shared sibling test suites. That's a particular use case for sure, but a real one.
  • I suppose I'm willing to try this change if pressed. To me it seems either method is fine for any potential contributors. There is almost no input here, so I'm not sure others feel strongly at all.

@davidlehn
Copy link
Member

@anatoly-scherbakov Could you also change the commit message with the fancy "+" and one with ellipses, arrow, and math symbol to use simple ASCII? Nicer to use chars people can easily type. Feel free to argue. :-)

@anatoly-scherbakov anatoly-scherbakov force-pushed the 181-git-submodules branch 2 times, most recently from 5798f0b to 440a251 Compare February 6, 2024 13:36
@anatoly-scherbakov
Copy link
Collaborator Author

@davidlehn thanks again for the review!

I have applied a few requested changes.

Help us (me) out with a few more snippets of docs perhaps in the README and/or CONTRIBUTING. Maybe even more script/makefile targets to run.

Yes! I feel that less documentation and more automation is for the better :)

I have concerns about forgetting to update the test suite repo refs and not noticing newly added tests are failing.
Indeed, this is another side of the coin. Most dependency management tools (poetry, npm, ...) seem to be okay with that, staying on the safe side to make builds predictable.

Tools like Dependabot, which post notifications about out-of-date dependencies and even create PRs automatically, are often leveraged. I think I can easily imagine a rather simple scheduled GitHub actions workflow that will update the specs submodules every week, run the tests, and create a PR with that upgrade. Do you think we should create an issue for that?

I put in some comments, partly directed towards making the former workflow possible with minimal hassle. I found that worked well for me. Especially when I was working on js and python at the same time and explicitly wanted to use shared sibling test suites. That's a particular use case for sure, but a real one.

I see. The runtests.py script allows to specify the path to the specification. Say, you have a development version of a spec somewhere on your local disk; you can easily run the tests from that version using this command.

Could you also change the commit message with the fancy "+" and one with ellipses, arrow, and math symbol to use simple ASCII? Nicer to use chars people can easily type. Feel free to argue. :-)

No argument here: I just like those fancy Unicode characters, especially putting them into commit messages 🙂 , and have setup my Linux to easily type them, but this is purely a matter of taste. Removed ✅

@BigBlueHat BigBlueHat removed this from the v2.0.4 milestone Feb 6, 2024
@anatoly-scherbakov anatoly-scherbakov changed the base branch from master to add-github-workflow February 6, 2024 17:56
@anatoly-scherbakov
Copy link
Collaborator Author

I've rebased this branch on top of #185.

@BigBlueHat
Copy link
Contributor

@BigBlueHat the previous address was https://github.com/json-ld/normalization/tests. Right now, https://github.com/json-ld/normalization redirects to https://github.com/w3c-ccg/rdf-dataset-canonicalization. Is that the link we should use?

This should be the correct location of the old tests suites:
https://github.com/w3c-ccg/rdf-dataset-canonicalization/tree/main/tests

So, still needs an updated URL in this (or another) PR. The rds-canon ones, though, introduce a whole new host of concerns.

Sorry I didn't see that earlier!

@anatoly-scherbakov
Copy link
Collaborator Author

  • Added rdf-dataset-canonicalization spec submodule,
  • Reconfigured the tests to use it and to ignore rdf-canon submodule,
  • Added make test command.

Should we replace rdf-canon references everywhere in the docs as well, given that we're going to get them back anyway later?

@BigBlueHat
Copy link
Contributor

  • Added rdf-dataset-canonicalization spec submodule,
  • Reconfigured the tests to use it and to ignore rdf-canon submodule,
  • Added make test command.

Should we replace rdf-canon references everywhere in the docs as well, given that we're going to get them back anyway later?

Yeah...I think we need to add rdf-canon stuff as a separate PR. Otherwise, we're pulling all this code and content down, and then not using it at all...which would be very confusing.

@mielvds
Copy link
Collaborator

mielvds commented Nov 6, 2025

I understand @davidlehn 's concerns; submodules can be a hassle. That said, the support for submodules has improved a lot over the years and, in this case, they solve the version alignment between repos. Technically, you can do this with git clone a specific commit and make, but it won't be as robust or transparent. I suggest we try it. It's no very destructive, so rolling back should be easy ifneedbe.

@davidlehn davidlehn force-pushed the add-github-workflow branch from 01fe0f8 to f9772ea Compare November 7, 2025 15:14
# Conflicts:
#	.github/workflows/main.yaml
#	CHANGELOG.md
#	README.rst
@anatoly-scherbakov
Copy link
Collaborator Author

I created a replacement PR in the main repo instead of my fork. Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants