🎉 feature(storage) Add multiple storage backends #274

casenave · 2025-11-22T20:53:36Z

Checklist

Implemented a facotrization of function needed for all backends and refactored the HF_datasets backend while added an additional zarr backend.
The webdataset backend is not provided yet, since there are some unexpected difficulties ocurring when trying to set one tar per feature per sample: iterating over the dataset will not enable easily to group files per sample - must find an efiicient trick, maybe by storing a dict with key the sample id and value the list of tar files corresponding to the sample ? Will this be efficient for iteration and plaid sample reconstruction ?

Implemented:

advanced parallel and OOM writers for hf_datasets and zarr (based on generator iterating over samples)
three reading functions common to hf_datasets and zarr backends (I downloaded a PLAID dataset. How do I read it from disk? #277 and [STORAGE] implement a download_dataset for hf_dataset storage mode #278)
- load_datasetdict
- download_datasetdict
- init_streamed_datasetdict

Tests, examples and tutorials to come

🔗 Related issues

Closes #270, #277

…ample) add string support to globals

…o improve_bridge

codecov · 2025-11-22T20:56:24Z

Codecov Report

❌ Patch coverage is 38.63636% with 216 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/plaid/storage/reader.py	0.00%	144 Missing ⚠️
src/plaid/storage/writer.py	0.00%	64 Missing ⚠️
src/plaid/problem_definition.py	45.45%	6 Missing ⚠️
src/plaid/storage/__init__.py	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

casenave and others added 20 commits November 1, 2025 21:54

fix(huggingface_bridge) correct per split constant detection ; feat(s…

e4c6af9

…ample) add string support to globals

continue

4345030

continue

77226e1

continute

1ac20e2

continue

1018128

continue

cb8c4b2

continue

da4a745

continue

dd7a544

continue

9fbd75e

continue

062d51e

continue

93d2a80

continue

cbf44c4

continue

28f0a41

Merge remote-tracking branch 'refs/remotes/origin/improve_bridge' int…

1c089a6

…o improve_bridge

Merge branch 'main' into improve_bridge

3bb77e4

continue

c4fc1c2

continue

43bc647

continue

06c2461

continue

4f5f793

start storage refactoring

47c1f94

casenave requested a review from a team as a code owner November 22, 2025 20:53

casenave marked this pull request as draft November 22, 2025 20:53

casenave and others added 7 commits November 23, 2025 11:40

continue

c717833

continue

4f1df0f

contineu

9728323

remove webdataset experiments

e713903

continue

1c0ea94

continue

3a277c4

update

a8e12db

casenave mentioned this pull request Nov 27, 2025

[STORAGE] implement a download_dataset for hf_dataset storage mode #278

Open

casenave added 6 commits November 29, 2025 09:29

continue

7a35b32

continue

6170f07

continue

852a49b

continue

d94697a

merge

eaa2043

continue

9c90c7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🎉 feature(storage) Add multiple storage backends #274

🎉 feature(storage) Add multiple storage backends #274

Uh oh!

casenave commented Nov 22, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

🎉 feature(storage) Add multiple storage backends #274

Are you sure you want to change the base?

🎉 feature(storage) Add multiple storage backends #274

Uh oh!

Conversation

casenave commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

🔗 Related issues

Uh oh!

codecov bot commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

casenave commented Nov 22, 2025 •

edited

Loading

codecov bot commented Nov 22, 2025 •

edited

Loading