Skip to content

Conversation

@stebloev
Copy link
Member

@stebloev stebloev commented Nov 7, 2025

The ydb workload vector now supports import files to populate table from CSV and parquet

Now vector workload supports import from files. Supported csv, tsv and parquet file formats. Also embeddings can be converted using option -t|--transform into ydb vector representation. Reference: https://ydb.tech/docs/yql/reference/udf/list/knn#functions-convert

Changelog category

  • New feature

Description for reviewers

...

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

2025-11-07 12:43:37 UTC Pre-commit check linux-x86_64-relwithdebinfo for ee59eb5 has started.
2025-11-07 12:43:55 UTC Artifacts will be uploaded here
2025-11-07 12:46:20 UTC ya make is running...
🟡 2025-11-07 13:13:54 UTC Some tests failed, follow the links below. Going to retry failed tests...

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
2531 2517 0 1 6 7

2025-11-07 13:14:04 UTC ya make is running... (failed tests rerun, try 2)
🟢 2025-11-07 13:22:18 UTC Tests successful.

Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
16 (only retried tests) 13 0 0 0 3

🟢 2025-11-07 13:22:28 UTC Build successful.
🟢 2025-11-07 13:22:47 UTC ydbd size 2.3 GiB changed* by 0 Bytes, which is <= 0 Bytes vs main: OK

ydbd size dash main: b694880 merge: ee59eb5 diff diff %
ydbd size 2 431 728 272 Bytes 2 431 728 272 Bytes 0 Bytes 0.000%
ydbd stripped size 516 517 424 Bytes 516 517 424 Bytes 0 Bytes 0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@stebloev stebloev linked an issue Nov 7, 2025 that may be closed by this pull request
@github-actions
Copy link

github-actions bot commented Nov 7, 2025

2025-11-07 12:45:41 UTC Pre-commit check linux-x86_64-release-asan for ee59eb5 has started.
2025-11-07 12:45:59 UTC Artifacts will be uploaded here
2025-11-07 12:48:11 UTC ya make is running...
🟡 2025-11-07 13:22:21 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1386 1232 0 18 134 2

🟢 2025-11-07 13:22:31 UTC Build successful.
🟢 2025-11-07 13:22:51 UTC ydbd size 3.8 GiB changed* by -96 Bytes, which is <= 0 Bytes vs main: OK

ydbd size dash main: b694880 merge: ee59eb5 diff diff %
ydbd size 4 072 867 456 Bytes 4 072 867 360 Bytes -96 Bytes -0.000%
ydbd stripped size 1 511 799 400 Bytes 1 511 799 336 Bytes -64 Bytes -0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

🔴 2025-11-07 16:41:02 UTC The validation of the Pull Request description has failed. Please update the description.

The changelog entry is less than 20 characters or missing.

@stebloev stebloev force-pushed the vector-workload-import-files branch from dc54405 to ebd1e39 Compare November 7, 2025 16:33
@stebloev stebloev force-pushed the vector-workload-import-files branch from ebd1e39 to 1c29779 Compare November 7, 2025 16:35
@stebloev stebloev marked this pull request as ready for review November 7, 2025 16:38
@stebloev stebloev requested a review from a team as a code owner November 7, 2025 16:38
@github-actions
Copy link

github-actions bot commented Nov 7, 2025

2025-11-07 16:38:35 UTC Pre-commit check linux-x86_64-relwithdebinfo for 848e9a4 has started.
2025-11-07 16:40:08 UTC Artifacts will be uploaded here
2025-11-07 16:42:25 UTC ya make is running...
🟢 2025-11-07 17:17:43 UTC Tests successful.

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
2531 2520 0 0 6 5

🟢 2025-11-07 17:17:54 UTC Build successful.
🟢 2025-11-07 17:18:11 UTC ydbd size 2.3 GiB changed* by +2.5 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 5d8445c merge: 848e9a4 diff diff %
ydbd size 2 431 796 456 Bytes 2 431 798 992 Bytes +2.5 KiB +0.000%
ydbd stripped size 516 528 304 Bytes 516 528 816 Bytes +512 Bytes +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

2025-11-07 16:39:41 UTC Pre-commit check linux-x86_64-release-asan for 848e9a4 has started.
2025-11-07 16:39:58 UTC Artifacts will be uploaded here
2025-11-07 16:42:12 UTC ya make is running...
🟡 2025-11-07 17:18:11 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1386 1216 0 20 148 2

🟢 2025-11-07 17:18:21 UTC Build successful.
🟢 2025-11-07 17:18:42 UTC ydbd size 3.8 GiB changed* by +2.3 KiB, which is < 100.0 KiB vs main: OK

ydbd size dash main: 5d8445c merge: 848e9a4 diff diff %
ydbd size 4 072 974 536 Bytes 4 072 976 904 Bytes +2.3 KiB +0.000%
ydbd stripped size 1 511 833 864 Bytes 1 511 834 248 Bytes +384 Bytes +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

{ }

void TWorkloadVectorFilesDataInitializer::ConfigureOpts(NLastGetopt::TOpts& opts) {
opts.AddLongOption('i', "input",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed in person a bit on user interface for this functionality. I also asked AI to give me some ideas and here is what we came up with. I think these options fit much better and give more explanations:

// --input
{
    TStringStream description;
    description
        << "File or directory with the dataset to import. Only two columns are imported: "
        << colors.BoldColor() << "id" << colors.OldColor() << " and "
        << colors.BoldColor() << "embedding" << colors.OldColor() << ". "
        << "If a directory is set, all supported files inside will be used."
        << "\nSupported formats: CSV/TSV (zipped or unzipped) and Parquet."
        << "\nIn " << colors.BoldColor() << "convert" << colors.OldColor() << " mode, "
        << "embedding is converted from list of floats to YDB binary embedding format."
        << "\nIn " << colors.BoldColor() << "raw" << colors.OldColor() << " mode, "
        << "embedding must already be binary; for CSV/TSV its encoding is controlled by --input-binary-strings."
        << "\nExample dataset: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings";
    config.Opts->AddLongOption('i', "input", description.Str())
        .RequiredArgument("PATH")
        .Required()
        .StoreResult(&DataFiles);
}

// --mode
{
    TStringStream description;
    description
        << "Import mode. Controls whether input data are converted or taken as-is."
        << "\n  " << colors.BoldColor() << "auto" << colors.OldColor()
        << "\n    " << "Detect mode from input schema/content:"
        << "\n    " << "- Parquet: list<float> embedding -> convert; binary/string embedding -> raw."
        << "\n    " << "- CSV/TSV: numeric array-like embedding -> convert; otherwise -> raw (requires --input-binary-strings)."
        << "\n  " << colors.BoldColor() << "convert" << colors.OldColor()
        << "\n    " << "Pick columns " << colors.BoldColor() << "id" << colors.OldColor() << " and "
                      << colors.BoldColor() << "embedding" << colors.OldColor()
        << ", cast id to Int64, convert embedding (list<float>) to YDB binary embedding."
        << "\n    " << "Reference: https://ydb.tech/docs/yql/reference/udf/list/knn#functions-convert"
        << "\n  " << colors.BoldColor() << "raw" << colors.OldColor()
        << "\n    " << "Load as-is: id must be Int64, embedding must be binary."
        << "\n    " << "For CSV/TSV, set embedding binary encoding with --input-binary-strings."
        << "\nDefault: " << colors.CyanColor() << "\"auto\"" << colors.OldColor() << ".";
    config.Opts->AddLongOption("mode", description.Str())
        .RequiredArgument("MODE")
        .DefaultValue("auto")
        .StoreResult(&Mode);
}

// --embedding-column-name
{
    TStringStream description;
    description
        << "Alternative source column name for the embedding field in input files."
        << "\nUsed in " << colors.BoldColor() << "convert" << colors.OldColor()
        << " (and " << colors.BoldColor() << "auto" << colors.OldColor() << " when it chooses convert)."
        << "\nIf not set, the column is expected to be named "
        << colors.BoldColor() << "\"embedding\"" << colors.OldColor() << ".";
    config.Opts->AddLongOption("embedding-column-name", description.Str())
        .RequiredArgument("NAME")
        .DefaultValue("embedding")
        .StoreResult(&EmbeddingColumnName);
}

// --input-binary-strings
{
    TStringStream description;
    description
        << "Binary encoding of the " << colors.BoldColor() << "embedding" << colors.OldColor()
        << " column in CSV/TSV when importing in " << colors.BoldColor() << "raw" << colors.OldColor()
        << " mode (or in " << colors.BoldColor() << "auto" << colors.OldColor() << " when it selects raw)."
        << "\nIgnored for Parquet and for " << colors.BoldColor() << "convert" << colors.OldColor() << " mode."
        << "\nAvailable options:"
        << "\n  " << colors.BoldColor() << "unicode" << colors.OldColor()
        << "\n    " << "Every byte in binary strings that is not a printable ASCII symbol (codes 32-126) should be encoded as UTF-8."
        << "\n  " << colors.BoldColor() << "base64" << colors.OldColor()
        << "\n    " << "Binary strings should be fully encoded with base64."
        << "\nDefault: " << colors.CyanColor() << "\"unicode\"" << colors.OldColor() << ".";
    config.Opts->AddLongOption("input-binary-strings", description.Str())
        .RequiredArgument("STRING")
        .DefaultValue("unicode")
        .StoreResult(&InputBinaryStringEncodingFormat);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workload vector should support import from files

2 participants