-
Notifications
You must be signed in to change notification settings - Fork 732
Add import files for workload vector
#28393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
🔴 The changelog entry is less than 20 characters or missing. |
dc54405 to
ebd1e39
Compare
ebd1e39 to
1c29779
Compare
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
| { } | ||
|
|
||
| void TWorkloadVectorFilesDataInitializer::ConfigureOpts(NLastGetopt::TOpts& opts) { | ||
| opts.AddLongOption('i', "input", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've discussed in person a bit on user interface for this functionality. I also asked AI to give me some ideas and here is what we came up with. I think these options fit much better and give more explanations:
// --input
{
TStringStream description;
description
<< "File or directory with the dataset to import. Only two columns are imported: "
<< colors.BoldColor() << "id" << colors.OldColor() << " and "
<< colors.BoldColor() << "embedding" << colors.OldColor() << ". "
<< "If a directory is set, all supported files inside will be used."
<< "\nSupported formats: CSV/TSV (zipped or unzipped) and Parquet."
<< "\nIn " << colors.BoldColor() << "convert" << colors.OldColor() << " mode, "
<< "embedding is converted from list of floats to YDB binary embedding format."
<< "\nIn " << colors.BoldColor() << "raw" << colors.OldColor() << " mode, "
<< "embedding must already be binary; for CSV/TSV its encoding is controlled by --input-binary-strings."
<< "\nExample dataset: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings";
config.Opts->AddLongOption('i', "input", description.Str())
.RequiredArgument("PATH")
.Required()
.StoreResult(&DataFiles);
}
// --mode
{
TStringStream description;
description
<< "Import mode. Controls whether input data are converted or taken as-is."
<< "\n " << colors.BoldColor() << "auto" << colors.OldColor()
<< "\n " << "Detect mode from input schema/content:"
<< "\n " << "- Parquet: list<float> embedding -> convert; binary/string embedding -> raw."
<< "\n " << "- CSV/TSV: numeric array-like embedding -> convert; otherwise -> raw (requires --input-binary-strings)."
<< "\n " << colors.BoldColor() << "convert" << colors.OldColor()
<< "\n " << "Pick columns " << colors.BoldColor() << "id" << colors.OldColor() << " and "
<< colors.BoldColor() << "embedding" << colors.OldColor()
<< ", cast id to Int64, convert embedding (list<float>) to YDB binary embedding."
<< "\n " << "Reference: https://ydb.tech/docs/yql/reference/udf/list/knn#functions-convert"
<< "\n " << colors.BoldColor() << "raw" << colors.OldColor()
<< "\n " << "Load as-is: id must be Int64, embedding must be binary."
<< "\n " << "For CSV/TSV, set embedding binary encoding with --input-binary-strings."
<< "\nDefault: " << colors.CyanColor() << "\"auto\"" << colors.OldColor() << ".";
config.Opts->AddLongOption("mode", description.Str())
.RequiredArgument("MODE")
.DefaultValue("auto")
.StoreResult(&Mode);
}
// --embedding-column-name
{
TStringStream description;
description
<< "Alternative source column name for the embedding field in input files."
<< "\nUsed in " << colors.BoldColor() << "convert" << colors.OldColor()
<< " (and " << colors.BoldColor() << "auto" << colors.OldColor() << " when it chooses convert)."
<< "\nIf not set, the column is expected to be named "
<< colors.BoldColor() << "\"embedding\"" << colors.OldColor() << ".";
config.Opts->AddLongOption("embedding-column-name", description.Str())
.RequiredArgument("NAME")
.DefaultValue("embedding")
.StoreResult(&EmbeddingColumnName);
}
// --input-binary-strings
{
TStringStream description;
description
<< "Binary encoding of the " << colors.BoldColor() << "embedding" << colors.OldColor()
<< " column in CSV/TSV when importing in " << colors.BoldColor() << "raw" << colors.OldColor()
<< " mode (or in " << colors.BoldColor() << "auto" << colors.OldColor() << " when it selects raw)."
<< "\nIgnored for Parquet and for " << colors.BoldColor() << "convert" << colors.OldColor() << " mode."
<< "\nAvailable options:"
<< "\n " << colors.BoldColor() << "unicode" << colors.OldColor()
<< "\n " << "Every byte in binary strings that is not a printable ASCII symbol (codes 32-126) should be encoded as UTF-8."
<< "\n " << colors.BoldColor() << "base64" << colors.OldColor()
<< "\n " << "Binary strings should be fully encoded with base64."
<< "\nDefault: " << colors.CyanColor() << "\"unicode\"" << colors.OldColor() << ".";
config.Opts->AddLongOption("input-binary-strings", description.Str())
.RequiredArgument("STRING")
.DefaultValue("unicode")
.StoreResult(&InputBinaryStringEncodingFormat);
}
The
ydb workload vectornow supportsimport filesto populate table from CSV and parquetNow vector workload supports import from files. Supported csv, tsv and parquet file formats. Also embeddings can be converted using option
-t|--transforminto ydb vector representation. Reference: https://ydb.tech/docs/yql/reference/udf/list/knn#functions-convertChangelog category
Description for reviewers
...