-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Description
Summary
The Result class in valkey/commands/search/result.py inappropriately applies UTF-8 decoding to all field values, including binary vector data. This corrupts VECTOR field embeddings and makes valkey-py unsuitable for vector search applications.
Root Cause
In valkey/commands/search/result.py line ~45:
map(to_string, res[i + fields_offset][1::2])The to_string() function uses decode("utf-8", "ignore") on all field values. VECTOR fields contain binary FLOAT32 data with invalid UTF-8 bytes, and the "ignore" parameter silently drops bytes, corrupting the data.
Reproduction
Run https://gist.github.com/swarnaprakash/0ae1e454cfae3ab2a0f628181d4e8045 to see the issue:
Valkey-py Vector Search Corruption Test
=============================================
Connected to Valkey server
Created vector search index with 2 vector fields
Inserting test documents with vector embeddings:
doc:1: embedding1=[0.1, 0.2, 0.3, 0.4], embedding2=[0.5, 0.6, 0.7]
doc:2: embedding1=[0.8, 0.9, 1.0, 1.1], embedding2=[1.2, 1.3, 1.4]
doc:3: embedding1=[1.5, 1.6, 1.7, 1.8], embedding2=[1.9, 2.0, 2.1]
Performing vector search with query: [0.11, 0.21, 0.31, 0.41]
Found 3 documents
==================================================
TESTING VECTOR RECONSTRUCTION
==================================================
Document 1: doc:1
embedding1 field type: <class 'str'>
[FAIL] embedding1 is string, not bytes: '=L>>>'
[FAIL] Cannot reconstruct vector: unpack requires a buffer of 16 bytes
[FAIL] Original bytes: 16 bytes
[FAIL] Corrupted string: 5 chars
[FAIL] DATA CORRUPTION DETECTED
[FAIL] embedding2 corrupted: <class 'str'>
Document 2: doc:2
embedding1 field type: <class 'str'>
[FAIL] embedding1 is string, not bytes: 'L?fff?\x00\x00?̌?'
[FAIL] Cannot reconstruct vector: unpack requires a buffer of 16 bytes
[FAIL] Original bytes: 16 bytes
[FAIL] Corrupted string: 11 chars
[FAIL] DATA CORRUPTION DETECTED
[FAIL] embedding2 corrupted: <class 'str'>
Document 3: doc:3
embedding1 field type: <class 'str'>
[FAIL] embedding1 is string, not bytes: '\x00\x00???ff?'
[FAIL] Cannot reconstruct vector: unpack requires a buffer of 16 bytes
[FAIL] Original bytes: 16 bytes
[FAIL] Corrupted string: 8 chars
[FAIL] DATA CORRUPTION DETECTED
[FAIL] embedding2 corrupted: <class 'str'>
==================================================
TEST RESULT
==================================================
[FAIL] CORRUPTION DETECTED: Vector embeddings are corrupted by UTF-8 decoding
[FAIL] Vector search functionality is broken in current valkey-py
The fix should:
1. Add preserve_bytes=True parameter to search methods
2. Keep binary fields as bytes instead of UTF-8 decoding them
3. Allow explicit binary_fields parameter for custom control
Cleaned up test index
==================================================
SUMMARY
==================================================
Vector search is broken due to UTF-8 corruption in Result class.
This demonstrates the need for the proposed fix:
• Add preserve_bytes parameter to search methods
• Preserve binary data for VECTOR fields
• Maintain backward compatibility
Run this script before and after applying the fix to validate it works.Proposed Solution: Optional Binary Preservation
Add optional parameters to search methods to preserve binary data without breaking existing code:
We propose introducing two parameters
- preserve_bytes: boolean - Default false. If set to true we will infer all field values in response that are binary strings and return them as bytes instead of strings.
- binary_fields: Array of string - optional. If provided this will limit the conversion of bytes only to the fields specified. This can be used to restrict explicitly to fields of VECTOR type so that TAG type fields with binary string data do not get converted to bytes.
# Current behavior (unchanged) - all fields UTF-8 decoded
results = client.ft("idx").search("*")
# Auto-preserve binary fields (VECTOR fields + any binary TAG fields)
results = client.ft("idx").search("*", preserve_bytes=True)
# Explicit control for specific fields
results = client.ft("idx").search("*", preserve_bytes=True, binary_fields=["embedding1"])Metadata
Metadata
Assignees
Labels
No labels