Skip to content

UTF-8 Encoding Issue That Corrupts Binary Vector Data in Search Results #242

@swarnaprakash

Description

@swarnaprakash

Summary

The Result class in valkey/commands/search/result.py inappropriately applies UTF-8 decoding to all field values, including binary vector data. This corrupts VECTOR field embeddings and makes valkey-py unsuitable for vector search applications.

Root Cause

In valkey/commands/search/result.py line ~45:

map(to_string, res[i + fields_offset][1::2])

The to_string() function uses decode("utf-8", "ignore") on all field values. VECTOR fields contain binary FLOAT32 data with invalid UTF-8 bytes, and the "ignore" parameter silently drops bytes, corrupting the data.

Reproduction

Run https://gist.github.com/swarnaprakash/0ae1e454cfae3ab2a0f628181d4e8045 to see the issue:

Valkey-py Vector Search Corruption Test
=============================================
Connected to Valkey server
Created vector search index with 2 vector fields

Inserting test documents with vector embeddings:
  doc:1: embedding1=[0.1, 0.2, 0.3, 0.4], embedding2=[0.5, 0.6, 0.7]
  doc:2: embedding1=[0.8, 0.9, 1.0, 1.1], embedding2=[1.2, 1.3, 1.4]
  doc:3: embedding1=[1.5, 1.6, 1.7, 1.8], embedding2=[1.9, 2.0, 2.1]

Performing vector search with query: [0.11, 0.21, 0.31, 0.41]
Found 3 documents

==================================================
TESTING VECTOR RECONSTRUCTION
==================================================

Document 1: doc:1
  embedding1 field type: <class 'str'>
    [FAIL] embedding1 is string, not bytes: '=L>>>'
    [FAIL] Cannot reconstruct vector: unpack requires a buffer of 16 bytes
    [FAIL] Original bytes: 16 bytes
    [FAIL] Corrupted string: 5 chars
    [FAIL] DATA CORRUPTION DETECTED
    [FAIL] embedding2 corrupted: <class 'str'>

Document 2: doc:2
  embedding1 field type: <class 'str'>
    [FAIL] embedding1 is string, not bytes: 'L?fff?\x00\x00?̌?'
    [FAIL] Cannot reconstruct vector: unpack requires a buffer of 16 bytes
    [FAIL] Original bytes: 16 bytes
    [FAIL] Corrupted string: 11 chars
    [FAIL] DATA CORRUPTION DETECTED
    [FAIL] embedding2 corrupted: <class 'str'>

Document 3: doc:3
  embedding1 field type: <class 'str'>
    [FAIL] embedding1 is string, not bytes: '\x00\x00???ff?'
    [FAIL] Cannot reconstruct vector: unpack requires a buffer of 16 bytes
    [FAIL] Original bytes: 16 bytes
    [FAIL] Corrupted string: 8 chars
    [FAIL] DATA CORRUPTION DETECTED
    [FAIL] embedding2 corrupted: <class 'str'>

==================================================
TEST RESULT
==================================================
[FAIL] CORRUPTION DETECTED: Vector embeddings are corrupted by UTF-8 decoding
[FAIL] Vector search functionality is broken in current valkey-py

The fix should:
  1. Add preserve_bytes=True parameter to search methods
  2. Keep binary fields as bytes instead of UTF-8 decoding them
  3. Allow explicit binary_fields parameter for custom control

Cleaned up test index

==================================================
SUMMARY
==================================================
Vector search is broken due to UTF-8 corruption in Result class.

This demonstrates the need for the proposed fix:
• Add preserve_bytes parameter to search methods
• Preserve binary data for VECTOR fields
• Maintain backward compatibility

Run this script before and after applying the fix to validate it works.

Proposed Solution: Optional Binary Preservation

Add optional parameters to search methods to preserve binary data without breaking existing code:

We propose introducing two parameters

  1. preserve_bytes: boolean - Default false. If set to true we will infer all field values in response that are binary strings and return them as bytes instead of strings.
  2. binary_fields: Array of string - optional. If provided this will limit the conversion of bytes only to the fields specified. This can be used to restrict explicitly to fields of VECTOR type so that TAG type fields with binary string data do not get converted to bytes.
# Current behavior (unchanged) - all fields UTF-8 decoded
results = client.ft("idx").search("*")

# Auto-preserve binary fields (VECTOR fields + any binary TAG fields)
results = client.ft("idx").search("*", preserve_bytes=True)


# Explicit control for specific fields
results = client.ft("idx").search("*", preserve_bytes=True, binary_fields=["embedding1"])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions