Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 5, 2025

📄 34% (0.34x) speedup for NeptuneAnalyticsVector._get_where_clause in mem0/vector_stores/neptune_analytics.py

⏱️ Runtime : 519 microseconds 386 microseconds (best of 250 runs)

📝 Explanation and details

The optimization replaces an inefficient iterative string concatenation approach with a list comprehension and join() operation, delivering a 34% speedup.

Key Changes:

  1. Early exit for empty filters: Added if not filters: return "" to avoid unnecessary processing
  2. Eliminated enumerate() and conditional logic: Replaced the loop with enumerate() and if i == 0 checks with a direct list comprehension
  3. Used join() instead of string concatenation: Built all clauses in a list, then joined with ' AND ' in a single operation

Why This Is Faster:

  • String concatenation inefficiency: The original code repeatedly concatenates strings (where_clause += ...), which creates new string objects each time since strings are immutable in Python
  • Unnecessary enumeration overhead: enumerate() adds extra work to track the index just to determine if it's the first item
  • Branch prediction costs: The if i == 0 condition creates branching overhead in the loop
  • List comprehension + join() efficiency: Building a list of clauses and joining them is much more efficient than repeated string concatenation, especially for larger filter sets

Performance Benefits by Test Case:

  • Empty filters: 181% faster (839ns → 299ns) due to early exit
  • Large-scale tests: 18-65% faster for 500-1000 filters, where the join() approach really shines
  • Small filters (1-3 items): Modest improvements of 1-8% due to reduced overhead
  • Single filters: Slightly slower (7-11%) due to list creation overhead, but this is minimal

The optimization is particularly effective for the large-scale scenarios common in vector database operations, where filter dictionaries can contain hundreds of conditions.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 41 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
from mem0.vector_stores.neptune_analytics import NeptuneAnalyticsVector

# unit tests

# 1. Basic Test Cases

def test_empty_filters_dict_returns_empty_string():
    # Test that an empty filter dict returns an empty string (no WHERE clause)
    codeflash_output = NeptuneAnalyticsVector._get_where_clause({}); result = codeflash_output # 839ns -> 299ns (181% faster)

def test_single_filter():
    # Test a single filter key-value pair
    codeflash_output = NeptuneAnalyticsVector._get_where_clause({'foo': 'bar'}); result = codeflash_output # 1.21μs -> 1.33μs (9.08% slower)

def test_multiple_filters():
    # Test multiple key-value pairs
    filters = {'foo': 'bar', 'baz': 'qux'}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.54μs -> 1.52μs (1.45% faster)
    # The order in dicts is preserved since Python 3.7
    expected = "WHERE n.foo = 'bar' AND n.baz = 'qux' "

def test_three_filters():
    # Test three filters for correct clause
    filters = {'a': '1', 'b': '2', 'c': '3'}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.79μs -> 1.71μs (5.04% faster)
    expected = "WHERE n.a = '1' AND n.b = '2' AND n.c = '3' "

# 2. Edge Test Cases

def test_key_with_special_characters():
    # Test keys with special characters
    filters = {'foo-bar': 'baz'}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.16μs -> 1.28μs (10.1% slower)
    expected = "WHERE n.foo-bar = 'baz' "

def test_value_with_special_characters():
    # Test values with special characters
    filters = {'foo': "ba'z"}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.11μs -> 1.26μs (11.2% slower)
    # The function does not escape single quotes, so this should be reflected
    expected = "WHERE n.foo = 'ba\'z' "

def test_key_and_value_are_empty_strings():
    # Test empty string key and value
    filters = {'': ''}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.13μs -> 1.19μs (4.73% slower)
    expected = "WHERE n. = '' "

def test_numeric_and_boolean_values():
    # Test numeric and boolean values (they are converted to strings)
    filters = {'num': 42, 'flag': True}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 2.08μs -> 2.03μs (2.11% faster)
    expected = "WHERE n.num = '42' AND n.flag = 'True' "

def test_none_value():
    # Test None value (should be stringified as 'None')
    filters = {'foo': None}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.29μs -> 1.38μs (6.86% slower)
    expected = "WHERE n.foo = 'None' "

def test_order_of_filters_is_preserved():
    # Test that order of keys is preserved
    filters = {'a': 'x', 'b': 'y', 'c': 'z'}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.78μs -> 1.70μs (4.66% faster)
    expected = "WHERE n.a = 'x' AND n.b = 'y' AND n.c = 'z' "

def test_key_with_spaces():
    # Test key with spaces
    filters = {'foo bar': 'baz'}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.13μs -> 1.25μs (9.55% slower)
    expected = "WHERE n.foo bar = 'baz' "

def test_value_with_double_quotes():
    # Test value with double quotes
    filters = {'foo': 'ba"z'}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.13μs -> 1.24μs (8.78% slower)
    expected = "WHERE n.foo = 'ba\"z' "

def test_key_with_unicode():
    # Test key with unicode characters
    filters = {'ключ': 'значение'}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.53μs -> 1.77μs (13.4% slower)
    expected = "WHERE n.ключ = 'значение' "

def test_value_with_newline_and_tab():
    # Test value with newline and tab characters
    filters = {'foo': 'bar\nbaz\tqux'}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.12μs -> 1.24μs (9.35% slower)
    expected = "WHERE n.foo = 'bar\nbaz\tqux' "

# 3. Large Scale Test Cases

def test_large_number_of_filters():
    # Test with a large number of filters (up to 1000)
    filters = {f'key{i}': f'value{i}' for i in range(1000)}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 100μs -> 77.9μs (29.5% faster)
    # Check the number of ANDs is 999 (since first is WHERE)
    and_count = result.count('AND ')

def test_large_scale_performance():
    # Test performance for large number of filters (not a strict timing test, but should not hang)
    filters = {f'k{i}': f'v{i}' for i in range(1000)}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 100μs -> 61.0μs (64.9% faster)

def test_large_scale_with_long_strings():
    # Test with long string values
    long_string = "x" * 500
    filters = {f'key{i}': long_string for i in range(100)}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 14.7μs -> 15.7μs (5.97% slower)
    # Check that all long values are present
    for i in range(100):
        expected = f"n.key{i} = '{long_string}'"

# Additional edge: test with non-string keys (should work, but keys are stringified)
def test_non_string_keys():
    filters = {42: 'answer', True: 'yes'}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 2.11μs -> 2.06μs (2.42% faster)
    expected = "WHERE n.42 = 'answer' AND n.True = 'yes' "

# Additional edge: test with tuple values (should be stringified)
def test_tuple_value():
    filters = {'foo': (1, 2, 3)}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 2.35μs -> 2.35μs (0.043% faster)
    expected = "WHERE n.foo = '(1, 2, 3)' "

# Additional edge: test with list value (should be stringified)
def test_list_value():
    filters = {'foo': [1, 2, 3]}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 2.08μs -> 2.05μs (1.22% faster)
    expected = "WHERE n.foo = '[1, 2, 3]' "

# Additional edge: test with dict value (should be stringified)
def test_dict_value():
    filters = {'foo': {'bar': 'baz'}}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 2.19μs -> 2.37μs (7.55% slower)
    expected = "WHERE n.foo = '{\'bar\': \'baz\'}' "

# Additional edge: test with None key
def test_none_key():
    filters = {None: 'value'}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.30μs -> 1.38μs (5.58% slower)
    expected = "WHERE n.None = 'value' "

# Additional edge: test with both None key and value
def test_none_key_and_value():
    filters = {None: None}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.42μs -> 1.44μs (1.18% slower)
    expected = "WHERE n.None = 'None' "

# Additional edge: test with boolean key and value
def test_boolean_key_and_value():
    filters = {True: False}
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.54μs -> 1.70μs (9.34% slower)
    expected = "WHERE n.True = 'False' "
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest  # used for our unit tests
from mem0.vector_stores.neptune_analytics import NeptuneAnalyticsVector

# unit tests

# ----------- BASIC TEST CASES -----------

def test_single_filter():
    # Test with a single key-value pair
    filters = {"foo": "bar"}
    expected = "WHERE n.foo = 'bar' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.31μs -> 1.42μs (7.88% slower)

def test_multiple_filters():
    # Test with multiple key-value pairs
    filters = {"foo": "bar", "baz": "qux"}
    expected = "WHERE n.foo = 'bar' AND n.baz = 'qux' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.64μs -> 1.59μs (3.46% faster)

def test_three_filters():
    # Test with three key-value pairs
    filters = {"a": "1", "b": "2", "c": "3"}
    expected = "WHERE n.a = '1' AND n.b = '2' AND n.c = '3' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.80μs -> 1.69μs (6.69% faster)

def test_empty_filters():
    # Test with empty dict
    filters = {}
    expected = ""
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 762ns -> 295ns (158% faster)

# ----------- EDGE TEST CASES -----------

def test_non_string_values():
    # Test with non-string values (int, float, bool)
    filters = {"foo": 1, "bar": 3.14, "baz": True}
    expected = "WHERE n.foo = '1' AND n.bar = '3.14' AND n.baz = 'True' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 3.96μs -> 3.77μs (5.03% faster)

def test_non_string_keys():
    # Test with non-string keys (int, float, bool)
    filters = {1: "a", 3.14: "b", True: "c"}
    expected = "WHERE n.1 = 'a' AND n.3.14 = 'b' AND n.True = 'c' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 2.66μs -> 2.65μs (0.339% faster)

def test_key_with_spaces():
    # Test with keys containing spaces
    filters = {"first name": "Alice", "last name": "Smith"}
    expected = "WHERE n.first name = 'Alice' AND n.last name = 'Smith' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.56μs -> 1.53μs (1.90% faster)

def test_value_with_single_quote():
    # Test with value containing single quote
    filters = {"name": "O'Reilly"}
    expected = "WHERE n.name = 'O'Reilly' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.17μs -> 1.26μs (7.20% slower)
    # Note: This exposes a potential SQL injection/escaping bug

def test_key_with_special_characters():
    # Test with keys containing special characters
    filters = {"user-id": "123", "email@address": "test@example.com"}
    expected = "WHERE n.user-id = '123' AND n.email@address = 'test@example.com' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.62μs -> 1.61μs (0.933% faster)

def test_value_with_special_characters():
    # Test with values containing special characters
    filters = {"desc": "Hello, world!", "emoji": "😊"}
    expected = "WHERE n.desc = 'Hello, world!' AND n.emoji = '😊' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 2.05μs -> 2.13μs (4.03% slower)

def test_value_is_none():
    # Test with None value
    filters = {"foo": None}
    expected = "WHERE n.foo = 'None' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.38μs -> 1.44μs (3.83% slower)

def test_key_is_empty_string():
    # Test with empty string key
    filters = {"": "value"}
    expected = "WHERE n. = 'value' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.13μs -> 1.23μs (8.53% slower)

def test_value_is_empty_string():
    # Test with empty string value
    filters = {"foo": ""}
    expected = "WHERE n.foo = '' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.16μs -> 1.17μs (0.257% slower)

def test_order_of_filters_is_preserved():
    # Test that order of filters is preserved (Python 3.7+ dicts are ordered)
    filters = {"a": "1", "b": "2", "c": "3"}
    expected = "WHERE n.a = '1' AND n.b = '2' AND n.c = '3' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 1.80μs -> 1.67μs (8.03% faster)

# ----------- LARGE SCALE TEST CASES -----------

def test_large_number_of_filters():
    # Test with a large number of filters (1000)
    filters = {f"key{i}": f"value{i}" for i in range(1000)}
    # Build expected string
    expected = ""
    for i in range(1000):
        if i == 0:
            expected += f"WHERE n.key0 = 'value0' "
        else:
            expected += f"AND n.key{i} = 'value{i}' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 97.8μs -> 82.5μs (18.6% faster)

def test_large_number_of_filters_with_special_chars():
    # Test with a large number of filters with special characters
    filters = {f"key-{i}": f"value_{i}@!" for i in range(500)}
    expected = ""
    for i in range(500):
        if i == 0:
            expected += f"WHERE n.key-0 = 'value_0@!' "
        else:
            expected += f"AND n.key-{i} = 'value_{i}@!' "
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 49.0μs -> 32.1μs (52.8% faster)

def test_performance_large_filters():
    # Test that function completes within reasonable time for 1000 filters
    import time
    filters = {f"k{i}": f"v{i}" for i in range(1000)}
    start = time.time()
    codeflash_output = NeptuneAnalyticsVector._get_where_clause(filters); result = codeflash_output # 99.9μs -> 61.1μs (63.4% faster)
    elapsed = time.time() - start
    # Also check correctness
    expected_start = "WHERE n.k0 = 'v0' "
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-NeptuneAnalyticsVector._get_where_clause-mhlgl9wa and push.

Codeflash Static Badge

The optimization replaces an inefficient iterative string concatenation approach with a list comprehension and `join()` operation, delivering a **34% speedup**.

**Key Changes:**
1. **Early exit for empty filters**: Added `if not filters: return ""` to avoid unnecessary processing
2. **Eliminated enumerate() and conditional logic**: Replaced the loop with `enumerate()` and `if i == 0` checks with a direct list comprehension
3. **Used join() instead of string concatenation**: Built all clauses in a list, then joined with `' AND '` in a single operation

**Why This Is Faster:**
- **String concatenation inefficiency**: The original code repeatedly concatenates strings (`where_clause += ...`), which creates new string objects each time since strings are immutable in Python
- **Unnecessary enumeration overhead**: `enumerate()` adds extra work to track the index just to determine if it's the first item
- **Branch prediction costs**: The `if i == 0` condition creates branching overhead in the loop
- **List comprehension + join() efficiency**: Building a list of clauses and joining them is much more efficient than repeated string concatenation, especially for larger filter sets

**Performance Benefits by Test Case:**
- **Empty filters**: 181% faster (839ns → 299ns) due to early exit
- **Large-scale tests**: 18-65% faster for 500-1000 filters, where the join() approach really shines
- **Small filters (1-3 items)**: Modest improvements of 1-8% due to reduced overhead
- **Single filters**: Slightly slower (7-11%) due to list creation overhead, but this is minimal

The optimization is particularly effective for the large-scale scenarios common in vector database operations, where filter dictionaries can contain hundreds of conditions.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 5, 2025 03:49
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant