⚡️ Speed up function natural_sort by 102%
#594
+3
−7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 102% (1.02x) speedup for
natural_sortinmarimo/_utils/files.py⏱️ Runtime :
4.83 milliseconds→2.39 milliseconds(best of132runs)📝 Explanation and details
The optimized version achieves a 102% speedup by eliminating function call overhead and moving regex compilation to module level.
Key optimizations:
Module-level regex compilation:
_num_split = re.compile("([0-9]+)").splitcompiles the regex once at import time instead of every function call, avoiding repetitive compilation overhead.Eliminated nested functions: Removed the
convert()andalphanum_key()helper functions, reducing Python function call overhead and stack frame creation.Direct list comprehension: Replaced the nested function calls with a single list comprehension that directly processes the split result.
Why it's faster:
return alphanum_key(filename))Performance characteristics:
The optimization shows consistent 80-250% speedups across all test cases, with particularly strong gains on:
The optimization maintains identical behavior while being universally faster across filename types, making it ideal for file sorting operations where this function may be called frequently.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
import re
from typing import Union
imports
import pytest # used for our unit tests
from marimo._utils.files import natural_sort
unit tests
1. Basic Test Cases
def test_basic_filename_with_number():
# Typical filename with numbers in the middle
codeflash_output = natural_sort("file10.txt") # 5.96μs -> 2.96μs (101% faster)
def test_basic_filename_with_multiple_numbers():
# Multiple numbers separated by text
codeflash_output = natural_sort("img12version3.png") # 6.76μs -> 3.66μs (84.9% faster)
def test_basic_filename_no_numbers():
# Filename with no digits
codeflash_output = natural_sort("README.md") # 4.77μs -> 1.82μs (162% faster)
def test_basic_filename_starts_with_number():
# Filename starting with a number
codeflash_output = natural_sort("123file.txt") # 6.09μs -> 2.85μs (113% faster)
def test_basic_filename_ends_with_number():
# Filename ending with a number
codeflash_output = natural_sort("file42") # 5.60μs -> 2.57μs (118% faster)
def test_basic_filename_only_number():
# Filename is just a number
codeflash_output = natural_sort("2024") # 5.29μs -> 2.32μs (128% faster)
def test_basic_filename_mixed_case():
# Filename with mixed case to ensure lowercasing
codeflash_output = natural_sort("FiLe2.TXT") # 5.85μs -> 2.89μs (102% faster)
def test_basic_filename_with_leading_zeros():
# Numbers with leading zeros should be parsed as int
codeflash_output = natural_sort("file007.txt") # 5.58μs -> 2.86μs (95.1% faster)
def test_basic_filename_with_multiple_consecutive_numbers():
# Consecutive numbers separated by text
codeflash_output = natural_sort("run10step20") # 6.34μs -> 3.22μs (96.9% faster)
2. Edge Test Cases
def test_edge_empty_string():
# Empty string input
codeflash_output = natural_sort("") # 4.02μs -> 1.21μs (232% faster)
def test_edge_only_numbers():
# String with only digits
codeflash_output = natural_sort("000123") # 5.45μs -> 2.46μs (122% faster)
def test_edge_only_letters():
# String with only letters
codeflash_output = natural_sort("abcXYZ") # 4.74μs -> 2.00μs (137% faster)
def test_edge_number_at_start_and_end():
# Number at both start and end
codeflash_output = natural_sort("42file99") # 6.46μs -> 3.36μs (92.2% faster)
def test_edge_filename_with_special_characters():
# Filename with special characters (should be part of text)
codeflash_output = natural_sort("file_1-2#3.txt") # 6.83μs -> 3.75μs (82.4% faster)
def test_edge_filename_with_spaces():
# Filename with spaces
codeflash_output = natural_sort("my file 12.txt") # 6.00μs -> 2.96μs (103% faster)
def test_edge_filename_with_multiple_adjacent_numbers():
# Adjacent numbers, e.g. "file1234"
codeflash_output = natural_sort("file1234") # 5.76μs -> 2.64μs (118% faster)
def test_edge_filename_with_no_text():
# Only numbers and special characters
codeflash_output = natural_sort("123_456") # 6.04μs -> 3.16μs (91.1% faster)
def test_edge_filename_with_unicode():
# Unicode characters should be lowercased
codeflash_output = natural_sort("Fíle2.txt") # 6.52μs -> 3.53μs (84.8% faster)
def test_edge_filename_with_empty_segments():
# Filename with empty segments due to consecutive numbers
codeflash_output = natural_sort("a12b34c") # 6.20μs -> 3.14μs (97.5% faster)
def test_edge_filename_with_leading_and_trailing_spaces():
# Leading and trailing spaces should be preserved in segments
codeflash_output = natural_sort(" file 99 ") # 5.99μs -> 2.88μs (108% faster)
def test_edge_filename_with_multiple_periods():
# Multiple periods in filename
codeflash_output = natural_sort("archive.v2.10.tar.gz") # 6.62μs -> 3.70μs (79.1% faster)
def test_edge_filename_with_empty_number_segment():
# Empty segment between two numbers
codeflash_output = natural_sort("12.34") # 5.91μs -> 2.93μs (101% faster)
def test_edge_filename_with_zero():
# Zero as a number
codeflash_output = natural_sort("file0.txt") # 5.83μs -> 2.81μs (107% faster)
def test_edge_filename_with_large_number():
# Large number in filename
codeflash_output = natural_sort("data999999.txt") # 5.85μs -> 2.78μs (110% faster)
def test_edge_filename_with_dash_and_number():
# Dash between text and number
codeflash_output = natural_sort("file-100.txt") # 5.65μs -> 2.75μs (106% faster)
def test_edge_filename_with_multiple_types():
# Mix of letters, numbers, symbols
codeflash_output = natural_sort("a1b2_c3.d4") # 7.00μs -> 3.94μs (77.5% faster)
def test_edge_filename_with_multiple_empty_strings():
# String with only empty segments
codeflash_output = natural_sort("") # 4.15μs -> 1.21μs (244% faster)
3. Large Scale Test Cases
def test_large_scale_long_filename():
# Very long filename with many numbers and letters
filename = "a" * 100 + "123" + "b" * 100 + "456" + "c" * 100 + "789"
expected = ["a" * 100, 123, "b" * 100, 456, "c" * 100, 789, ""]
codeflash_output = natural_sort(filename) # 12.2μs -> 7.83μs (56.1% faster)
def test_large_scale_filename_with_many_segments():
# Filename with alternating text and numbers, 500 segments
filename = "".join(f"x{i}" for i in range(500))
# Should split into ["x", 0, "x", 1, ..., "x", 499, ""]
expected = []
for i in range(500):
expected.extend(["x", i])
expected.append("")
codeflash_output = natural_sort(filename) # 126μs -> 98.0μs (29.4% faster)
def test_large_scale_filename_with_large_numbers():
# Filename with large numbers, ensure int conversion
filename = "big" + "999999" * 10
expected = ["big"] + [999999] * 10 + [""]
codeflash_output = natural_sort(filename) # 6.39μs -> 3.15μs (103% faster)
def test_large_scale_filename_with_repeated_pattern():
# Repeated pattern with numbers and text
filename = ("data42_" * 100)[:-1] # remove last underscore
# Should split into 100 segments: ["data", 42, "", ...]
expected = []
for _ in range(100):
expected.extend(["data", 42, ""])
expected.pop() # remove last "_"
codeflash_output = natural_sort(filename) # 38.9μs -> 29.9μs (30.4% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import re
import string # used for generating large scale test cases
from typing import Union
imports
import pytest # used for our unit tests
from marimo._utils.files import natural_sort
unit tests
-------------------------
Basic Test Cases
-------------------------
def test_basic_alpha():
# Purely alphabetic string
codeflash_output = natural_sort("abc") # 5.94μs -> 2.15μs (176% faster)
def test_basic_numeric():
# Purely numeric string
codeflash_output = natural_sort("123") # 5.65μs -> 2.50μs (126% faster)
def test_alpha_numeric_simple():
# Simple alpha-numeric mix
codeflash_output = natural_sort("file10") # 6.02μs -> 2.69μs (124% faster)
def test_alpha_numeric_multiple_numbers():
# Multiple numbers in the string
codeflash_output = natural_sort("ver2.4beta5") # 7.05μs -> 3.87μs (82.2% faster)
def test_alpha_numeric_leading_trailing():
# Number at the start and end
codeflash_output = natural_sort("1file2") # 6.15μs -> 2.93μs (110% faster)
def test_alpha_numeric_embedded():
# Embedded numbers
codeflash_output = natural_sort("foo123bar456") # 6.38μs -> 3.19μs (99.9% faster)
def test_alpha_numeric_with_spaces():
# Spaces included
codeflash_output = natural_sort("file 10 version 2") # 6.62μs -> 3.39μs (95.5% faster)
def test_alpha_numeric_with_uppercase():
# Uppercase letters should be lowercased
codeflash_output = natural_sort("File10A") # 5.65μs -> 2.68μs (111% faster)
def test_alpha_numeric_with_mixed_case():
# Mixed case, should all be lowercased
codeflash_output = natural_sort("FiLe123bAr") # 5.76μs -> 2.74μs (111% faster)
def test_alpha_numeric_with_underscore():
# Underscores are treated as part of the string
codeflash_output = natural_sort("file_10") # 5.61μs -> 2.56μs (119% faster)
def test_alpha_numeric_with_dash():
# Dashes are treated as part of the string
codeflash_output = natural_sort("file-10") # 5.46μs -> 2.59μs (111% faster)
def test_alpha_numeric_with_dot():
# Dots are treated as part of the string
codeflash_output = natural_sort("file.10") # 5.42μs -> 2.50μs (117% faster)
def test_alpha_numeric_with_empty_string():
# Empty string should return an empty list
codeflash_output = natural_sort("") # 4.05μs -> 1.17μs (245% faster)
def test_alpha_numeric_with_single_digit():
# Single digit number
codeflash_output = natural_sort("a1b") # 5.72μs -> 2.60μs (120% faster)
def test_alpha_numeric_with_multiple_adjacent_numbers():
# Adjacent numbers
codeflash_output = natural_sort("a12b34c56") # 6.62μs -> 3.46μs (91.5% faster)
-------------------------
Edge Test Cases
-------------------------
def test_edge_leading_zeros():
# Leading zeros in numbers
codeflash_output = natural_sort("file007") # 5.71μs -> 2.67μs (114% faster)
def test_edge_only_zeros():
# String with only zeros
codeflash_output = natural_sort("000") # 5.44μs -> 2.28μs (139% faster)
def test_edge_large_number():
# Very large number
codeflash_output = natural_sort("file123456789") # 5.70μs -> 2.69μs (112% faster)
def test_edge_multiple_consecutive_numbers():
# Multiple consecutive numbers
codeflash_output = natural_sort("a1b2c3d4e5") # 7.23μs -> 3.97μs (82.3% faster)
def test_edge_no_alpha():
# String with only numbers
codeflash_output = natural_sort("123456") # 5.39μs -> 2.25μs (139% faster)
def test_edge_no_numeric():
# String with only letters
codeflash_output = natural_sort("abcdef") # 5.23μs -> 2.03μs (157% faster)
def test_edge_special_characters():
# Special characters between numbers and letters
codeflash_output = natural_sort("a$1#b@2!") # 6.62μs -> 3.51μs (88.8% faster)
def test_edge_unicode_characters():
# Unicode characters in the string
codeflash_output = natural_sort("naïve42café") # 6.67μs -> 3.60μs (85.5% faster)
def test_edge_empty_between_numbers():
# Empty string between numbers
codeflash_output = natural_sort("123456") # 5.54μs -> 2.40μs (131% faster)
def test_edge_number_at_start():
# Number at the very start
codeflash_output = natural_sort("42answer") # 5.90μs -> 2.85μs (107% faster)
def test_edge_number_at_end():
# Number at the very end
codeflash_output = natural_sort("answer42") # 5.64μs -> 2.62μs (116% faster)
def test_edge_adjacent_numbers():
# Adjacent numbers with no separator
codeflash_output = natural_sort("a12b34") # 6.17μs -> 3.09μs (99.9% faster)
def test_edge_multiple_empty_splits():
# Multiple empty splits
codeflash_output = natural_sort("123abc456def") # 6.58μs -> 3.28μs (101% faster)
def test_edge_mixed_case_and_numbers():
# Mixed case and numbers
codeflash_output = natural_sort("A1b2C3") # 6.53μs -> 3.17μs (106% faster)
def test_edge_long_string():
# Long string with mixed content
input_str = "a" * 100 + "123" + "b" * 100 + "456"
expected = ["a" * 100, 123, "b" * 100, 456, ""]
codeflash_output = natural_sort(input_str) # 8.33μs -> 5.39μs (54.6% faster)
def test_edge_numbers_within_special_chars():
# Numbers surrounded by special chars
codeflash_output = natural_sort("foo#123$bar") # 5.83μs -> 2.68μs (118% faster)
def test_edge_string_with_multiple_dots():
# Dots between numbers and letters
codeflash_output = natural_sort("v1.2.3") # 6.48μs -> 3.18μs (104% faster)
def test_edge_string_with_multiple_underscores():
# Underscores between numbers and letters
codeflash_output = natural_sort("v_1_2_3") # 6.41μs -> 3.26μs (96.7% faster)
def test_edge_string_with_multiple_hyphens():
# Hyphens between numbers and letters
codeflash_output = natural_sort("v-1-2-3") # 6.48μs -> 3.14μs (107% faster)
def test_edge_string_with_mixed_separators():
# Mixed separators
codeflash_output = natural_sort("foo_1-bar.2") # 6.25μs -> 3.08μs (103% faster)
def test_edge_string_with_multiple_empty_strings():
# String with only empty splits
codeflash_output = natural_sort("") # 4.14μs -> 1.18μs (250% faster)
def test_edge_string_with_only_special_chars():
# Only special chars, no numbers
codeflash_output = natural_sort("!@#") # 4.87μs -> 1.79μs (172% faster)
def test_edge_string_with_special_chars_and_numbers():
# Special chars and numbers
codeflash_output = natural_sort("!@#123$%^") # 6.07μs -> 2.94μs (106% faster)
def test_edge_string_with_spaces_and_numbers():
# Spaces and numbers
codeflash_output = natural_sort("foo 123 bar") # 5.98μs -> 2.91μs (106% faster)
def test_edge_string_with_tab_and_newline():
# Tabs and newlines
codeflash_output = natural_sort("foo\t123\nbar") # 5.98μs -> 2.88μs (107% faster)
def test_edge_string_with_non_ascii_digits():
# Non-ASCII digits should not be parsed as numbers
codeflash_output = natural_sort("foo١٢٣bar") # 5.66μs -> 2.49μs (127% faster)
def test_edge_string_with_multiple_empty_numbers():
# Multiple empty splits with numbers
codeflash_output = natural_sort("123456789") # 5.63μs -> 2.37μs (137% faster)
def test_edge_string_with_number_in_middle():
# Number in the middle
codeflash_output = natural_sort("foo123bar") # 6.04μs -> 2.93μs (106% faster)
def test_edge_string_with_number_and_special_char():
# Number and special char
codeflash_output = natural_sort("foo123!bar") # 5.86μs -> 2.87μs (105% faster)
def test_edge_string_with_number_and_space():
# Number and space
codeflash_output = natural_sort("foo 123 bar") # 5.90μs -> 2.86μs (107% faster)
def test_edge_string_with_number_and_tab():
# Number and tab
codeflash_output = natural_sort("foo\t123\tbar") # 6.02μs -> 2.89μs (108% faster)
-------------------------
Large Scale Test Cases
-------------------------
def test_large_scale_many_files():
# 1000 file names, check sorting keys
for i in range(1, 1001):
fname = f"file{i}"
expected = ["file", i, ""]
codeflash_output = natural_sort(fname) # 1.59ms -> 630μs (153% faster)
def test_large_scale_long_string():
# Long string with 500 'a's, a number, and 500 'b's
s = "a" * 500 + "123" + "b" * 500
expected = ["a" * 500, 123, "b" * 500, ""]
codeflash_output = natural_sort(s) # 16.3μs -> 13.1μs (25.1% faster)
def test_large_scale_multiple_numbers():
# String with 100 numbers separated by 'x'
s = "x".join(str(i) for i in range(1, 101))
# The expected result alternates between "" (for the first split), int, and "x"
expected = []
parts = re.split("([0-9]+)", s)
for part in parts:
if part.isdigit():
expected.append(int(part))
else:
expected.append(part.lower())
codeflash_output = natural_sort(s) # 30.7μs -> 21.6μs (41.9% faster)
def test_large_scale_varied_file_names():
# 1000 varied file names
for i in range(1, 1001):
fname = f"TestFile_{i}Ver{i*2}"
expected = ["testfile", i, "_ver", i*2, ""]
codeflash_output = natural_sort(fname) # 2.02ms -> 1.01ms (101% faster)
def test_large_scale_alpha_numeric_pattern():
# Pattern: a1b2c3...z26
s = "".join(f"{c}{i}" for i, c in enumerate(string.ascii_lowercase, 1))
expected = []
for i, c in enumerate(string.ascii_lowercase, 1):
expected.append(c)
expected.append(i)
expected.append("")
codeflash_output = natural_sort(s) # 14.1μs -> 9.17μs (54.3% faster)
def test_large_scale_with_special_chars():
# 500 repetitions of 'foo#123$bar'
s = "foo#123$bar" * 500
expected = []
for _ in range(500):
expected += ["foo#", 123, "$bar", ""]
codeflash_output = natural_sort(s) # 176μs -> 146μs (20.5% faster)
def test_large_scale_with_mixed_case():
# 500 repetitions of 'Foo123Bar'
s = "Foo123Bar" * 500
expected = []
for _ in range(500):
expected += ["foo", 123, "bar", ""]
codeflash_output = natural_sort(s) # 161μs -> 126μs (27.3% faster)
def test_large_scale_numbers_with_leading_zeros():
# 100 file names with leading zeros
for i in range(1, 101):
fname = f"file{str(i).zfill(5)}"
expected = ["file", i, ""]
codeflash_output = natural_sort(fname) # 168μs -> 69.4μs (143% faster)
def test_large_scale_long_numeric_string():
# Very long numeric string (999 digits)
s = "1" * 999
expected = [int(s)]
codeflash_output = natural_sort(s) # 12.9μs -> 9.96μs (29.6% faster)
def test_large_scale_long_alpha_string():
# Very long alpha string (999 'a's)
s = "a" * 999
expected = [s]
codeflash_output = natural_sort(s) # 13.8μs -> 11.2μs (23.3% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from marimo._utils.files import natural_sort
def test_natural_sort():
natural_sort('')
🔎 Concolic Coverage Tests and Runtime
codeflash_concolic_bps3n5s8/tmpevzf92ds/test_concolic_coverage.py::test_natural_sortTo edit these changes
git checkout codeflash/optimize-natural_sort-mhv4uc52and push.