⚡️ Speed up function describe by 230%
#179
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 230% (2.30x) speedup for
describeinsrc/statistics/descriptive.py⏱️ Runtime :
3.16 milliseconds→958 microseconds(best of250runs)📝 Explanation and details
The optimized code achieves a 230% speedup by replacing inefficient pandas operations with vectorized NumPy operations. The key optimizations are:
What was optimized:
[v for v in series if not pd.isna(v)]with vectorized operations:arr = series.to_numpy(),mask = ~pd.isna(arr), andvalues = arr[mask]sorted(values)to NumPy'snp.sort(values)values.mean()instead ofsum(values) / n, and((values - mean) ** 2).mean()for varianceWhy it's faster:
Performance breakdown from profiling:
Test case performance:
The optimization particularly benefits larger datasets - the large-scale test cases with 1000+ elements will see the most dramatic improvements due to the vectorized operations scaling much better than the original element-by-element processing.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-describe-midsk9vvand push.