-
Notifications
You must be signed in to change notification settings - Fork 888
Description
Describe the bug
mlxtend 0.23.4 introduced a severe performance regression in the fpgrowth function when processing large sparse boolean DataFrames with low min_support values (0.01). The execution time increased from seconds to hours/days, making the function unusable for realistic datasets.
The regression appears to be caused by inefficient NULL/NaN handling introduced in 0.23.4 that processes every cell of the DataFrame unnecessarily, even when the DataFrame contains only boolean values without any NULL values. There is even a parameter null_values that lets the user declare whether DataFrame contains NULL values, but declaring it False doesn't actually omit the unnecessary NULL handling logic.
The additional NULL handling logic seems to (more or less) increase the generate_itemsets() time complexity from O(n_generated_itemsets) to O(n_generated_itemsets * n_df_rows) through the nested for loop iterating by rows. While trying to process a Dataframe of size 300 000 x 3 600 with min_support=0.01 there are approx. 300 000 itemsets generated, which basically squares the number of operations needed in the previous version (0.23.1). 300 000 rows * 300 000 itemsets =~ 10^11 suddenly makes fpgrowth unusable, where previously it usually took less than a minute to calculate. The mentioned dataset is intractable with current version of fpgrowth, although it is small to moderate in business applications.
Steps/Code to Reproduce
The example provided uses a smaller DataFrame (10 000 rows x 400 columns with rather low support), so that it can showcase the huge execution time difference between mlxtend versions 0.23.1 and 0.23.4, instead of just stalling for an eternity (as happens with larger data).
import time
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import fpgrowth
def create_dataframe(n_rows = 10000, n_cols = 400):
np.random.seed(42)
# Generate realistic sparse boolean data based on real categorical data patterns
# 90% columns have support < 0.01, 6% have 0.01-0.1, 4% have 0.1-0.65
support_values = np.zeros(n_cols)
# Most columns: very low support (don't participate in itemsets with min_support=0.01)
n_very_low = int(n_cols * 0.9)
support_values[:n_very_low] = np.random.uniform(0.0001, 0.009, n_very_low)
# Some columns: medium support (participate in itemsets)
n_medium = int(n_cols * 0.06)
support_values[n_very_low:n_very_low+n_medium] = np.random.uniform(0.01, 0.1, n_medium)
# Few columns: high support (common categories)
n_high = n_cols - n_very_low - n_medium
support_values[n_very_low+n_medium:] = np.random.uniform(0.1, 0.65, n_high)
# Generate boolean DataFrame (no NULL values)
data = {}
for i in range(n_cols):
col_name = f"feature_{i:04d}"
prob = support_values[i]
data[col_name] = np.random.random(n_rows) < prob
return pd.DataFrame(data)
def test_performance():
"""Test fpgrowth performance."""
df = create_dataframe()
print(f"Testing {df.shape} DataFrame with {df.values.mean():.4f} density")
print(f"mlxtend version: {__import__('mlxtend').__version__}")
start_time = time.time()
frequent_itemsets = fpgrowth(df, min_support=0.01, use_colnames=True)
elapsed_time = time.time() - start_time
print(f"Execution time: {elapsed_time:.2f} seconds")
print(f"Found itemsets: {len(frequent_itemsets):,}")
return elapsed_time
if __name__ == "__main__":
test_performance()Expected Results
With mlxtend 0.23.1:
- Execution time: < 1 second
Actual Results
With mlxtend 0.23.4:
- Execution time: ~3 minutes !
- This shows that the newly added NULL handling dramatically increases complexity and takes up nearly all of the computing time.
The issue becomes even worse with larger DataFrames or higher support columns (scales poorly with both n_rows and n_generated_itemsets).
Workaround: Downgrade to mlxtend 0.23.1.
Suggested Fix: Review the NULL handling logic introduced in 0.23.4. The function should skip expensive NULL processing when the DataFrame contains only boolean values.
The current NULL logic in itself could be sped up a little by using vectorized operations in generate_itemsets(), instead of the for loop and costly conversions from pandas to python list. This does not however solve the general issue with higher complexity O(n_generated_itemsets * n_df_rows).
Versions
MLxtend 0.23.4
Linux-5.15.0-152-generic-x86_64-with-glibc2.35
Python 3.11.13 (main, Jun 4 2025, 08:57:29) [GCC 11.4.0]
Scikit-learn 1.7.0
NumPy 1.26.4
SciPy 1.15.2