-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
The various methods for calculating the Pearson correlation between DataFrame columns or between Series have distinct implementations.
Series.corr calls nanops.nancorr, which in turn just drops NaNs pairwise then calls np.corrcoef; DataFrame.corr calls _libs/algos.nancorr, which uses a custom Cython implementation; finally DataFrame.corrwith has yet another implementation, this time in Python, but not using numpy.
Is there a reason for these different approaches? If no, wouldn't it make more sense to use a consistent approach in each?
Feature Description
Standardize which approach is used. Based on how often pandas methods end up calling numpy methods under the hood, it seems sensible to change everything to use nanops.nancorr.
Alternative Solutions
If the Cython implementation in _lib/algos is notably faster than np.corrcoef, it could make sense to prefer that approach.
Additional Context
I have not checked how corr is implemented in GroupBy or Window, but even with these three methods I've mentioned it seems weird to be so inconsistent. I also haven't fully checked if this issue is present for other correlation methods (Kendall and Spearman), but at least DataFrame.corrwith seems to defer to nanops.
All three methods should and do seem to produce the same result, so this is probably low-priority. It is more of an issue from a design and maintenance perspective.