Skip to content

ENH: Standardize calculation of Pearson correlation #63030

@Zelpuz

Description

@Zelpuz

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The various methods for calculating the Pearson correlation between DataFrame columns or between Series have distinct implementations.

Series.corr calls nanops.nancorr, which in turn just drops NaNs pairwise then calls np.corrcoef; DataFrame.corr calls _libs/algos.nancorr, which uses a custom Cython implementation; finally DataFrame.corrwith has yet another implementation, this time in Python, but not using numpy.

Is there a reason for these different approaches? If no, wouldn't it make more sense to use a consistent approach in each?

Feature Description

Standardize which approach is used. Based on how often pandas methods end up calling numpy methods under the hood, it seems sensible to change everything to use nanops.nancorr.

Alternative Solutions

If the Cython implementation in _lib/algos is notably faster than np.corrcoef, it could make sense to prefer that approach.

Additional Context

I have not checked how corr is implemented in GroupBy or Window, but even with these three methods I've mentioned it seems weird to be so inconsistent. I also haven't fully checked if this issue is present for other correlation methods (Kendall and Spearman), but at least DataFrame.corrwith seems to defer to nanops.

All three methods should and do seem to produce the same result, so this is probably low-priority. It is more of an issue from a design and maintenance perspective.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions