Skip to content

Unnecessary repeated iteration in split_dataframe_by_prefix #607

@ptgolden

Description

@ptgolden

I've been trying to track some performance issues in the mondo pipeline, and one bottleneck I've identified is in a function that calls sssom.parsers.split_dataframe_by_prefix. Specifically, this part here:

sssom-py/src/sssom/parsers.py

Lines 980 to 1022 in cf75d07

def split_dataframe_by_prefix(
msdf: MappingSetDataFrame,
subject_prefixes: Iterable[str],
object_prefixes: Iterable[str],
relations: Iterable[str],
) -> Dict[str, MappingSetDataFrame]:
"""Split a mapping set dataframe by prefix.
:param msdf: An SSSOM MappingSetDataFrame
:param subject_prefixes: a list of prefixes pertaining to the subject
:param object_prefixes: a list of prefixes pertaining to the object
:param relations: a list of relations of interest
:return: a dict of SSSOM data frame names to MappingSetDataFrame
"""
df = msdf.df
meta = msdf.metadata
split_to_msdf: Dict[str, MappingSetDataFrame] = {}
for subject_prefix, object_prefix, relation in itt.product(
subject_prefixes, object_prefixes, relations
):
relation_prefix, relation_id = relation.split(":")
split = f"{subject_prefix.lower()}_{relation_id.lower()}_{object_prefix.lower()}"
if subject_prefix not in msdf.converter.bimap:
logging.warning(f"{split} - missing subject prefix - {subject_prefix}")
continue
if object_prefix not in msdf.converter.bimap:
logging.warning(f"{split} - missing object prefix - {object_prefix}")
continue
df_subset = df[
(df[SUBJECT_ID].str.startswith(subject_prefix + ":"))
& (df[PREDICATE_ID] == relation)
& (df[OBJECT_ID].str.startswith(object_prefix + ":"))
]
if 0 == len(df_subset):
logging.debug(f"No matches ({len(df_subset)} matches found)")
continue
subconverter = msdf.converter.get_subconverter(
[subject_prefix, object_prefix, relation_prefix]
)
split_to_msdf[split] = from_sssom_dataframe(
df_subset, prefix_map=dict(subconverter.bimap), meta=meta
)
return split_to_msdf

(full implementation here)

The issue is a bit more clear before the refactor in 1fc45ec:

    for pre_subj in subject_prefixes:
        for pre_obj in object_prefixes:
            for rel in relations:
...
                if df is not None:
                    dfs = df[
                        (df[SUBJECT_ID].str.startswith(pre_subj + ":"))
                        & (df[PREDICATE_ID] == rel)
                        & (df[OBJECT_ID].str.startswith(pre_obj + ":"))
                    ]

The issue is: the entire MappingSetDataFrame is iterated over for the product of len(subject_prefix) * len(object_prefix) * len(relations). Each iteration involves creating a new string and calling .startswith twice. While these are small operations, they add up in the end. (It takes ~10 minutes in the case I'm looking at). There might be a big performance gain from only going through the MappingSetDataFrame once.

A quickly written attempt that I can formalize as a PR if the approach looks OK:

    @dataclass
    class SSSOMTriple:
        subject_prefix: str
        object_prefix: str
        relation: str

        def __post_init__(self):
            relation_prefix, relation_id = self.relation.split(":")
            self.relation_prefix = relation_prefix
            self.relation_id = relation_id

        def as_identifier(self):
            return f"{self.subject_prefix.lower()}_{self.relation_id.lower()}_{object_prefix.lower()}"

    splits_by_triple: dict[SSSOMTriple, list[dict]] = {}

    # Still iterate through the product of SxPxO initially to pre-populate a dict
    for subject_prefix, object_prefix, relation in itt.product(
        subject_prefixes, object_prefixes, relations
    ):
        triple = SSSOMTriple(subject_prefix, object_prefix, relation)
        split = triple.as_identifier()
        if subject_prefix not in msdf.converter.bimap:
            logging.warning(f"{split} - missing subject prefix - {subject_prefix}")
            continue
        if object_prefix not in msdf.converter.bimap:
            logging.warning(f"{split} - missing object prefix - {object_prefix}")
            continue
        splits_by_triple[triple] = []

    # Iterate through `msdf.df` once, building up a list of matches in the pre-populated dict
    for entry in df.itertuples(index=False):
        _entry = entry._asdict()
        subject_prefix = entry[SUBJECT_ID].split(":")[0]
        object_prefix = entry[OBJECT_ID].split(":")[0]
        relation = entry[PREDICATE_ID]
        triple = SSSOMTriple(subject_prefix, object_prefix, relation)
        if triple in splits_by_triple:
            splits_by_triple[triple].append(_entry)

    # Iterate through the prepopulated dict
    for triple, values in splits_by_triple.items():
        split = triple.as_identifier()
        if len(values) == 0:
            logging.debug(f"{split} - No matches (0 matches found)")
            continue
        subconverter = msdf.converter.get_subconverter(
            [triple.subject_prefix, triple.object_prefix, triple.relation_prefix]
        )
        split_to_msdf[split] = from_sssom_dataframe(
            pd.DataFrame(values), prefix_map=dict(subconverter.bimap), meta=meta
        )

    return split_to_msdf

This would drastically reduce the performance of the algorithm as the size of msdf.df increases (if there are non-trival amounts of subject_prefix/object_prefix/relation iterables). In my case, it went from ~10m to nearly instant with a similar approach. The only added complexity is constructing the SSSOMTriple object for every row in the dataframe and every individual of the SxPxO product, but this is not computationally complex-- it should not add a performance penalty in any case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions