Unnecessary repeated iteration in `split_dataframe_by_prefix`

I've been trying to track some performance issues in the [mondo](https://github.com/monarch-initiative/mondo/) pipeline, and one bottleneck I've identified is in a function that calls [`sssom.parsers.split_dataframe_by_prefix`](https://github.com/mapping-commons/sssom-py/blob/cf75d07cfea233718331365675a27dfda7bf80ba/src/sssom/parsers.py#L980-L1022). Specifically, this part here:

https://github.com/mapping-commons/sssom-py/blob/cf75d07cfea233718331365675a27dfda7bf80ba/src/sssom/parsers.py#L980-L1022

(full implementation [here](https://github.com/mapping-commons/sssom-py/blob/cf75d07cfea233718331365675a27dfda7bf80ba/src/sssom/parsers.py#L980-L1022))

The issue is a bit more clear before the refactor in 1fc45eceb776546c6ea028c81f37ca30cc6a8710:

```
    for pre_subj in subject_prefixes:
        for pre_obj in object_prefixes:
            for rel in relations:
...
                if df is not None:
                    dfs = df[
                        (df[SUBJECT_ID].str.startswith(pre_subj + ":"))
                        & (df[PREDICATE_ID] == rel)
                        & (df[OBJECT_ID].str.startswith(pre_obj + ":"))
                    ]
```

The issue is: the entire MappingSetDataFrame is iterated over for the product of `len(subject_prefix) * len(object_prefix) * len(relations)`. Each iteration involves creating a new string and calling `.startswith` twice. While these are small operations, they add up in the end. (It takes ~10 minutes in the case I'm looking at). There might be a big performance gain from only going through the MappingSetDataFrame once.

A quickly written attempt that I can formalize as a PR if the approach looks OK:

```python
    @dataclass
    class SSSOMTriple:
        subject_prefix: str
        object_prefix: str
        relation: str

        def __post_init__(self):
            relation_prefix, relation_id = self.relation.split(":")
            self.relation_prefix = relation_prefix
            self.relation_id = relation_id

        def as_identifier(self):
            return f"{self.subject_prefix.lower()}_{self.relation_id.lower()}_{object_prefix.lower()}"

    splits_by_triple: dict[SSSOMTriple, list[dict]] = {}

    # Still iterate through the product of SxPxO initially to pre-populate a dict
    for subject_prefix, object_prefix, relation in itt.product(
        subject_prefixes, object_prefixes, relations
    ):
        triple = SSSOMTriple(subject_prefix, object_prefix, relation)
        split = triple.as_identifier()
        if subject_prefix not in msdf.converter.bimap:
            logging.warning(f"{split} - missing subject prefix - {subject_prefix}")
            continue
        if object_prefix not in msdf.converter.bimap:
            logging.warning(f"{split} - missing object prefix - {object_prefix}")
            continue
        splits_by_triple[triple] = []

    # Iterate through `msdf.df` once, building up a list of matches in the pre-populated dict
    for entry in df.itertuples(index=False):
        _entry = entry._asdict()
        subject_prefix = entry[SUBJECT_ID].split(":")[0]
        object_prefix = entry[OBJECT_ID].split(":")[0]
        relation = entry[PREDICATE_ID]
        triple = SSSOMTriple(subject_prefix, object_prefix, relation)
        if triple in splits_by_triple:
            splits_by_triple[triple].append(_entry)

    # Iterate through the prepopulated dict
    for triple, values in splits_by_triple.items():
        split = triple.as_identifier()
        if len(values) == 0:
            logging.debug(f"{split} - No matches (0 matches found)")
            continue
        subconverter = msdf.converter.get_subconverter(
            [triple.subject_prefix, triple.object_prefix, triple.relation_prefix]
        )
        split_to_msdf[split] = from_sssom_dataframe(
            pd.DataFrame(values), prefix_map=dict(subconverter.bimap), meta=meta
        )

    return split_to_msdf
```

This would drastically reduce the performance of the algorithm as the size of `msdf.df` increases (if there are non-trival amounts of subject_prefix/object_prefix/relation iterables). In my case, it went from ~10m to nearly instant with a similar approach. The only added complexity is constructing the `SSSOMTriple` object for every row in the dataframe and every individual of the SxPxO product, but this is not computationally complex-- it should not add a performance penalty in any case.

	def split_dataframe_by_prefix(
	msdf: MappingSetDataFrame,
	subject_prefixes: Iterable[str],
	object_prefixes: Iterable[str],
	relations: Iterable[str],
	) -> Dict[str, MappingSetDataFrame]:
	"""Split a mapping set dataframe by prefix.

	:param msdf: An SSSOM MappingSetDataFrame
	:param subject_prefixes: a list of prefixes pertaining to the subject
	:param object_prefixes: a list of prefixes pertaining to the object
	:param relations: a list of relations of interest
	:return: a dict of SSSOM data frame names to MappingSetDataFrame
	"""
	df = msdf.df
	meta = msdf.metadata
	split_to_msdf: Dict[str, MappingSetDataFrame] = {}
	for subject_prefix, object_prefix, relation in itt.product(
	subject_prefixes, object_prefixes, relations
	):
	relation_prefix, relation_id = relation.split(":")
	split = f"{subject_prefix.lower()}_{relation_id.lower()}_{object_prefix.lower()}"
	if subject_prefix not in msdf.converter.bimap:
	logging.warning(f"{split} - missing subject prefix - {subject_prefix}")
	continue
	if object_prefix not in msdf.converter.bimap:
	logging.warning(f"{split} - missing object prefix - {object_prefix}")
	continue
	df_subset = df[
	(df[SUBJECT_ID].str.startswith(subject_prefix + ":"))
	& (df[PREDICATE_ID] == relation)
	& (df[OBJECT_ID].str.startswith(object_prefix + ":"))
	]
	if 0 == len(df_subset):
	logging.debug(f"No matches ({len(df_subset)} matches found)")
	continue
	subconverter = msdf.converter.get_subconverter(
	[subject_prefix, object_prefix, relation_prefix]
	)
	split_to_msdf[split] = from_sssom_dataframe(
	df_subset, prefix_map=dict(subconverter.bimap), meta=meta
	)
	return split_to_msdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unnecessary repeated iteration in `split_dataframe_by_prefix` #607

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unnecessary repeated iteration in split_dataframe_by_prefix #607

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Unnecessary repeated iteration in `split_dataframe_by_prefix` #607