-
Notifications
You must be signed in to change notification settings - Fork 15
Description
I've been trying to track some performance issues in the mondo pipeline, and one bottleneck I've identified is in a function that calls sssom.parsers.split_dataframe_by_prefix. Specifically, this part here:
Lines 980 to 1022 in cf75d07
| def split_dataframe_by_prefix( | |
| msdf: MappingSetDataFrame, | |
| subject_prefixes: Iterable[str], | |
| object_prefixes: Iterable[str], | |
| relations: Iterable[str], | |
| ) -> Dict[str, MappingSetDataFrame]: | |
| """Split a mapping set dataframe by prefix. | |
| :param msdf: An SSSOM MappingSetDataFrame | |
| :param subject_prefixes: a list of prefixes pertaining to the subject | |
| :param object_prefixes: a list of prefixes pertaining to the object | |
| :param relations: a list of relations of interest | |
| :return: a dict of SSSOM data frame names to MappingSetDataFrame | |
| """ | |
| df = msdf.df | |
| meta = msdf.metadata | |
| split_to_msdf: Dict[str, MappingSetDataFrame] = {} | |
| for subject_prefix, object_prefix, relation in itt.product( | |
| subject_prefixes, object_prefixes, relations | |
| ): | |
| relation_prefix, relation_id = relation.split(":") | |
| split = f"{subject_prefix.lower()}_{relation_id.lower()}_{object_prefix.lower()}" | |
| if subject_prefix not in msdf.converter.bimap: | |
| logging.warning(f"{split} - missing subject prefix - {subject_prefix}") | |
| continue | |
| if object_prefix not in msdf.converter.bimap: | |
| logging.warning(f"{split} - missing object prefix - {object_prefix}") | |
| continue | |
| df_subset = df[ | |
| (df[SUBJECT_ID].str.startswith(subject_prefix + ":")) | |
| & (df[PREDICATE_ID] == relation) | |
| & (df[OBJECT_ID].str.startswith(object_prefix + ":")) | |
| ] | |
| if 0 == len(df_subset): | |
| logging.debug(f"No matches ({len(df_subset)} matches found)") | |
| continue | |
| subconverter = msdf.converter.get_subconverter( | |
| [subject_prefix, object_prefix, relation_prefix] | |
| ) | |
| split_to_msdf[split] = from_sssom_dataframe( | |
| df_subset, prefix_map=dict(subconverter.bimap), meta=meta | |
| ) | |
| return split_to_msdf |
(full implementation here)
The issue is a bit more clear before the refactor in 1fc45ec:
for pre_subj in subject_prefixes:
for pre_obj in object_prefixes:
for rel in relations:
...
if df is not None:
dfs = df[
(df[SUBJECT_ID].str.startswith(pre_subj + ":"))
& (df[PREDICATE_ID] == rel)
& (df[OBJECT_ID].str.startswith(pre_obj + ":"))
]
The issue is: the entire MappingSetDataFrame is iterated over for the product of len(subject_prefix) * len(object_prefix) * len(relations). Each iteration involves creating a new string and calling .startswith twice. While these are small operations, they add up in the end. (It takes ~10 minutes in the case I'm looking at). There might be a big performance gain from only going through the MappingSetDataFrame once.
A quickly written attempt that I can formalize as a PR if the approach looks OK:
@dataclass
class SSSOMTriple:
subject_prefix: str
object_prefix: str
relation: str
def __post_init__(self):
relation_prefix, relation_id = self.relation.split(":")
self.relation_prefix = relation_prefix
self.relation_id = relation_id
def as_identifier(self):
return f"{self.subject_prefix.lower()}_{self.relation_id.lower()}_{object_prefix.lower()}"
splits_by_triple: dict[SSSOMTriple, list[dict]] = {}
# Still iterate through the product of SxPxO initially to pre-populate a dict
for subject_prefix, object_prefix, relation in itt.product(
subject_prefixes, object_prefixes, relations
):
triple = SSSOMTriple(subject_prefix, object_prefix, relation)
split = triple.as_identifier()
if subject_prefix not in msdf.converter.bimap:
logging.warning(f"{split} - missing subject prefix - {subject_prefix}")
continue
if object_prefix not in msdf.converter.bimap:
logging.warning(f"{split} - missing object prefix - {object_prefix}")
continue
splits_by_triple[triple] = []
# Iterate through `msdf.df` once, building up a list of matches in the pre-populated dict
for entry in df.itertuples(index=False):
_entry = entry._asdict()
subject_prefix = entry[SUBJECT_ID].split(":")[0]
object_prefix = entry[OBJECT_ID].split(":")[0]
relation = entry[PREDICATE_ID]
triple = SSSOMTriple(subject_prefix, object_prefix, relation)
if triple in splits_by_triple:
splits_by_triple[triple].append(_entry)
# Iterate through the prepopulated dict
for triple, values in splits_by_triple.items():
split = triple.as_identifier()
if len(values) == 0:
logging.debug(f"{split} - No matches (0 matches found)")
continue
subconverter = msdf.converter.get_subconverter(
[triple.subject_prefix, triple.object_prefix, triple.relation_prefix]
)
split_to_msdf[split] = from_sssom_dataframe(
pd.DataFrame(values), prefix_map=dict(subconverter.bimap), meta=meta
)
return split_to_msdfThis would drastically reduce the performance of the algorithm as the size of msdf.df increases (if there are non-trival amounts of subject_prefix/object_prefix/relation iterables). In my case, it went from ~10m to nearly instant with a similar approach. The only added complexity is constructing the SSSOMTriple object for every row in the dataframe and every individual of the SxPxO product, but this is not computationally complex-- it should not add a performance penalty in any case.