Skip to content

Conversation

@cthoyt
Copy link
Member

@cthoyt cthoyt commented Sep 5, 2025

Related to #607

@ptgolden
Copy link

ptgolden commented Sep 5, 2025

Nice! Could you benchmark against https://github.com/monarch-initiative/mondo/raw/refs/heads/master/src/ontology/mappings/mondo.sssom.tsv? Here's my implementation in #609:

$ /usr/bin/time poetry run sssom split ../mondo/src/ontology/tmp/mondo.sssom.tsv -d mondo-test
54.97user 0.99system 0:55.21elapsed 101%CPU (0avgtext+0avgdata 1906496maxresident)k
42208inputs+49768outputs (5major+598126minor)pagefaults 0swaps

$ du mondo-test/
24888	mondo-test/

$ ls -lah mondo-test/ | wc -l
74

@ptgolden
Copy link

ptgolden commented Sep 5, 2025

Also to note-- those results from du and ls are identical to the invocation before #609.

@cthoyt
Copy link
Member Author

cthoyt commented Sep 5, 2025

(biopragmatics) cthoyt@Mac ~/d/sssom-py> /usr/bin/time sssom split ~/Downloads/mondo.sssom.tsv -d mondo-test
       12.97 real        12.87 user         3.13 sys
(biopragmatics) cthoyt@Mac ~/d/sssom-py> du mondo-test/
24936	mondo-test/
(biopragmatics) cthoyt@Mac ~/d/sssom-py> ls -lah mondo-test/ | wc -l
      21
(biopragmatics) cthoyt@Mac ~/d/sssom-py> 

@cthoyt
Copy link
Member Author

cthoyt commented Sep 5, 2025

i guess I should check to see how much of that is poetry nonsense, huh...

@ptgolden
Copy link

ptgolden commented Sep 5, 2025

Wow, interesting! When I profiled, ~90% of the runtime was in parsers.from_sssom_dataframe

(also, I may have a slow computer :-) )

@ptgolden
Copy link

ptgolden commented Sep 5, 2025

What list of files were created for you?

$ ls -lah mondo-test
total 25M
drwxrwxr-x  2 ptgolden ptgolden 4.0K Sep  5 11:25 .
drwxrwxr-x 14 ptgolden ptgolden 4.0K Sep  5 11:24 ..
-rw-rw-r--  1 ptgolden ptgolden  677 Sep  5 11:25 bfo_hasdbxref_bfo.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 8.1K Sep  5 11:25 mondo_broadmatch_icd10cm.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  756 Sep  5 11:25 mondo_broadmatch_mesh.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  696 Sep  5 11:25 mondo_broadmatch_ncit.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  675 Sep  5 11:25 mondo_broadmatch_sctid.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  684 Sep  5 11:25 mondo_closematch_icd10cm.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 141K Sep  5 11:25 mondo_closematch_meddra.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  686 Sep  5 11:25 mondo_closematch_ncit.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  677 Sep  5 11:25 mondo_closematch_sctid.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 1.2M Sep  5 11:25 mondo_exactmatch_doid.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 187K Sep  5 11:25 mondo_exactmatch_icd10cm.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  19K Sep  5 11:25 mondo_exactmatch_icd10who.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 461K Sep  5 11:25 mondo_exactmatch_icd11.foundation.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  757 Sep  5 11:25 mondo_exactmatch_meddra.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 2.2M Sep  5 11:25 mondo_exactmatch_medgen.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 812K Sep  5 11:25 mondo_exactmatch_mesh.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 695K Sep  5 11:25 mondo_exactmatch_ncit.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  61K Sep  5 11:25 mondo_exactmatch_omimps.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 1.1M Sep  5 11:25 mondo_exactmatch_omim.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 1.1M Sep  5 11:25 mondo_exactmatch_orphanet.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 900K Sep  5 11:25 mondo_exactmatch_sctid.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 2.2M Sep  5 11:25 mondo_exactmatch_umls.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 2.0K Sep  5 11:25 mondo_hasdbxref_birnlex.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 4.4K Sep  5 11:25 mondo_hasdbxref_csp.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 7.1K Sep  5 11:25 mondo_hasdbxref_decipher.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 1.2M Sep  5 11:25 mondo_hasdbxref_doid.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 1.2M Sep  5 11:25 mondo_hasdbxref_gard.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 8.3K Sep  5 11:25 mondo_hasdbxref_gtr.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 6.5K Sep  5 11:25 mondo_hasdbxref_hgnc.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  53K Sep  5 11:25 mondo_hasdbxref_hp.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 256K Sep  5 11:25 mondo_hasdbxref_icd10cm.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 9.5K Sep  5 11:25 mondo_hasdbxref_icd10exp.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  20K Sep  5 11:25 mondo_hasdbxref_icd10who.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 473K Sep  5 11:25 mondo_hasdbxref_icd11.foundation.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  869 Sep  5 11:25 mondo_hasdbxref_icd9cm.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 536K Sep  5 11:25 mondo_hasdbxref_icd9.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  76K Sep  5 11:25 mondo_hasdbxref_icdo.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  735 Sep  5 11:25 mondo_hasdbxref_ido.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 147K Sep  5 11:25 mondo_hasdbxref_meddra.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 2.3M Sep  5 11:25 mondo_hasdbxref_medgen.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 856K Sep  5 11:25 mondo_hasdbxref_mesh.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  937 Sep  5 11:25 mondo_hasdbxref_mfomd.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  732 Sep  5 11:25 mondo_hasdbxref_mpath.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  747 Sep  5 11:25 mondo_hasdbxref_mth.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 737K Sep  5 11:25 mondo_hasdbxref_ncit.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  770 Sep  5 11:25 mondo_hasdbxref_ndfrt.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  735 Sep  5 11:25 mondo_hasdbxref_obi.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  818 Sep  5 11:25 mondo_hasdbxref_ogms.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 294K Sep  5 11:25 mondo_hasdbxref_omia.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  64K Sep  5 11:25 mondo_hasdbxref_omimps.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 1.1M Sep  5 11:25 mondo_hasdbxref_omim.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  57K Sep  5 11:25 mondo_hasdbxref_oncotree.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 1.2M Sep  5 11:25 mondo_hasdbxref_orphanet.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 4.1K Sep  5 11:25 mondo_hasdbxref_pmid.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  836 Sep  5 11:25 mondo_hasdbxref_scdo.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 944K Sep  5 11:25 mondo_hasdbxref_sctid.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 2.3M Sep  5 11:25 mondo_hasdbxref_umls.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  10K Sep  5 11:25 mondo_hasdbxref_wikipedia.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  958 Sep  5 11:25 mondo_narrowmatch_icd10cm.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  703 Sep  5 11:25 mondo_narrowmatch_mesh.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  778 Sep  5 11:25 mondo_narrowmatch_ncit.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  797 Sep  5 11:25 mondo_narrowmatch_sctid.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 2.3K Sep  5 11:25 mondo_relatedmatch_icd10cm.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  909 Sep  5 11:25 mondo_relatedmatch_meddra.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 1.8K Sep  5 11:25 mondo_relatedmatch_mesh.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 4.1K Sep  5 11:25 mondo_relatedmatch_ncit.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 2.7K Sep  5 11:25 mondo_relatedmatch_sctid.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  695 Sep  5 11:25 omo_hasdbxref_omo.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  651 Sep  5 11:25 rdfs_hasdbxref_rdfs.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden 2.2K Sep  5 11:25 ro_hasdbxref_ro.sssom.tsv
-rw-rw-r--  1 ptgolden ptgolden  673 Sep  5 11:25 sssom_hasdbxref_sssom.sssom.tsv

@cthoyt cthoyt changed the title [do not merge] Demonstrate reimplementing dataframe split Add alternate dataframe split implementations Sep 22, 2025
@cthoyt cthoyt marked this pull request as ready for review September 22, 2025 06:26
@cthoyt cthoyt requested a review from matentzn September 22, 2025 06:44
@cthoyt
Copy link
Member Author

cthoyt commented Sep 22, 2025

@matentzn @ptgolden I have refactored this PR to retain the original implementation of the dataframe split method and add two new methods. This PR does not change any functionality of SSSOM-py, only allows opting into alternate (more idiomatic) implementations

All three implementations are tested against the existing unit tests. These unit tests have been iteratively improved over several previous PRs. This PR does not change any unit tests other than the minor refactor to allow passing the implementation name and running a loop to test all of them.

I suggest merging this so @ptgolden can better test their implementation in #611 and we can do a more direct benchmark

Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review the changes to src/sssom/parsers.py. If you want @cthoyt you can trust yourself and merge this (the test gave me that confidence). Or you can request a more careful review from @ptgolden - I leave it to you.

@input_argument
@output_directory_option
def split(input: str, output_directory: str) -> None:
@click.option("--method", type=click.Choice(typing.get_args(SplitMethod)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a sensible default to --method? So that an execultion with --method does not fail because of click config?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're understanding. This PR keeps the status quo 100% but adds a new method to opt into one of two better implementations. After we merge this, then we can test against @ptgolden 's PR, and ultimately decide on the final implementation.

WRT the --method, if none is given, it keeps the existing implementation as default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? Wont the click parser complain when no --method is passed on the CMD?

@cthoyt cthoyt merged commit b5077e7 into master Sep 22, 2025
6 checks passed
@cthoyt cthoyt deleted the update-split-dataframe branch September 22, 2025 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants