-
Notifications
You must be signed in to change notification settings - Fork 15
Add alternate dataframe split implementations #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Nice! Could you benchmark against https://github.com/monarch-initiative/mondo/raw/refs/heads/master/src/ontology/mappings/mondo.sssom.tsv? Here's my implementation in #609: $ /usr/bin/time poetry run sssom split ../mondo/src/ontology/tmp/mondo.sssom.tsv -d mondo-test
54.97user 0.99system 0:55.21elapsed 101%CPU (0avgtext+0avgdata 1906496maxresident)k
42208inputs+49768outputs (5major+598126minor)pagefaults 0swaps
$ du mondo-test/
24888 mondo-test/
$ ls -lah mondo-test/ | wc -l
74 |
|
Also to note-- those results from |
|
|
i guess I should check to see how much of that is poetry nonsense, huh... |
|
Wow, interesting! When I profiled, ~90% of the runtime was in (also, I may have a slow computer :-) ) |
|
What list of files were created for you? |
|
@matentzn @ptgolden I have refactored this PR to retain the original implementation of the dataframe split method and add two new methods. This PR does not change any functionality of SSSOM-py, only allows opting into alternate (more idiomatic) implementations All three implementations are tested against the existing unit tests. These unit tests have been iteratively improved over several previous PRs. This PR does not change any unit tests other than the minor refactor to allow passing the implementation name and running a loop to test all of them. I suggest merging this so @ptgolden can better test their implementation in #611 and we can do a more direct benchmark |
matentzn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @input_argument | ||
| @output_directory_option | ||
| def split(input: str, output_directory: str) -> None: | ||
| @click.option("--method", type=click.Choice(typing.get_args(SplitMethod))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a sensible default to --method? So that an execultion with --method does not fail because of click config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're understanding. This PR keeps the status quo 100% but adds a new method to opt into one of two better implementations. After we merge this, then we can test against @ptgolden 's PR, and ultimately decide on the final implementation.
WRT the --method, if none is given, it keeps the existing implementation as default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? Wont the click parser complain when no --method is passed on the CMD?
Related to #607