-
Notifications
You must be signed in to change notification settings - Fork 25
Add Coverage Statistics Calculation Script for Mosdepth #460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
72cb659
310826f
44b6878
686ff54
bdc4f4e
13d51b3
13e1c11
ef3574f
229eb1d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,166 @@ | ||||||||||||||||||||||
| import argparse | ||||||||||||||||||||||
| import gzip | ||||||||||||||||||||||
| import json | ||||||||||||||||||||||
| import logging | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| import numpy as np | ||||||||||||||||||||||
| import pandas as pd | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| def parse_arguments(): | ||||||||||||||||||||||
| parser = argparse.ArgumentParser(description="Process coverage values.") | ||||||||||||||||||||||
| parser.add_argument("mosdepth_regions", type=str, | ||||||||||||||||||||||
| help="Mosdepth output file containing coverage values.") | ||||||||||||||||||||||
| parser.add_argument("--cov_col", type=int, default=4, | ||||||||||||||||||||||
| help="Column holding the coverage values.") | ||||||||||||||||||||||
| parser.add_argument("--output_prefix", type=str, | ||||||||||||||||||||||
| help="Prefix to use for output files.") | ||||||||||||||||||||||
| parser.add_argument("--round", type=int, default=2, help="Rounding precision.") | ||||||||||||||||||||||
| parser.add_argument("--debug", action="store_true", help="Debug mode.") | ||||||||||||||||||||||
| return parser.parse_args() | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| def calculate_summary_statistics( | ||||||||||||||||||||||
| df: pd.DataFrame, cov_col: int, round_precision: int | ||||||||||||||||||||||
| ) -> dict: | ||||||||||||||||||||||
| """ | ||||||||||||||||||||||
| Calculate summary statistics for the coverage values. | ||||||||||||||||||||||
| @param df: Dataframe containing coverage values | ||||||||||||||||||||||
| @param cov_col: Column containing coverage values | ||||||||||||||||||||||
| @param round_precision: Rounding precision | ||||||||||||||||||||||
| @return: | ||||||||||||||||||||||
| """ | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| cov_data: pd.Series = df.iloc[:, cov_col - 1] | ||||||||||||||||||||||
| mean_val: float = cov_data.mean() | ||||||||||||||||||||||
| mad_val: float = np.mean(np.abs(cov_data - mean_val)) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| statistics = { | ||||||||||||||||||||||
| "cov_mean": round(mean_val, round_precision), | ||||||||||||||||||||||
| "cov_q1": round(cov_data.quantile(0.25), round_precision), | ||||||||||||||||||||||
| "cov_median": round(cov_data.median(), round_precision), | ||||||||||||||||||||||
| "cov_q3": round(cov_data.quantile(0.75), round_precision), | ||||||||||||||||||||||
| "cov_iqr": round(cov_data.quantile(0.75) - cov_data.quantile(0.25), round_precision), | ||||||||||||||||||||||
| "cov_stdev": round(cov_data.std(), round_precision), | ||||||||||||||||||||||
| "cov_mad": round(mad_val, round_precision), | ||||||||||||||||||||||
| "cov_percent_above_4x": percentage_greater_than_4x(df, cov_col, round_precision), | ||||||||||||||||||||||
| "cov_evenness_score": calculate_evenness_score(df, cov_col, round_precision) | ||||||||||||||||||||||
| } | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # Replace Nan values with null | ||||||||||||||||||||||
| for key, value in statistics.items(): | ||||||||||||||||||||||
| if pd.isna(value): | ||||||||||||||||||||||
| statistics[key] = "null" | ||||||||||||||||||||||
|
Comment on lines
+50
to
+53
|
||||||||||||||||||||||
| # Replace Nan values with null | |
| for key, value in statistics.items(): | |
| if pd.isna(value): | |
| statistics[key] = "null" | |
| # Replace NaN values with None | |
| for key, value in statistics.items(): | |
| if pd.isna(value): | |
| statistics[key] = None |
Copilot
AI
Jun 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting the series to a list and then summing in Python is less efficient. Consider using vectorized operations: d2 = cov_data[cov_data <= mean_coverage], then d2_count = d2.size and d2_sum = d2.sum().
| d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1].tolist() | |
| # count of coverage values that are less than or equal to the mean coverage | |
| d2_count = len(d2) | |
| # sum of coverage values that are less than or equal to the mean coverage | |
| d2_sum = sum(d2) | |
| d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1] | |
| # count of coverage values that are less than or equal to the mean coverage | |
| d2_count = d2.size | |
| # sum of coverage values that are less than or equal to the mean coverage | |
| d2_sum = d2.sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON key names (e.g.
cov_mean,cov_q1, etc.) don’t match the example output (mean_cov,q1_cov, etc.). Update the keys to align with the documented format.