Skip to content

Commit e580493

Browse files
docs: add explanations and sources to config/config.yaml (#162)
Following a number of setups of this workflow with a number of different users, I realized that the configuration is not well explained, so far. So this is a major overhaul that: 1. Moves all information on `config/config.yaml` entries out of the `config/README.md` and into comments in the file itself. 2. Explains every `config/config.yaml` entry in detail, with lots of linkouts to documentation and reference sources. This should make it a lot easier for new users to configure the workflow to what they want. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Configuration files now include detailed explanatory comments and updated guidance for all major RNA-seq workflow settings. * New sections added for meta comparisons and bootstrap plots, with comprehensive usage instructions. * Default values and file paths updated for reference data, gene lists, and gene sets. * Documentation now directs users to rely on in-file comments for configuration details, streamlining external documentation. * Simplified alternate config files by removing most explanatory comments and consolidating guidance. * **New Features** * Added configuration options for meta comparisons and bootstrap plotting. * **Chores** * Improved clarity and organization of configuration files and documentation. * Updated version numbers for reference releases and adjusted default parameters for analysis tools. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Adrian Prinz <44083468+Addimator@users.noreply.github.com>
1 parent 59dd1c7 commit e580493

File tree

4 files changed

+187
-249
lines changed

4 files changed

+187
-249
lines changed

.test/config/config.yaml

Lines changed: 13 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -1,163 +1,92 @@
11
samples: config/samples.tsv
22
units: config/units.tsv
33

4+
##########################################################
5+
# FOR A COMMENTED VERSION OF THIS CONFIGURATION FILE, #
6+
# PLEASE SEE THE MAIN `config/config.yaml` FILE #
7+
##########################################################
8+
49
experiment:
5-
# If set to `true`, this option allows the workflow to analyse 3-prime RNA seq data obtained from Quantseq protocol by Lexogen.
6-
# For more information https://www.lexogen.com/quantseq-3mrna-sequencing/
710
3-prime-rna-seq:
811
activate: false
9-
# Specify vendor of the used protocol. Currently, only lexogene is supported.
1012
vendor: lexogen
11-
# this allows to plot QC of aligned read postion for specific transcripts (or 'all' transcripts)
1213
plot-qc: all
1314

1415
resources:
1516
ref:
16-
# ensembl species name
1717
species: homo_sapiens
18-
# ensembl release version
19-
release: "113"
20-
# genome build
18+
release: "114"
2119
build: GRCh38
22-
# pfam release to use for annotation of domains in differential splicing analysis
23-
pfam: "33.0"
24-
# Choose strategy for selecting representative transcripts for each gene.
25-
# Possible values:
26-
# - canonical (use the canonical transcript from ensembl, only works for human at the moment)
27-
# - mostsignificant (use the most significant transcript)
28-
# - path/to/any/file.txt (a path to a file with ensembl transcript IDs to use;
29-
# the user has to ensure that there is only one ID per gene given)
20+
pfam: "37.0"
3021
representative_transcripts: canonical
3122
ontology:
32-
# gene ontology to download, used e.g. in goatools
33-
gene_ontology: "http://current.geneontology.org/ontology/go-basic.obo"
23+
gene_ontology: "https://release.geneontology.org/2025-07-22/ontology/go-basic.obo"
3424

3525
pca:
36-
# If set to true, samples with NA values in the specified covariate column will be removed for PCA computation;
3726
exclude_nas: false
3827
labels:
39-
# columns of sample sheet to use for PCA
4028
- condition
4129

4230
scatter:
43-
# for use as diagnostic plots
44-
# all samples are compared in pairs to assess their correlation
45-
# scatter plots are only created if parameter 'activate' is set to 'true'
4631
activate: true
4732

4833
diffexp:
49-
# samples to exclude (e.g. outliers due to technical problems)
5034
exclude:
51-
# model for sleuth differential expression analysis
5235
models:
5336
model_X:
5437
full: ~condition + batch_effect
5538
reduced: ~batch_effect
56-
# Covariate / sample sheet column that shall be used for fold
57-
# change/effect size based downstream analyses.
5839
primary_variable: condition
59-
# base level of the primary variable (this should be one of the entries
60-
# in the primary_variable sample sheet column and will be considered as
61-
# denominator in the fold change/effect size estimation).
6240
base_level: untreated
63-
# significance level to use for volcano, ma- and qq-plots
6441
sig-level:
6542
volcano-plot: 0.05
6643
ma-plot: 0.05
6744
qq-plot: 0.05
68-
# Optional (comment in to use): provide a list of genes that shall be shown in a heatmap
69-
# and for which bootstrap plots (see below) shall be created.
7045
genes_of_interest:
7146
activate: true
7247
genelist: "resources/gene_list.tsv"
7348

7449
diffsplice:
7550
activate: false
76-
# codingCutoff parameter of isoformSwitchAnalyzer, see
77-
# https://rdrr.io/bioc/IsoformSwitchAnalyzeR/man/analyzeCPAT.html
7851
coding_cutoff: 0.725
79-
# Should be set to true when using de-novo assembled transcripts.
8052
remove_noncoding_orfs: false
81-
# False discovery rate to control for.
8253
fdr: 1.0
83-
# Minimum size of differential isoform usage effect
84-
# (see dIFcutoff, https://rdrr.io/github/kvittingseerup/IsoformSwitchAnalyzeR/man/IsoformSwitchTestDEXSeq.html)
85-
min_effect_size: 0.0
54+
min_effect_size: 0.1
8655

8756
enrichment:
8857
goatools:
89-
# tool is only run if set to `true`
9058
activate: true
9159
fdr_genes: 0.05
9260
fdr_go_terms: 0.05
9361
fgsea:
94-
# If activated, you need to provide a GMT file with gene sets of interest.
95-
# A GMT file can contain multiple gene sets. So if you want to test for
96-
# enrichment in multiple sets of gene sets, please merge your gene sets
97-
# into one GMT file and provide this here. Don't forget to document your
98-
# gene set sources and to cite them upon publication.
99-
gene_sets_file: "ngs-test-data/ref/dummy.gmt"
100-
# tool is only run if set to `true`
10162
activate: true
102-
# if activated, you need to provide a GMT file with gene sets of interest
63+
gene_sets_file: "ngs-test-data/ref/dummy.gmt"
10364
fdr_gene_set: 0.05
10465
eps: 0.0001
10566
spia:
106-
# tool is only run if set to `true`
10767
activate: true
108-
# pathway databases to use in SPIA
109-
# The database needs to be available for the species specified by
110-
# resources -> ref -> species above, which you can check via the graphite
111-
# package function pathwayDatabases():
112-
# https://rdrr.io/bioc/graphite/man/pathwayDatabases.html
113-
# The actual list is maintained in the code (make sure to select the
114-
# correct commit/version of the repository):
115-
# https://github.com/sales-lab/graphite/blob/1fa6ebfb41c1cd01f8b6e7f76f45ee76a15a3450/R/fetch.R#L43
116-
# Or you can look it up in the graphite documentation PDF (page 6):
117-
# https://bioconductor.org/packages/release/bioc/vignettes/graphite/inst/doc/graphite.pdf
11868
pathway_databases:
11969
- panther
12070

12171
meta_comparisons:
122-
# comparison is only run if set to `true`
12372
activate: false
124-
# Define here the comparisons under interest
12573
comparisons:
126-
# Define any name for comparison. You can add as many comparisions as you want
127-
model_X_vs_model_Y:
74+
treated_vs_untreated_against_other_model:
12875
items:
129-
# Define the two underlying models for the comparison. The models must be defined in the diffexp/models in the config
130-
# items must be of form <arbitrary label>: <existing diffexp model from config>
131-
X: model_X
76+
treated_vs_untreated: treated_vs_untreated
13277
Y: model_Y
133-
# Define label for datavzrd report
134-
label: model X vs. model Y
78+
label: treated_vs_untreated compared to other_model
13579

13680
report:
137-
# make this `true`, to get excel files for download in the snakemake
138-
# report, BUT: this can drastically increase the runtime of datavzrd report
139-
# generation, especially on larger cohorts
14081
offer_excel: true
14182

14283
bootstrap_plots:
143-
# desired false discovery rate for bootstrap plots, i.e. a lower FDR will result in fewer boxplots generated
14484
FDR: 0.01
145-
# maximum number of bootstrap plots to generate, i.e. top n discoveries to plot
14685
top_n: 3
14786
color_by: condition
148-
# for now, this will plot the sleuth-normalised kallisto count estimations with kallisto
149-
# for all the transcripts of the respective genes
15087

15188
plot_vars:
152-
# significance level used for plot_vars() plots
15389
sig_level: 0.1
15490

15591
params:
156-
# for kallisto parameters, see the kallisto manual:
157-
# https://pachterlab.github.io/kallisto/manual
158-
# reasoning behind parameters:
159-
# * `-b 100`: Doing 100 bootstrap samples was used by the tool authors
160-
# [when originally introducing the feature](https://github.com/pachterlab/kallisto/issues/11#issuecomment-74346385).
161-
# If you want to decrease this for larger datasets, there paper and
162-
# [a reply on GitHub suggest a value of `-b 30`](https://github.com/pachterlab/kallisto/issues/353#issuecomment-1215742328).
16392
kallisto: "-b 30"
Lines changed: 12 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -1,157 +1,89 @@
11
samples: config/samples.tsv
22
units: config/units.tsv
33

4+
##########################################################
5+
# FOR A COMMENTED VERSION OF THIS CONFIGURATION FILE, #
6+
# PLEASE SEE THE MAIN `config/config.yaml` FILE #
7+
##########################################################
8+
49
experiment:
5-
# If set to `true`, this option allows the workflow to analyse 3-prime RNA seq data obtained from Quantseq protocol by Lexogen.
6-
# For more information https://www.lexogen.com/quantseq-3mrna-sequencing/
710
3-prime-rna-seq:
811
activate: true
9-
# this allows to plot QC of aligned read postion for specific transcripts (or 'all' transcripts)
10-
# Specify vendor of the used protocol. Currently, only lexogene is supported.
1112
vendor: lexogen
1213
plot-qc: all
1314

1415
resources:
1516
ref:
16-
# ensembl species name
1717
species: homo_sapiens
18-
# ensembl release version
19-
release: "113"
20-
# genome build
18+
release: "114"
2119
build: GRCh38
22-
# pfam release to use for annotation of domains in differential splicing analysis
23-
pfam: "33.0"
24-
# Choose strategy for selecting representative transcripts for each gene.
25-
# Possible values:
26-
# - canonical (use the canonical transcript from ensembl, only works for human at the moment)
27-
# - mostsignificant (use the most significant transcript)
28-
# - path/to/any/file.txt (a path to a file with ensembl transcript IDs to use;
29-
# the user has to ensure that there is only one ID per gene given)
20+
pfam: "37.0"
3021
representative_transcripts: canonical
3122
ontology:
32-
# gene ontology to download, used e.g. in goatools
33-
gene_ontology: "http://current.geneontology.org/ontology/go-basic.obo"
23+
gene_ontology: "https://release.geneontology.org/2025-07-22/ontology/go-basic.obo"
3424

3525
pca:
36-
# If set to true, samples with NA values in the specified covariate column will be removed for PCA computation.
3726
exclude_nas: false
3827
labels:
39-
# columns of sample sheet to use for PCA
4028
- condition
4129

4230
scatter:
43-
# for use as diagnostic plots
44-
# all samples are compared in pairs to assess their correlation
45-
# scatter plots are only created if parameter 'activate' is set to 'true'
4631
activate: true
4732

4833
diffexp:
49-
# samples to exclude (e.g. outliers due to technical problems)
5034
exclude:
51-
# model for sleuth differential expression analysis
5235
models:
5336
model_X:
5437
full: ~condition
5538
reduced: ~1
56-
# Covariate / sample sheet column that shall be used for fold
57-
# change/effect size based downstream analyses.
5839
primary_variable: condition
59-
# base level of the primary variable (this should be one of the entries
60-
# in the primary_variable sample sheet column and will be considered as
61-
# denominator in the fold change/effect size estimation).
6240
base_level: Control
63-
# significance level to use for volcano, ma- and qq-plots
6441
sig-level:
6542
volcano-plot: 0.05
6643
ma-plot: 0.05
6744
qq-plot: 0.05
68-
# Optional (comment in to use): provide a list of genes that shall be shown in a heatmap
69-
# and for which bootstrap plots (see below) shall be created.
7045
genes_of_interest:
7146
activate: false
7247
genelist: "resources/gene_list.tsv"
7348

7449
diffsplice:
7550
activate: false
76-
# codingCutoff parameter of isoformSwitchAnalyzer, see
77-
# https://rdrr.io/bioc/IsoformSwitchAnalyzeR/man/analyzeCPAT.html
7851
coding_cutoff: 0.725
79-
# Should be set to true when using de-novo assembled transcripts.
8052
remove_noncoding_orfs: false
81-
# False discovery rate to control for.
8253
fdr: 1.0
83-
# Minimum size of differential isoform usage effect
84-
# (see dIFcutoff, https://rdrr.io/github/kvittingseerup/IsoformSwitchAnalyzeR/man/IsoformSwitchTestDEXSeq.html)
8554
min_effect_size: 0.0
8655

8756
enrichment:
8857
goatools:
89-
# tool is only run if set to `true`
9058
activate: true
9159
fdr_genes: 0.05
9260
fdr_go_terms: 0.05
9361
fgsea:
94-
# If activated, you need to provide a GMT file with gene sets of interest.
95-
# A GMT file can contain multiple gene sets. So if you want to test for
96-
# enrichment in multiple sets of gene sets, please merge your gene sets
97-
# into one GMT file and provide this here. Don't forget to document your
98-
# gene set sources and to cite them upon publication.
99-
gene_sets_file: "config/gene_sets.gmt"
100-
# tool is only run if set to `true`
10162
activate: true
102-
# if activated, you need to provide a GMT file with gene sets of interest
63+
gene_sets_file: "config/gene_sets.gmt"
10364
fdr_gene_set: 0.05
10465
eps: 0.0001
10566
spia:
106-
# tool is only run if set to `true`
10767
activate: true
108-
# pathway databases to use in SPIA
109-
# The database needs to be available for the species specified by
110-
# resources -> ref -> species above, which you can check via the graphite
111-
# package function pathwayDatabases():
112-
# https://rdrr.io/bioc/graphite/man/pathwayDatabases.html
113-
# The actual list is maintained in the code (make sure to select the
114-
# correct commit/version of the repository):
115-
# https://github.com/sales-lab/graphite/blob/1fa6ebfb41c1cd01f8b6e7f76f45ee76a15a3450/R/fetch.R#L43
116-
# Or you can look it up in the graphite documentation PDF (page 6):
117-
# https://bioconductor.org/packages/release/bioc/vignettes/graphite/inst/doc/graphite.pdf
11868
pathway_databases:
11969
- panther
12070

12171
meta_comparisons:
122-
# comparison is only run if set to `true`
12372
activate: false
124-
# Define here the comparisons under interest
12573
comparisons:
126-
# Define any name for comparison. You can add as many comparisions as you want
127-
model_X_vs_model_Y:
74+
treated_vs_untreated_against_other_model:
12875
items:
129-
# Define the two underlying models for the comparison. The models must be defined in the diffexp/models in the config
130-
# items must be of form <arbitrary label>: <existing diffexp model from config>
131-
X: model_X
76+
treatead_vs_untreated: treated_vs_untreated
13277
Y: model_Y
133-
# Define label for datavzrd report
134-
label: model X vs. model Y
78+
label: treated_vs_untreated compared to other_model
13579

13680
bootstrap_plots:
137-
# desired false discovery rate for bootstrap plots, i.e. a lower FDR will result in fewer boxplots generated
13881
FDR: 0.01
139-
# maximum number of bootstrap plots to generate, i.e. top n discoveries to plot
14082
top_n: 3
14183
color_by: condition
142-
# for now, this will plot the sleuth-normalised kallisto count estimations with kallisto
143-
# for all the transcripts of the respective genes
14484

14585
plot_vars:
146-
# significance level used for plot_vars() plots
14786
sig_level: 0.1
14887

14988
params:
150-
# for kallisto parameters, see the kallisto manual:
151-
# https://pachterlab.github.io/kallisto/manual
152-
# reasoning behind parameters:
153-
# * `-b 100`: Doing 100 bootstrap samples was used by the tool authors
154-
# [when originally introducing the feature](https://github.com/pachterlab/kallisto/issues/11#issuecomment-74346385).
155-
# If you want to decrease this for larger datasets, there paper and
156-
# [a reply on GitHub suggest a value of `-b 30`](https://github.com/pachterlab/kallisto/issues/353#issuecomment-1215742328).
15789
kallisto: "-b 30"

config/README.md

Lines changed: 4 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ To configure this workflow, modify the following files to reflect your dataset a
55
* `config/units.tsv`: (sequencing) units sheet with raw data paths
66
* `config/config.yaml`: general workflow configuration and differential expression model setup
77

8+
For the `samples.tsv` and `units.tsv`, we explain the expected columns right here, in this `README.md` file.
9+
For the `config.yaml` file, all entries are explained in detail in its comments.
10+
811

912
## samples sheet
1013

@@ -83,41 +86,5 @@ The `fastp` equivalents, including minimal deviations from the recommendations,
8386
## config.yaml
8487

8588
This file contains the general workflow configuration and the setup for the differential expression analysis performed by sleuth.
86-
Configurable options should be explained in the comments above the respective entry or right here in this `config/README.md` section.
89+
Configurable options should be explained in the comments above the respective entry, so the easiest way to set it up for your workflow is to carefully read through the `config/config.yaml` file and adjust it to your needs.
8790
If something is unclear, don't hesitate to [file an issue in the `rna-seq-kallisto-sleuth` GitHub repository](https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/issues/new/choose).
88-
89-
### differential expression model setup
90-
91-
The core functionality of this workflow is provided by the software [`sleuth`](https://pachterlab.github.io/sleuth/about).
92-
You can use it to test for differential expression of genes or transcripts between two or more subgroups of samples.
93-
94-
#### main sleuth model
95-
96-
The main idea of sleuth's internal model, is to test a `full:` model (containing (a) variable(s) of interest AND batch effects) against a `reduced:` model (containing ONLY the batch effects).
97-
So these are the most important entries to set up under any model that you specify via `diffexp: models:`.
98-
If you don't know any batch effects, the `reduced:` model will have to be `~1`.
99-
Otherwise it will be the tilde followed by an addition of the names of any columns that contain batch effects, for example: `reduced: ~batch_effect_1 + batch_effect_2`.
100-
The full model than additionally includes variables of interest, so fore example: `full: ~variable_of_interest + batch_effect_1 + batch_effect_2`.
101-
102-
#### sleuth effect sizes
103-
104-
Effect size estimates are calculated as so-called beta-values by `sleuth`.
105-
For binary comparisons (your variable of interest has two factor levels), they resemble a log2 fold change.
106-
To know which variable of interest to use for the effect size calculation, you need to provide its column name as the `primary_variable:`.
107-
And for sleuth to know what level of that variable of interest to use as the base level, specify the respective entry as the `base_level:`.
108-
109-
### preprocessing `params`
110-
111-
For **transcript quantification**, `kallisto` is used.
112-
For details regarding its command line arguments, see the [`kallisto` documentation](https://pachterlab.github.io/kallisto/manual).
113-
114-
#### Lexogen 3' QuantSeq data analysis
115-
116-
For Lexogen 3' QuantSeq data analysis, please set `experiment: 3-prime-rna-seq: activate: true` in the `config/config.yaml` file.
117-
For more information information on Lexogen QuantSeq 3' sequencing, see: https://www.lexogen.com/quantseq-3mrna-sequencing/
118-
119-
### meta comparisons
120-
Meta comparisons allow for comparing two full models against each other.
121-
The axes represent the log2-fold changes (beta-scores) for the two models, with each point representing a gene.
122-
Points on the diagonal indicate no difference between the comparisons, while deviations from the diagonal suggest differences in gene expression between the treatments.
123-
For more details see the comments in the `config.yaml`.

0 commit comments

Comments
 (0)