Skip to content

Commit f624e84

Browse files
committed
Updated for NAAM v2.0.0
1 parent 307e29f commit f624e84

File tree

2 files changed

+106
-60
lines changed

2 files changed

+106
-60
lines changed

posts/NAAM-02-preparation.qmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,10 @@ This workflow is distributed as a self-contained Singularity container image, wh
1919
The singularity container needs an image file to activate the precompiled work environment. You can download the required workflow image file (naam_workflow.sif) directly through the terminal via:
2020

2121
``` bash
22-
wget https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/download/v1.1.2/naam_workflow.sif
22+
wget https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/download/v2.0.0/naam_workflow.sif
2323
```
2424

25-
Or go to the [github page](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/tag/v1.1.2){target="_blank"} and manually download it there, then transfer it to your HPC system.
25+
Or go to the [github page](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/tag/v2.0.0){target="_blank"} and manually download it there, then transfer it to your HPC system.
2626

2727
### 1.2 Verify container {.unnumbered}
2828

posts/NAAM-07-nanopore_hpc.qmd

Lines changed: 104 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ We can make use of a tool called [Snakemake](https://snakemake.readthedocs.io/en
1212

1313
To run the automated workflow, you'll need to make sure that your project directory is set up correctly.
1414

15-
To make the project setup process even easier, we've created a simple command-line tool called `amplicon_project.py`. This tool automates the creation of the project directory, the sample configuration file (`sample.tsv`), and the general settings configuration file (`config.yaml`), guiding you through each step with clear prompts and error checking.
15+
To make the project setup process even easier, we've created a simple command-line tool called `amplicon_project.py`. This tool automates the creation of the project directory and the sample configuration file (`sample.tsv`), guiding you through each step with clear prompts and error checking.
1616

1717
The amplicon_project.py tool is built into the singularity container image. Instead of using `singularity shell`, we can use `singularity exec` to directly execute commands. Try accessing amplicon_project.py:
1818

@@ -21,10 +21,9 @@ singularity exec naam_workflow.sif python /amplicon_project.py --help
2121
```
2222

2323
```default
24-
usage: amplicon_project.py [-h] [-p PROJECT_DIR] -n STUDY_NAME -d RAW_FASTQ_DIR -P PRIMER -r PRIMER_REFERENCE -R REFERENCE_GENOME [-m MIN_LENGTH] [-c COVERAGE] [-t THREADS] [--use_sars_cov_2_workflow]
25-
[--nextclade_dataset NEXTCLADE_DATASET]
24+
usage: amplicon_project.py [-h] [-p PROJECT_DIR] -n STUDY_NAME -d RAW_FASTQ_DIR --virus-config VIRUS_CONFIG --sample-map SAMPLE_MAP
2625
27-
Interactive tool for setting up a Snakemake project.
26+
Interactive tool for setting up a multi-virus amplicon analysis project.
2827
2928
options:
3029
-h, --help show this help message and exit
@@ -33,88 +32,133 @@ options:
3332
-n STUDY_NAME, --study_name STUDY_NAME
3433
Name of the study
3534
-d RAW_FASTQ_DIR, --raw_fastq_dir RAW_FASTQ_DIR
36-
Directory containing raw FASTQ files
37-
-P PRIMER, --primer PRIMER
38-
Fasta file containing primer sequences
39-
-r PRIMER_REFERENCE, --primer_reference PRIMER_REFERENCE
40-
Fasta file containing primer reference sequence
41-
-R REFERENCE_GENOME, --reference_genome REFERENCE_GENOME
42-
Fasta file containing reference genome
43-
-m MIN_LENGTH, --min_length MIN_LENGTH
44-
Minimum read length (default: 1000)
45-
-c COVERAGE, --coverage COVERAGE
46-
Minimum coverage required for consensus (default: 30)
47-
-t THREADS, --threads THREADS
48-
Maximum number of threads for the Snakefile (default: 8)
49-
--use_sars_cov_2_workflow
50-
Add this parameter if you want to analyze SARS-CoV-2 data
51-
--nextclade_dataset NEXTCLADE_DATASET
52-
Path to a custom Nextclade dataset directory, OR an official Nextclade dataset name (e.g., 'nextstrain/sars-cov-2/wuhan-hu-1/orfs'). Check official nextclade datasets with `nextclade
53-
dataset list`.
35+
Directory containing raw FASTQ barcode subdirectories
36+
--virus-config VIRUS_CONFIG
37+
Path to the virus configuration YAML file.
38+
--sample-map SAMPLE_MAP
39+
Path to the sample map CSV file.
40+
(CSV must have 'barcode_dir' and 'virus_id' columns)
5441
```
5542

56-
Now prepare your project directory with **amplicon_project.py** as follows:
43+
You can prepare your project directory with **amplicon_project.py** as follows:
5744

5845
``` bash
5946
singularity exec \
60-
--bind /mnt/viro0002-data:/mnt/viro0002-data \
47+
--bind /mnt/viro0002-data:/mnt/viro0002-data \ # Use whatever --bind is required for your context
6148
--bind $HOME:$HOME \
6249
--bind $PWD:$PWD \
6350
naam_workflow.sif \
6451
python /amplicon_project.py \
6552
-p {project.folder} \
6653
-n {name} \
6754
-d {reads} \
68-
-m {min_length} \
69-
-c {coverage} \
70-
-P {primer} \
71-
-r {primer.reference} \
72-
-R {reference} \
73-
-t {threads} \
74-
--nextclade_dataset {nextclade.dataset}
55+
--virus-config {virus_config.yaml} \
56+
--sample-map {sample_map.csv}
7557
```
7658

77-
REQUIRED ARGUMENTS:
59+
::: callout-important
60+
Please use absolute paths for the **reads**, **virus_config.yaml** and **sample_map.csv** so that they can always be located.
61+
:::
7862

7963
- `{project.folder}` is your project folder. This is where you run your workflow and store results.
8064
- `{name}` is the name of your study, no spaces allowed.
8165
- `{reads}` is the folder that contains your barcode directories (e.g. barcode01, barcode02).
82-
- `{min_length}` is the minimum length required for the reads to be accepted. This must be below the expected size of the amplicon, for example, for the 2500nt mpox amplicon we use a threshold of 1000
83-
- `{coverage}` is the minimum coverage required, anything lower than 30 is not recommended, for low accuracy basecalling, higher coverage is recommended.
84-
- `{primer}` is the file containing the primer sequences
85-
- `{primer.reference}` is the reference sequence .fasta file used for primer trimming.
86-
- `{reference}` is the reference sequence .fasta file used for the consensus generation.
66+
- `{virus_config.yaml}` is a yaml file specifying run parameters for your viruses.
67+
- `{sample_map.csv}` is a **comma separated** file, specifying sample + virus combination.
8768

88-
OPTIONAL ARGUMENTS:
89-
90-
- `{nextclade.dataset}` the path to an official or custom nextclade dataset. A list official nextclade datasets can be checked with the following command: `singularity exec naam_workflow.sif nextclade dataset list`. If you are using a self made custom nextclade dataset, then please provide the absolute path to the dataset.
91-
92-
::: callout-important
93-
Please use absolute paths for the **reads**, **primers** and **references** so that they can always be located.
94-
:::
69+
### Binding directories
9570

9671
The `--bind` arguments are needed to explicitly tell Singularity to mount the necessary host directories into the container. The part before the colon is the path on the host machine that you want to make available. The path after the colon is the path inside the container where the host directory should be mounted.
9772

9873
As a default, Singularity often automatically binds your home directory (`$HOME`) and the current directory (`$PWD`). We also explicitly bind `/mnt/viro0002-data` in this example. If your input files (reads, reference, databases) or output project directory reside outside these locations, you MUST add specific `--bind /host/path:/container/path` options for those locations, otherwise the container won't be able to find them.
9974

100-
Once the setup is completed, move to your newly created project directory with `cd`, check where you are with `pwd`.
75+
### Virus config and sample map
76+
77+
The **virus config** file is the central "database" of all supported viruses and their specific parameters. This file contains information about file paths for references and primers, as well as analysis parameters for each virus.
78+
79+
Example of a virus config:
80+
```yaml
81+
sars-cov-2:
82+
# Paths to reference and primer files
83+
reference_genome: /path/to/reference.fasta
84+
primer: /path/to/primer.fasta
85+
primer_reference: /path/to/primer_reference.fasta
86+
87+
# Required analysis parameters
88+
min_length: 250
89+
coverage: 30
90+
91+
# Optional workflow steps
92+
run_nextclade: true
93+
nextclade_dataset: 'nextstrain/sars-cov-2/wuhan-hu-1/orfs' # Official Nextclade maintained dataset
94+
95+
measles:
96+
reference_genome: /path/to/reference.fasta
97+
primer: /path/to/primer.fasta
98+
primer_reference: /path/to/primer_reference.fasta
99+
100+
min_length: 100
101+
coverage: 30
102+
103+
run_nextclade: true
104+
nextclade_dataset: '/path/to/custom/measles/dataset' # Custom user created dataset
105+
106+
mpox:
107+
reference_genome: /path/to/reference.fasta
108+
primer: /path/to/primer.fasta
109+
primer_reference: /path/to/primer_reference.fasta
110+
111+
min_length: 1000
112+
coverage: 30
113+
114+
run_nextclade: false # Nextclade will not run for this virus
115+
nextclade_dataset: null
116+
117+
# ... add other viruses if needed.
118+
```
119+
Key parameters within each virus entry include:
101120

102-
Next, use the `ls` command to list the files in the project directory and check if the following files are present: `sample.tsv`, `config.yaml` and `Snakefile`.
121+
- `reference_genome, primer, primer_reference`: Absolute paths to the respective FASTA files.
122+
- `min_length`: The minimum read length to keep after QC. Must be below the expected amplicon size.
123+
- `coverage`: The minimum read depth required for consensus calling. 30x is a common minimum.
124+
- `run_nextclade`: Set to true to enable the Nextclade analysis for this virus, false otherwise.
125+
- `nextclade_dataset`: Path to a Nextclade dataset. This can be an official dataset name (the workflow will download it) or an absolute path to a custom dataset you have locally.
103126

104-
- The **sample.tsv** should have 9 columns:
105-
- `unique_id`: the unique sample name that's generated based on the barcode directories.
106-
- `sequence_name`: the name given to the consensus sequence at the end of the pipeline. It's generated with the following template: {study_name}_{unique_id}.
107-
- `fastq_path`: the location of the raw .fastq.gz files per sample.
108-
- `reference`: the location of the reference sequence for the consensus generation.
109-
- `primers`: the location of the file containing the primer sequences.
110-
- `primer_reference`: the location of the reference sequence for primer trimming.
111-
- `coverage`: minimum coverage required.
112-
- `min_length`: minimum length required.
113-
- `nextclade_db`: path to the nextclade dataset. This column can be empty if you're not including Nextclade.
127+
The **sample map** is a simple file to link each barcode/sample to a virus from the config. For each sample, amplicon_project.py uses its virus_id to pull the correct parameters (paths, min_length, etc.) from the virus config and place them into a sample.tsv.
114128

115-
- The **config.yaml** determines if the SARS_CoV_2 section of the workflow is enabled and the amount of default threads to use.
129+
Example of a sample map:
130+
```bash
131+
barcode_dir,virus_id
132+
barcode01,sars-cov-2
133+
barcode02,sars-cov-2
134+
barcode03,measles
135+
barcode04,measles
136+
barcode05,mpox
137+
```
138+
139+
- `barcode_dir`: name of barcode directory containing raw fastq.gz files.
140+
- `virus_id`: virus name, must match with names in the virus config.
141+
142+
After running amplicon_project.py, move to your newly created project directory with `cd`, check where you are with `pwd`.
143+
144+
Next, use the `ls` command to list the files in the project directory and check if the following files are present: `sample.tsv` and `Snakefile`.
116145

117-
- The **Snakefile** is the "recipe" for the workflow, describing all the steps we have done by hand, and it is most commonly placed in the root directory of your project (you can open the Snakefile with a text editor and have a look).
146+
The **sample.tsv** file is automatically generated by the setup script. It contains all the per-sample information the workflow needs to run. Each row represents one sample, with the following columns:
147+
148+
- `unique_id`: a unique identifier for the sample, generated from the barcode directory name (e.g., BC01)
149+
- `sequence_name`: the final name for the consensus sequence, formatted as {study_name}_{unique_id}.
150+
- `fastq_path`: the location of the raw *.fastq.gz files.
151+
- `virus_id`: name of the virus.
152+
- `reference_genome`: the location of the reference sequence for the consensus generation.
153+
- `primer`: the location of the file containing the primer sequences.
154+
- `primer_reference`: the location of the reference sequence for primer trimming.
155+
- `min_length`: minimum length required.
156+
- `coverage`: minimum coverage required.
157+
- `run_nextclade`: true or false, determines if nextclade should be run.
158+
- `nextclade_dataset`: path to the nextclade dataset. This field can be empty if you're not running Nextclade.
159+
160+
161+
The **Snakefile** is the "recipe" for the workflow, describing all the steps we have done by hand, and it is most commonly placed in the root directory of your project (you can open the Snakefile with a text editor and have a look).
118162

119163
## 6.2 Running the workflow {.unnumbered}
120164

@@ -133,4 +177,6 @@ singularity exec \
133177
--dryrun
134178
```
135179

180+
- `{threads}`: sets the maximum number of cores Snakemake can use in parallel. If you want a specific step of the workflow to utilize more threads, then you can manually edit the Snakefile.
181+
136182
If no errors appear, then remove the `--dryrun` argument and run it again to fully execute the workflow.

0 commit comments

Comments
 (0)