Skip to content

Commit f542d41

Browse files
committed
Added Nanopore amplicon manual
1 parent 959e31f commit f542d41

8 files changed

+454
-0
lines changed

_quarto.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,12 @@ website:
4444
style: "docked"
4545
contents:
4646
- auto: posts/IMAM*.qmd
47+
48+
- id: id-naam
49+
title: "Nanopore amplicon analysis manual"
50+
style: "docked"
51+
contents:
52+
- auto: posts/NAAM*.qmd
4753

4854
format:
4955
html:

posts/NAAM-01-main-page-index.qmd

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
title: "Nanopore amplicon analysis manual"
3+
author: "David Nieuwenhuijse, Luc van Zon"
4+
date: 2025-04-30
5+
Reading Time: 20 min
6+
sidebar: id-naam
7+
---
8+
9+
# Introduction {.unnumbered}
10+
11+
Welcome to the Nanopore amplicon analysis manual. This manual contains a step by step guide for performing quality control, generating a consensus and comparing sequences to reference sequences. In the final chapter of the manual we will show how to automate all of these steps into a single pipeline for speed and convenience.
12+
13+
::: callout-tip
14+
If you are just interested in running the automated workflow, then you only have to check out the chapters 'Preparation' and 'Automating data analysis'.
15+
:::

posts/NAAM-02-preparation.qmd

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# 1. Preparation {.unnumbered}
2+
3+
::: callout-warning
4+
# Important!
5+
6+
In the following sections whenever a **"parameter"** in brackets `{}` is shown, the intention is to fill in your own filename or value. Each parameter will be explained in the section in detail.
7+
:::
8+
9+
::: callout-tip
10+
Notice the small *"Copy to Clipboard"* button on the right hand side of each code chunk, this can be used to copy the code.
11+
:::
12+
13+
## 1.1 Singularity container {.unnumbered}
14+
15+
This workflow is distributed as a self-contained Singularity container image, which includes all necessary software dependencies and helper scripts. This simplifies setup considerably. It is required that [Singularity](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html){target="_blank"} version 3.x or later is available on your system. If you are working with a high performance computing (HPC) system, then this will likely already be installed and available for use. Try writing `singularity --help` in your terminal (that's connected to the HPC system) and see if the command is recognized.
16+
17+
### Download pre-built image
18+
19+
The singularity container needs an image file to activate the precompiled work environment. You can download the required workflow image file (naam_workflow.sif) directly through the terminal via:
20+
21+
``` bash
22+
wget https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/download/v1.0.0/naam_workflow.sif
23+
```
24+
25+
Or go to the [github page](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/tag/v1.0.0){target="_blank"} and manually download it there, then transfer it to your HPC system.
26+
27+
### 1.2 Verify container {.unnumbered}
28+
29+
You can test basic execution:
30+
31+
``` bash
32+
singularity --version
33+
singularity exec naam_workflow.sif echo "Container is accessible!"
34+
```
35+
36+
To check more in depth, you can start an interactive shell inside the build container and run some checks. `singularity shell naam_workflow.sif` will drop you into a shell running inside the container. The conda environment needed for this workflow is automatically active on start-up of the interactive shell. All the tools of the conda environment will therefore be ready to use.
37+
38+
Please note that you do not have to run `conda activate {environment}` to activate the environment – everything is inside naam_workflow.sif. If you're curious about the conda environment we're using, you can check it out [here](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/blob/main/envs/environment.yml){target="_blank"}
39+
40+
``` bash
41+
singularity shell naam_workflow.sif # Start interactive shell
42+
minimap2 --help # Check one of the tools from the conda environment
43+
which python # Check python version of the conda environment
44+
```
45+
46+
::: callout-note
47+
We are now ready to start executing the code to perform quality control of our raw Nanopore sequencing data in the next chapter.
48+
:::

posts/NAAM-03-quality_control.qmd

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# 2. Quality control {.unnumbered}
2+
3+
::: callout-warning
4+
# Important!
5+
6+
In the next steps we are going to copy-paste code, adjust it to our needs, and execute it on the command-line.
7+
8+
**Please open a plain text editor to paste the code from the next steps, to keep track of your progress!**
9+
:::
10+
11+
For simplicity's sake, most steps will be geared towards an analysis of a single sample. It is recommended to follow a basic file structure like the following below:
12+
13+
```
14+
my_project/
15+
├── raw_data/ # Contains barcode directories
16+
│ ├── barcode01 # Contains the raw, gzipped FASTQ files
17+
│ ├── file1.fastq.gz
18+
│ └── file2.fastq.gz
19+
│ ├── barcode02
20+
│ └── file1.fastq.gz
21+
│ └── barcode03
22+
│ ├── file1.fastq.gz
23+
│ ├── file2.fastq.gz
24+
│ └── file3.fastq.gz
25+
└── results/ # This is where the output files will be stored
26+
└── log/ # This is where log files will be stored
27+
```
28+
29+
When running any command that generates output files, it's essential to ensure that the output directory exists *before* executing the command. While some tools will automatically create the output directory if it's not present, this behavior is not guaranteed. If the output directory doesn't exist and the tool doesn't create it, the command will likely fail with an error message (or, worse, it might fail silently, leading to unexpected results). This is not required if you are running a snakemake workflow.
30+
31+
To prevent a lot of future frustration, create your output directories beforehand with the `mkdir` command as such:
32+
33+
``` bash
34+
mkdir -p results
35+
mkdir -p log
36+
# Create a subdirectory
37+
mkdir -p results/assembly
38+
```
39+
40+
To use the required tools, activate the Singularity container as follows:
41+
```bash
42+
singularity shell naam_workflow.sif
43+
```
44+
45+
## 2.1 Merging and decompressing FASTQ {.unnumbered}
46+
47+
Any file in linux can be pasted to another file using the cat command. zcat in addition also unzips gzipped files (e.g. .fastq.gz extension). If your files are already unzipped, use cat instead.
48+
49+
**Modify and run:**
50+
51+
``` bash
52+
zcat {input.folder}/*.fastq.gz > {output}
53+
```
54+
55+
- `{input.folder}` should contain all your .fastq.gz files for a single barcode.
56+
- `{output}` should be the name of the combined unzipped fastq file (e.g. all_barcode01.fastq).
57+
58+
## 2.2 Running fastp quality control software {.unnumbered}
59+
60+
The [fastp](https://github.com/OpenGene/fastp){target="_blank"} software is a very fast multipurpose quality control software to perform quality and sequence adapter trimming for Illumina short-read and Nanopore long-read data.
61+
62+
Because we are processing Nanopore data, several quality control options have to be disabled. The only requirement we set is a minimum median phred quality score of the read of 10 and a minimum length of around the size of the amplicon (e.g. 400 nucleotides).
63+
64+
```bash
65+
fastp -i {input} -o {output} -j /dev/null -h {report} \
66+
--disable_trim_poly_g \
67+
--disable_adapter_trimming \
68+
--qualified_quality_phred 10 \
69+
--unqualified_percent_limit 50 \
70+
--length_required {min_length} \
71+
-w {threads}
72+
```
73+
74+
- `{input}` is the merged file from step 2.1.
75+
- `{output}` is the the quality controlled `.fastq` filename (e.g. `all_barcode01_QC.fastq`).
76+
- `{report}` is the QC report filename, containing various details about the quality of the data before and after processing.
77+
- `{min_length}` is the expected size of your amplicons, to remove very short "rubbish" reads, generally the advise is to set it a bit lower than the expected size. Based on the QC report, which lists the number of removed reads you may adjust this setting, if too many reads are removed.
78+
79+
::: callout-note
80+
`{threads}` is a recurring setting for the number of CPUs to use for the processing. On a laptop this will be less (e.g. 8), on an HPC you may be able to use 64 or more CPUs for processing. However, how much performance increase you get depends on the software.
81+
:::
82+
83+
## 2.3 Mapping reads to primer reference {.unnumbered}
84+
85+
To precisely trim the primers we map the reads to a reference sequence based on which the primers were designed. This is to make sure, when looking for the primer locations, all primer location can be found. To map the reads we use [minimap2](https://github.com/lh3/minimap2) with the `-x map-ont` option for ONT reads. `-Y` ensures reads are not hardclipped. Afterwards we use [samtools](https://www.htslib.org/) to reduce the `.bam` (mapping) file to only those reads that mapped to the reference and sort the reads in mapping file based on mapping position, which is necessary to continue working with the file.
86+
87+
```bash
88+
minimap2 -Y -t {threads} -x map-ont -a {reference} {input} | \
89+
samtools view -bF 4 - | samtools sort -@ {threads} - > {output}
90+
```
91+
92+
- `{reference}` is the fasta file containing the reference that your primers should be able to map to.
93+
- `{input}` is the QC fastq file from step 2.2.
94+
- `{output}` is the mapping file, it could be named something like `barcode01_QCmapped.bam`
95+
96+
## 2.4 Trimming primers using Ampliclip {.unnumbered}
97+
98+
[Ampliclip](https://github.com/dnieuw/Ampliclip) is a tool written by [David](https://github.com/dnieuw/) to remove the primer sequences of nanopore amplicon reads. It works by mapping the primer sequences to a reference genome to find their location. Then it clips the reads mapped to the same reference (which we did in the previous step) by finding overlap between the primer location and the read ends. It allows for some "junk" in front of the primer location with `--padding` and mismatches between primer and reference `--mismatch`. After clipping it trims the reads and outputs a clipped `.bam` file and a trimmed `.fastq` file. `--minlength` can be set to remove any reads that, after trimming, have become shorter than this length. Set this to the value that was used in the QC section (e.g. 400).
99+
100+
After the trimming the clipped mapping file has to be sorted again.
101+
102+
```bash
103+
samtools index {input.mapped}
104+
105+
ampliclip \
106+
--infile {input.mapped} \
107+
--outfile {output.clipped}_ \
108+
--outfastq {output.trimmed} \
109+
--primerfile {primers} \
110+
--referencefile {reference}\
111+
-fwd LEFT -rev RIGHT \
112+
--padding 20 --mismatch 2 --minlength {min_length} > {log} 2>&1
113+
114+
samtools sort {output.clipped}_ > {output.clipped}
115+
rm {output.clipped}_
116+
```
117+
118+
- `{input.mapped}` is the mapping file created in step 2.3.
119+
- `{output.clipped}` is the mapping file processed to clip the primer sequences off (e.g. `barcode01_clipped.bam`).
120+
- `{output.trimmed}` is the trimmed fastq file, this contains all reads mapped to the reference with primer sequences trimmed off (e.g. `barcode01_trimmed.bam`).
121+
- `{primers}` is the name of the primer sequence fasta file. Make sure names of the primers have either 'LEFT' or 'RIGHT' in their name to specify if it is a left or right side primer.
122+
- `{reference}` is the name of the reference file, this must be the same file as was used for mapping in section 2.3.
123+
- `{min_length}` is the minimum required length of the trimmed reads, set it to the same value as when using `fastp`.
124+
125+
To see what has happened in the trimming process we can open the `.bam` mapping files before and after primer trimming using the visualization tool [UGENE](https://ugene.net/), a free and open source version of the software [geneious](https://www.geneious.com/).
126+
127+
In UGENE you can open a `.bam` via the "open file" option.
128+
129+
::: {.callout-note}
130+
We now have our quality controlled sequence reads which we can use to create a consensus sequence in the next chapter.
131+
:::
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# 3. Generating a Consensus Sequence {.unnumbered}
2+
3+
## 3.1 Mapping trimmed reads to reference {.unnumbered}
4+
5+
::: {.callout-note}
6+
This is optional if the reference for the primer design and the preferred reference for consensus generation are different. Otherwise simply use the clipped mapping file from the previous step.
7+
:::
8+
9+
Similar to what we did before we now map the trimmed reads to our preferred reference genome.
10+
11+
``` bash
12+
minimap2 -Y -t {threads} -x map-ont -a {reference} {input} | \
13+
samtools view -bF 4 - | samtools sort -@ {threads} - > {output}
14+
```
15+
16+
- `{reference}` is the fasta file containing the preferred reference.
17+
- `{input}` is the trimmed fastq file from step 2.4.
18+
- `{output}` is the mapping file, it could be named something like `barcode01_mapped.bam`
19+
20+
## 3.2 Creating consensus from filtered mutations {.unnumbered}
21+
22+
[Virconsens](https://github.com/dnieuw/Virconsens) is a tool written by [David](https://github.com/dnieuw/) to create a consensus sequence from Nanopore amplicon reads mapped to a reference in a mapping `.bam` file.
23+
24+
It works by reading the mapping file position-by-position and counting the mutations, insertions and deletions. The mutation or deletion (or original nucleotide from the reference) with the highest count is considered the "consensus" at that position.
25+
26+
In the next step it filters positions with too low coverage based on the `mindepth` threshold or too low frequency based on the `minAF` threshold.
27+
28+
The last step of the tool is to iterate over the reference genome and replace the reference nucleotide with the mutation, insertion or deletion, replace filtered position with "N", or keep the original reference nucleotide. (1 and 2 nucleotide indels are ignored as they are very often erroneous).
29+
30+
Before running virconsens we have to index the `.bam` mapping file.
31+
32+
```bash
33+
samtools index {input}
34+
35+
virconsens \
36+
-b {input} \
37+
-o {output} \
38+
-n {name} \
39+
-r {reference} \
40+
-d {coverage} \
41+
-af 0.1 \
42+
-c {threads}
43+
```
44+
45+
- `{input}` is the mapping bam file from step 3.1.
46+
- `{output}` is the fasta file containing the consensus sequence (e.g. `barcode01_consensus.fasta`)
47+
- `{name}` is the custom name of your sequence that will be used in the fasta file (e.g. `barcode01_consensus`)
48+
- `{reference}` is the fasta file containing the preferred reference, the same as in the previous step.
49+
- `{coverage}` is the minimal depth at which to not consider any alternative alleles
50+
51+
::: {.callout-note}
52+
We now have a consensus sequence of our sequencing result. This is the "raw" result we can continue using for multiple sequence alignment and phylogeny in the next chapter.
53+
:::
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# 4. Comparing sequence to reference sequences {.unnumbered}
2+
3+
## 4.1 Creating multiple sequence alignment {.unnumbered}
4+
5+
We can now create a multiple sequence alignment (MSA). (Not to be confused with a read alignment .bam file).
6+
7+
We can use a reference based multiple sequence alignment approach with minimap2 and gofasta. This is very fast and works well even for large genomes (e.g. 200kb+) or many sequences (10,000+). However, gofasta does not perform a “real” multiple alignment, because it ignores insertions in the sequences compared to the reference and removes them. Therefore if insertions are expected and present in the sequences, they will have to be added manually. On the positive side, phylogenetic analysis tools, such as [IQTREE2](https://github.com/iqtree/iqtree2), also ignore any insertions, so for the phylogenetic analysis the removal of insertions does not matter.
8+
9+
``` bash
10+
minimap2 -t {threads} -a \
11+
-x asm20 \
12+
--sam-hit-only \
13+
--secondary=no \
14+
--score-N=0 \
15+
{reference} \
16+
{input} \
17+
-o tmp.sam
18+
19+
gofasta sam toMultiAlign \
20+
-s tmp.sam \
21+
-o {output}
22+
```
23+
24+
- `{input}` here is the `.fasta` file containing all consensus sequences and references you would like to align.
25+
- `{output}` is the name of the aligned fasta file (e.g. `consensus_with_ref_aligned.fasta`).
26+
- `{reference}` is the reference used to performe a reference based multiple alignment, use the same reference as we used for read mapping before.
27+
28+
(`tmp.sam` can be deleted)
29+
30+
## 4.2 Generate useful stats {.unnumbered}
31+
32+
You can generate read stats with `seqkit stats -T` for the raw, QC, trimmed and mapped reads.
33+
34+
```bash
35+
seqkit stats -T {input} > {output}
36+
```
37+
38+
If want to check the read stats for a mapping file, you can use the following:
39+
40+
```bash
41+
for file in {input}; do
42+
samtools fastq $file | seqkit stats -T --stdin-label $file | tail -1
43+
done > {output}
44+
```
45+
46+
- `{input}` is your fastq or bam file.
47+
- `{output}` is a tab separated (.tsv) file with the read stats.
48+
49+
::: {.callout-note}
50+
If you are not analyzing SARS-CoV-2 data, then you can skip the next chapter and go to the final chapter to automate all of the steps we've previously discussed.
51+
:::

posts/NAAM-06-sars_cov_2.qmd

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# 5. SARS-CoV-2 analysis {.unnumbered}
2+
3+
If you are dealing with SARS-Cov-2 data, then you can run the [pangolin software](https://github.com/cov-lineages/pangolin) to submit your SARS-CoV-2 genome sequences which then are compared with other genome sequences and assigned the most likely lineage.
4+
5+
Execute the following:
6+
```bash
7+
pangolin {input} --outfile {output}
8+
```
9+
10+
- `{input}` is your aggregated consensus fasta file from step X.X.
11+
- `{output}` is a .csv file that contains taxon name and lineage assigned per fasta sequence. Read more about the output format: [https://cov-lineages.org/resources/pangolin/output.html](https://cov-lineages.org/resources/pangolin/output.html)
12+
13+
14+
## To be added...
15+
16+
Here are some of the snakemake rules that are currently excluded:
17+
18+
- create_depth_file
19+
- create_vcf
20+
- annotate_vcf
21+
- filter_vcf
22+
- create_filtered_vcf_tables
23+
24+
These rules are exclusively for analysis of SARS-Cov-2 data and will be implemented into the container workflow in the near future.
25+
26+
::: {.callout-note}
27+
You can now move to the final chapter to automate all of the steps we’ve previously discussed.
28+
:::

0 commit comments

Comments
 (0)