Added Nanopore amplicon manual

LucvZon · LucvZon · commit f542d41786b7 · 2025-04-30T14:14:43.000+02:00
diff --git a/_quarto.yml b/_quarto.yml
@@ -44,6 +44,12 @@ website:
       style: "docked"
       contents:
         - auto: posts/IMAM*.qmd
+        
+    - id: id-naam
+      title: "Nanopore amplicon analysis manual"
+      style: "docked"
+      contents:
+        - auto: posts/NAAM*.qmd
           
 format:
   html:
diff --git a/posts/NAAM-01-main-page-index.qmd b/posts/NAAM-01-main-page-index.qmd
@@ -0,0 +1,15 @@
+---
+title: "Nanopore amplicon analysis manual"
+author: "David Nieuwenhuijse, Luc van Zon"
+date: 2025-04-30
+Reading Time: 20 min
+sidebar: id-naam
+---
+
+# Introduction {.unnumbered}
+
+Welcome to the Nanopore amplicon analysis manual. This manual contains a step by step guide for performing quality control, generating a consensus and comparing sequences to reference sequences. In the final chapter of the manual we will show how to automate all of these steps into a single pipeline for speed and convenience.
+
+::: callout-tip
+If you are just interested in running the automated workflow, then you only have to check out the chapters 'Preparation' and 'Automating data analysis'.
+:::
diff --git a/posts/NAAM-02-preparation.qmd b/posts/NAAM-02-preparation.qmd
@@ -0,0 +1,48 @@
+# 1. Preparation {.unnumbered}
+
+::: callout-warning
+# Important!
+
+In the following sections whenever a **"parameter"** in brackets `{}` is shown, the intention is to fill in your own filename or value. Each parameter will be explained in the section in detail.
+:::
+
+::: callout-tip
+Notice the small *"Copy to Clipboard"* button on the right hand side of each code chunk, this can be used to copy the code.
+:::
+
+## 1.1 Singularity container {.unnumbered}
+
+This workflow is distributed as a self-contained Singularity container image, which includes all necessary software dependencies and helper scripts. This simplifies setup considerably. It is required that [Singularity](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html){target="_blank"} version 3.x or later is available on your system. If you are working with a high performance computing (HPC) system, then this will likely already be installed and available for use. Try writing `singularity --help` in your terminal (that's connected to the HPC system) and see if the command is recognized.
+
+### Download pre-built image
+
+The singularity container needs an image file to activate the precompiled work environment. You can download the required workflow image file (naam_workflow.sif) directly through the terminal via:
+
+``` bash
+wget https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/download/v1.0.0/naam_workflow.sif
+```
+
+Or go to the [github page](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/tag/v1.0.0){target="_blank"} and manually download it there, then transfer it to your HPC system.
+
+### 1.2 Verify container {.unnumbered}
+
+You can test basic execution:
+
+``` bash
+singularity --version
+singularity exec naam_workflow.sif echo "Container is accessible!"
+```
+
+To check more in depth, you can start an interactive shell inside the build container and run some checks. `singularity shell naam_workflow.sif` will drop you into a shell running inside the container. The conda environment needed for this workflow is automatically active on start-up of the interactive shell. All the tools of the conda environment will therefore be ready to use.
+
+Please note that you do not have to run `conda activate {environment}` to activate the environment – everything is inside naam_workflow.sif. If you're curious about the conda environment we're using, you can check it out [here](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/blob/main/envs/environment.yml){target="_blank"}
+
+``` bash
+singularity shell naam_workflow.sif # Start interactive shell
+minimap2 --help # Check one of the tools from the conda environment
+which python # Check python version of the conda environment
+```
+
+::: callout-note
+We are now ready to start executing the code to perform quality control of our raw Nanopore sequencing data in the next chapter.
+:::
diff --git a/posts/NAAM-03-quality_control.qmd b/posts/NAAM-03-quality_control.qmd
@@ -0,0 +1,131 @@
+# 2. Quality control {.unnumbered}
+
+::: callout-warning
+# Important!
+
+In the next steps we are going to copy-paste code, adjust it to our needs, and execute it on the command-line.
+
+**Please open a plain text editor to paste the code from the next steps, to keep track of your progress!**
+:::
+
+For simplicity's sake, most steps will be geared towards an analysis of a single sample. It is recommended to follow a basic file structure like the following below:
+
+```         
+my_project/
+├── raw_data/     # Contains barcode directories 
+│   ├── barcode01      # Contains the raw, gzipped FASTQ files
+│       ├── file1.fastq.gz 
+│       └── file2.fastq.gz
+│   ├── barcode02
+│       └── file1.fastq.gz 
+│   └── barcode03
+│       ├── file1.fastq.gz 
+│       ├── file2.fastq.gz
+│       └── file3.fastq.gz
+└── results/           # This is where the output files will be stored
+└── log/               # This is where log files will be stored
+```
+
+When running any command that generates output files, it's essential to ensure that the output directory exists *before* executing the command. While some tools will automatically create the output directory if it's not present, this behavior is not guaranteed. If the output directory doesn't exist and the tool doesn't create it, the command will likely fail with an error message (or, worse, it might fail silently, leading to unexpected results). This is not required if you are running a snakemake workflow.
+
+To prevent a lot of future frustration, create your output directories beforehand with the `mkdir` command as such:
+
+``` bash
+mkdir -p results
+mkdir -p log
+# Create a subdirectory
+mkdir -p results/assembly
+```
+
+To use the required tools, activate the Singularity container as follows:
+```bash
+singularity shell naam_workflow.sif
+```
+
+## 2.1 Merging and decompressing FASTQ {.unnumbered}
+
+Any file in linux can be pasted to another file using the cat command. zcat in addition also unzips gzipped files (e.g. .fastq.gz extension). If your files are already unzipped, use cat instead.
+
+**Modify and run:**
+
+``` bash
+zcat {input.folder}/*.fastq.gz > {output}
+```
+
+-   `{input.folder}` should contain all your .fastq.gz files for a single barcode.
+-   `{output}` should be the name of the combined unzipped fastq file (e.g. all_barcode01.fastq).
+
+## 2.2 Running fastp quality control software {.unnumbered}
+
+The [fastp](https://github.com/OpenGene/fastp){target="_blank"} software is a very fast multipurpose quality control software to perform quality and sequence adapter trimming for Illumina short-read and Nanopore long-read data.
+
+Because we are processing Nanopore data, several quality control options have to be disabled. The only requirement we set is a minimum median phred quality score of the read of 10 and a minimum length of around the size of the amplicon (e.g. 400 nucleotides).
+
+```bash
+fastp -i {input} -o {output} -j /dev/null -h {report} \
+--disable_trim_poly_g \
+--disable_adapter_trimming \
+--qualified_quality_phred 10 \
+--unqualified_percent_limit 50 \
+--length_required {min_length} \
+-w {threads}
+```
+
+- `{input}` is the merged file from step 2.1.
+- `{output}` is the the quality controlled `.fastq` filename (e.g. `all_barcode01_QC.fastq`).
+- `{report}` is the QC report filename, containing various details about the quality of the data before and after processing. 
+- `{min_length}` is the expected size of your amplicons, to remove very short "rubbish" reads, generally the advise is to set it a bit lower than the expected size. Based on the QC report, which lists the number of removed reads you may adjust this setting, if too many reads are removed.
+
+::: callout-note
+`{threads}` is a recurring setting for the number of CPUs to use for the processing. On a laptop this will be less (e.g. 8), on an HPC you may be able to use 64 or more CPUs for processing. However, how much performance increase you get depends on the software.
+:::
+
+## 2.3 Mapping reads to primer reference {.unnumbered}
+
+To precisely trim the primers we map the reads to a reference sequence based on which the primers were designed. This is to make sure, when looking for the primer locations, all primer location can be found. To map the reads we use [minimap2](https://github.com/lh3/minimap2) with the `-x map-ont` option for ONT reads. `-Y` ensures reads are not hardclipped. Afterwards we use [samtools](https://www.htslib.org/) to reduce the `.bam` (mapping) file to only those reads that mapped to the reference and sort the reads in mapping file based on mapping position, which is necessary to continue working with the file.
+
+```bash
+minimap2 -Y -t {threads} -x map-ont -a {reference} {input} | \
+samtools view -bF 4 - | samtools sort -@ {threads} - > {output}
+```
+
+- `{reference}` is the fasta file containing the reference that your primers should be able to map to.
+- `{input}` is the QC fastq file from step 2.2.
+- `{output}` is the mapping file, it could be named something like `barcode01_QCmapped.bam`
+
+## 2.4 Trimming primers using Ampliclip {.unnumbered}
+
+[Ampliclip](https://github.com/dnieuw/Ampliclip) is a tool written by [David](https://github.com/dnieuw/) to remove the primer sequences of nanopore amplicon reads. It works by mapping the primer sequences to a reference genome to find their location. Then it clips the reads mapped to the same reference (which we did in the previous step) by finding overlap between the primer location and the read ends. It allows for some "junk" in front of the primer location with `--padding` and mismatches between primer and reference `--mismatch`. After clipping it trims the reads and outputs a clipped `.bam` file and a trimmed `.fastq` file. `--minlength` can be set to remove any reads that, after trimming, have become shorter than this length. Set this to the value that was used in the QC section (e.g. 400).
+
+After the trimming the clipped mapping file has to be sorted again.
+
+```bash
+samtools index {input.mapped}
+
+ampliclip \
+--infile {input.mapped} \
+--outfile {output.clipped}_ \
+--outfastq {output.trimmed} \
+--primerfile {primers} \
+--referencefile {reference}\
+-fwd LEFT -rev RIGHT \
+--padding 20 --mismatch 2 --minlength {min_length} > {log} 2>&1
+
+samtools sort {output.clipped}_ > {output.clipped}
+rm {output.clipped}_
+```
+
+- `{input.mapped}` is the mapping file created in step 2.3.
+- `{output.clipped}` is the mapping file processed to clip the primer sequences off (e.g. `barcode01_clipped.bam`).
+- `{output.trimmed}` is the trimmed fastq file, this contains all reads mapped to the reference with primer sequences trimmed off (e.g. `barcode01_trimmed.bam`).
+- `{primers}` is the name of the primer sequence fasta file. Make sure names of the primers have either 'LEFT' or 'RIGHT' in their name to specify if it is a left or right side primer.
+- `{reference}` is the name of the reference file, this must be the same file as was used for mapping in section 2.3.
+- `{min_length}` is the minimum required length of the trimmed reads, set it to the same value as when using `fastp`.
+
+To see what has happened in the trimming process we can open the `.bam` mapping files before and after primer trimming using the visualization tool [UGENE](https://ugene.net/), a free and open source version of the software [geneious](https://www.geneious.com/).
+
+In UGENE you can open a `.bam` via the "open file" option.
+
+::: {.callout-note}
+We now have our quality controlled sequence reads which we can use to create a consensus sequence in the next chapter.
+:::
diff --git a/posts/NAAM-04-generate_consensus.qmd b/posts/NAAM-04-generate_consensus.qmd
@@ -0,0 +1,53 @@
+# 3. Generating a Consensus Sequence {.unnumbered}
+
+## 3.1 Mapping trimmed reads to reference {.unnumbered}
+
+::: {.callout-note}
+This is optional if the reference for the primer design and the preferred reference for consensus generation are different. Otherwise simply use the clipped mapping file from the previous step.
+:::
+
+Similar to what we did before we now map the trimmed reads to our preferred reference genome.
+
+``` bash
+minimap2 -Y -t {threads} -x map-ont -a {reference} {input} | \
+samtools view -bF 4 - | samtools sort -@ {threads} - > {output}
+```
+
+- `{reference}` is the fasta file containing the preferred reference.
+- `{input}` is the trimmed fastq file from step 2.4.
+- `{output}` is the mapping file, it could be named something like `barcode01_mapped.bam`
+
+## 3.2 Creating consensus from filtered mutations {.unnumbered}
+
+[Virconsens](https://github.com/dnieuw/Virconsens) is a tool written by [David](https://github.com/dnieuw/) to create a consensus sequence from Nanopore amplicon reads mapped to a reference in a mapping `.bam` file. 
+
+It works by reading the mapping file position-by-position and counting the mutations, insertions and deletions. The mutation or deletion (or original nucleotide from the reference) with the highest count is considered the "consensus" at that position. 
+
+In the next step it filters positions with too low coverage based on the `mindepth` threshold or too low frequency based on the `minAF` threshold.
+
+The last step of the tool is to iterate over the reference genome and replace the reference nucleotide with the mutation, insertion or deletion, replace filtered position with "N", or keep the original reference nucleotide. (1 and 2 nucleotide indels are ignored as they are very often erroneous).
+
+Before running virconsens we have to index the `.bam` mapping file.
+
+```bash
+samtools index {input}
+
+virconsens \
+-b {input} \
+-o {output} \
+-n {name} \
+-r {reference} \
+-d {coverage} \
+-af 0.1 \
+-c {threads}
+```
+
+- `{input}` is the mapping bam file from step 3.1. 
+- `{output}` is the fasta file containing the consensus sequence (e.g. `barcode01_consensus.fasta`)
+- `{name}` is the custom name of your sequence that will be used in the fasta file (e.g. `barcode01_consensus`)
+- `{reference}` is the fasta file containing the preferred reference, the same as in the previous step.
+- `{coverage}` is the minimal depth at which to not consider any alternative alleles
+
+::: {.callout-note}
+We now have a consensus sequence of our sequencing result. This is the "raw" result we can continue using for multiple sequence alignment and phylogeny in the next chapter.
+:::
diff --git a/posts/NAAM-05-compare_sequences.qmd b/posts/NAAM-05-compare_sequences.qmd
@@ -0,0 +1,51 @@
+# 4. Comparing sequence to reference sequences {.unnumbered}
+
+## 4.1 Creating multiple sequence alignment {.unnumbered}
+
+We can now create a multiple sequence alignment (MSA). (Not to be confused with a read alignment .bam file).
+
+We can use a reference based multiple sequence alignment approach with minimap2 and gofasta. This is very fast and works well even for large genomes (e.g. 200kb+) or many sequences (10,000+). However, gofasta does not perform a “real” multiple alignment, because it ignores insertions in the sequences compared to the reference and removes them. Therefore if insertions are expected and present in the sequences, they will have to be added manually. On the positive side, phylogenetic analysis tools, such as [IQTREE2](https://github.com/iqtree/iqtree2), also ignore any insertions, so for the phylogenetic analysis the removal of insertions does not matter.
+
+``` bash
+minimap2 -t {threads} -a \
+-x asm20 \
+--sam-hit-only \
+--secondary=no \
+--score-N=0 \
+{reference} \
+{input} \
+-o tmp.sam
+
+gofasta sam toMultiAlign \
+-s tmp.sam \
+-o {output}
+```
+
+- `{input}` here is the `.fasta` file containing all consensus sequences and references you would like to align.
+- `{output}` is the name of the aligned fasta file (e.g. `consensus_with_ref_aligned.fasta`). 
+- `{reference}` is the reference used to performe a reference based multiple alignment, use the same reference as we used for read mapping before.
+
+(`tmp.sam` can be deleted)
+
+## 4.2 Generate useful stats {.unnumbered}
+
+You can generate read stats with `seqkit stats -T` for the raw, QC, trimmed and mapped reads.
+
+```bash
+seqkit stats -T {input} > {output}
+```
+
+If want to check the read stats for a mapping file, you can use the following:
+
+```bash
+for file in {input}; do
+    samtools fastq $file | seqkit stats -T --stdin-label $file | tail -1
+done > {output}
+```
+
+- `{input}` is your fastq or bam file.
+- `{output}` is a tab separated (.tsv) file with the read stats.
+
+::: {.callout-note}
+If you are not analyzing SARS-CoV-2 data, then you can skip the next chapter and go to the final chapter to automate all of the steps we've previously discussed.
+:::
diff --git a/posts/NAAM-06-sars_cov_2.qmd b/posts/NAAM-06-sars_cov_2.qmd
@@ -0,0 +1,28 @@
+# 5. SARS-CoV-2 analysis {.unnumbered}
+
+If you are dealing with SARS-Cov-2 data, then you can run the [pangolin software](https://github.com/cov-lineages/pangolin) to submit your SARS-CoV-2 genome sequences which then are compared with other genome sequences and assigned the most likely lineage.
+
+Execute the following:
+```bash
+pangolin {input} --outfile {output}
+```
+
+- `{input}` is your aggregated consensus fasta file from step X.X.
+- `{output}` is a .csv file that contains taxon name and lineage assigned per fasta sequence. Read more about the output format: [https://cov-lineages.org/resources/pangolin/output.html](https://cov-lineages.org/resources/pangolin/output.html)
+
+
+## To be added...
+
+Here are some of the snakemake rules that are currently excluded:
+
+- create_depth_file
+- create_vcf
+- annotate_vcf
+- filter_vcf
+- create_filtered_vcf_tables
+
+These rules are exclusively for analysis of SARS-Cov-2 data and will be implemented into the container workflow in the near future.
+
+::: {.callout-note}
+You can now move to the final chapter to automate all of the steps we’ve previously discussed.
+:::
diff --git a/posts/NAAM-07-nanopore_hpc.qmd b/posts/NAAM-07-nanopore_hpc.qmd