Chapter 5 Dereplication and mapping

The next step of the pipeline is to dereplicate bins into a MAG catalogue, and to map the pre-processed sequencing reads to the MAG catalogue to obtain read-count data per sample and MAG.

5.1 MAG dereplication

Dereplication is the reduction of a set of MAGs based on high sequence similarity between them. Although this step is neither essential nor meaningful in certain cases (e.g., when studying straing-level variation or pangenomes), in most cases it contributes to overcome issues such as excessive computational demands, inflated diversity or unspecific read mapping. If the catalogue of MAGs used to map sequencing reads to contains many similar genomes, read mapping results in multiple high-quality alignments. Depending on the software used and parameters chosen, this leads to sequencing reads either being randomly distributed across the redundant genomes or being reported at all redundant locations. This can bias quantitative estimations of relative representation of each MAG in a given metagenomic sample.

Dereplication is based on pairwise comparisons of average nucleotide identity (ANI) between MAGs. This implies that the number of comparisons scales quadratically with an increasing amount of MAGs, which requires for efficient strategies to perform dereplication in a cost-efficient way. A popular tool used for dereplicating MAGs is dRep, which combines the fast yet innacurate algorithm MASH with the slow but accurate gANI computation to yield a fast and accurate estimation of ANIs between MAGs. An optimal threshold that balances between retaining genome diversity while minimising cross-mapping issues has been found to be 98% ANI.

dRep dereplicate \
      {config[workdir]}/drep \
      -p {threads} \
      -comp 50 \
      -sa {config[ani]} \
      -g {config[magdir]}/*.fa.gz \
      --genomeInfo mags_formatted.csv

5.2 Mapping to MAG catalogue

When the objective of a genome-resolved metagenomic analysis is to reconstruct and analyse a microbiome, researchers usually require relative abundance information to measure how abundant or rare each bacteria was in the analysed sample. In order to achieve this, it is necessary to map the reads of each sample back to the MAG catalogue and retrieve mapping statistics. The procedure is identical to that explained in the assembly read-mapping section, yet using the MAG catalogue as a reference database rather than the metagenomic assembly.

This procedure usually happens in two steps. In the first step, reads are mapped to the MAG catalogue to generate BAM or CRAM mapping files. In the second step, these mapping files are used to extract quantitative read-abundance information in the form of a table in which the amount of reads mapped to each MAG in each sample is displayed.

First, all MAGs need to be concatenated into a single file, which will become the reference MAG catalogue or database.

# Concatenate MAGs
cat {config[workdir]}/drep/dereplicated_genomes/*.fa.gz > {config[workdir]}/mag_catalogue/{config[dmb]}_mags.fasta.gz

The MAG catalogue needs to be indexed before the mapping.

# Index the coassembly
bowtie2-build \
      --large-index \
      --threads {threads} \
      {output.mags} {output.mags}

Then, the following step needs to be iterated for each sample, yielding a BAM mapping file for each sample.

bowtie2 \
    --time \
    --threads {threads} \
    -x {input.contigs} \
    -1 {input.r1} \
    -2 {input.r2} \
| samtools sort -@ {threads} -o {output}

Finally, CoverM can be used to extract the required stats, such as relative abundance and covered fraction per MAG per sample.

#Relative abundance
coverm genome \
    -b {input} \
    -s ^ \
    -m relative_abundance \
    -t {threads} \
    --min-covered-fraction 0 \
    > {output.mapping_rate}

#MAG coverage
coverm genome \
    -b {input} \
    -s ^ \
    -m count covered_fraction \
    -t {threads} \
    --min-covered-fraction 0 \
    > {output.count_table}