Chapter 3 Assembly & Binning

The next step is to assemble pre-processed reads into contigs, and to cluster these contigs into bins to reconstruct draft bacterial genomes known as metagenome-assembled genomes or MAGs.

3.1 Metagenomic assembly

Metagenomic assembly is a computational process used to reconstruct the genomes of microorganisms present in complex environmental samples, such as faecal samples. Unlike traditional genome sequencing, where a single organism’s genome is targeted, metagenomics focuses on the genetic material of all microorganisms present in a given sample, allowing for the simultaneous study of entire microbial communities.

Two of the most popular metagenome assemblers are Megahit and MetaSpades. Metaspades is considered superior in terms of assembly quality, yet memory requirements are much larger than those of Megahit. Thus, one of the most relevant criteria to choose the assembler to be employed is the balance between amount of data and available memory. Another minor, yet relevant difference between both assemblers is that Megahit allows removing contings below a certain size, while MetaSpades needs to be piped with another software (e.g. bbmap) to get rid off barely informative yet often abundant short contigs. Due to the large computational requirements in the EHI, the default metagenomic assembler we use is Megahit.

megahit \
    -t {threads} \
    --verbose \
    --min-contig-len 1500 \
    -1 {input.r1} -2 {input.r2} \
    -f \
    -o {config[workdir]}/{wildcards.PRB}_{wildcards.EHI}_assembly

The metagenome assemblies can have very different properties depending on the amount of data used for the assembly, the complexity of the microbial community, and other biological and technical aspects. It is therefore convenient to obtain some general statistics of the assemblies to decide whether they look meaningful to continue with downstream analyses. This can be easily done using the software Quast.

quast \
    -o {output.report} \
    --threads {threads} \
    {input.assembly}

3.2 Assembly mapping

In order to group contigs into MAGs, binners (software that group contigs into bins) require contig coverage information to decide which contig belong to which bin. This type of data is generated by mapping pre-processed metagenomic reads agains the metagenomic assembly.

Fist of all, the assembly needs to be indexed.

bowtie2-build \
    --large-index \
    --threads {threads} \
    {input.contigs} {input.contigs}

Once indexed, the metagenomics reads are mapped against the assembly.

# Map reads to assembly using Bowtie2
bowtie2 \
    --time \
    --threads {threads} \
    -x {input.contigs} \
    -1 {input.r1} \
    -2 {input.r2} \
| samtools sort -@ {threads} -o {output}

Using coverM, we can generate the assembly coverage data required for downstream binning steps.

coverm genome \
      -b {input.bam} \
      --genome-fasta-files {input.contigs} \
      -m relative_abundance \
      -t {threads} \
      --min-covered-fraction 0 \
      > {output.coverm}

#Run coverm for the eukaryotic assessment pipeline
coverm genome \
      -s - \
      -b {input.bam} \
      -m relative_abundance count mean covered_fraction \
      -t {threads} \
      --min-covered-fraction 0 \
      > {output.euk}

3.3 Ensemble binning

Metagenomic binning is the bioinformatic process that attempts to group metagenomic sequences by their organism of origin. In practice, what binning does is to cluster contigs of a metagenomic assembly into putative bacterial genomes. In the last decade over a dozen of binning algorithms have been released, each relying on different structural and mathematical properties of the input data. Two of the most relevant structural properties to group contigs into bins are oligonucleotide composition of contigs and present of universally conserved genes in contigs.

The performance of the binning algorithms is largely dependent on the specific properties of each sample. A software that performs very well with a given sample can be easily outcompeted by another one in the next sample. In consequence, many researchers opt for ensemble approaches whereby assemblies are binned using multiple algorithms, followed by a refinement step that merges all generated information to yield consensus bins.

3.3.1 Maxbin

MaxBin2 is a metagenome binning software designed to partition assembled metagenomic contigs into individual genome bins based on sequence composition and abundance. It utilises a combination of tetranucleotide frequency analysis and abundance information obtained from read mapping to identify distinct genomic bins representing different microorganisms within a metagenomic sample. MaxBin2 is known for its efficiency and accuracy in binning metagenomic contigs, making it a valuable tool for characterising microbial communities in various environments.

# summarise contig depths
jgi_summarize_bam_contig_depths \
    --outputDepth {params.outdir}/mb2_master_depth.txt \
    --noIntraDepthVariance \
    {input.bam}

#calculate total numper of columns
A=($(head -n 1 {params.outdir}/mb2_master_depth.txt))
N=${{#A[*]}}

# split the contig depth file into multiple files
if [ -f {params.outdir}/mb2_abund_list.txt ]; then rm {params.outdir}/mb2_abund_list.txt; fi
for i in $(seq 4 $N); do
    sample=$(head -n 1 {params.outdir}/mb2_master_depth.txt | cut -f $i)
    grep -v totalAvgDepth {params.outdir}/mb2_master_depth.txt | cut -f 1,$i > {params.outdir}/mb2_${{sample%.*}}.txt
    if [[ {params.outdir} == /* ]]; then
        echo {params.outdir}/mb2_${{sample%.*}}.txt >> {params.outdir}/mb2_abund_list.txt
    else
        echo {params.outdir}/mb2_${{sample%.*}}.txt >> {params.outdir}/mb2_abund_list.txt
    fi
done

# Run maxbin2
run_MaxBin.pl \
    -contig {input.contigs} \
    -markerset 107 \
    -thread {threads} \
    -min_contig_length 1500 \
  -out {params.outdir}/maxbin2_out/bin \
  -abund_list {params.outdir}/mb2_abund_list.txt

mkdir -p {params.outdir}/maxbin2_bins
for i in {params.outdir}/maxbin2_out/*.fasta;
    do mv $i {params.outdir}/maxbin2_bins/$(basename ${{i/.fasta/.fa}});
done

3.3.2 Metabat

MetaBAT2 is another metagenomic binning tool that specialises in clustering metagenomic contigs into genome bins by considering both sequence composition and differential coverage across samples. It employs an innovative algorithm that uses a likelihood-based approach to estimate the probability of contigs belonging to the same genome, allowing for the identification of distinct bins for individual microorganisms. MetaBAT2 is well-regarded for its ability to handle complex microbial communities and is widely used in metagenomic research for its binning accuracy.

# summarise contig depths
jgi_summarize_bam_contig_depths --outputDepth {params.outdir}/metabat_depth.txt {input.bam}

# Run metabat2
metabat2 \
     -i {input.contigs} \
     -a {params.outdir}/metabat_depth.txt \
     -o {params.outdir}/metabat2_bins/bin \
     -m 1500 \
     -t {threads} \
     --unbinned

3.3.3 CONCOCT

CONCOCT is a metagenomic binning software that utilises differential coverage and sequence composition to cluster contigs into genome bins. It employs a two-dimensional binning strategy, considering both coverage across multiple samples and sequence composition, allowing for robust binning of metagenomic data. One notable feature of CONCOCT is its capacity to separate closely related strains, which can be challenging for other binning tools. It is particularly valuable for exploring the finer-scale diversity within microbial communities.

cut_up_fasta.py {input.contigs} -c 10000 --merge_last -b {params.outdir}/assembly_10K.bed -o 0 > {params.outdir}/assembly_10K.fa

concoct_coverage_table.py {params.outdir}/assembly_10K.bed {input.bam} > {params.outdir}/concoct_depth.txt

concoct \
    -l 1500 \
    -t {threads} \
    --coverage_file {params.outdir}/concoct_depth.txt \
    --composition_file {params.outdir}/assembly_10K.fa \
    -b {params.outdir}

merge_cutup_clustering.py {params.outdir}/clustering_gt1500.csv > {params.outdir}/clustering_gt1500_merged.csv
mkdir -p {params.outdir}/concoct_bins
python {config[codedir]}/scripts/metawrap_split_concoct_bins.py {params.outdir}/clustering_gt1500_merged.csv {input.contigs} {params.outdir}/concoct_bins

3.4 Bin refinement

# setup checkm2 db path (in case of first run)
checkm2 database --setdblocation {config[checkmdb]}

bash {config[codedir]}/scripts/metawrap_refinement.sh \
    -t {threads} \
    -o {params.outdir} \
    -A {params.binning}/concoct_bins/ \
    -B {params.binning}/maxbin2_bins/ \
    -C {params.binning}/metabat2_bins/ \
    -c 50 \
    -x 10

# Rename output files, and sort metawrap by bin name
head -1 {params.outdir}/metawrap_50_10_bins.stats > {params.outdir}/mw_colnames.tsv
sed '1d;' {params.outdir}/metawrap_50_10_bins.stats | sort -k1,1 -t$'\t' > {params.outdir}/mw_sorted.tsv
cat {params.outdir}/mw_colnames.tsv {params.outdir}/mw_sorted.tsv > {params.outdir}/mw_sorted_col.tsv
mv {params.outdir}/mw_sorted_col.tsv {output.stats}
mv {params.outdir}/metawrap_50_10_bins.contigs {output.contigmap}
sed -i'' '2,$s/bin/{wildcards.EHA}_bin/g' {output.stats}
sed -i'' 's/bin/{wildcards.EHA}_bin/g' {output.contigmap}

# Rename metawrap bins to match coassembly group:
for bin in {params.outdir}/metawrap_50_10_bins/*.fa;
    do mv $bin ${{bin/bin./{wildcards.EHA}_bin.}};
done

# Compress output bins
pigz -p {threads} {params.outdir}/metawrap_50_10_bins/*.fa

#Print the number of MAGs to a file for combining with the assembly report
mkdir -p {params.stats_dir}
ls -l {params.outdir}/metawrap_50_10_bins/*.fa.gz | wc -l > {params.stats_dir}/{wildcards.EHA}_bins.tsv

# Reformat MAG headers for CoverM
for mag in {params.outdir}/metawrap_50_10_bins/*.fa.gz;
    do rename.sh \
        in=$mag \
        out={params.outdir}/$(basename ${{mag/.fa.gz/_renamed.fa.gz}}) \
        zl=9 \
        prefix=$(basename ${{mag/.fa.gz/^}});
done

rm {params.outdir}/metawrap_50_10_bins/*.fa.gz
for mag in {params.outdir}/*.fa.gz;
    do mv $mag {params.outdir}/metawrap_50_10_bins/$(basename ${{mag/_renamed/}});
done