Chapter 2 Data summary

2.1 Load the data

The EHI pipeline outputs six data files in analysis batches. Each batch is assigned a batch code in the form of DMB1234. DMB stands for Dereplication & Mapping Batch, which is the final step of the EHI pipeline. The six output files are:

Microbiome count table ([BATCH]_counts.tsv.gz): this is a contingency table containing the number of sequencing reads from each sample mapped against each MAG, with MAG identifiers in rows and sample identifiers in columns.

Microbiome coverage table ([BATCH]_coverage.tsv.gz): it has the same structure as the count table, but contains breadth-of-coverage information of each MAG in each sample. In other words, it contains the fraction of the genomes covered by mapped reads in each sample.

Sample metadata table ([BATCH]_metadata.tsv.gz): it contains relevant metadata of the samples, including geographic origin, host species, sample type and statistics of sample preprocessing.

Microbiome metadata table ([BATCH]_mag_info.tsv.gz): it contains relevant metadata of the MAGs, including taxonomy, genome completeness, contamination/redundancy and other quality metrics.

Microbiome phylogenetic tree ([BATCH].tree.gz): this is the phylogenetic tree of the MAGs derived from the GTDB master tree after pruning all reference genomes. This file is used for phylogenetic analyses.

Microbiome functional attribute table ([BATCH]_merged_kegg.tsv.gz): this is a contingency table containing the fullness levels of hundreds of KEGG modules, with MAG identifiers in rows and KEGG module identifiers in columns. This table is used for functional analyses.

You first need to download these documents from the server, using the links in the EHI data report. In this workflow, we assume the files are downloaded to the folder “data” in your working environment.

# Batch
batch="DMB0038"

# Microbiome count table
count_table <- read.table(
      gunzip(paste0("data/",batch,"_counts.tsv.gz"), remove=FALSE, overwrite=TRUE),
      sep="\t",row.names=1,header=T)

# Microbiome coverage table
coverage_table <- read.table(
      gunzip(paste0("data/",batch,"_coverage.tsv.gz"), remove=FALSE, overwrite=TRUE),
      sep="\t",row.names=1,header=T)

# Sample metadata table
sample_table <- read.table(
      gunzip(paste0("data/",batch,"_metadata.tsv.gz"), remove=FALSE, overwrite=TRUE),
      sep="\t",header=T) %>%
      rename(sample=EHI_plaintext) # rename column

# Microbiome metadata table
mags_table <- read.table(
      gunzip(paste0("data/",batch,"_mag_info.tsv.gz"), remove=FALSE, overwrite=TRUE),
      sep="\t",header=T)
rownames(mags_table) <- mags_table[,1] # add row names

# Microbiome phylogenetic tree
tree <- read.tree(
      gunzip(paste0("data/",batch,".tree.gz"), remove=FALSE, overwrite=TRUE))

# Microbiome functional attribute table
kegg_table <- read.table(
      gunzip(paste0("data/",batch,"_merged_kegg.tsv.gz"), remove=FALSE, overwrite=TRUE),
      sep="\t",header=T, row.names=1)

2.2 General data statistics

You can then generate some general statistics to obtain an overview of your data.

Number of samples

ncol(count_table)
[1] 38

Amount of discarded data

The value is in GBs (gigabases)

sum(round(((sample_table$metagenomic_bases+sample_table$host_bases)/(1-sample_table$bases_lost_fastp_percent))-(sample_table$metagenomic_bases+sample_table$host_bases)))/1000000000
[1] 6.410973

Amount of host data

The value is in GBs (gigabases)

sum(sample_table$host_bases)/1000000000
[1] 7.046948

Amount of metagenomic data

The value is in GBs (gigabases)

sum(sample_table$metagenomic_bases)/1000000000
[1] 133.1334

Amount of estimated prokaryotic

sum(sample_table$metagenomic_bases * sample_table$singlem_fraction)/1000000000
[1] 88.89506

2.3 General MAG statistics

Number of MAGs

nrow(count_table)
[1] 581

Number of MAGs without species-level annotation

These are the MAGs that could be considered “new” species

mags_table %>%
    filter(species == "s__") %>%
    nrow()
[1] 521

2.3.1 Number of phylums

mags_table %>%
    select(phylum) %>%
  unique() %>%
  pull() %>%
    length()
[1] 17

2.4 Geographic distribution

You can visualise the origin of your samples using the geographic information available in the sample metadata table. First you need to generate summary information containing unique sampling sites and the number of samples per site.

#Summarise for generating map
options(dplyr.summarise.inform = FALSE)
sample_table_summary <- sample_table %>%
  #Group by geography and count samples
  select(sample, latitude, longitude, country) %>%
  group_by(latitude, longitude) %>%
  summarize(count = n()) %>%
  ungroup()

Then, this new table can be used to generate the map with location shape sizes indicating the number of samples.

sample_table_summary %>%
ggplot(.) +
    #render map
    geom_map(
      data=map_data("world"),
      map = map_data("world"),
      aes(long, lat, map_id=region),
      color = "white", fill = "#cccccc", linewidth = 0.2
    ) +
    #render points
    geom_point(
      aes(x=longitude,y=latitude, size=count),
      alpha=0.5, shape=16) +
    #add general plot layout
    theme_minimal() +
    theme(legend.position = "none",
        axis.title.x=element_blank(),
        axis.title.y=element_blank()
      )

You can also plot the summary table.

sample_table %>%
  select(sample,sample_type,region,country,latitude,longitude) %>%
  kable()
sample sample_type region country latitude longitude
EHI00678 Faecal San Salvador Bahamas 24.05862 -74.46739
EHI00682 Faecal Chub Cay Bahamas 25.40986 -77.87054
EHI00743 Faecal Cockpit Jamaica 18.38248 -77.51726
EHI00679 Faecal Chub Cay Bahamas 25.40986 -77.87054
EHI00674 Faecal Cockpit Jamaica 18.38248 -77.51726
EHI00689 Faecal Andros Bahamas 24.66547 -77.80168
EHI00688 Faecal Acklins Bahamas 22.65397 -73.93493
EHI00690 Faecal Andros Bahamas 24.66547 -77.80168
EHI00692 Faecal Andros Bahamas 24.66547 -77.80168
EHI00753 Faecal Chub Cay Bahamas 25.40986 -77.87054
EHI00696 Faecal Grand Cayman Cayman Islands 19.32504 -81.20698
EHI00751 Faecal Cockpit Jamaica 18.38248 -77.51726
EHI00680 Faecal San Salvador Bahamas 24.05862 -74.46739
EHI00683 Faecal San Salvador Bahamas 24.05862 -74.46739
EHI00732 Faecal Soroa Cuba 22.77456 -83.03659
EHI00695 Faecal Grand Cayman Cayman Islands 19.32504 -81.20698
EHI00693 Faecal Andros Bahamas 24.66547 -77.80168
EHI00731 Faecal Acklins Bahamas 22.65397 -73.93493
EHI00675 Faecal Cockpit Jamaica 18.38248 -77.51726
EHI00686 Faecal Acklins Bahamas 22.65397 -73.93493
EHI00681 Faecal San Salvador Bahamas 24.05862 -74.46739
EHI00758 Faecal Soroa Cuba 22.77456 -83.03659
EHI00685 Faecal Acklins Bahamas 22.65397 -73.93493
EHI00752 Faecal Chub Cay Bahamas 25.40986 -77.87054
EHI00697 Faecal Grand Cayman Cayman Islands 19.32504 -81.20698
EHI00742 Faecal Cockpit Jamaica 18.38248 -77.51726
EHI00698 Faecal Grand Cayman Cayman Islands 19.32504 -81.20698
EHI00687 Faecal Acklins Bahamas 22.65397 -73.93493
EHI00684 Faecal Acklins Bahamas 22.65397 -73.93493
EHI00733 Faecal Soroa Cuba 22.77456 -83.03659
EHI00691 Faecal Andros Bahamas 24.66547 -77.80168
EHI00757 Faecal Chub Cay Bahamas 25.40986 -77.87054
EHI00699 Faecal Grand Cayman Cayman Islands 19.32504 -81.20698
EHI00677 Faecal San Salvador Bahamas 24.05862 -74.46739
EHI00700 Faecal Grand Cayman Cayman Islands 19.32504 -81.20698
EHI00730 Faecal Chub Cay Bahamas 25.40986 -77.87054
EHI00694 Faecal Andros Bahamas 24.66547 -77.80168
EHI00676 Faecal Cockpit Jamaica 18.38248 -77.51726

2.5 Prepare the colour layout

For the sake of consistency through EHI projects, we have created a unified colour profile for bacterial and archaeal taxa. The palette assign similar colours to closely related phyla, and more distinct hues to distantly related phyla. The colour profile can be downloaded from Github, and store as an object in R for downstream analyses.

https://github.com/earthhologenome/EHI_taxonomy_colour/edit/main/README.md

# Download and load the phylum colour table
colours_URL="https://raw.githubusercontent.com/earthhologenome/EHI_taxonomy_colour/main/ehi_phylum_colors.tsv"
download.file(colours_URL, "data/ehi_phylum_colors.tsv")
ehi_phylum_colors <- read.table("data/ehi_phylum_colors.tsv",sep="\t",header=T,comment.char = "")

# Arrange colors alphabetically
colors_alphabetic <- ehi_phylum_colors %>%
  right_join(mags_table, by=join_by(phylum == phylum)) %>%
    arrange(match(genome, tree$tip.label)) %>%
  select(phylum, colors) %>%
    unique() %>%
    arrange(phylum) %>%
    select(colors) %>%
    pull()
Coloured GTDB 214 phylum tree
Coloured GTDB 214 phylum tree