Necessary packages

Please install the following packages that will be useful during this practical:

  • GEOquery
  • readxl
  • dplyr
  • pvca
  • flashClust
  • edgeR
  • limma
  • UpSetR

WARNING: some of those are from Bioconductor

Context

This practicals will be based on the public data provided alongside the (Singhania et al. 2018) article on the Gene Expression Omnibus website data repository under the superserie accession number GSE107995 and that studied the transcriptional signature of active human tuberculosis (TB) infection.

Data Management

Public data import

0.a) first download both the raw and edgeR-normalized xlsx files from GEO for all 3 subseries:

  • GSE107991
  • GSE107992
  • GSE107993

0.b) read through the code below and try to understand the operation performed. In particular, have a look at the GSE107991_metadata object (tip: it is a S4 object, so use the @ operator)

# DATA ----
rm(list = ls())

library(GEOquery)
library(readxl)

London_norm <- read_excel("GSE107991_edgeR_normalized_Berry_London.xlsx")  # edgeR preprocessed
SouthAfrica_norm <- read_excel("GSE107992_edgeR_normalized_Berry_SouthAfrica.xlsx")
Leicester_norm <- read_excel("GSE107993_edgeR_normalized_Leicester_non_progressor_longitudnal_only.xlsx")  # raw counts
London_raw <- read_excel("GSE107991_Raw_counts_Berry_London.xlsx")  # raw counts
SouthAfrica_raw <- read_excel("GSE107992_Raw_counts_Berry_SouthAfrica.xlsx")  # raw counts
Leicester_raw <- read_excel("GSE107993_Raw_counts_Leicester_non_progressor_longitudnal_only.xlsx")  # raw counts

# metadata
GSE107991_metadata <- GEOquery::getGEO("GSE107991", GSEMatrix = FALSE)  #London
GSE107992_metadata <- GEOquery::getGEO("GSE107992", GSEMatrix = FALSE)  # SouthAfrica
GSE107993_metadata <- GEOquery::getGEO("GSE107993", GSEMatrix = FALSE)  #Leicester

get_info <- function(i, meta) {
    name <- meta@gsms[[i]]@header$source_name_ch1
    name <- gsub("Active_TB", "ActiveTB", name)
    name <- gsub("Test_set", "TestSet", name)
    name <- gsub("Validation_set", "ValidationSet", name)
    name <- gsub("Non_progressor", "NonProgressor", name)
    res <- unlist(strsplit(name, split = "_"))
    res <- c(res, meta@gsms[[i]]@header$title)
    res
}
infos_london <- sapply(1:length(GSE107991_metadata@gsms), FUN = get_info, meta = GSE107991_metadata)
infos_df_london <- cbind.data.frame(GSMID = names(GSE107991_metadata@gsms), 
    t(infos_london))
rownames(infos_df_london) <- names(GSE107991_metadata@gsms)
colnames(infos_df_london)[-1] <- c("Cohort", "Location", "Set", "Status", "SampleName")

infos_sf <- sapply(1:length(GSE107992_metadata@gsms), FUN = get_info, meta = GSE107992_metadata)
infos_df_sf <- cbind.data.frame(GSMID = names(GSE107992_metadata@gsms), t(infos_sf))
rownames(infos_df_sf) <- names(GSE107992_metadata@gsms)
colnames(infos_df_sf)[-1] <- c("Cohort", "Location", "Set", "Status", "SampleName")

infos_leicester <- sapply(1:length(GSE107993_metadata@gsms), FUN = get_info, 
    meta = GSE107993_metadata)
infos_df_leicester <- cbind.data.frame(GSMID = names(GSE107993_metadata@gsms), 
    t(infos_leicester))
rownames(infos_df_leicester) <- names(GSE107993_metadata@gsms)
colnames(infos_df_leicester)[-1] <- c("Cohort", "Location", "Status", "Set", 
    "SampleName")
infos_df_leicester <- infos_df_leicester[, c(1, 2, 3, 5, 4, 6)]

infos_singhania <- rbind.data.frame(infos_df_london, infos_df_sf, infos_df_leicester)
expr_singhania_raw <- merge(x = merge(x = London_raw, y = SouthAfrica_raw, by = c(1:3)), 
    y = Leicester_raw, by = c(1:3))
expr_singhania_norm <- merge(x = merge(x = London_norm, y = SouthAfrica_norm, 
    by = c(1:3)), y = Leicester_norm, by = c(1:3))

save(infos_singhania, expr_singhania_raw, expr_singhania_norm, London_norm, 
    SouthAfrica_norm, Leicester_norm, file = "SinghaniaTB_data.Rdata")

Filtering

The counts matrix outputted from the alignement usually contains many genes that are never observed.

1.a) remove the genes that are never observed, i.e. with 0 counts in every samples

Low counts are still badly understood by the community and their measure is less reliable than higher counts, therefore it is common to remove genes with very low counts. In particular, Singhania et al considered only genes “expressed with counts per million (CPM) > 2 in at least five samples” (such filters can seem a bit ad hoc).

1.b) compute the counts per million and keep only the genes that have a cpm>2 in at least 5 samples

1.c) Finally keep only the protein coding genes (i.e. for which Gene_biotype is protein_coding)

Normalization

2.a) Display the boxplots of each samples \(log2\) raw counts alongside each other and comment the need for normalization.

The sequencing depth problem is commonly treated with the TMM normalization. This normalization divides the raw count by the library size.

2.b) use the edgeR package to compute the TMM normalization after converting the filtered count matrix to a DGEList object

2.c) use the edgeR::cpm() function to return the filtered, TMM normalized, log2-Counts per million.

2.d) Compare the obtained values to the one shared publicly.

2.e) Display the boxplots of each samples \(log2\)-cpm normalized expressions alongside each other and comment.

2.f) Display the dendrogram of the hierarchical clusterings using Wards method and the EUclidean distance comparing the raw counts and the normalized \(log2\)-cpm.

Differential expression

Research question & design matrix

3.a) Formulate the research question of interest here.

3.b) With the function model.matrix() generate a model matrix for the (generalized) linear model that you will use to answer that question.

edgeR pipeline

(Robinson, McCarthy, and Smyth 2010)

4.a) Using edgeR::estimateCommonDisp() add a (common) dispertion estimate to the DGE object

4.b) Using edgeR::glmFit() and edgeR::glmLRT() functions, identify the differentially expressed genes (adjusted p-value < 5% and | log-FoldChange | > 1 )

voom-limma pipeline

(Smyth et al. 2002)

5.a) compute the precision weights with the limma::voom() function.

5.b) now apply the limma pipeline with the functions limma::lmFit(), limma::eBayes() and limma::decideTests().

Discussion

6.a) Using an Up-Set diagramm (Conway, Lex, and Gehlenborg 2017) with the UpSetR package, compare the results between edgeR and voom-limma

6.b) Compare to the results from the Singhania et al. original article (in particular to the 373 gene signature). Can you list the elements that differed in their analysis compared to what you just did here ?

Bibliography

Conway, Jake R, Alexander Lex, and Nils Gehlenborg. 2017. “UpSetR: An R Package for the Visualization of Intersecting Sets and Their Properties.” Bioinformatics 33 (18): 2938–40.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.

Singhania, Akul, Raman Verma, Christine M Graham, Jo Lee, Trang Tran, Matthew Richardson, Patrick Lecine, et al. 2018. “A Modular Transcriptional Signature Identifies Phenotypic Heterogeneity of Human Tuberculosis Infection.” Nature Communications 9 (1): 2308.

Smyth, Gordon K, Matthew Ritchie, Natalie Thorne, James Wettenhall, Wei Shi, and Yifang Hu. 2002. “Limma: Linear Models for Microarray and Rna-Seq Data User’s Guide.”