Please install the following packages that will be useful during this practical:
GEOquery
readxl
dplyr
pvca
flashClust
edgeR
limma
UpSetR
WARNING: some of those are from Bioconductor
This practicals will be based on the public data provided alongside the (Singhania et al. 2018) article on the Gene Expression Omnibus website data repository under the superserie accession number GSE107995
and that studied the transcriptional signature of active human tuberculosis (TB) infection.
0.a) first download both the raw and
edgeR
-normalizedxlsx
files from GEO for all 3 subseries:
GSE107991
GSE107992
GSE107993
0.b) read through the code below and try to understand the operation performed. In particular, have a look at the
GSE107991_metadata
object (tip: it is aS4 object
, so use the@
operator)
# DATA ----
rm(list = ls())
library(GEOquery)
library(readxl)
London_norm <- read_excel("GSE107991_edgeR_normalized_Berry_London.xlsx") # edgeR preprocessed
SouthAfrica_norm <- read_excel("GSE107992_edgeR_normalized_Berry_SouthAfrica.xlsx")
Leicester_norm <- read_excel("GSE107993_edgeR_normalized_Leicester_non_progressor_longitudnal_only.xlsx") # raw counts
London_raw <- read_excel("GSE107991_Raw_counts_Berry_London.xlsx") # raw counts
SouthAfrica_raw <- read_excel("GSE107992_Raw_counts_Berry_SouthAfrica.xlsx") # raw counts
Leicester_raw <- read_excel("GSE107993_Raw_counts_Leicester_non_progressor_longitudnal_only.xlsx") # raw counts
# metadata
GSE107991_metadata <- GEOquery::getGEO("GSE107991", GSEMatrix = FALSE) #London
GSE107992_metadata <- GEOquery::getGEO("GSE107992", GSEMatrix = FALSE) # SouthAfrica
GSE107993_metadata <- GEOquery::getGEO("GSE107993", GSEMatrix = FALSE) #Leicester
get_info <- function(i, meta) {
name <- meta@gsms[[i]]@header$source_name_ch1
name <- gsub("Active_TB", "ActiveTB", name)
name <- gsub("Test_set", "TestSet", name)
name <- gsub("Validation_set", "ValidationSet", name)
name <- gsub("Non_progressor", "NonProgressor", name)
res <- unlist(strsplit(name, split = "_"))
res <- c(res, meta@gsms[[i]]@header$title)
res
}
infos_london <- sapply(1:length(GSE107991_metadata@gsms), FUN = get_info, meta = GSE107991_metadata)
infos_df_london <- cbind.data.frame(GSMID = names(GSE107991_metadata@gsms),
t(infos_london))
rownames(infos_df_london) <- names(GSE107991_metadata@gsms)
colnames(infos_df_london)[-1] <- c("Cohort", "Location", "Set", "Status", "SampleName")
infos_sf <- sapply(1:length(GSE107992_metadata@gsms), FUN = get_info, meta = GSE107992_metadata)
infos_df_sf <- cbind.data.frame(GSMID = names(GSE107992_metadata@gsms), t(infos_sf))
rownames(infos_df_sf) <- names(GSE107992_metadata@gsms)
colnames(infos_df_sf)[-1] <- c("Cohort", "Location", "Set", "Status", "SampleName")
infos_leicester <- sapply(1:length(GSE107993_metadata@gsms), FUN = get_info,
meta = GSE107993_metadata)
infos_df_leicester <- cbind.data.frame(GSMID = names(GSE107993_metadata@gsms),
t(infos_leicester))
rownames(infos_df_leicester) <- names(GSE107993_metadata@gsms)
colnames(infos_df_leicester)[-1] <- c("Cohort", "Location", "Status", "Set",
"SampleName")
infos_df_leicester <- infos_df_leicester[, c(1, 2, 3, 5, 4, 6)]
infos_singhania <- rbind.data.frame(infos_df_london, infos_df_sf, infos_df_leicester)
expr_singhania_raw <- merge(x = merge(x = London_raw, y = SouthAfrica_raw, by = c(1:3)),
y = Leicester_raw, by = c(1:3))
expr_singhania_norm <- merge(x = merge(x = London_norm, y = SouthAfrica_norm,
by = c(1:3)), y = Leicester_norm, by = c(1:3))
save(infos_singhania, expr_singhania_raw, expr_singhania_norm, London_norm,
SouthAfrica_norm, Leicester_norm, file = "SinghaniaTB_data.Rdata")
The counts matrix outputted from the alignement usually contains many genes that are never observed.
1.a) remove the genes that are never observed, i.e. with 0 counts in every samples
Low counts are still badly understood by the community and their measure is less reliable than higher counts, therefore it is common to remove genes with very low counts. In particular, Singhania et al considered only genes “expressed with counts per million (CPM) > 2 in at least five samples” (such filters can seem a bit ad hoc).
1.b) compute the counts per million and keep only the genes that have a cpm>2 in at least 5 samples
1.c) Finally keep only the protein coding genes (i.e. for which
Gene_biotype
isprotein_coding
)
2.a) Display the boxplots of each samples \(log2\) raw counts alongside each other and comment the need for normalization.
The sequencing depth problem is commonly treated with the TMM normalization. This normalization divides the raw count by the library size.
2.b) use the
edgeR
package to compute the TMM normalization after converting the filtered count matrix to aDGEList
object
2.c) use the
edgeR::cpm()
function to return the filtered, TMM normalized, log2-Counts per million.
2.d) Compare the obtained values to the one shared publicly.
2.e) Display the boxplots of each samples \(log2\)-cpm normalized expressions alongside each other and comment.
2.f) Display the dendrogram of the hierarchical clusterings using Wards method and the EUclidean distance comparing the raw counts and the normalized \(log2\)-cpm.
3.a) Formulate the research question of interest here.
3.b) With the function
model.matrix()
generate a model matrix for the (generalized) linear model that you will use to answer that question.
edgeR
pipeline(Robinson, McCarthy, and Smyth 2010)
4.a) Using
edgeR::estimateCommonDisp()
add a (common) dispertion estimate to the DGE object
4.b) Using
edgeR::glmFit()
andedgeR::glmLRT()
functions, identify the differentially expressed genes (adjusted p-value < 5% and | log-FoldChange | > 1 )
voom
-limma
pipeline(Smyth et al. 2002)
5.a) compute the precision weights with the
limma::voom()
function.
5.b) now apply the
limma
pipeline with the functionslimma::lmFit()
,limma::eBayes()
andlimma::decideTests()
.
6.a) Using an Up-Set diagramm (Conway, Lex, and Gehlenborg 2017) with the
UpSetR
package, compare the results betweenedgeR
andvoom
-limma
6.b) Compare to the results from the Singhania et al. original article (in particular to the 373 gene signature). Can you list the elements that differed in their analysis compared to what you just did here ?
Conway, Jake R, Alexander Lex, and Nils Gehlenborg. 2017. “UpSetR: An R Package for the Visualization of Intersecting Sets and Their Properties.” Bioinformatics 33 (18): 2938–40.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.
Singhania, Akul, Raman Verma, Christine M Graham, Jo Lee, Trang Tran, Matthew Richardson, Patrick Lecine, et al. 2018. “A Modular Transcriptional Signature Identifies Phenotypic Heterogeneity of Human Tuberculosis Infection.” Nature Communications 9 (1): 2308.
Smyth, Gordon K, Matthew Ritchie, Natalie Thorne, James Wettenhall, Wei Shi, and Yifang Hu. 2002. “Limma: Linear Models for Microarray and Rna-Seq Data User’s Guide.”