Source: https://github.com/markziemann/bioinformatics_intro_workshop
Previously in the series, we have been learning about the basics of RStudio, R Markdown and version control with git.
In this session, we will be using workshop time to write a R Markdown script to analyse some RNA-seq data.
This is gene expression data from 39 lung cancer patients that includes both tumor and normal adjacent tissue, which is available from NCBI GEO under accession number GSE158420.
The general goals of today’s session is to conduct the following:
Data import, checking and cleaning.
Basic quality control: count the number of reads per sample.
Principal component analysis: quantify within and between group variation.
Statistical analysis of differential expression.
Visualisation of key findings.
For all of these steps, I will be providing you helpful tips for conducting the analysis, but it is up to you to put the pieces into a complete workflow.
As you are growing the workflow, be sure to knit it regularly to ensure the parts are working well together.
For best results, work in small groups of 2-3.
The RNA-seq counts are available from the GEO Accession. In your script, include a command to download the file from GEO. Use read.table()
to import the data. The row names (gene symbols) might not be unique, so use the row.names=NULL
option to read it in. Use duplicated()
to find out how many gene symbols are duplicated. Non-unique row names can be dealt with using aggregate()
.
xa <- aggregate(. ~ row.names,x,sum)
Next, use the dim()
, str()
and head()
commands to do some basic checks of the data structure. Do you notice anything strange?
There is an alternative download link in the bioinfo_workshop folder here.
For RNA-seq, there are several types of QC that can be done. Let’s just focus on quantifying the number of reads per sample. Present the data as a bargraph, using the barplot()
or ggplot2()
functions.
The size of the chart can be adjusted using fig.width
and fig.height
options in the chunk header.
Also check the number of genes that are included in this dataset. If there are ~60,000 genes annotated in the human genome, do you think this dataset has already undergone some filtering of lowly expressed genes?
Does the dataset need any additional filtering of lowly expressed genes?
You can use the princomp()
or cmdscale()
functions. Here is an example with cmdscale, that saves an object called mds
that has the coordinates of the first three dimensions.
mds <- cmdscale(dist(t(scale(x))),k=3)
Next, use plot()
to plot the first two dimensions.
Use text()
to overlay the sample names.
Now can you use different colours for tumor and normal tissues? The colours of each point can be provided as a vector in the plot()
function. The vector needs to be the same length as the number of columns in the dataset.
Make another barplot that shows the magnitude of eigenvalues for the first 10 dimensions.
In this step we will conduct differential analysis. We will use DESeq2 because it has superior accuracy compared to other similar packages.
Here is a general workflow of how DESeq2 can be applied, which does a paired
analysis.
library("DESeq2") #bioconductor package
dds <- DESeqDataSetFromMatrix(countData = x , colData=ss, design = ~ patient + tumor)
res <- DESeq(dds)
z <- results(res)
vsd <- vst(dds, blind=FALSE)
zz <- cbind(as.data.frame(z),assay(vsd))
dge <- as.data.frame(zz[order(zz$pvalue),])
The missing element from the above code chunk is the samplesheet (ss). We will need to set up a sample sheet that has the patient ID and whether the sample is normal or tumour.
The samplesheet should be included in the report in a nice kable table.
library("kable") # cran package
ss %>%
kbl(caption = "Sample sheet") %>%
kable_paper("hover", full_width = F)
Once DESeq2 is done, count the number of genes with FDR<0.05 that are up- and down-regulated. The subset()
function can be used for this.
Make another two kable tables showing the top 20 up annd down regulated genes.
MA plot: use plot()
with log10(baseMean) on the x-axis and log2 fold change on the y-axis. Mark significant genes in a different colour. This can be achieved by first running plot()
and then points()
. The points()
function adds additional data series to the plot. Bonus points for adding a subtitle that includes the number of genes in total, the number with FDR<0.05 that are up- and down-regulated respectively. The mtext()
function can add a subheading.
Volcano plot: Similar to the above, except that log2 fold change is on the x-axis and -log10 p-value is on the y-axis.
Heatmap: Use the heatmap.2
function of the gplots()
CRAN package to make a heatmap with the top 50 genes.
Here is an example of plotting Pearson correlation coefficient with the default red and yellow palette.
heatmap.2(cor(xx),trace="n",scale="none", main="Pearson correlation heatmap")
Here is how you make a blue, white and red palette which is popular.
colfunc <- colorRampPalette(c("blue", "white", "red"))
Here is an example of making a heatmap from data in a matrix (mx). The matrix needs to have normalised values. Normalisation by total library size is simplest.
heatmap.2( mx, col=colfunc(25),
scale="row" , trace="none",
margins = c(6,10), cexRow=.6, cexCol=.5, main="Top 50 genes")
Normalising by library size can be done using apply()
, dividing counts by the total number for that sample and multiplying by 1e6.
rpm <- apply(x,2,function(y) {y/sum(y)*1e6} )
Alternatively, use the normalised values provided by DESeq2.
Bonus points for using ColSideColors
to specify which samples are tumour and which are normal.
I’m not sure whether we will have enough time to go through this in the workshop, so I have included a simple enrichment analysis using the fgsea
package.
The results can be shown as a volcano plot, or as a barplot of the sets with the biggest magnitude enrichment scores that need the preset significance threshold (FDR<0.01).
library("fgsea") #bioconductor
dge$stat[which(is.na(dge$stat))] <- 0 # substitute 0 for NA vals
stat <- dge$stat #use DESeq2 test stat for ranking genes
names(stat) <- rownames(dge) #attach gene names to values
fres <- fgsea(pathways=gs,stats=stat,minSize=5, nproc=4) #run fgsea
fsig <- subset(fres,padj<0.05) # subset significant sets
nrow(fsig)
fres <- fres[order(-abs(fres$ES)),]
head(subset(fres,padj<0.05 & ES>0),20) %>%
kbl(caption="mRNA upregulated pathways") %>%
kable_styling("hover",full_width=FALSE)
head(subset(fres,padj<0.05 & ES<0),20) %>%
kbl(caption="mRNA downregulated pathways") %>%
kable_styling("hover",full_width=FALSE)
NRES=nrow(fres)
NSIG=nrow(fsig)
NUP=nrow(subset(fsig,ES>0))
NDN=nrow(subset(fsig,ES<0))
HEADER=paste(NRES,"pathways,",NSIG,"with FDR<0.05,",NUP,"up and",NDN,"down")
plot(fres$ES,-log10(fres$pval),pch=19,cex=0.6,col="gray",
xlab="ES",ylab="-log10 p-value",main="FGSEA")
points(fsig$ES,-log10(fsig$pval),cex=0.6,col="red",pch=19)
mtext(HEADER)
If you have gotten this far - awesome work!
Automatically add a bibliography to your report using R Markdown. Download the desired style from GitHub here. Specify this file in the YAML header, along with the BIB file.
bibliography: references.bib csl: bmc-bioinformatics.csl
Then use BibGuru or CiteDrive to get the .bib format of your articles. The BIB entry for the DESeq2 paper looks like this:
@ARTICLE{Love2014-wg,
title = "Moderated estimation of fold change and dispersion for {RNA-seq}
data with {DESeq2}",
author = "Love, Michael I and Huber, Wolfgang and Anders, Simon",
journal = "Genome Biol.",
volume = 15,
number = 12,
year = 2014,
language = "en"
}
In the text, the article can be cited like this:[@Love2014-wg]
Make sure you have a Bibliography sub-heading at the foot of the report.
For reproducibility.
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] beeswarm_0.4.0 tictoc_1.2.1
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.36 R6_2.5.1 fastmap_1.2.0 xfun_0.45
## [5] cachem_1.1.0 knitr_1.48 htmltools_0.5.8.1 rmarkdown_2.27
## [9] lifecycle_1.0.4 cli_3.6.3 sass_0.4.9 jquerylib_0.1.4
## [13] compiler_4.4.1 tools_4.4.1 evaluate_0.24.0 bslib_0.7.0
## [17] yaml_2.3.9 rlang_1.1.4 jsonlite_1.8.8