Let’s QC this data.
library("gplots")
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library("reshape2")
Please have a look at the multiQC report. Here are a few key points:
Skewer trimming resulted in loss of only a tiny number of bases. This indicates the sequence quality is very high.
Fastqc results showing the number of unique and duplicate reads indicates a few samples with <10M unique reads.
Per seqence GC content showed an unusual profile for two samples. PG1423-EOS R1 and R2 had GC profile max at 40% compared to the mean. PG2090-EOS also showed an unusual pattern with underrepresented low GC%.
Sequence duplication levels were elevated for some fastq files. Here are the files of concern, with <20% unique reads: PG3627-POD1_S86_R1_001 PG3627-POD1_S86_R2_001 PG3609-T0_S317_R1_001 PG2090-EOS_S134_R1_001 PG2090-EOS_S134_R2_001
There were two files with overrepresented sequences: PG2090-EOS R1 and R2. Others are okay.
Adapter content was very low which is good.
The fastq files were also checked with validatefastq-assembly which looks for signs of file corruption which can occur in large data transfers. No problematic files were detected.
Ribosomal RNA carryover can be a source of noise. The proportion should be <10% and there were a few samples in excess of this including PG2020-EOS, PG815-EOS, PG1452-EOS and PG702-POD1.
rrna <- read.table("rrna_stats.txt")
rrna <- rrna[,c(1,5)]
rrna$V1 <- sapply(strsplit(rrna$V1,"\\."),"[[",1)
rrna$V5 <- gsub("\\(","",rrna$V5)
rrna$V5 <- gsub("%","",rrna$V5)
rrna$V5 <- as.numeric(rrna$V5)
str(rrna)
## 'data.frame': 319 obs. of 2 variables:
## $ V1: chr "3166-POD1_S266_R1_001" "3166-T0_S265_R1_001" "3167-POD1_S268_R1_001" "3167-T0_S267_R1_001" ...
## $ V5: num 0.57 1.11 0.61 0.93 0.96 0.79 0.7 5.2 1.14 2.83 ...
rrna2 <- rrna[,2]
names(rrna2) <- rrna[,1]
par(mar=c(5,8,3,1))
barplot(rrna2,horiz=TRUE,las=1,cex.names=0.5,main="rRNA carryover")
rrna2 <- rrna2[order(-rrna2)]
barplot(head(rrna2,20),horiz=TRUE,las=1,cex.names=0.6,main="rRNA carryover")
tmp <- read.table("3col.tsv.gz",header=FALSE)
x <- as.matrix(acast(tmp, V2~V1, value.var="V3", fun.aggregate = sum))
x <- as.data.frame(x)
accession <- sapply((strsplit(rownames(x),"\\|")),"[[",2)
symbol<-sapply((strsplit(rownames(x),"\\|")),"[[",6)
x$geneid <- paste(accession,symbol)
xx <- aggregate(. ~ geneid,x,sum)
rownames(xx) <- xx$geneid
colnames <- gsub("T0R","T0",colnames(xx))
xx$geneid = NULL
xx <- round(xx)
xx[1:10,1:6]
## 3166-POD1 3166-T0 3167-POD1 3167-T0 3171-POD1
## ENSG00000000003.15 TSPAN6 3 1 5 5 23
## ENSG00000000005.6 TNMD 0 0 0 0 0
## ENSG00000000419.14 DPM1 685 577 521 735 811
## ENSG00000000457.14 SCYL3 622 611 550 777 789
## ENSG00000000460.17 C1orf112 181 171 232 263 215
## ENSG00000000938.13 FGR 33797 44344 31524 38959 26402
## ENSG00000000971.16 CFH 106 40 98 183 195
## ENSG00000001036.14 FUCA2 1229 769 1150 868 978
## ENSG00000001084.13 GCLC 944 1085 577 961 908
## ENSG00000001167.15 NFYA 1243 1277 1295 1605 1166
## 3171-T0
## ENSG00000000003.15 TSPAN6 4
## ENSG00000000005.6 TNMD 1
## ENSG00000000419.14 DPM1 494
## ENSG00000000457.14 SCYL3 575
## ENSG00000000460.17 C1orf112 196
## ENSG00000000938.13 FGR 33751
## ENSG00000000971.16 CFH 130
## ENSG00000001036.14 FUCA2 805
## ENSG00000001084.13 GCLC 798
## ENSG00000001167.15 NFYA 1251
Let’s look at the number of reads per sample
Most samples were in the range of 25-30 million assigned reads. Just 2 samples had less than 20 million reads: PG1452-EOS and PG1423-EOS. The maximum read count was about 40 million for PG7072-EOS.
xxcs <- colSums(xx)
par(mar=c(5,8,3,1))
barplot(xxcs,horiz=TRUE,las=1,main="no. reads per sample")
barplot(head(xxcs[order(xxcs)],20),horiz=TRUE,las=1,main="lowest no. reads per sample")
barplot(head(xxcs[order(-xxcs)],20),horiz=TRUE,las=1,main="highest no. reads per sample")
Some outliers are apparent.
PG2090-EOS to the left of the chart - this is clearly the effect of rRNA carryover. Other samples over to the left of the chart include PG815-EOS, PG145-EOS and PG702-POD1 which all have elevated rRNA.
mds <- cmdscale(dist(t(xx)))
par(mar=c(5,5,3,1))
minx <- min(mds[,1])
maxx <- max(mds[,1])
miny <- min(mds[,2])
maxy <- max(mds[,2])
plot(mds, xlab="Coordinate 1", ylab="Coordinate 2",
xlim=c(minx*1.1,maxx*1.1), ylim = c(miny*1.1,maxy*1.1) ,
type = "p", col="gray", pch=19, cex.axis=1.3,cex.lab=1.3, bty='n')
text(mds, labels=rownames(mds), cex=0.8)
col <- rownames(mds)
col <- sapply(strsplit(col,"-"),"[[",2)
col <- gsub("T0","lightblue",col)
col <- gsub("POD1","orange",col)
col <- gsub("EOS","pink",col)
plot(mds, xlab="Coordinate 1", ylab="Coordinate 2",
xlim=c(minx*1.1,maxx*1.1), ylim = c(miny*1.1,maxy*1.1) , cex=1.5 ,
type = "p", col=col, pch=19, cex.axis=1.3,cex.lab=1.3, bty='n')
#text(mds, labels=rownames(mds), cex=0.8)
mtext("blue=T0, orange=POD1, pink=EOS")
Exclude PG2090-EOS and repeat the analysis.
xx <- xx[,grep("PG2090-EOS",colnames(xx),invert=TRUE)]
mds <- cmdscale(dist(t(xx)))
par(mar=c(5,5,3,1))
minx <- min(mds[,1])
maxx <- max(mds[,1])
miny <- min(mds[,2])
maxy <- max(mds[,2])
plot(mds, xlab="Coordinate 1", ylab="Coordinate 2",
xlim=c(minx*1.1,maxx*1.1), ylim = c(miny*1.1,maxy*1.1) ,
type = "p", col="gray", pch=19, cex.axis=1.3,cex.lab=1.3, bty='n')
text(mds, labels=rownames(mds), cex=0.8)
col <- rownames(mds)
col <- sapply(strsplit(col,"-"),"[[",2)
col <- gsub("T0","lightblue",col)
col <- gsub("POD1","orange",col)
col <- gsub("EOS","pink",col)
plot(mds, xlab="Coordinate 1", ylab="Coordinate 2",
xlim=c(minx*1.1,maxx*1.1), ylim = c(miny*1.1,maxy*1.1) , cex=1.5 ,
type = "p", col=col, pch=19, cex.axis=1.3,cex.lab=1.3, bty='n')
#text(mds, labels=rownames(mds), cex=0.8)
mtext("blue=T0, orange=POD1, pink=EOS")
In the MDS plot with PG2090-EOS removed, there appears to be some separation of T0, POD1 and EOS samples. POD1 (orange) are more towards the upper side of the chart and T0 (blue) are toward the bottom right. EOS (pink) are quite spread out.
PG2090-EOS suffered rRNA carryover and needs to be re-prepared. The other samples with slightly higher rRNA are not a problem as the rRNA can be corrected for statistically. not sure what to do about samples with low numbers of unique reads.
For reproducibility
sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-conda_cos6-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS/LAPACK: /ceph-g/opt/miniconda3/envs/R40/lib/libopenblasp-r0.3.9.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.4 gplots_3.1.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 gtools_3.9.2 digest_0.6.25 bitops_1.0-7
## [5] plyr_1.8.6 magrittr_1.5 evaluate_0.14 KernSmooth_2.23-17
## [9] rlang_0.4.11 stringi_1.4.6 rmarkdown_2.3 tools_4.0.1
## [13] stringr_1.4.0 xfun_0.15 yaml_2.2.1 compiler_4.0.1
## [17] caTools_1.18.2 htmltools_0.5.0 knitr_1.28