Introduction

Let’s QC this data.

library("gplots")
library("reshape2")

Multi-qc results

Please have a look at the multiQC report. Here are a few key points:

  • Skewer trimming resulted in loss of only a tiny number of bases. This indicates the sequence quality is very high.

  • Fastqc results showing the number of unique and duplicate reads indicates a few samples with <10M unique reads.

  • Per seqence GC content showed an unusual profile for two samples. PG1423-EOS R1 and R2 had GC profile max at 40% compared to the mean. PG2090-EOS also showed an unusual pattern with underrepresented low GC%.

  • Sequence duplication levels were elevated for two fastq files.

  • There were two files with overrepresented sequences: PG2090-EOS R1 and R2. Others are okay.

  • Adapter content was very low which is good.

The fastq files were also checked with validatefastq-assembly which looks for signs of file corruption which can occur in large data transfers. No problematic files were detected.

rRNA amount

Ribosomal RNA carryover can be a source of noise. The proportion should be <10% and there were a few samples in excess of this including PG2020-EOS, PG815-EOS, PG1452-EOS and PG702-POD1.

rrna <- read.table("rrna_stats.txt")
rrna <- rrna[,c(1,5)]
rrna$V1 <- sapply(strsplit(rrna$V1,"\\."),"[[",1)
rrna$V5 <- gsub("\\(","",rrna$V5)
rrna$V5 <- gsub("%","",rrna$V5)
rrna$V5 <- as.numeric(rrna$V5)
str(rrna)
## 'data.frame':    319 obs. of  2 variables:
##  $ V1: chr  "3166-POD1_S266_R1_001" "3166-T0_S265_R1_001" "3167-POD1_S268_R1_001" "3167-T0_S267_R1_001" ...
##  $ V5: num  0.57 1.11 0.61 0.93 0.96 0.79 0.7 5.2 1.14 2.83 ...
rrna2 <- rrna[,2]
names(rrna2) <- rrna[,1]

par(mar=c(5,8,3,1))
barplot(rrna2,horiz=TRUE,las=1,cex.names=0.5,main="rRNA carryover")

rrna2 <- rrna2[order(-rrna2)]
barplot(head(rrna2,20),horiz=TRUE,las=1,cex.names=0.6,main="rRNA carryover")

Load the data

tmp <- read.table("3col.tsv.gz",header=FALSE)
x <- as.matrix(acast(tmp, V2~V1, value.var="V3", fun.aggregate = sum))
x <- as.data.frame(x)
accession <- sapply((strsplit(rownames(x),"\\|")),"[[",2)
symbol<-sapply((strsplit(rownames(x),"\\|")),"[[",6)
x$geneid <- paste(accession,symbol)
xx <- aggregate(. ~ geneid,x,sum)
rownames(xx) <- xx$geneid
colnames <- gsub("T0R","T0",colnames(xx))
xx$geneid = NULL
xx <- round(xx)
xx[1:10,1:6]
##                             3166-POD1 3166-T0 3167-POD1 3167-T0 3171-POD1
## ENSG00000000003.15 TSPAN6           3       1         5       5        23
## ENSG00000000005.6 TNMD              0       0         0       0         0
## ENSG00000000419.14 DPM1           685     577       521     735       811
## ENSG00000000457.14 SCYL3          622     611       550     777       789
## ENSG00000000460.17 C1orf112       181     171       232     263       215
## ENSG00000000938.13 FGR          33797   44344     31524   38959     26402
## ENSG00000000971.16 CFH            106      40        98     183       195
## ENSG00000001036.14 FUCA2         1229     769      1150     868       978
## ENSG00000001084.13 GCLC           944    1085       577     961       908
## ENSG00000001167.15 NFYA          1243    1277      1295    1605      1166
##                             3171-T0
## ENSG00000000003.15 TSPAN6         4
## ENSG00000000005.6 TNMD            1
## ENSG00000000419.14 DPM1         494
## ENSG00000000457.14 SCYL3        575
## ENSG00000000460.17 C1orf112     196
## ENSG00000000938.13 FGR        33751
## ENSG00000000971.16 CFH          130
## ENSG00000001036.14 FUCA2        805
## ENSG00000001084.13 GCLC         798
## ENSG00000001167.15 NFYA        1251

Number of reads per sample

Let’s look at the number of reads per sample

Most samples were in the range of 25-30 million assigned reads. Just 2 samples had less than 20 million reads: PG1452-EOS and PG1423-EOS. The maximum read count was about 40 million for PG7072-EOS.

xxcs <- colSums(xx)
par(mar=c(5,8,3,1))
barplot(xxcs,horiz=TRUE,las=1,main="no. reads per sample")

barplot(head(xxcs[order(xxcs)],20),horiz=TRUE,las=1,main="lowest no. reads per sample")

barplot(head(xxcs[order(-xxcs)],20),horiz=TRUE,las=1,main="highest no. reads per sample")

MDS

Some outliers are apparent.

PG2090-EOS to the left of the chart - this is clearly the effect of rRNA carryover. Other samples over to the left of the chart include PG815-EOS, PG145-EOS and PG702-POD1 which all have elevated rRNA.

mds <- cmdscale(dist(t(xx)))

par(mar=c(5,5,3,1))
minx <- min(mds[,1])
maxx <- max(mds[,1])
miny <- min(mds[,2])
maxy <- max(mds[,2])

plot(mds, xlab="Coordinate 1", ylab="Coordinate 2", 
  xlim=c(minx*1.1,maxx*1.1), ylim = c(miny*1.1,maxy*1.1) ,
  type = "p", col="gray", pch=19, cex.axis=1.3,cex.lab=1.3, bty='n')
text(mds, labels=rownames(mds), cex=0.8) 

col <- rownames(mds)
col <- sapply(strsplit(col,"-"),"[[",2)
col <- gsub("T0","lightblue",col)
col <- gsub("POD1","orange",col)
col <- gsub("EOS","pink",col)

plot(mds, xlab="Coordinate 1", ylab="Coordinate 2", 
  xlim=c(minx*1.1,maxx*1.1), ylim = c(miny*1.1,maxy*1.1) , cex=1.5 , 
  type = "p", col=col, pch=19, cex.axis=1.3,cex.lab=1.3, bty='n')
#text(mds, labels=rownames(mds), cex=0.8) 
mtext("blue=T0, orange=POD1, pink=EOS")

Exclude PG2090-EOS and repeat the analysis.

xx <- xx[,grep("PG2090-EOS",colnames(xx),invert=TRUE)]

mds <- cmdscale(dist(t(xx)))

par(mar=c(5,5,3,1))
minx <- min(mds[,1])
maxx <- max(mds[,1])
miny <- min(mds[,2])
maxy <- max(mds[,2])

plot(mds, xlab="Coordinate 1", ylab="Coordinate 2",
  xlim=c(minx*1.1,maxx*1.1), ylim = c(miny*1.1,maxy*1.1) ,
  type = "p", col="gray", pch=19, cex.axis=1.3,cex.lab=1.3, bty='n')
text(mds, labels=rownames(mds), cex=0.8)

col <- rownames(mds)
col <- sapply(strsplit(col,"-"),"[[",2)
col <- gsub("T0","lightblue",col)
col <- gsub("POD1","orange",col)
col <- gsub("EOS","pink",col)

plot(mds, xlab="Coordinate 1", ylab="Coordinate 2",
  xlim=c(minx*1.1,maxx*1.1), ylim = c(miny*1.1,maxy*1.1) , cex=1.5 ,
  type = "p", col=col, pch=19, cex.axis=1.3,cex.lab=1.3, bty='n')
#text(mds, labels=rownames(mds), cex=0.8) 
mtext("blue=T0, orange=POD1, pink=EOS")

Conclusion

PG2090-EOS suffered rRNA carryover and needs to be re-prepared. The other samples with slightly higher rRNA are not a problem as the rRNA can be corrected for statistically.

Session information

For reproducibility

sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-conda_cos6-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /ceph-g/opt/miniconda3/envs/R40/lib/libopenblasp-r0.3.9.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.4 gplots_3.1.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6       gtools_3.8.2       digest_0.6.25      bitops_1.0-6      
##  [5] plyr_1.8.6         magrittr_1.5       evaluate_0.14      KernSmooth_2.23-17
##  [9] rlang_0.4.6        stringi_1.4.6      rmarkdown_2.3      tools_4.0.1       
## [13] stringr_1.4.0      xfun_0.15          yaml_2.2.1         compiler_4.0.1    
## [17] caTools_1.18.0     htmltools_0.5.0    knitr_1.28