Bioinformatics data skills workshop - Session 5: Intermediate R - functions, vectorization, parallel programming

Source: https://github.com/markziemann/bioinformatics_intro_workshop

library("tictoc")
library("parallel")
library("beeswarm")

Functional programming

Functions are a key concept in computing. By declaring functions, we can execute complex routines with a single command. Functions are the backbone of most computer programs and they can help us with our data analysis projects by providing us reusable modules which we can transfer between projects. For example, I have customised functions that I use for all my RNA-seq studies. Re-using these functions allows me to complete an analysis of RNA-seq datasets in only a few hours of hands-on time, and prevents me from having to type out these long routines again and again.

If you have a bunch of these functions you use regularly, then you can consider turning them into a package. A package doesn’t necessarily need to be in CRAN or Bioconductor, you can keep it private or just within your organisation.

Functions in R

A function is a stored command which can be rerun with different inputs. We define the function using the function() command where the function itself is defined in the {} curly brackets as shown below.

Unlike this example, functions are normally more than one line. Your function can specify more than one input but has to return just one output. return is used to select what data should be returned to the user. If a function provides a lot of output data, then that data typically gets converted to a list format before being returned to the user.

cubed <- function(x) {
    x^3
}

cubed(42)

## [1] 74088

cubed(c(1,2,4))

## [1]  1  8 64

So you can consider functions to be “shortcuts”, to help you execute custom codes that you will be reusing again and again.

Functions are a bit like lego blocks in that they are versatile. For example you can use functions inside other functions.

cubedsum <- function(x,y) {
    z <- cubed(x) + cubed(y)
    return(z)
}

cubedsum(3,4)

## [1] 91

Using this approach, you can make incredibly sophisticated programs out of relatively simple building blocks.

The inputs don’t need to be single numbers, they can be vectors or dataframes. In this example we create a function to automate linear model analysis.

str(cars)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

mylm <- lm(cars[,2] ~ cars[,1])

INT=signif(mylm$coefficients[1],3)
INT

## (Intercept) 
##       -17.6

SLOPE=signif(mylm$coefficients[2],3)
SLOPE

## cars[, 1] 
##      3.93

HEADER=paste("Slope:",SLOPE,"Int:",INT)
HEADER

## [1] "Slope: 3.93 Int: -17.6"

myfunc <- function(x,y) {
  mylm <- lm(y ~ x)
  INT=signif(mylm$coefficients[1],3)
  SLOPE=signif(mylm$coefficients[2],3)
  HEADER=paste("Slope:",SLOPE,"Int:",INT)
  plot(x,y)
  abline(mylm,lty=2,lwd=2,col="red")
  mtext(HEADER)
}

myfunc(cars[,1],cars[,2])

R functions can be reused. Let’s try with some built-in datasets.

myfunc(women$height, women$weight)

myfunc(ChickWeight$Time, ChickWeight$weight)

myfunc(Loblolly$age, Loblolly$height)

Let’s try to make our own functions with the next 30 mins. Make a function that accepts 2 numerical vectors and does the following:

Student t-test with the syntax t.test(vec2,vec1).
Make a boxplot of the two groups with a subheading that contains the t.test p-value using the mtext() command.

In order to generate two numeric vectors you can use the following:

vec1 <- rnorm(n=10, mean=20, sd=10)

vec2 <- rnorm(n=10, mean=30, sd=15)

Apply this function to calculate the difference in weight between control and treatment groups in the built-in PlantGrowth dataset.

Vectorisation with sapply and lapply

The preferred solution to repeat a task on many inputs is to use vectorisation. Let’s say we want to find the median of three matrices. We create a new object, either a vector or list and then run a function on each element.

Note below the difference in the output type given by vapply, lapply and sapply.

mx1 <- matrix( rbinom(n=100, size=50, prob=0.8), nrow = 10 )
mx2 <- matrix( rbinom(n=100, size=100, prob=0.8), nrow = 10 )
mx3 <- matrix( rbinom(n=100, size=300, prob=0.8), nrow = 10 )

mxs <- list(mx1,mx2,mx3)

str(mxs)

## List of 3
##  $ : int [1:10, 1:10] 38 33 33 38 43 41 43 40 39 40 ...
##  $ : int [1:10, 1:10] 75 80 75 81 90 78 74 86 74 77 ...
##  $ : int [1:10, 1:10] 235 235 233 243 237 237 247 235 250 244 ...

# sapply output is either a vector or matrix
sapply(X = mxs, FUN = mean)

## [1]  40.20  79.82 240.20

# lapply output is always a list
lapply(X = mxs, FUN = mean)

## [[1]]
## [1] 40.2
## 
## [[2]]
## [1] 79.82
## 
## [[3]]
## [1] 240.2

Take a few minutes now to visit the help page for these three commands.

?sapply

As you can see from the code above, vectorisation leads to shorter and more readable code. Naturally, you can use your custom functions with sapply and other apply family functions.

Here I’ve created a function to calculate the mean correlation in the columns of a matrix.

meancor <- function(x) {
    mycor <- cor(x)
    mean(mycor)
}

meancor(mx1)

## [1] 0.06474712

sapply(mxs,meancor)

## [1] 0.06474712 0.13107528 0.08570490

Your turn. Use sapply to find the mean and standard deviation of each matrix (mx1,mx2 and mx3).

Apply - for two dimensional objects

Sometimes you have a large data table and you want to run the same analysis on each row or column. To do this, use apply. The general syntax is:

apply( x, MARGIN, FUN )

Where x is the object (df or mx) to work on.

MARGIN is either 1 for rows and 2 for columns.

FUN is the function to ron on the data.

For example, here is how you can calculate median of rows or columns in a matrix or dataframe.

# row medians
apply(mx1, 1, median)

##  [1] 40.0 41.5 39.5 40.0 38.0 41.0 40.0 40.0 40.5 39.5

# column medians
apply(mx2, 2, median)

##  [1] 77.5 83.0 80.5 80.5 80.5 83.0 76.5 80.5 77.5 79.0

In this next example I will create a new function that will calculate the coefficient of variation (CV), and then run this for each of the columns in the mtcars data.

cv <- function(x) {
  mymean <- mean(x)
  mysd <- sd(x)
  cv <- mysd/mymean
  signif(cv,3)
}

# run on one vector
cv(mtcars$mpg)

## [1] 0.3

# now run on all columns in the df
apply(X = mtcars, MARGIN = 2  ,FUN = cv)

##   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
## 0.300 0.289 0.537 0.467 0.149 0.304 0.100 1.150 1.230 0.200 0.574

Your turn. Create a function that reports the difference between the median and mean of a numerical vector, then use this function inside an apply command to report interquartile ranges of rows in the mx1 matrix.

Parallel processing

Sometimes we need to repeat some computationally intensive work many times. This can be sped up using the many CPU threads on modern processors.

For example we have some proteomics data with 2000 proteins and want to run a t-test.

First let’s make a synthetic dataset.

nreps=5 # number of replicates per group
nanalytes=20000 # number of proteins detected
nup=100 # number of overexpressed proteins.
ndn=300 # number of downexpressed proteins.
foldchange=2 # fold change of proteins
setseed=42 # set seed so that results are repeatable
set.seed(setseed)
mx1 <- matrix(rnorm(n=nanalytes*10,mean=1000,sd=100),ncol=nreps*2)
colnames(mx1) <- c(paste("ctl",1:nreps,sep=""), paste("trt",1:nreps,sep=""))
rownames(mx1) <- paste("prot",sprintf('%0.5d', 1:nanalytes),sep="")
ups <- sample(x=1:nrow(mx1),size=nup,replace=FALSE)
dns <- sample(x=setdiff(1:nrow(mx1),ups),size=ndn,replace=FALSE)
trt <- mx1[,(nreps+1):(nreps*2)]
trt[ups,] <- trt[ups,] * foldchange
trt[dns,] <- trt[dns,] * 1 / foldchange
ctl <- mx1[,1:nreps]
mx2 <- cbind(ctl,trt)

str(mx2)

##  num [1:20000, 1:10] 1137 944 1036 1063 1040 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:20000] "prot00001" "prot00002" "prot00003" "prot00004" ...
##   ..$ : chr [1:10] "ctl1" "ctl2" "ctl3" "ctl4" ...

head(mx2)

##                ctl1      ctl2      ctl3      ctl4     ctl5      trt1      trt2
## prot00001 1137.0958  940.5540 1026.4115 1038.8986 909.4732  958.5465  951.6831
## prot00002  943.5302  978.4160  893.5450 1044.1635 953.7112  999.2725 1007.3538
## prot00003 1036.3128 1008.3821 1059.7282  946.7259 892.1259  818.6022 1002.4445
## prot00004 1063.2863 1053.8968 1036.9734  921.9177 734.6694  918.4936 1106.6502
## prot00005 1040.4268 1076.4908 1018.6671  987.5584 924.1994  967.1660  946.6561
## prot00006  989.3875  992.1326  906.2417  919.8680 952.0671 1066.9797 1057.3063
##                trt3      trt4      trt5
## prot00001  815.9130  997.0441 1014.3964
## prot00002 1110.3072  953.1701 1048.0025
## prot00003 1123.8082 1221.0411  994.8300
## prot00004 1196.7389  999.7897  956.1159
## prot00005 1062.4115  676.1765 1104.1016
## prot00006  998.1027 1168.9585  890.6375

str(which(mx2<0))

##  int(0)

Run a t-test.

myttest <- function(mx,nctl,ntrt, paired=FALSE) {
  tres <- lapply(1:nrow(mx), function(i) {
    x <- mx[i,(nctl+1):(nctl+ntrt)]
    y <- mx[i,1:nctl]
    tres <- t.test(x=x,y=y,alternative = "two.sided",paired=paired)
    log2FC <- log2(tres$estimate[[1]]/tres$estimate[[2]])
    out=c("pval"=tres$p.value,"log2FC"=log2FC)
    return(out)
  })
  df <- as.data.frame(do.call(rbind,tres))
  rownames(df) <- rownames(mx)
  df <- df[order(df$pval),]
  df$fdr <- p.adjust(df$pval,method="fdr")
  return(df)
}

tic()
mytres <- myttest(mx=mx2,nctl=5,ntrt=5)
head(mytres,20)

##                   pval     log2FC          fdr
## prot16182 2.549739e-09 -1.1078016 5.099479e-05
## prot13850 1.606928e-08 -0.9355722 1.151590e-04
## prot04301 1.727385e-08 -1.0348036 1.151590e-04
## prot15788 4.257621e-08 -0.9892406 1.384606e-04
## prot02250 4.346404e-08 -1.0474924 1.384606e-04
## prot03761 5.213388e-08 -0.9766247 1.384606e-04
## prot15670 6.049718e-08 -0.9159408 1.384606e-04
## prot10009 6.056453e-08  0.8588148 1.384606e-04
## prot09198 6.413504e-08 -1.0849706 1.384606e-04
## prot07215 7.791593e-08 -1.0254758 1.384606e-04
## prot00963 8.443870e-08 -0.9822571 1.384606e-04
## prot11095 9.278293e-08 -1.0426860 1.384606e-04
## prot08257 9.279491e-08 -0.9768546 1.384606e-04
## prot12873 9.692240e-08  1.0860962 1.384606e-04
## prot05105 1.248597e-07 -0.9720499 1.664796e-04
## prot06454 1.485651e-07 -1.0736652 1.814681e-04
## prot18915 1.624660e-07  1.0033733 1.814681e-04
## prot05104 1.633213e-07 -1.0437723 1.814681e-04
## prot18212 1.782395e-07 -0.9700308 1.876206e-04
## prot12223 2.096930e-07 -1.0044631 1.923677e-04

toc()

## 1.377 sec elapsed

That was actually very fast, parallel processing won’t help us that much.

Now test the number of up and downregulated proteins.

nrow(subset(mytres,fdr<0.05 & log2FC>0))

## [1] 102

nrow(subset(mytres,fdr<0.05 & log2FC<0))

## [1] 270

Calculate precision and recall.

Precision = the accuracy of positive instances. AKA: specificity.

Recall = the proportion of positive instances called. AKA: sensitivity.

F1 = harmonic mean of Precision and Recall.

myups <- which(rownames(mx2) %in% rownames(subset(mytres,fdr<0.05 & log2FC>0)))
mydns <- which(rownames(mx2) %in% rownames(subset(mytres,fdr<0.05 & log2FC<0)))
myprec <- c(myups %in% ups, mydns %in% dns)
myprec <- length(which(myprec==TRUE))/length(myprec)
myrec <- c(ups %in% myups, dns %in% mydns)
myrec <- length(which(myrec==TRUE))/length(myrec)

Test 100 simulations with different set seeds.

This is a more realistic application of parallel processing because each simulation takes a few seconds and we can execute them in parallel.

simdat <- function(nreps=5, nanalytes=20000, nup=100, ndn=300, mymean=1000,
  mysd=100, foldchange=2, setseed=42){
  set.seed(setseed)
  mx1 <- matrix(rnorm(n=nanalytes*10,mean=mymean,sd=mysd),ncol=nreps*2)
  colnames(mx1) <- c(paste("ctl",1:nreps,sep=""), paste("trt",1:nreps,sep=""))
  rownames(mx1) <- paste("prot",sprintf('%0.5d', 1:nanalytes),sep="")
  ups <- sample(x=1:nrow(mx1),size=nup,replace=FALSE)
  dns <- sample(x=setdiff(1:nrow(mx1),ups),size=ndn,replace=FALSE)
  trt <- mx1[,(nreps+1):(nreps*2)]
  trt[ups,] <- trt[ups,] * foldchange
  trt[dns,] <- trt[dns,] * 1 / foldchange
  ctl <- mx1[,1:nreps]
  mx2 <- cbind(ctl,trt)
  out <- list(ups,dns,mx2)
  return(out)
}

tic()
res1 <- lapply(1:3,function(i) {
  seed=i*100
  dat <- simdat(setseed=seed)
  ups <- dat[[1]]
  dns <- dat[[2]]
  mx <- dat[[3]]
  mytres <- myttest(mx=mx,nctl=5,ntrt=5)
  myups <- which(rownames(mx2) %in% rownames(subset(mytres,fdr<0.05 & log2FC>0)))
  mydns <- which(rownames(mx2) %in% rownames(subset(mytres,fdr<0.05 & log2FC<0)))
  myprec <- c(myups %in% ups, mydns %in% dns)
  myprec <- length(which(myprec==TRUE))/length(myprec)
  myrec <- c(ups %in% myups, dns %in% mydns)
  myrec <- length(which(myrec==TRUE))/length(myrec)
  res = c("seed"=seed,"precision"=myprec,"recall"=myrec)
  return(res)
})
toc()

## 4.228 sec elapsed

res1

## [[1]]
##        seed   precision      recall 
## 100.0000000   0.9502618   0.9075000 
## 
## [[2]]
##        seed   precision      recall 
## 200.0000000   0.9761905   0.9225000 
## 
## [[3]]
##        seed   precision      recall 
## 300.0000000   0.9708223   0.9150000

Now execute 16 runs on 4 parallel threads

Use the parallel package.

tic()
res2 <- mclapply(1:16,function(i) {
  seed=i*100
  dat <- simdat(setseed=seed)
  ups <- dat[[1]]
  dns <- dat[[2]]
  mx <- dat[[3]]
  mytres <- myttest(mx=mx,nctl=5,ntrt=5)
  myups <- which(rownames(mx2) %in% rownames(subset(mytres,fdr<0.05 & log2FC>0)))
  mydns <- which(rownames(mx2) %in% rownames(subset(mytres,fdr<0.05 & log2FC<0)))
  myprec <- c(myups %in% ups, mydns %in% dns)
  myprec <- length(which(myprec==TRUE))/length(myprec)
  myrec <- c(ups %in% myups, dns %in% mydns)
  myrec <- length(which(myrec==TRUE))/length(myrec)
  res = c("seed"=seed,"precision"=myprec,"recall"=myrec)
  return(res)
},mc.cores=4)
toc()

## 6.17 sec elapsed

res2df <- data.frame(do.call(rbind,res2))
res2df

##    seed precision recall
## 1   100 0.9502618 0.9075
## 2   200 0.9761905 0.9225
## 3   300 0.9708223 0.9150
## 4   400 0.9893333 0.9275
## 5   500 0.9704301 0.9025
## 6   600 0.9711286 0.9250
## 7   700 0.9736148 0.9225
## 8   800 0.9840426 0.9250
## 9   900 0.9687500 0.9300
## 10 1000 0.9683377 0.9175
## 11 1100 0.9557292 0.9175
## 12 1200 0.9606299 0.9150
## 13 1300 0.9662338 0.9300
## 14 1400 0.9708223 0.9150
## 15 1500 0.9680000 0.9075
## 16 1600 0.9765625 0.9375

res2df$seed=NULL

Now make a chart of the results.

res2df$seed=NULL
boxplot(res2df,col="white",cex=0,ylab="Index")
beeswarm(res2df,add=TRUE,pch=19,cex=1.8,col="black")

Now repeat with more repeats and higher number of available CPU threads. Can you find out the optimum number of parallel threads for your system?

Test your understanding - take home questions

Question 1

Create a function that calculates the cubed root for a number

Solution

a <- c(8,27,64)
cubedroot <- function(x) { x^(1/3) }
cubedroot(a)

Question 2

Use sapply to calculate the cubed root for the numbers 44,77,99,123

Solution

sapply(a,cubedroot)

Question 3

Use apply to get the maximum value in each column of mtcars.

Solution

apply(mtcars,2,max)

Question 5

Use lapply or sapply to draw two numbers from 0 to 9. How frequent it is to draw the same number twice? Run 10,000 replications.

Solution


res <- sapply(1:10000, function(i) {
  x<- sample(x=0:9,2,replace=TRUE) ; x[1] == x[2]
})

table(res)

Question 6

How frequent it would be to flip 10 coins simultaneously and get all heads? Use mclapply to simulate 10 million replications. How much time can be saved with parallel execution

Solution

tic()
res <- lapply(1:1e7, function(i) {
  sum(sample(x=c(0,1),size=10,replace=TRUE))==10
}  )
length(which(res==TRUE))
toc()

tic()
res <- mclapply(1:1e7, function(i) {
  sum(sample(x=c(0,1),size=10,replace=TRUE))==10
} , mc.cores=16 )
length(which(res==TRUE))
toc()

## 39.3 seconds saved

out <- lapply(1:32,function(threads){
  tic()
  res <- mclapply(1:1e7, function(i) {
    sum(sample(x=c(0,1),size=10,replace=TRUE))==10
  } , mc.cores=threads )
  out <- toc()
  return(out$toc - out$tic)
})
out

Session information

For reproducibility.

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] beeswarm_0.4.0 tictoc_1.2.1  
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.36     R6_2.5.1          fastmap_1.2.0     xfun_0.45        
##  [5] cachem_1.1.0      knitr_1.48        htmltools_0.5.8.1 rmarkdown_2.27   
##  [9] lifecycle_1.0.4   cli_3.6.3         sass_0.4.9        jquerylib_0.1.4  
## [13] compiler_4.4.1    highr_0.11        tools_4.4.1       evaluate_0.24.0  
## [17] bslib_0.7.0       yaml_2.3.9        rlang_1.1.4       jsonlite_1.8.8

Bioinformatics data skills workshop - Session 5: Intermediate R - functions, vectorization, parallel programming

Burnet Bioinformatics Group

2024-08-14

Functional programming

Functions in R

Vectorisation with sapply and lapply

Apply - for two dimensional objects

Parallel processing

Test your understanding - take home questions

Question 1

Question 2

Question 3

Question 5

Question 6

Session information