Source: https://github.com/markziemann/SLE712_files/BioinfoPrac4.Rmd

Introduction

In this lesson we’ll be taking your R skills to the next level with functions, conditionals, loops and vectorisation.

List data

But first we need to become familiar with lists in R, as this will make the following section easier.

Lists are simply R objects which are grouped together in order. Unlike vectors, they don’t need to be the same data type. Therefore they are very flexible and commonly used to return results to the user.

Here we learn how to construct a list, as well as subset and modify the names.

a <- sample(x=1:10,size=5,replace=TRUE)
b <- 10:30
c <- rnorm(n=50,mean=1,sd=3)
d <- mtcars

mylist <- list(a,b,c,d)

str(mylist)
## List of 4
##  $ : int [1:5] 2 2 10 8 4
##  $ : int [1:21] 10 11 12 13 14 15 16 17 18 19 ...
##  $ : num [1:50] 1.67 1.52 4.4 1.02 -2.23 ...
##  $ :'data.frame':    32 obs. of  11 variables:
##   ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##   ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
##   ..$ disp: num [1:32] 160 160 108 258 360 ...
##   ..$ hp  : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
##   ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##   ..$ wt  : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
##   ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
##   ..$ vs  : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
##   ..$ am  : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
##   ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
##   ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
mylist[2]
## [[1]]
##  [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
mylist[[2]]
##  [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
mylist[c(3,1)]
## [[1]]
##  [1]  1.67297341  1.52391087  4.39566042  1.01699536 -2.22857905 -4.22164325  1.60284652 -1.01637725
##  [9] -3.18456713 -2.71747291  0.37205816 -1.17450142 -1.72514033  1.04184223 -0.30841557  1.53257306
## [17] -0.44837096  7.38584192  7.55806721  0.36888921  2.19655579  1.80407081  0.34795338 -0.23599046
## [25] -0.85382447 -5.27753598  2.60138065 -2.79201115  5.50491048 -0.27872053 -0.30658101  8.14768056
## [33]  3.55155296  2.85695260 -2.48524496  6.07750496 -1.42296405  0.08388423 -1.49405288  5.14889687
## [41]  0.40906980  1.25755749 -1.92960829  3.58277372  4.48756412 -3.11514104 -0.11604674  0.09952740
## [49] -3.04173777 -3.15545076
## 
## [[2]]
## [1]  2  2 10  8  4
listnames <- c("random integers", "range", "norm dist", "dataframe")

names(mylist) <- listnames

mylist2 <- list("random integers"=a, "range"=b, "norm dist"=c, "dataframe"=d)

str(mylist2)
## List of 4
##  $ random integers: int [1:5] 2 2 10 8 4
##  $ range          : int [1:21] 10 11 12 13 14 15 16 17 18 19 ...
##  $ norm dist      : num [1:50] 1.67 1.52 4.4 1.02 -2.23 ...
##  $ dataframe      :'data.frame': 32 obs. of  11 variables:
##   ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##   ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
##   ..$ disp: num [1:32] 160 160 108 258 360 ...
##   ..$ hp  : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
##   ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##   ..$ wt  : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
##   ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
##   ..$ vs  : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
##   ..$ am  : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
##   ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
##   ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...

Functions

A function is a stored command which can be rerun with different inputs. We define the function using the function() command where the function itself is defined in the {} curly brackets as shown below.

Unlinke this example, functions are normally more than one line. Your function can specify more than one input but has to return just one output. return is used to select what data should be returned to the user.

cubed <- function(x) {
    x^3
}

cubed(42)
## [1] 74088
cubed(c(1,2,4))
## [1]  1  8 64

So you can consider functions to be “shortcuts”, to help you execute custom codes that you will be reusing again and again.

Functions are a bit like lego blocks in that they are versatile. For example you can use functions inside other functions.

cubedsum <- function(x,y) {
    z <- cubed(x) + cubed(y)
    return(z)
}

cubedsum(3,4)
## [1] 91

Using this approach, you can make incredibly sophisticated programs out of relatively simple building blocks.

The inputs don’t need to be single numbers, they can be vectors or dataframes. In this example we create a function to automate linear model analysis.

str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
mylm <- lm(cars[,2] ~ cars[,1])

INT=signif(mylm$coefficients[1],3)
INT
## (Intercept) 
##       -17.6
SLOPE=signif(mylm$coefficients[2],3)
SLOPE
## cars[, 1] 
##      3.93
HEADER=paste("Slope:",SLOPE,"Int:",INT)
HEADER
## [1] "Slope: 3.93 Int: -17.6"
myfunc <- function(x,y) {
  mylm <- lm(y ~ x)
  INT=signif(mylm$coefficients[1],3)
  SLOPE=signif(mylm$coefficients[2],3)
  HEADER=paste("Slope:",SLOPE,"Int:",INT)
  plot(x,y)
  abline(mylm,lty=2,lwd=2,col="red")
  mtext(HEADER)
}

myfunc(cars[,1],cars[,2])

R functions can be reused. Let’s try with some built-in datasets.

myfunc(women$height, women$weight)

myfunc(ChickWeight$Time, ChickWeight$weight)

myfunc(Loblolly$age, Loblolly$height)

Let’s try to make our own functions with the next 30 mins. Make a function that accepts 2 numerical vectors and does the following:

  1. Student t-test with the syntax t.test(vec2,vec1).

  2. Make a boxplot of the two groups with a subheading that contains the t.test pvalue using the mtext() command.

In order to generate two numeric vectors you can use the following:

vec1 <- rnorm(n=10, mean=20, sd=10)

vec2 <- rnorm(n=10, mean=30, sd=15)

Apply this function to calculate the difference in weight between control and treatment groups in the built-in PlantGrowth dataset.

Conditionals

Commonly, your workflow will need to adapt to different inputs. For example if a file already exists, you may not want to over-write it. This is the general format of an if statement in R:

if (condition) {command to run if TRUE} 

Most of the time the command to run is written on a new line. Many times we need to run more than one command.

if (condition) {
    command1 to run if TRUE
    command2 to run if TRUE
} 

If we only want to execute the command if FALSE, then we can add ! to toggle the conditional.

2>3
## [1] FALSE
!2>3
## [1] TRUE

Here is how it looks in an actual if statement. Note the beow is only an example, the code won’t work when run because command1 and command2 are not defined.

if (!condition) {
    command1 to run if FALSE
    command2 to run if FALSE
} 

We can add else to select from two options

if (condition) {
    command1 to run if TRUE
} else {
    command2 to run if FALSE
}

Here is a real life example.

if ( !file.exists( "mtcars.tsv" ) ) {

  write.table( mtcars, file="mtcars.tsv", quote=FALSE, sep="\t")

  message("File written")

}
## File written

Here is another (better) way to do it.

if ( file.exists( "mtcars.tsv" ) ) {

  warning( "File already exists!" )

} else {

  write.table( mtcars, file="mtcars.tsv", quote=FALSE, sep="\t" )

  message( "File written" )

}
## Warning: File already exists!
# delete the file
unlink("mtcars.tsv")

We can use Boolean operators to come of with more sophisticated conditional. For example & means AND, while | means OR.

# & = AND
if( 3>2 & 5>4 ) {
  print( "ok" )
}
## [1] "ok"
if( 3>2 & 5>10 ) {
  print( "ok" )
}

# | = OR
if( 3>2 | 5>10 ) {
  print( "ok" )
}
## [1] "ok"

Here is another example. Let’s check to see when this Rmd was last updated. If it was more than a month then it’s too old.

# record current time in seconds
CURRENT_TIME <- as.numeric( Sys.time() )

# get the last modified time in seconds
MOD_TIME <- as.numeric( file.info( "BioinfoPrac4.Rmd" )$mtime )

# difference in time
DIFF_TIME <- signif( CURRENT_TIME - MOD_TIME, 3 )

MY_MESSAGE <- paste( "This prac file is", DIFF_TIME, "seconds old" ) 
message( MY_MESSAGE )
## This prac file is 17100000 seconds old
# Define a month in seconds
MONTH <- 60 * 60 * 24 * 30
message( paste( "One month is", MONTH, "seconds" ) )
## One month is 2592000 seconds
# now run the comparison
if ( DIFF_TIME > MONTH ) {

  message( "This learning material more than 1 month old. Time to update!" )

} else {

  message( "This learning material is less than 1 month old :)" )

}
## This learning material more than 1 month old. Time to update!

Now it’s your turn. Test to see if the mean petal length of setosa, versicolor an virginica are greater than 3cm. If this condition is true, print a message including the species name and measurement.

Loops

In software engineering and data analysis it is common to repeat a task with different inputs. In most computer languages a for loop would be used.

This can be done in R as demonstrated below:

for ( i in c("a","b","c") ) {

  print(i)

}
## [1] "a"
## [1] "b"
## [1] "c"
for ( i in 1:5 ) {

  print(i^2)

}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25

While this works in the above example, you’ll find that it is difficult to work downstream of the loop. In particular it is difficult to access the results of a looped command and use it in subsequent steps.

In this example we should in theory be able to save the squares into the object x but it doesn’t work as you might think.

# try storing the results
x <- for (i in 1:5) {

  i^2

}

x
## NULL

There is one way to make it work, but still it isn’t recommended.

x <- vector()

# try another way of storing the results
for (i in 1:5) {

  res <- i^2

  x[i] <- res

}

x
## [1]  1  4  9 16 25

For loops are also really slow and inefficient in R when dealing with big datasets. Nevertheless, there is also a while loop.

i=0

while( i < 5 ) {

    print(i)

    i=i+1

}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Vectorisation with sapply and lapply

The preferred solution to repeat a task on many inputs is to use vectorisation. Let’s say we want to find the median of three matrices. We create a new object, either a vector or list and then run a function on each element.

Note below the difference in the output type given by vapply, lapply and sapply.

mx1 <- matrix( rbinom(n=100, size=50, prob=0.8), nrow = 10 )
mx2 <- matrix( rbinom(n=100, size=100, prob=0.8), nrow = 10 )
mx3 <- matrix( rbinom(n=100, size=300, prob=0.8), nrow = 10 )

mxs <- list(mx1,mx2,mx3) 

str(mxs)
## List of 3
##  $ : int [1:10, 1:10] 36 39 44 40 37 44 39 41 39 40 ...
##  $ : int [1:10, 1:10] 82 77 80 80 86 75 82 81 80 74 ...
##  $ : int [1:10, 1:10] 229 248 248 231 237 238 243 249 241 248 ...
# sapply output is either a vector or matrix
sapply(X = mxs, FUN = mean)
## [1]  39.93  80.18 241.10
# lapply output is always a list
lapply(X = mxs, FUN = mean)
## [[1]]
## [1] 39.93
## 
## [[2]]
## [1] 80.18
## 
## [[3]]
## [1] 241.1

Take a few minutes now to visit the help page for these three commands.

?sapply

As you can see from the code above, vectorisation leads to shorter and more readable code. Naturally, you can use your custom functions with sapply and other apply family functions.

Here I’ve created a function to calculate the mean correlation in the columns of a matrix.

meancor <- function(x) {
    mycor <- cor(x)
    mean(mycor)
}

meancor(mx1)
## [1] 0.06263797
sapply(mxs,meancor)
## [1] 0.06263797 0.11415551 0.17629755

Your turn. Use sapply to find the mean and standard deviation of each matrix (mx1,mx2 and mx3).

Apply - for two dimensional objects

Sometimes you have a large data table and you want to run the same analysis on each row or column. To do this, use apply. The general syntax is:

apply( x, MARGIN, FUN )

Where x is the object (df or mx) to work on.

MARGIN is either 1 for rows and 2 for columns.

FUN is the function to ron on the data.

For example, here is how you can calculate median of rows or columns in a matrix or dataframe.

# row medians
apply(mx1, 1, median)
##  [1] 40.5 39.5 41.5 40.5 39.5 40.5 40.0 40.0 40.0 40.0
# column medians
apply(mx2, 2, median)
##  [1] 80.0 81.0 79.0 80.0 78.5 79.0 80.0 80.5 82.0 81.5

In this next example I will create a new function that will calculate the coefficient of variation (CV), and then run this for each of the columns in the mtcars data.

cv <- function(x) {
  mymean <- mean(x)
  mysd <- sd(x)
  cv <- mysd/mymean
  signif(cv,3)
}

# run on one vector
cv(mtcars$mpg)
## [1] 0.3
# now run on all columns in the df
apply(X = mtcars, MARGIN = 2  ,FUN = cv)
##   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
## 0.300 0.289 0.537 0.467 0.149 0.304 0.100 1.150 1.230 0.200 0.574

Your turn. Create a function that reports the difference between the median and mean of a numerical vector, then use this function inside an apply command to report interquartile ranges of rows in the mx1 matrix.

Replicate

This is a special case of sapply which repeats a task a set number of times, which could be useful for simulations.

In the example below, we create a function which simulates a coin toss where tails=0 and heads=1.

Calculate the mean of n=10 tosses. Is the result consistent?

cointoss<-function(){
    sample(c(0,1),1,replace=TRUE)
}

res <- replicate(10,cointoss())
res
##  [1] 0 1 0 0 1 1 1 1 0 0
mean(res)
## [1] 0.5
mean(replicate(10,cointoss()))
## [1] 0.4
mean(replicate(100,cointoss()))
## [1] 0.44
mean(replicate(1000,cointoss()))
## [1] 0.499
mean(replicate(10000,cointoss()))
## [1] 0.4923

See how the result converges near 50% as the number of replications increases.

Test your understanding - take home questions

Question 1

Create a function that calculates the cubed root for a number

Question 2

Use sapply to calculate the cubed root for the numbers 44,77,99,123

Question 3

Use if to test whether object mtcars.tsv exists and run head if it does exist.

Question 4

Use apply to get the maximum value in each column of mtcars.

Question 5

Use replicate to quantify how frequent it is to draw the same number twice where the pool of numbers consists of 0 to 9.

Session

sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
##  [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] readxl_1.4.1
## 
## loaded via a namespace (and not attached):
##  [1] MatrixGenerics_1.8.1        Biobase_2.56.0              httr_1.4.4                 
##  [4] sass_0.4.2                  jsonlite_1.8.3              bit64_4.0.5                
##  [7] splines_4.2.2               bslib_0.4.0                 assertthat_0.2.1           
## [10] highr_0.9                   stats4_4.2.2                blob_1.2.3                 
## [13] cellranger_1.1.0            GenomeInfoDbData_1.2.8      yaml_2.3.6                 
## [16] pillar_1.8.1                RSQLite_2.2.18              lattice_0.20-45            
## [19] glue_1.6.2                  digest_0.6.30               GenomicRanges_1.48.0       
## [22] RColorBrewer_1.1-3          XVector_0.36.0              colorspace_2.0-3           
## [25] htmltools_0.5.3             Matrix_1.5-1                DESeq2_1.36.0              
## [28] XML_3.99-0.12               pkgconfig_2.0.3             genefilter_1.78.0          
## [31] zlibbioc_1.42.0             xtable_1.8-4                scales_1.2.1               
## [34] downlit_0.4.2               BiocParallel_1.30.3         tibble_3.1.8               
## [37] annotate_1.74.0             KEGGREST_1.36.3             generics_0.1.3             
## [40] IRanges_2.30.0              ggplot2_3.3.6               withr_2.5.0                
## [43] cachem_1.0.6                SummarizedExperiment_1.26.1 BiocGenerics_0.42.0        
## [46] cli_3.4.1                   survival_3.4-0              magrittr_2.0.3             
## [49] crayon_1.5.2                memoise_2.0.1               evaluate_0.17              
## [52] fansi_1.0.3                 MASS_7.3-58.1               tools_4.2.2                
## [55] lifecycle_1.0.3             matrixStats_0.62.0          stringr_1.4.1              
## [58] S4Vectors_0.34.0            locfit_1.5-9.6              munsell_0.5.0              
## [61] DelayedArray_0.22.0         AnnotationDbi_1.58.0        Biostrings_2.64.0          
## [64] jquerylib_0.1.4             ade4_1.7-19                 compiler_4.2.2             
## [67] GenomeInfoDb_1.32.2         rlang_1.0.6                 grid_4.2.2                 
## [70] RCurl_1.98-1.9              rstudioapi_0.14             bitops_1.0-7               
## [73] rmarkdown_2.17              gtable_0.3.1                codetools_0.2-18           
## [76] DBI_1.1.3                   R6_2.5.1                    knitr_1.40                 
## [79] dplyr_1.0.10                fastmap_1.1.0               seqinr_4.2-16              
## [82] bit_4.0.4                   utf8_1.2.2                  stringi_1.7.8              
## [85] parallel_4.2.2              Rcpp_1.0.9                  vctrs_0.5.0                
## [88] geneplotter_1.74.0          png_0.1-7                   tidyselect_1.2.0           
## [91] xfun_0.34