Source: https://github.com/markziemann/SLE712_files/BioinfoPrac4.Rmd
In this lesson we’ll be taking your R skills to the next level with functions, conditionals, loops and vectorisation.
But first we need to become familiar with lists in R, as this will make the following section easier.
Lists are simply R objects which are grouped together in order. Unlike vectors, they don’t need to be the same data type. Therefore they are very flexible and commonly used to return results to the user.
Here we learn how to construct a list, as well as subset and modify the names.
a <- sample(x=1:10,size=5,replace=TRUE)
b <- 10:30
c <- rnorm(n=50,mean=1,sd=3)
d <- mtcars
mylist <- list(a,b,c,d)
str(mylist)
## List of 4
## $ : int [1:5] 2 2 10 8 4
## $ : int [1:21] 10 11 12 13 14 15 16 17 18 19 ...
## $ : num [1:50] 1.67 1.52 4.4 1.02 -2.23 ...
## $ :'data.frame': 32 obs. of 11 variables:
## ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## ..$ disp: num [1:32] 160 160 108 258 360 ...
## ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
## ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
## ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
mylist[2]
## [[1]]
## [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
mylist[[2]]
## [1] 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
mylist[c(3,1)]
## [[1]]
## [1] 1.67297341 1.52391087 4.39566042 1.01699536 -2.22857905 -4.22164325 1.60284652 -1.01637725
## [9] -3.18456713 -2.71747291 0.37205816 -1.17450142 -1.72514033 1.04184223 -0.30841557 1.53257306
## [17] -0.44837096 7.38584192 7.55806721 0.36888921 2.19655579 1.80407081 0.34795338 -0.23599046
## [25] -0.85382447 -5.27753598 2.60138065 -2.79201115 5.50491048 -0.27872053 -0.30658101 8.14768056
## [33] 3.55155296 2.85695260 -2.48524496 6.07750496 -1.42296405 0.08388423 -1.49405288 5.14889687
## [41] 0.40906980 1.25755749 -1.92960829 3.58277372 4.48756412 -3.11514104 -0.11604674 0.09952740
## [49] -3.04173777 -3.15545076
##
## [[2]]
## [1] 2 2 10 8 4
listnames <- c("random integers", "range", "norm dist", "dataframe")
names(mylist) <- listnames
mylist2 <- list("random integers"=a, "range"=b, "norm dist"=c, "dataframe"=d)
str(mylist2)
## List of 4
## $ random integers: int [1:5] 2 2 10 8 4
## $ range : int [1:21] 10 11 12 13 14 15 16 17 18 19 ...
## $ norm dist : num [1:50] 1.67 1.52 4.4 1.02 -2.23 ...
## $ dataframe :'data.frame': 32 obs. of 11 variables:
## ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
## ..$ disp: num [1:32] 160 160 108 258 360 ...
## ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
## ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
## ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
## ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
## ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
## ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
## ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
A function is a stored command which can be rerun with different
inputs. We define the function using the function()
command
where the function itself is defined in the {}
curly
brackets as shown below.
Unlinke this example, functions are normally more than one line. Your
function can specify more than one input but has to return just one
output. return
is used to select what data should be
returned to the user.
cubed <- function(x) {
x^3
}
cubed(42)
## [1] 74088
cubed(c(1,2,4))
## [1] 1 8 64
So you can consider functions to be “shortcuts”, to help you execute custom codes that you will be reusing again and again.
Functions are a bit like lego blocks in that they are versatile. For example you can use functions inside other functions.
cubedsum <- function(x,y) {
z <- cubed(x) + cubed(y)
return(z)
}
cubedsum(3,4)
## [1] 91
Using this approach, you can make incredibly sophisticated programs out of relatively simple building blocks.
The inputs don’t need to be single numbers, they can be vectors or dataframes. In this example we create a function to automate linear model analysis.
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
mylm <- lm(cars[,2] ~ cars[,1])
INT=signif(mylm$coefficients[1],3)
INT
## (Intercept)
## -17.6
SLOPE=signif(mylm$coefficients[2],3)
SLOPE
## cars[, 1]
## 3.93
HEADER=paste("Slope:",SLOPE,"Int:",INT)
HEADER
## [1] "Slope: 3.93 Int: -17.6"
myfunc <- function(x,y) {
mylm <- lm(y ~ x)
INT=signif(mylm$coefficients[1],3)
SLOPE=signif(mylm$coefficients[2],3)
HEADER=paste("Slope:",SLOPE,"Int:",INT)
plot(x,y)
abline(mylm,lty=2,lwd=2,col="red")
mtext(HEADER)
}
myfunc(cars[,1],cars[,2])
R functions can be reused. Let’s try with some built-in datasets.
myfunc(women$height, women$weight)
myfunc(ChickWeight$Time, ChickWeight$weight)
myfunc(Loblolly$age, Loblolly$height)
Let’s try to make our own functions with the next 30 mins. Make a function that accepts 2 numerical vectors and does the following:
Student t-test with the syntax
t.test(vec2,vec1)
.
Make a boxplot of the two groups with a subheading that contains
the t.test pvalue using the mtext()
command.
In order to generate two numeric vectors you can use the following:
vec1 <- rnorm(n=10, mean=20, sd=10)
vec2 <- rnorm(n=10, mean=30, sd=15)
Apply this function to calculate the difference in weight between
control and treatment groups in the built-in PlantGrowth
dataset.
Commonly, your workflow will need to adapt to different inputs. For
example if a file already exists, you may not want to over-write it.
This is the general format of an if
statement in R:
if (condition) {command to run if TRUE}
Most of the time the command to run is written on a new line. Many times we need to run more than one command.
if (condition) {
command1 to run if TRUE
command2 to run if TRUE
}
If we only want to execute the command if FALSE, then we can add
!
to toggle the conditional.
2>3
## [1] FALSE
!2>3
## [1] TRUE
Here is how it looks in an actual if
statement. Note the
beow is only an example, the code won’t work when run because command1
and command2 are not defined.
if (!condition) {
command1 to run if FALSE
command2 to run if FALSE
}
We can add else
to select from two options
if (condition) {
command1 to run if TRUE
} else {
command2 to run if FALSE
}
Here is a real life example.
if ( !file.exists( "mtcars.tsv" ) ) {
write.table( mtcars, file="mtcars.tsv", quote=FALSE, sep="\t")
message("File written")
}
## File written
Here is another (better) way to do it.
if ( file.exists( "mtcars.tsv" ) ) {
warning( "File already exists!" )
} else {
write.table( mtcars, file="mtcars.tsv", quote=FALSE, sep="\t" )
message( "File written" )
}
## Warning: File already exists!
# delete the file
unlink("mtcars.tsv")
We can use Boolean operators to come of with more sophisticated
conditional. For example &
means AND, while
|
means OR.
# & = AND
if( 3>2 & 5>4 ) {
print( "ok" )
}
## [1] "ok"
if( 3>2 & 5>10 ) {
print( "ok" )
}
# | = OR
if( 3>2 | 5>10 ) {
print( "ok" )
}
## [1] "ok"
Here is another example. Let’s check to see when this Rmd was last updated. If it was more than a month then it’s too old.
# record current time in seconds
CURRENT_TIME <- as.numeric( Sys.time() )
# get the last modified time in seconds
MOD_TIME <- as.numeric( file.info( "BioinfoPrac4.Rmd" )$mtime )
# difference in time
DIFF_TIME <- signif( CURRENT_TIME - MOD_TIME, 3 )
MY_MESSAGE <- paste( "This prac file is", DIFF_TIME, "seconds old" )
message( MY_MESSAGE )
## This prac file is 17100000 seconds old
# Define a month in seconds
MONTH <- 60 * 60 * 24 * 30
message( paste( "One month is", MONTH, "seconds" ) )
## One month is 2592000 seconds
# now run the comparison
if ( DIFF_TIME > MONTH ) {
message( "This learning material more than 1 month old. Time to update!" )
} else {
message( "This learning material is less than 1 month old :)" )
}
## This learning material more than 1 month old. Time to update!
Now it’s your turn. Test to see if the mean petal length of setosa, versicolor an virginica are greater than 3cm. If this condition is true, print a message including the species name and measurement.
In software engineering and data analysis it is common to repeat a
task with different inputs. In most computer languages a
for
loop would be used.
This can be done in R as demonstrated below:
for ( i in c("a","b","c") ) {
print(i)
}
## [1] "a"
## [1] "b"
## [1] "c"
for ( i in 1:5 ) {
print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
While this works in the above example, you’ll find that it is difficult to work downstream of the loop. In particular it is difficult to access the results of a looped command and use it in subsequent steps.
In this example we should in theory be able to save the squares into
the object x
but it doesn’t work as you might think.
# try storing the results
x <- for (i in 1:5) {
i^2
}
x
## NULL
There is one way to make it work, but still it isn’t recommended.
x <- vector()
# try another way of storing the results
for (i in 1:5) {
res <- i^2
x[i] <- res
}
x
## [1] 1 4 9 16 25
For loops are also really slow and inefficient in R when dealing with
big datasets. Nevertheless, there is also a while
loop.
i=0
while( i < 5 ) {
print(i)
i=i+1
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
The preferred solution to repeat a task on many inputs is to use vectorisation. Let’s say we want to find the median of three matrices. We create a new object, either a vector or list and then run a function on each element.
Note below the difference in the output type given by
vapply
, lapply
and sapply
.
mx1 <- matrix( rbinom(n=100, size=50, prob=0.8), nrow = 10 )
mx2 <- matrix( rbinom(n=100, size=100, prob=0.8), nrow = 10 )
mx3 <- matrix( rbinom(n=100, size=300, prob=0.8), nrow = 10 )
mxs <- list(mx1,mx2,mx3)
str(mxs)
## List of 3
## $ : int [1:10, 1:10] 36 39 44 40 37 44 39 41 39 40 ...
## $ : int [1:10, 1:10] 82 77 80 80 86 75 82 81 80 74 ...
## $ : int [1:10, 1:10] 229 248 248 231 237 238 243 249 241 248 ...
# sapply output is either a vector or matrix
sapply(X = mxs, FUN = mean)
## [1] 39.93 80.18 241.10
# lapply output is always a list
lapply(X = mxs, FUN = mean)
## [[1]]
## [1] 39.93
##
## [[2]]
## [1] 80.18
##
## [[3]]
## [1] 241.1
Take a few minutes now to visit the help page for these three commands.
?sapply
As you can see from the code above, vectorisation leads to shorter
and more readable code. Naturally, you can use your custom functions
with sapply
and other apply
family
functions.
Here I’ve created a function to calculate the mean correlation in the columns of a matrix.
meancor <- function(x) {
mycor <- cor(x)
mean(mycor)
}
meancor(mx1)
## [1] 0.06263797
sapply(mxs,meancor)
## [1] 0.06263797 0.11415551 0.17629755
Your turn. Use sapply
to find the mean and standard
deviation of each matrix (mx1,mx2 and mx3).
Sometimes you have a large data table and you want to run the same
analysis on each row or column. To do this, use apply
. The
general syntax is:
apply( x, MARGIN, FUN )
Where x
is the object (df or mx) to work on.
MARGIN
is either 1 for rows and 2 for columns.
FUN
is the function to ron on the data.
For example, here is how you can calculate median of rows or columns in a matrix or dataframe.
# row medians
apply(mx1, 1, median)
## [1] 40.5 39.5 41.5 40.5 39.5 40.5 40.0 40.0 40.0 40.0
# column medians
apply(mx2, 2, median)
## [1] 80.0 81.0 79.0 80.0 78.5 79.0 80.0 80.5 82.0 81.5
In this next example I will create a new function that will calculate
the coefficient of variation (CV), and then run this for each of the
columns in the mtcars
data.
cv <- function(x) {
mymean <- mean(x)
mysd <- sd(x)
cv <- mysd/mymean
signif(cv,3)
}
# run on one vector
cv(mtcars$mpg)
## [1] 0.3
# now run on all columns in the df
apply(X = mtcars, MARGIN = 2 ,FUN = cv)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0.300 0.289 0.537 0.467 0.149 0.304 0.100 1.150 1.230 0.200 0.574
Your turn. Create a function that reports the difference between the
median and mean of a numerical vector, then use this function inside an
apply
command to report interquartile ranges of rows in the
mx1
matrix.
This is a special case of sapply which repeats a task a set number of times, which could be useful for simulations.
In the example below, we create a function which simulates a coin toss where tails=0 and heads=1.
Calculate the mean of n=10 tosses. Is the result consistent?
cointoss<-function(){
sample(c(0,1),1,replace=TRUE)
}
res <- replicate(10,cointoss())
res
## [1] 0 1 0 0 1 1 1 1 0 0
mean(res)
## [1] 0.5
mean(replicate(10,cointoss()))
## [1] 0.4
mean(replicate(100,cointoss()))
## [1] 0.44
mean(replicate(1000,cointoss()))
## [1] 0.499
mean(replicate(10000,cointoss()))
## [1] 0.4923
See how the result converges near 50% as the number of replications increases.
Create a function that calculates the cubed root for a number
Use sapply
to calculate the cubed root for the numbers
44,77,99,123
Use if
to test whether object mtcars.tsv
exists and run head
if it does exist.
Use apply
to get the maximum value in each column of
mtcars.
Use replicate
to quantify how frequent it is to draw the
same number twice where the pool of numbers consists of 0 to 9.
sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8
## [4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] readxl_1.4.1
##
## loaded via a namespace (and not attached):
## [1] MatrixGenerics_1.8.1 Biobase_2.56.0 httr_1.4.4
## [4] sass_0.4.2 jsonlite_1.8.3 bit64_4.0.5
## [7] splines_4.2.2 bslib_0.4.0 assertthat_0.2.1
## [10] highr_0.9 stats4_4.2.2 blob_1.2.3
## [13] cellranger_1.1.0 GenomeInfoDbData_1.2.8 yaml_2.3.6
## [16] pillar_1.8.1 RSQLite_2.2.18 lattice_0.20-45
## [19] glue_1.6.2 digest_0.6.30 GenomicRanges_1.48.0
## [22] RColorBrewer_1.1-3 XVector_0.36.0 colorspace_2.0-3
## [25] htmltools_0.5.3 Matrix_1.5-1 DESeq2_1.36.0
## [28] XML_3.99-0.12 pkgconfig_2.0.3 genefilter_1.78.0
## [31] zlibbioc_1.42.0 xtable_1.8-4 scales_1.2.1
## [34] downlit_0.4.2 BiocParallel_1.30.3 tibble_3.1.8
## [37] annotate_1.74.0 KEGGREST_1.36.3 generics_0.1.3
## [40] IRanges_2.30.0 ggplot2_3.3.6 withr_2.5.0
## [43] cachem_1.0.6 SummarizedExperiment_1.26.1 BiocGenerics_0.42.0
## [46] cli_3.4.1 survival_3.4-0 magrittr_2.0.3
## [49] crayon_1.5.2 memoise_2.0.1 evaluate_0.17
## [52] fansi_1.0.3 MASS_7.3-58.1 tools_4.2.2
## [55] lifecycle_1.0.3 matrixStats_0.62.0 stringr_1.4.1
## [58] S4Vectors_0.34.0 locfit_1.5-9.6 munsell_0.5.0
## [61] DelayedArray_0.22.0 AnnotationDbi_1.58.0 Biostrings_2.64.0
## [64] jquerylib_0.1.4 ade4_1.7-19 compiler_4.2.2
## [67] GenomeInfoDb_1.32.2 rlang_1.0.6 grid_4.2.2
## [70] RCurl_1.98-1.9 rstudioapi_0.14 bitops_1.0-7
## [73] rmarkdown_2.17 gtable_0.3.1 codetools_0.2-18
## [76] DBI_1.1.3 R6_2.5.1 knitr_1.40
## [79] dplyr_1.0.10 fastmap_1.1.0 seqinr_4.2-16
## [82] bit_4.0.4 utf8_1.2.2 stringi_1.7.8
## [85] parallel_4.2.2 Rcpp_1.0.9 vctrs_0.5.0
## [88] geneplotter_1.74.0 png_0.1-7 tidyselect_1.2.0
## [91] xfun_0.34