Source: https://github.com/markziemann/SLE712_files/blob/master/BioinfoPrac1.Rmd

Introduction

What is bioinformatics?

It is data science techniques applied to biological or biomedical domains.

This is important because biology and biomedicine are becoming more data-driven. This means it is becoming an essential skill to be able to process these huge datasets, analyse them and create meaningful charts and tables.

Bioinformatics is done in many languages including Perl, Python, Java, Matlab and more, so why R?

  1. It is free and open source

  2. It works in Linux, Mac and Windows (and on the cloud)

  3. Is the premier language for numerical analysis and statistics

  4. It has best data visualisation options

  5. It has a vibrant and noob-friendly base of science users throughout the world

  6. There is a huge array of packages available on CRAN and Bioconductor.

  7. Nearly everything you learn in R here can be applied to other fields of endeavour such as business, engineering and medicine.

Therefore all of the bioinformatics content covered in this unit will be related to using R. We will be using the Rstudio application to help us managing the development process. It is a graphical interface with different panels and tools which makes the code development process easier and more efficient. You can log on to our Rstudio server here: http://118.138.234.73:8787/ The username and password were provided to you by email. If you missed out

This week, our goal is to learn:

The Rstudio interface

Let’s get to know the Rstudio interface a bit better. It consists of a menu bar at the top of the browser window and four panels below:

Menu Bar
Top left: Script (type and save your scripts here) Top right: Global Environment (R data objects you can work with)
Bottom left: Console (commands are executed here) Bottom right: Files, Plots and help pages

If you cannot see the Script panel, click “New Script” on the menu bar and it should appear.

Let’s watch a video together to learn the basics of the Rstudio environment.

IMAGE ALT TEXT HERE

Let’s get started with using R!

Arithmetic

The first thing we will cover in R is the different types of data structures. Let’s go through the different data types together: https://www.statmethods.net/input/datatypes.html

Open up your Rstudio and try these commands by typing in the script panel and hitting the “Run” button or Ctrl+Enter, observing the output.

The hash character # is used to include comments. Including comments help the user to understand the purpose of the commands. It is a good idea to add comments to your code. Any characters that come after the # are ignored.

We’ll start with some arithmetic.

1+2
## [1] 3
10000^2
## [1] 1e+08
sqrt(64)
## [1] 8

Notice that as you start typing commands like sqrt that Rstudio will give you suggestions, you can use the tab key to autocomplete.

Working with variables

Use the <- back arrow or = to define variables.

s <- sqrt(64)
s
## [1] 8
a <- 2
b <- 5
c <- a*b
c
## [1] 10

Vectors

We can also do operations on vectors of numbers.

Vectors are specified with c-brackets like this c(1,2,3)

c(1,2,3)
## [1] 1 2 3
sum(c(1,10,100))
## [1] 111
mean(c(1,10,100))
## [1] 37
median(c(1,10,100))
## [1] 10
max(c(1,10,100))
## [1] 100
min(c(1,10,100))
## [1] 1

Saving variables

To make things easier to read and write, we can save the vector as a variable a. Then we can work on a.

We use the <- to save objects in R, but = also works.

length tells us the number of elements in the vector which is very useful.

a <- c(1,10,100)
a
## [1]   1  10 100
2*a
## [1]   2  20 200
a+1
## [1]   2  11 101
sum(a)
## [1] 111
mean(a)
## [1] 37
median(a)
## [1] 10
sd(a)
## [1] 54.74486
var(a)
## [1] 2997
length(a)
## [1] 3

R object types

Whenever we are using an R object we should check exactly what sort of object it is. This is the most common type of error you will encounter.

We use the str() command to check the structure. Other commands we can use to investigate the data structure include class() and typeof()

The colon : can be used to specify series. For example 1:5 for numbers 1 to 5.

Note the different object types here:

A is a Numerical vector

B is Numerical vector (series)

C is a named numerical vector

D is a character string

E is a character vector

F is a named character vector

A <- c(1,10,100)
A
## [1]   1  10 100
str(A) 
##  num [1:3] 1 10 100
class(A)
## [1] "numeric"
typeof(A)
## [1] "double"
B <- 1:10 
B
##  [1]  1  2  3  4  5  6  7  8  9 10
str(B)
##  int [1:10] 1 2 3 4 5 6 7 8 9 10
class(B)
## [1] "integer"
typeof(B)
## [1] "integer"
C <- c("prime1"=2, "prime2"=3, "prime3"=5, "prime4"=7)
C
## prime1 prime2 prime3 prime4 
##      2      3      5      7
str(C)
##  Named num [1:4] 2 3 5 7
##  - attr(*, "names")= chr [1:4] "prime1" "prime2" "prime3" "prime4"
class(C)
## [1] "numeric"
typeof(C)
## [1] "double"
D <- "x1" 
D
## [1] "x1"
str(D)
##  chr "x1"
class(D)
## [1] "character"
typeof(D)
## [1] "character"
E <- c("x1", "y2", "z3")
E
## [1] "x1" "y2" "z3"
str(E)
##  chr [1:3] "x1" "y2" "z3"
names(E) <- c("code1","code2","code3")
E
## code1 code2 code3 
##  "x1"  "y2"  "z3"
str(E)
##  Named chr [1:3] "x1" "y2" "z3"
##  - attr(*, "names")= chr [1:3] "code1" "code2" "code3"
class(E)
## [1] "character"
typeof(E)
## [1] "character"

Note how the names can be added during or after the vector is defined.

Factors

Factors are an entirely different type of data in R. They are represented as non numeric categories, but are stored internally as numerical data. Typically factors are used in R for categorical data like biological sex. Here the example is an unordered category.

x <- factor(c("single", "married", "married", "single","defacto","widowed"))
x
## [1] single  married married single  defacto widowed
## Levels: defacto married single widowed
str(x)
##  Factor w/ 4 levels "defacto","married",..: 3 2 2 3 1 4
levels(x)
## [1] "defacto" "married" "single"  "widowed"

But it is possible to have ordered factors as well.

Take this survey result about a product as an example, where respondents rate the food at a restaurant between very poor and very good. These responses have a natural order, so it makes sense to treat these as ordered factors.

y <- c("good", "very good", "fair", "poor","good")
str(y)
##  chr [1:5] "good" "very good" "fair" "poor" "good"
yy <- factor(y, levels = c("very poor","poor","fair","good","very good"),ordered = TRUE)
yy
## [1] good      very good fair      poor      good     
## Levels: very poor < poor < fair < good < very good
str(yy)
##  Ord.factor w/ 5 levels "very poor"<"poor"<..: 4 5 3 2 4
levels(yy)
## [1] "very poor" "poor"      "fair"      "good"      "very good"

Logicals

R also uses TRUE/FALSE a lot, as a data type as well as an option when executing commands.

It is also possible to have vectors of logical values.

myvariable1 <- 0
as.logical(myvariable1)
## [1] FALSE
myvariable2 <- 1
as.logical(myvariable2)
## [1] TRUE
vals <- c(0,1,0,1,0,0,0,1,1,0,1,0,1,0)
vals
##  [1] 0 1 0 1 0 0 0 1 1 0 1 0 1 0
str(vals) 
##  num [1:14] 0 1 0 1 0 0 0 1 1 0 ...
as.logical(vals)
##  [1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE
## [13]  TRUE FALSE

Subsetting vectors

Next, we would like to subset vectors. To do this, we use square brackets and inside the square bracket we indicate which values we want, using 1 for the first and 2 for second and so on.

See how it is possible to get the last and second last elements.

We can subset vectors, run arithmetic operations and save the results to a new variable in a single line.

a <- c(2:22,98,124,3002)
length(a)
## [1] 24
a[2]
## [1] 3
a[3:4]
## [1] 4 5
a[c(1,3)] 
## [1] 2 4
a[length(a)]
## [1] 3002
a[(length(a)-1)]
## [1] 124
x <- a[10:(length(a)-1)] * 2
x
##  [1]  22  24  26  28  30  32  34  36  38  40  42  44 196 248

Coersing R data into different types

As shown above with commands like factor and as.logical, it os possible to convert objects into different types.

Here are some further examples.

Note that some conversions don’t make sense, which can cause errors in your analysis.

a <- c(1.9,2.7,3.3,5.1,9.9,0)
a
## [1] 1.9 2.7 3.3 5.1 9.9 0.0
as.integer(a)
## [1] 1 2 3 5 9 0
as.character(a)
## [1] "1.9" "2.7" "3.3" "5.1" "9.9" "0"
as.logical(a)
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
as.factor(a)
## [1] 1.9 2.7 3.3 5.1 9.9 0  
## Levels: 0 1.9 2.7 3.3 5.1 9.9
b <- c("abc","def","ghi","jkl")
b
## [1] "abc" "def" "ghi" "jkl"
as.numeric(b)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.logical(b)
## [1] NA NA NA NA
as.integer(b)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
my_factor <- as.factor(b)
as.numeric(my_factor)
## [1] 1 2 3 4

Creating random and semi random data

This is used a lot in statistics and probability as well as simulation analysis.

nums <- 1:5
sample(x = nums, size = 3)
## [1] 2 5 3
sample(x = nums, size = 5)
## [1] 5 3 1 2 4
sample(x = nums ,size = 10, replace = TRUE)
##  [1] 5 1 4 2 2 1 1 4 5 4

We can also sample from distributions. Here we are sampling 5 numbers from a normal distribution around a median of 10 and standard deviation of 2

d <- rnorm(n = 5, mean = 10, sd = 2)
d
## [1] 10.198622  9.841321  9.395114  9.942377 11.010320

Here we are sampling 20 numbers from a binomial distribution with a size of 50 and probability of 0.5

b <- rbinom(n = 20, size = 50, prob = 0.5)
b
##  [1] 23 25 22 22 24 26 28 30 26 29 25 22 23 28 26 28 20 27 32 27
mean(b)
## [1] 25.65

Basic plots

Creating basic plots in R isn’t very difficult. Here are some simple ones.

Dot plot and line plot.

Adding extra lines.

Changing line colour and adding a subheading

a <- (1:10)^2
a
##  [1]   1   4   9  16  25  36  49  64  81 100
plot(a)

plot(a,type="l")

plot(a,type="b")

plot(a,type="b")
lines(a/2, type="b",col="red")
mtext("Black:Growth of A. Red: growth of A/2")

Now for scatterplots.

We can change the point type (pch) and size (cex), as well as add a main heading (main).

We can also add additional series of points to the chart and adjust the axis limits with xlim and ylim.

x_vals <- rnorm(n = 1000, mean = 10, sd = 2)

d_error <- rnorm(n = 1000, mean = 1, sd = 0.1)

y_vals <- x_vals * d_error

plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values")

plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5, main="Plot of X and Y values")

plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5, main="Plot of X and Y values")
points(x=x_vals, y=y_vals/2, pch=19, cex=0.5,col="blue")

plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, 
     cex=0.5, main="Plot of X and Y values",
     ylim=c(0,17))
points(x=x_vals, y=y_vals/2, pch=19, cex=0.5,col="blue")

Let’s now run a linear regression on the relationship of y with x.

linear_regression_model <- lm(y_vals ~ x_vals)

summary(linear_regression_model)
## 
## Call:
## lm(formula = y_vals ~ x_vals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1952 -0.6916 -0.0102  0.6805  3.4723 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.15744    0.16337  -0.964    0.335    
## x_vals       1.01405    0.01611  62.945   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.035 on 998 degrees of freedom
## Multiple R-squared:  0.7988, Adjusted R-squared:  0.7986 
## F-statistic:  3962 on 1 and 998 DF,  p-value: < 2.2e-16
SLOPE <- linear_regression_model$coefficients[1]
INTERCEPT <- linear_regression_model$coefficients[2]

HEADER <- paste("Slope:",signif(SLOPE,4),"Intercept:",signif(INTERCEPT,4))

plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5)
abline(linear_regression_model,col="red",lty=2,lwd=3)
mtext(HEADER)

Barplots are also really useful.

The quantities in the vector need to have names.

names(a) <- 1:length(a)
barplot(a)

barplot(a,horiz = TRUE,las=1, xlab = "Measurements")

Boxplots are relatively easy to create.

boxplot(x_vals, y_vals/2)

boxplot(x_vals, y_vals/2, names=c("X values", "Y values"),ylab="Measurement (cm)")

Histograms are easily made.

And it is possible to place multiple charts on a single image.

hist(x_vals)

par(mfrow = c(2, 1))
hist(x_vals,main="")
hist(y_vals,main="")

Homework questions for group study

  1. calculate the sum of all integers numbers between 500 and 600

  2. calculate the sum of all the square roots of all integers between 900 and 1000

  3. Create the following datasets and plot a boxplot:

  1. A Sample 10000 datapoints from a normal distribution with mean of 50 and SD of 5

  2. A Sample 10000 datapoints from a normal distribution with mean of 50 and SD of 10

  1. Plot a and b above as a scatterplot, and plot the trend line.

  2. Plot a and b above as histograms on the same chart.

Next week we will

Learn:

Session information

For reproducibility.

sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.30   R6_2.5.1        jsonlite_1.8.3  magrittr_2.0.3 
##  [5] evaluate_0.17   highr_0.9       stringi_1.7.8   cachem_1.0.6   
##  [9] rlang_1.0.6     cli_3.4.1       rstudioapi_0.14 jquerylib_0.1.4
## [13] bslib_0.4.0     rmarkdown_2.17  tools_4.2.2     stringr_1.4.1  
## [17] xfun_0.34       yaml_2.3.6      fastmap_1.1.0   compiler_4.2.2 
## [21] htmltools_0.5.3 knitr_1.40      sass_0.4.2