Source: https://github.com/markziemann/SLE712_files/blob/master/BioinfoPrac1.Rmd
What is bioinformatics?
It is data science techniques applied to biological or biomedical domains.
This is important because biology and biomedicine are becoming more data-driven. This means it is becoming an essential skill to be able to process these huge datasets, analyse them and create meaningful charts and tables.
Bioinformatics is done in many languages including Perl, Python, Java, Matlab and more, so why R?
It is free and open source
It works in Linux, Mac and Windows (and on the cloud)
Is the premier language for numerical analysis and statistics
It has best data visualisation options
It has a vibrant and noob-friendly base of science users throughout the world
There is a huge array of packages available on CRAN and Bioconductor.
Nearly everything you learn in R here can be applied to other fields of endeavour such as business, engineering and medicine.
Therefore all of the bioinformatics content covered in this unit will be related to using R. We will be using the Rstudio application to help us managing the development process. It is a graphical interface with different panels and tools which makes the code development process easier and more efficient. You can log on to our Rstudio server here: http://118.138.234.73:8787/ The username and password were provided to you by email. If you missed out
This week, our goal is to learn:
how to use R studio interface
about different data objects in R
how to do basic math in R
Let’s get to know the Rstudio interface a bit better. It consists of a menu bar at the top of the browser window and four panels below:
Menu Bar | |
---|---|
Top left: Script (type and save your scripts here) | Top right: Global Environment (R data objects you can work with) |
Bottom left: Console (commands are executed here) | Bottom right: Files, Plots and help pages |
If you cannot see the Script panel, click “New Script” on the menu bar and it should appear.
Let’s watch a video together to learn the basics of the Rstudio environment.
The first thing we will cover in R is the different types of data structures. Let’s go through the different data types together: https://www.statmethods.net/input/datatypes.html
Open up your Rstudio and try these commands by typing in the script panel and hitting the “Run” button or Ctrl+Enter, observing the output.
The hash character #
is used to include comments.
Including comments help the user to understand the purpose of the
commands. It is a good idea to add comments to your code. Any characters
that come after the # are ignored.
We’ll start with some arithmetic.
1+2
## [1] 3
10000^2
## [1] 1e+08
sqrt(64)
## [1] 8
Notice that as you start typing commands like sqrt
that
Rstudio will give you suggestions, you can use the tab key to
autocomplete.
Use the <-
back arrow or =
to define
variables.
s <- sqrt(64)
s
## [1] 8
a <- 2
b <- 5
c <- a*b
c
## [1] 10
We can also do operations on vectors of numbers.
Vectors are specified with c-brackets like this
c(1,2,3)
c(1,2,3)
## [1] 1 2 3
sum(c(1,10,100))
## [1] 111
mean(c(1,10,100))
## [1] 37
median(c(1,10,100))
## [1] 10
max(c(1,10,100))
## [1] 100
min(c(1,10,100))
## [1] 1
To make things easier to read and write, we can save the vector as a
variable a
. Then we can work on a
.
We use the <-
to save objects in R, but
=
also works.
length
tells us the number of elements in the vector
which is very useful.
a <- c(1,10,100)
a
## [1] 1 10 100
2*a
## [1] 2 20 200
a+1
## [1] 2 11 101
sum(a)
## [1] 111
mean(a)
## [1] 37
median(a)
## [1] 10
sd(a)
## [1] 54.74486
var(a)
## [1] 2997
length(a)
## [1] 3
Whenever we are using an R object we should check exactly what sort of object it is. This is the most common type of error you will encounter.
We use the str()
command to check the structure. Other
commands we can use to investigate the data structure include
class()
and typeof()
The colon :
can be used to specify series. For example
1:5 for numbers 1 to 5.
Note the different object types here:
A is a Numerical vector
B is Numerical vector (series)
C is a named numerical vector
D is a character string
E is a character vector
F is a named character vector
A <- c(1,10,100)
A
## [1] 1 10 100
str(A)
## num [1:3] 1 10 100
class(A)
## [1] "numeric"
typeof(A)
## [1] "double"
B <- 1:10
B
## [1] 1 2 3 4 5 6 7 8 9 10
str(B)
## int [1:10] 1 2 3 4 5 6 7 8 9 10
class(B)
## [1] "integer"
typeof(B)
## [1] "integer"
C <- c("prime1"=2, "prime2"=3, "prime3"=5, "prime4"=7)
C
## prime1 prime2 prime3 prime4
## 2 3 5 7
str(C)
## Named num [1:4] 2 3 5 7
## - attr(*, "names")= chr [1:4] "prime1" "prime2" "prime3" "prime4"
class(C)
## [1] "numeric"
typeof(C)
## [1] "double"
D <- "x1"
D
## [1] "x1"
str(D)
## chr "x1"
class(D)
## [1] "character"
typeof(D)
## [1] "character"
E <- c("x1", "y2", "z3")
E
## [1] "x1" "y2" "z3"
str(E)
## chr [1:3] "x1" "y2" "z3"
names(E) <- c("code1","code2","code3")
E
## code1 code2 code3
## "x1" "y2" "z3"
str(E)
## Named chr [1:3] "x1" "y2" "z3"
## - attr(*, "names")= chr [1:3] "code1" "code2" "code3"
class(E)
## [1] "character"
typeof(E)
## [1] "character"
Note how the names can be added during or after the vector is defined.
Factors are an entirely different type of data in R. They are represented as non numeric categories, but are stored internally as numerical data. Typically factors are used in R for categorical data like biological sex. Here the example is an unordered category.
x <- factor(c("single", "married", "married", "single","defacto","widowed"))
x
## [1] single married married single defacto widowed
## Levels: defacto married single widowed
str(x)
## Factor w/ 4 levels "defacto","married",..: 3 2 2 3 1 4
levels(x)
## [1] "defacto" "married" "single" "widowed"
But it is possible to have ordered factors as well.
Take this survey result about a product as an example, where respondents rate the food at a restaurant between very poor and very good. These responses have a natural order, so it makes sense to treat these as ordered factors.
y <- c("good", "very good", "fair", "poor","good")
str(y)
## chr [1:5] "good" "very good" "fair" "poor" "good"
yy <- factor(y, levels = c("very poor","poor","fair","good","very good"),ordered = TRUE)
yy
## [1] good very good fair poor good
## Levels: very poor < poor < fair < good < very good
str(yy)
## Ord.factor w/ 5 levels "very poor"<"poor"<..: 4 5 3 2 4
levels(yy)
## [1] "very poor" "poor" "fair" "good" "very good"
R also uses TRUE/FALSE
a lot, as a data type as well as
an option when executing commands.
It is also possible to have vectors of logical values.
myvariable1 <- 0
as.logical(myvariable1)
## [1] FALSE
myvariable2 <- 1
as.logical(myvariable2)
## [1] TRUE
vals <- c(0,1,0,1,0,0,0,1,1,0,1,0,1,0)
vals
## [1] 0 1 0 1 0 0 0 1 1 0 1 0 1 0
str(vals)
## num [1:14] 0 1 0 1 0 0 0 1 1 0 ...
as.logical(vals)
## [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
## [13] TRUE FALSE
Next, we would like to subset vectors. To do this, we use square brackets and inside the square bracket we indicate which values we want, using 1 for the first and 2 for second and so on.
See how it is possible to get the last and second last elements.
We can subset vectors, run arithmetic operations and save the results to a new variable in a single line.
a <- c(2:22,98,124,3002)
length(a)
## [1] 24
a[2]
## [1] 3
a[3:4]
## [1] 4 5
a[c(1,3)]
## [1] 2 4
a[length(a)]
## [1] 3002
a[(length(a)-1)]
## [1] 124
x <- a[10:(length(a)-1)] * 2
x
## [1] 22 24 26 28 30 32 34 36 38 40 42 44 196 248
As shown above with commands like factor
and
as.logical
, it os possible to convert objects into
different types.
Here are some further examples.
Note that some conversions don’t make sense, which can cause errors in your analysis.
a <- c(1.9,2.7,3.3,5.1,9.9,0)
a
## [1] 1.9 2.7 3.3 5.1 9.9 0.0
as.integer(a)
## [1] 1 2 3 5 9 0
as.character(a)
## [1] "1.9" "2.7" "3.3" "5.1" "9.9" "0"
as.logical(a)
## [1] TRUE TRUE TRUE TRUE TRUE FALSE
as.factor(a)
## [1] 1.9 2.7 3.3 5.1 9.9 0
## Levels: 0 1.9 2.7 3.3 5.1 9.9
b <- c("abc","def","ghi","jkl")
b
## [1] "abc" "def" "ghi" "jkl"
as.numeric(b)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.logical(b)
## [1] NA NA NA NA
as.integer(b)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
my_factor <- as.factor(b)
as.numeric(my_factor)
## [1] 1 2 3 4
This is used a lot in statistics and probability as well as simulation analysis.
nums <- 1:5
sample(x = nums, size = 3)
## [1] 2 5 3
sample(x = nums, size = 5)
## [1] 5 3 1 2 4
sample(x = nums ,size = 10, replace = TRUE)
## [1] 5 1 4 2 2 1 1 4 5 4
We can also sample from distributions. Here we are sampling 5 numbers from a normal distribution around a median of 10 and standard deviation of 2
d <- rnorm(n = 5, mean = 10, sd = 2)
d
## [1] 10.198622 9.841321 9.395114 9.942377 11.010320
Here we are sampling 20 numbers from a binomial distribution with a size of 50 and probability of 0.5
b <- rbinom(n = 20, size = 50, prob = 0.5)
b
## [1] 23 25 22 22 24 26 28 30 26 29 25 22 23 28 26 28 20 27 32 27
mean(b)
## [1] 25.65
Creating basic plots in R isn’t very difficult. Here are some simple ones.
Dot plot and line plot.
Adding extra lines.
Changing line colour and adding a subheading
a <- (1:10)^2
a
## [1] 1 4 9 16 25 36 49 64 81 100
plot(a)
plot(a,type="l")
plot(a,type="b")
plot(a,type="b")
lines(a/2, type="b",col="red")
mtext("Black:Growth of A. Red: growth of A/2")
Now for scatterplots.
We can change the point type (pch
) and size
(cex
), as well as add a main heading
(main
).
We can also add additional series of points to the chart and adjust
the axis limits with xlim
and ylim
.
x_vals <- rnorm(n = 1000, mean = 10, sd = 2)
d_error <- rnorm(n = 1000, mean = 1, sd = 0.1)
y_vals <- x_vals * d_error
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values")
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5, main="Plot of X and Y values")
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5, main="Plot of X and Y values")
points(x=x_vals, y=y_vals/2, pch=19, cex=0.5,col="blue")
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19,
cex=0.5, main="Plot of X and Y values",
ylim=c(0,17))
points(x=x_vals, y=y_vals/2, pch=19, cex=0.5,col="blue")
Let’s now run a linear regression on the relationship of y with x.
linear_regression_model <- lm(y_vals ~ x_vals)
summary(linear_regression_model)
##
## Call:
## lm(formula = y_vals ~ x_vals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1952 -0.6916 -0.0102 0.6805 3.4723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.15744 0.16337 -0.964 0.335
## x_vals 1.01405 0.01611 62.945 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.035 on 998 degrees of freedom
## Multiple R-squared: 0.7988, Adjusted R-squared: 0.7986
## F-statistic: 3962 on 1 and 998 DF, p-value: < 2.2e-16
SLOPE <- linear_regression_model$coefficients[1]
INTERCEPT <- linear_regression_model$coefficients[2]
HEADER <- paste("Slope:",signif(SLOPE,4),"Intercept:",signif(INTERCEPT,4))
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5)
abline(linear_regression_model,col="red",lty=2,lwd=3)
mtext(HEADER)
Barplots are also really useful.
The quantities in the vector need to have names.
names(a) <- 1:length(a)
barplot(a)
barplot(a,horiz = TRUE,las=1, xlab = "Measurements")
Boxplots are relatively easy to create.
boxplot(x_vals, y_vals/2)
boxplot(x_vals, y_vals/2, names=c("X values", "Y values"),ylab="Measurement (cm)")
Histograms are easily made.
And it is possible to place multiple charts on a single image.
hist(x_vals)
par(mfrow = c(2, 1))
hist(x_vals,main="")
hist(y_vals,main="")
calculate the sum of all integers numbers between 500 and 600
calculate the sum of all the square roots of all integers between 900 and 1000
Create the following datasets and plot a boxplot:
A Sample 10000 datapoints from a normal distribution with mean of 50 and SD of 5
A Sample 10000 datapoints from a normal distribution with mean of 50 and SD of 10
Plot a and b above as a scatterplot, and plot the trend line.
Plot a and b above as histograms on the same chart.
Learn:
how to work with data frames and matrices
how to read in files properly with R
how to filter, subset, search and sort data
more types of analysis like correlations
how to generate more sophisticated data types
For reproducibility.
sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.30 R6_2.5.1 jsonlite_1.8.3 magrittr_2.0.3
## [5] evaluate_0.17 highr_0.9 stringi_1.7.8 cachem_1.0.6
## [9] rlang_1.0.6 cli_3.4.1 rstudioapi_0.14 jquerylib_0.1.4
## [13] bslib_0.4.0 rmarkdown_2.17 tools_4.2.2 stringr_1.4.1
## [17] xfun_0.34 yaml_2.3.6 fastmap_1.1.0 compiler_4.2.2
## [21] htmltools_0.5.3 knitr_1.40 sass_0.4.2