Source: https://github.com/markziemann/bioinformatics_intro_workshop

Image credit: Steven Lokwan.

What is R and why use it?

A statistical computing and data visualisation language, derived from an earlier language called S.

What a typical workflow looks like

  • Comments

  • Load libraries

  • Read data

  • Cleaning

  • Analysis

  • Data visualisation

  • Save files

  • Execute

This week, our goal is to learn:

  • how to work with R on the command line and in RStudio on the HPC

  • about different data structures in R

  • how to do basic math in R

Setting up an interactive R session on the HPC

If you are connected to the Burnet network, visit the HPC documentation here.

Create a new tmux session which will be persistent. Then make an interactive SLURM session where you can run some analysis. You can customise the threads, memory and time to your needs.

tmux new -s mysession

srun -p interactive --pty --time=180 --threads=2 --mem=8G bash -i

Then run an example command. Here we are grabbing some random data and compressing it with pigz, which is a parallel compression tool. This will use all available CPU threads.

head -1000000 /dev/random | pigz > /dev/null

If you want to run some scripts, you will likely need to load modules. This is a good way to maintain parallel version of languages like R. Use the following commands to show the available modules.

module avail

module spider

Try loading R version 4.3.2.

module load R/4.3.2

Now run R and check the version and exit.

R
sessionInfo()
q()

Exit interactive sessions with exit.

Non interactive jobs can be scheduled using SLURM, using the HPC docs as a guide.

RStudio session

Rstudio offers some benefits due to its interactive and graphical nature.

sbatch --time=180 --threads=2 --mem=8G  /software/jobs/rstudio.job

It will say “Submitted batch job 1234” (the number will be different), and will create a file in the current working directory called rstudio.job.1234. List the contents with ls to confirm that the file was created, then type cat rstudio.job.1234 to read the instructions to connect.

Let’s get to know the Rstudio interface a bit better. It consists of a menu bar at the top of the browser window and four panels below:

Menu Bar
Top left: Script (type and save your scripts here) Top right: Global Environment (R data objects you can work with)
Bottom left: Console (commands are executed here) Bottom right: Files, Plots and help pages

If you cannot see the Script panel, click “New Script” on the menu bar and it should appear.

Let’s watch a video together to learn the basics of the Rstudio environment.

IMAGE ALT TEXT HERE

Intro to R part 1

Here we will begin with simpler data structures and commands.

Arithmetic

Open up your Rstudio and try these commands by typing in the script panel and hitting the “Run” button or Ctrl+Enter, observing the output.

We’ll start with some arithmetic.

1+2
## [1] 3
10000^2
## [1] 1e+08
sqrt(64)
## [1] 8

Notice that as you start typing commands like sqrt that Rstudio will give you suggestions, you can use the tab key to autocomplete.

Working with variables

Use the <- back arrow or = to define variables.

s <- sqrt(64)
s
## [1] 8
a <- 2
b <- 5
c <- a*b
c
## [1] 10

Vectors

We can also do operations on vectors of numbers.

Vectors are specified with c-brackets like this c(1,2,3)

c(1,2,3)
## [1] 1 2 3
sum(c(1,10,100))
## [1] 111
mean(c(1,10,100))
## [1] 37
median(c(1,10,100))
## [1] 10
max(c(1,10,100))
## [1] 100
min(c(1,10,100))
## [1] 1

Saving variables

To make things easier to read and write, we can save the vector as a variable a. Then we can work on a.

We use the <- to save objects in R, but = also works.

length tells us the number of elements in the vector which is very useful.

a <- c(1,10,100)
a
## [1]   1  10 100
2*a
## [1]   2  20 200
a+1
## [1]   2  11 101
sum(a)
## [1] 111
mean(a)
## [1] 37
median(a)
## [1] 10
sd(a)
## [1] 54.74486
var(a)
## [1] 2997
length(a)
## [1] 3

R object types

Whenever we are using an R object we should check exactly what sort of object it is. This is the most common type of error you will encounter.

We use the str() command to check the structure. Other commands we can use to investigate the data structure include class() and typeof()

The colon : can be used to specify a series or range. For example 1:5 for numbers 1 to 5.

Note the different object types here:

A is a Numerical vector

B is Numerical vector (series)

C is a named numerical vector

D is a character string

E is a character vector

F is a named character vector

A <- c(1,10,100)
A
## [1]   1  10 100
str(A)
##  num [1:3] 1 10 100
class(A)
## [1] "numeric"
typeof(A)
## [1] "double"
B <- 1:10
B
##  [1]  1  2  3  4  5  6  7  8  9 10
str(B)
##  int [1:10] 1 2 3 4 5 6 7 8 9 10
class(B)
## [1] "integer"
typeof(B)
## [1] "integer"
C <- c("prime1"=2, "prime2"=3, "prime3"=5, "prime4"=7)
C
## prime1 prime2 prime3 prime4 
##      2      3      5      7
str(C)
##  Named num [1:4] 2 3 5 7
##  - attr(*, "names")= chr [1:4] "prime1" "prime2" "prime3" "prime4"
class(C)
## [1] "numeric"
typeof(C)
## [1] "double"
D <- "x1"
D
## [1] "x1"
str(D)
##  chr "x1"
class(D)
## [1] "character"
typeof(D)
## [1] "character"
E <- c("x1", "y2", "z3")
E
## [1] "x1" "y2" "z3"
str(E)
##  chr [1:3] "x1" "y2" "z3"
names(E) <- c("code1","code2","code3")
E
## code1 code2 code3 
##  "x1"  "y2"  "z3"
str(E)
##  Named chr [1:3] "x1" "y2" "z3"
##  - attr(*, "names")= chr [1:3] "code1" "code2" "code3"
class(E)
## [1] "character"
typeof(E)
## [1] "character"

Note how the names can be added during or after the vector is defined.

Factors

Factors are an entirely different type of data in R. They are represented as non numeric categories, but are stored internally as numerical data. Typically factors are used in R for categorical data like biological sex. Here the example is an unordered category.

x <- factor(c("single", "married", "married", "single","defacto","widowed"))
x
## [1] single  married married single  defacto widowed
## Levels: defacto married single widowed
str(x)
##  Factor w/ 4 levels "defacto","married",..: 3 2 2 3 1 4
levels(x)
## [1] "defacto" "married" "single"  "widowed"

But it is possible to have ordered factors as well.

Take this survey result about a product as an example, where respondents rate the food at a restaurant between very poor and very good. These responses have a natural order, so it makes sense to treat these as ordered factors.

y <- c("good", "very good", "fair", "poor","good")
str(y)
##  chr [1:5] "good" "very good" "fair" "poor" "good"
yy <- factor(y, levels = c("very poor","poor","fair","good","very good"),ordered = TRUE)
yy
## [1] good      very good fair      poor      good     
## Levels: very poor < poor < fair < good < very good
str(yy)
##  Ord.factor w/ 5 levels "very poor"<"poor"<..: 4 5 3 2 4
levels(yy)
## [1] "very poor" "poor"      "fair"      "good"      "very good"

Logicals

R also uses TRUE/FALSE a lot, as a data type as well as an option when executing commands.

It is also possible to have vectors of logical values.

myvariable1 <- 0
as.logical(myvariable1)
## [1] FALSE
myvariable2 <- 1
as.logical(myvariable2)
## [1] TRUE
vals <- c(0,1,0,1,0,0,0,1,1,0,1,0,1,0)
vals
##  [1] 0 1 0 1 0 0 0 1 1 0 1 0 1 0
str(vals)
##  num [1:14] 0 1 0 1 0 0 0 1 1 0 ...
as.logical(vals)
##  [1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE
## [13]  TRUE FALSE

Subsetting vectors

Next, we would like to subset vectors. To do this, we use square brackets and inside the square bracket we indicate which values we want, using 1 for the first and 2 for second and so on.

See how it is possible to get the last and second last elements.

We can subset vectors, run arithmetic operations and save the results to a new variable in a single line.

a <- c(2:22,98,124,3002)
length(a)
## [1] 24
a[2]
## [1] 3
a[3:4]
## [1] 4 5
a[c(1,3)]
## [1] 2 4
a[length(a)]
## [1] 3002
a[(length(a)-1)]
## [1] 124
x <- a[10:(length(a)-1)] * 2
x
##  [1]  22  24  26  28  30  32  34  36  38  40  42  44 196 248

Coersing R data into different types

As shown above with commands like factor and as.logical, it is possible to convert objects into different types.

Here are some further examples.

Note that some conversions don’t make sense, which can cause errors in your analysis.

a <- c(1.9,2.7,3.3,5.1,9.9,0)
a
## [1] 1.9 2.7 3.3 5.1 9.9 0.0
as.integer(a)
## [1] 1 2 3 5 9 0
as.character(a)
## [1] "1.9" "2.7" "3.3" "5.1" "9.9" "0"
as.logical(a)
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
as.factor(a)
## [1] 1.9 2.7 3.3 5.1 9.9 0  
## Levels: 0 1.9 2.7 3.3 5.1 9.9
b <- c("abc","def","ghi","jkl")
b
## [1] "abc" "def" "ghi" "jkl"
as.numeric(b)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.logical(b)
## [1] NA NA NA NA
as.integer(b)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
my_factor <- as.factor(b)
as.numeric(my_factor)
## [1] 1 2 3 4

Creating random and semi random data

This is used a lot in statistics and probability as well as simulation analysis.

nums <- 1:5
sample(x = nums, size = 3)
## [1] 3 4 5
sample(x = nums, size = 5)
## [1] 4 3 1 5 2
sample(x = nums ,size = 10, replace = TRUE)
##  [1] 5 3 1 5 3 1 2 3 4 1

We can also sample from distributions. Here we are sampling 5 numbers from a normal distribution around a median of 10 and standard deviation of 2

d <- rnorm(n = 5, mean = 10, sd = 2)
d
## [1] 12.219169  7.641795 11.501564 10.305560 10.606350

Here we are sampling 20 numbers from a binomial distribution with a size of 50 and probability of 0.5

b <- rbinom(n = 20, size = 50, prob = 0.5)
b
##  [1] 21 27 24 31 28 32 22 26 23 27 25 26 24 28 18 17 22 25 20 28
mean(b)
## [1] 24.7

Basic plots

Creating basic plots in R isn’t very difficult. Here are some simple ones.

Dot plot and line plot.

Adding extra lines.

Changing line colour and adding a subheading

a <- (1:10)^2
a
##  [1]   1   4   9  16  25  36  49  64  81 100
plot(a)

plot(a,type="l")

plot(a,type="b")

plot(a,type="b")
lines(a/2, type="b",col="red")
mtext("Black:Growth of A. Red: growth of A/2")

Now for scatterplots.

We can change the point type (pch) and size (cex), as well as add a main heading (main).

We can also add additional series of points to the chart and adjust the axis limits with xlim and ylim.

x_vals <- rnorm(n = 1000, mean = 10, sd = 2)

d_error <- rnorm(n = 1000, mean = 1, sd = 0.1)

y_vals <- x_vals * d_error

plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values")

plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5, main="Plot of X and Y values")

plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5, main="Plot of X and Y values")
points(x=x_vals, y=y_vals/2, pch=19, cex=0.5,col="blue")

plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, 
     cex=0.5, main="Plot of X and Y values",
     ylim=c(0,17))
points(x=x_vals, y=y_vals/2, pch=19, cex=0.5,col="blue")

Let’s now run a linear regression on the relationship of y with x.

linear_regression_model <- lm(y_vals ~ x_vals)

summary(linear_regression_model)
## 
## Call:
## lm(formula = y_vals ~ x_vals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5715 -0.6726 -0.0191  0.6932  4.2832 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.007809   0.172762  -0.045    0.964    
## x_vals       0.999330   0.016953  58.946   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.083 on 998 degrees of freedom
## Multiple R-squared:  0.7769, Adjusted R-squared:  0.7766 
## F-statistic:  3475 on 1 and 998 DF,  p-value: < 2.2e-16
SLOPE <- linear_regression_model$coefficients[1]
INTERCEPT <- linear_regression_model$coefficients[2]

HEADER <- paste("Slope:",signif(SLOPE,4),"Intercept:",signif(INTERCEPT,4))

plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5)
abline(linear_regression_model,col="red",lty=2,lwd=3)
mtext(HEADER)

Barplots are also really useful.

The quantities in the vector need to have names.

names(a) <- 1:length(a)
barplot(a)

barplot(a,horiz = TRUE,las=1, xlab = "Measurements")

Boxplots are relatively easy to create.

boxplot(x_vals, y_vals/2)

boxplot(x_vals, y_vals/2, names=c("X values", "Y values"),ylab="Measurement (cm)")

Enhanced boxplot with beeswarm.

library("beeswarm")
mylist <- list("X vals"=x_vals, "Y vals"=y_vals/2)
boxplot(mylist, cex=0, ylab="Measurement (cm)",col="white",main="Main title") #cex=0 means no outliers shown
beeswarm(mylist,pch=1,add=TRUE,cex=0.5)

Histograms are easily made.

And it is possible to place multiple charts on a single image.

hist(x_vals)

par(mfrow = c(2, 1))
hist(x_vals,main="")
hist(y_vals,main="")

Part 1: Homework questions for group study

  1. calculate the sum of all integers numbers between 500 and 600

  2. calculate the sum of all the square roots of all integers between 900 and 1000

  3. Create the following datasets and plot a boxplot:

  1. A Sample 10000 datapoints from a normal distribution with mean of 50 and SD of 5

  2. A Sample 10000 datapoints from a normal distribution with mean of 50 and SD of 10

  1. Plot a and b above as a scatterplot, and plot the trend line.

  2. Plot a and b above as histograms on the same chart.

Intro to R part 2

Data tables

So far we have only worked with 1 dimensional data, so let’s get to know 2D tables.

The two most common types are data frames and matrices. We can use str() to distinguish them.

head(freeny.x)
##      lag quarterly revenue price index income level market potential
## [1,]               8.79636     4.70997      5.82110          12.9699
## [2,]               8.79236     4.70217      5.82558          12.9733
## [3,]               8.79137     4.68944      5.83112          12.9774
## [4,]               8.81486     4.68558      5.84046          12.9806
## [5,]               8.81301     4.64019      5.85036          12.9831
## [6,]               8.90751     4.62553      5.86464          12.9854
str(freeny.x)
##  num [1:39, 1:4] 8.8 8.79 8.79 8.81 8.81 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "lag quarterly revenue" "price index" "income level" "market potential"
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

As an example of commands that can be used for matrices and not data frames, try mean() for freeny.x and mtcars.

Basic commands that can be used for data frames and matrices

Find out the number of rows, columns, and simple operations on those rows and columns.

# number of columns
ncol(mtcars)
## [1] 11
# number of rows
nrow(mtcars)
## [1] 32
# dimensions
dim(mtcars)
## [1] 32 11
# analysing rows and columns
colMeans(freeny.x)
## lag quarterly revenue           price index          income level 
##              9.280718              4.496182              6.038596 
##      market potential 
##             13.066831
colSums(freeny.x)
## lag quarterly revenue           price index          income level 
##              361.9480              175.3511              235.5052 
##      market potential 
##              509.6064
rowMeans(freeny.x)
##  [1] 8.074333 8.073353 8.072332 8.080375 8.071665 8.095770 8.106082 8.117520
##  [9] 8.124863 8.140490 8.149078 8.158940 8.158470 8.180965 8.192552 8.205092
## [17] 8.209112 8.223212 8.230868 8.231193 8.240507 8.250258 8.249220 8.260175
## [25] 8.267057 8.269210 8.282000 8.293537 8.297823 8.310002 8.315660 8.317607
## [33] 8.317818 8.330682 8.330505 8.333490 8.339897 8.345975 8.354988
rowSums(freeny.x)
##  [1] 32.29733 32.29341 32.28933 32.32150 32.28666 32.38308 32.42433 32.47008
##  [9] 32.49945 32.56196 32.59631 32.63576 32.63388 32.72386 32.77021 32.82037
## [17] 32.83645 32.89285 32.92347 32.92477 32.96203 33.00103 32.99688 33.04070
## [25] 33.06823 33.07684 33.12800 33.17415 33.19129 33.24001 33.26264 33.27043
## [33] 33.27127 33.32273 33.32202 33.33396 33.35959 33.38390 33.41995

You can also transpose a matrix or data frame. But be careful, transposing a data frame will automatically convert it to a matrix which could cause downstream errors.

freeny_flip <- t(freeny.x)
head(freeny_flip)
##                           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## lag quarterly revenue  8.79636  8.79236  8.79137  8.81486  8.81301  8.90751
## price index            4.70997  4.70217  4.68944  4.68558  4.64019  4.62553
## income level           5.82110  5.82558  5.83112  5.84046  5.85036  5.86464
## market potential      12.96990 12.97330 12.97740 12.98060 12.98310 12.98540
##                           [,7]     [,8]     [,9]    [,10]    [,11]    [,12]
## lag quarterly revenue  8.93673  8.96161  8.96044  9.00868  9.03049  9.06906
## price index            4.61991  4.61654  4.61407  4.60766  4.60227  4.58960
## income level           5.87769  5.89763  5.92574  5.94232  5.95365  5.96120
## market potential      12.99000 12.99430 12.99920 13.00330 13.00990 13.01590
##                          [,13]    [,14]    [,15]    [,16]    [,17]    [,18]
## lag quarterly revenue  9.05871  9.10698  9.12685  9.17096  9.18665  9.23823
## price index            4.57592  4.58661  4.57997  4.57176  4.56104  4.54906
## income level           5.97805  6.00377  6.02829  6.03475  6.03906  6.05046
## market potential      13.02120 13.02650 13.03510 13.04290 13.04970 13.05510
##                          [,19]    [,20]    [,21]    [,22]    [,23]    [,24]
## lag quarterly revenue  9.26487  9.28436  9.31378  9.35025  9.35835  9.39767
## price index            4.53957  4.51018  4.50352  4.49360  4.46505  4.44924
## income level           6.05563  6.06093  6.07103  6.08018  6.08858  6.10199
## market potential      13.06340 13.06930 13.07370 13.07700 13.08490 13.09180
##                          [,25]    [,26]    [,27]    [,28]    [,29]    [,30]
## lag quarterly revenue  9.42150  9.44223  9.48721  9.52374  9.53980  9.58123
## price index            4.43966  4.42025  4.41060  4.41151  4.39810  4.38513
## income level           6.11207  6.11596  6.12129  6.12200  6.13119  6.14705
## market potential      13.09500 13.09840 13.10890 13.11690 13.12220 13.12660
##                          [,31]    [,32]    [,33]    [,34]    [,35]    [,36]
## lag quarterly revenue  9.60048  9.64496  9.64390  9.69405  9.69958  9.68683
## price index            4.37320  4.32770  4.32023  4.30909  4.30909  4.30552
## income level           6.15336  6.15627  6.16274  6.17369  6.16135  6.18231
## market potential      13.13560 13.14150 13.14440 13.14590 13.15200 13.15930
##                          [,37]    [,38]    [,39]
## lag quarterly revenue  9.71774  9.74924  9.77536
## price index            4.29627  4.27839  4.27789
## income level           6.18768  6.19377  6.20030
## market potential      13.15790 13.16250 13.16640
mtcars_flip <- t(mtcars)
head(mtcars_flip)
##      Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
## mpg      21.00        21.000      22.80         21.400             18.70
## cyl       6.00         6.000       4.00          6.000              8.00
## disp    160.00       160.000     108.00        258.000            360.00
## hp      110.00       110.000      93.00        110.000            175.00
## drat      3.90         3.900       3.85          3.080              3.15
## wt        2.62         2.875       2.32          3.215              3.44
##      Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
## mpg    18.10      14.30     24.40    22.80    19.20     17.80      16.40
## cyl     6.00       8.00      4.00     4.00     6.00      6.00       8.00
## disp  225.00     360.00    146.70   140.80   167.60    167.60     275.80
## hp    105.00     245.00     62.00    95.00   123.00    123.00     180.00
## drat    2.76       3.21      3.69     3.92     3.92      3.92       3.07
## wt      3.46       3.57      3.19     3.15     3.44      3.44       4.07
##      Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## mpg       17.30       15.20              10.40              10.400
## cyl        8.00        8.00               8.00               8.000
## disp     275.80      275.80             472.00             460.000
## hp       180.00      180.00             205.00             215.000
## drat       3.07        3.07               2.93               3.000
## wt         3.73        3.78               5.25               5.424
##      Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
## mpg             14.700    32.40      30.400         33.900        21.500
## cyl              8.000     4.00       4.000          4.000         4.000
## disp           440.000    78.70      75.700         71.100       120.100
## hp             230.000    66.00      52.000         65.000        97.000
## drat             3.230     4.08       4.930          4.220         3.700
## wt               5.345     2.20       1.615          1.835         2.465
##      Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9
## mpg             15.50      15.200      13.30           19.200    27.300
## cyl              8.00       8.000       8.00            8.000     4.000
## disp           318.00     304.000     350.00          400.000    79.000
## hp             150.00     150.000     245.00          175.000    66.000
## drat             2.76       3.150       3.73            3.080     4.080
## wt               3.52       3.435       3.84            3.845     1.935
##      Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora
## mpg          26.00       30.400          15.80        19.70         15.00
## cyl           4.00        4.000           8.00         6.00          8.00
## disp        120.30       95.100         351.00       145.00        301.00
## hp           91.00      113.000         264.00       175.00        335.00
## drat          4.43        3.770           4.22         3.62          3.54
## wt            2.14        1.513           3.17         2.77          3.57
##      Volvo 142E
## mpg       21.40
## cyl        4.00
## disp     121.00
## hp       109.00
## drat       4.11
## wt         2.78
str(mtcars_flip)
##  num [1:11, 1:32] 21 6 160 110 3.9 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:11] "mpg" "cyl" "disp" "hp" ...
##   ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
mtcars_flip <- as.data.frame(mtcars_flip)
str(mtcars_flip)
## 'data.frame':    11 obs. of  32 variables:
##  $ Mazda RX4          : num  21 6 160 110 3.9 ...
##  $ Mazda RX4 Wag      : num  21 6 160 110 3.9 ...
##  $ Datsun 710         : num  22.8 4 108 93 3.85 ...
##  $ Hornet 4 Drive     : num  21.4 6 258 110 3.08 ...
##  $ Hornet Sportabout  : num  18.7 8 360 175 3.15 ...
##  $ Valiant            : num  18.1 6 225 105 2.76 ...
##  $ Duster 360         : num  14.3 8 360 245 3.21 ...
##  $ Merc 240D          : num  24.4 4 146.7 62 3.69 ...
##  $ Merc 230           : num  22.8 4 140.8 95 3.92 ...
##  $ Merc 280           : num  19.2 6 167.6 123 3.92 ...
##  $ Merc 280C          : num  17.8 6 167.6 123 3.92 ...
##  $ Merc 450SE         : num  16.4 8 275.8 180 3.07 ...
##  $ Merc 450SL         : num  17.3 8 275.8 180 3.07 ...
##  $ Merc 450SLC        : num  15.2 8 275.8 180 3.07 ...
##  $ Cadillac Fleetwood : num  10.4 8 472 205 2.93 ...
##  $ Lincoln Continental: num  10.4 8 460 215 3 ...
##  $ Chrysler Imperial  : num  14.7 8 440 230 3.23 ...
##  $ Fiat 128           : num  32.4 4 78.7 66 4.08 ...
##  $ Honda Civic        : num  30.4 4 75.7 52 4.93 ...
##  $ Toyota Corolla     : num  33.9 4 71.1 65 4.22 ...
##  $ Toyota Corona      : num  21.5 4 120.1 97 3.7 ...
##  $ Dodge Challenger   : num  15.5 8 318 150 2.76 ...
##  $ AMC Javelin        : num  15.2 8 304 150 3.15 ...
##  $ Camaro Z28         : num  13.3 8 350 245 3.73 ...
##  $ Pontiac Firebird   : num  19.2 8 400 175 3.08 ...
##  $ Fiat X1-9          : num  27.3 4 79 66 4.08 ...
##  $ Porsche 914-2      : num  26 4 120.3 91 4.43 ...
##  $ Lotus Europa       : num  30.4 4 95.1 113 3.77 ...
##  $ Ford Pantera L     : num  15.8 8 351 264 4.22 3.17 14.5 0 1 5 ...
##  $ Ferrari Dino       : num  19.7 6 145 175 3.62 2.77 15.5 0 1 5 ...
##  $ Maserati Bora      : num  15 8 301 335 3.54 3.57 14.6 0 1 5 ...
##  $ Volvo 142E         : num  21.4 4 121 109 4.11 2.78 18.6 1 1 4 ...

Subsetting a data frame

One of the most common tasks in data analysis is to perform filtering. Last prac, we found out how to do this with vectors using the square bracket notation. Eg:x[3] will retrieve the 3rd element of x. Square brackets can also be used for two dimensional objects, but we need to provide two indexes. The syntax is df[rows,cols].

# get rows 1-10 of column 2
freeny.x[1:10,2]
##  [1] 4.70997 4.70217 4.68944 4.68558 4.64019 4.62553 4.61991 4.61654 4.61407
## [10] 4.60766
# get rows 1-6 of columns 1-3
freeny.x[1:6,1:3]
##      lag quarterly revenue price index income level
## [1,]               8.79636     4.70997      5.82110
## [2,]               8.79236     4.70217      5.82558
## [3,]               8.79137     4.68944      5.83112
## [4,]               8.81486     4.68558      5.84046
## [5,]               8.81301     4.64019      5.85036
## [6,]               8.90751     4.62553      5.86464
# get rows 1-6 of all columns
freeny.x[1:6,]
##      lag quarterly revenue price index income level market potential
## [1,]               8.79636     4.70997      5.82110          12.9699
## [2,]               8.79236     4.70217      5.82558          12.9733
## [3,]               8.79137     4.68944      5.83112          12.9774
## [4,]               8.81486     4.68558      5.84046          12.9806
## [5,]               8.81301     4.64019      5.85036          12.9831
## [6,]               8.90751     4.62553      5.86464          12.9854
# get all rows for columns 1 and 2 
freeny.x[,1:2]
##       lag quarterly revenue price index
##  [1,]               8.79636     4.70997
##  [2,]               8.79236     4.70217
##  [3,]               8.79137     4.68944
##  [4,]               8.81486     4.68558
##  [5,]               8.81301     4.64019
##  [6,]               8.90751     4.62553
##  [7,]               8.93673     4.61991
##  [8,]               8.96161     4.61654
##  [9,]               8.96044     4.61407
## [10,]               9.00868     4.60766
## [11,]               9.03049     4.60227
## [12,]               9.06906     4.58960
## [13,]               9.05871     4.57592
## [14,]               9.10698     4.58661
## [15,]               9.12685     4.57997
## [16,]               9.17096     4.57176
## [17,]               9.18665     4.56104
## [18,]               9.23823     4.54906
## [19,]               9.26487     4.53957
## [20,]               9.28436     4.51018
## [21,]               9.31378     4.50352
## [22,]               9.35025     4.49360
## [23,]               9.35835     4.46505
## [24,]               9.39767     4.44924
## [25,]               9.42150     4.43966
## [26,]               9.44223     4.42025
## [27,]               9.48721     4.41060
## [28,]               9.52374     4.41151
## [29,]               9.53980     4.39810
## [30,]               9.58123     4.38513
## [31,]               9.60048     4.37320
## [32,]               9.64496     4.32770
## [33,]               9.64390     4.32023
## [34,]               9.69405     4.30909
## [35,]               9.69958     4.30909
## [36,]               9.68683     4.30552
## [37,]               9.71774     4.29627
## [38,]               9.74924     4.27839
## [39,]               9.77536     4.27789

Now we need to see what happens when we subset just one column or row. You can see the default behaviour is to convert the data from matrix format to a vector. We can modify this using drop=FALSE to keep it in matrix format.

# get all rows of column 1
freeny.x[,1]
##  [1] 8.79636 8.79236 8.79137 8.81486 8.81301 8.90751 8.93673 8.96161 8.96044
## [10] 9.00868 9.03049 9.06906 9.05871 9.10698 9.12685 9.17096 9.18665 9.23823
## [19] 9.26487 9.28436 9.31378 9.35025 9.35835 9.39767 9.42150 9.44223 9.48721
## [28] 9.52374 9.53980 9.58123 9.60048 9.64496 9.64390 9.69405 9.69958 9.68683
## [37] 9.71774 9.74924 9.77536
# to prevent conversion to vector, use drop=FALSE
freeny.x[,1,drop=FALSE]
##       lag quarterly revenue
##  [1,]               8.79636
##  [2,]               8.79236
##  [3,]               8.79137
##  [4,]               8.81486
##  [5,]               8.81301
##  [6,]               8.90751
##  [7,]               8.93673
##  [8,]               8.96161
##  [9,]               8.96044
## [10,]               9.00868
## [11,]               9.03049
## [12,]               9.06906
## [13,]               9.05871
## [14,]               9.10698
## [15,]               9.12685
## [16,]               9.17096
## [17,]               9.18665
## [18,]               9.23823
## [19,]               9.26487
## [20,]               9.28436
## [21,]               9.31378
## [22,]               9.35025
## [23,]               9.35835
## [24,]               9.39767
## [25,]               9.42150
## [26,]               9.44223
## [27,]               9.48721
## [28,]               9.52374
## [29,]               9.53980
## [30,]               9.58123
## [31,]               9.60048
## [32,]               9.64496
## [33,]               9.64390
## [34,]               9.69405
## [35,]               9.69958
## [36,]               9.68683
## [37,]               9.71774
## [38,]               9.74924
## [39,]               9.77536
# the same concept for rows
freeny.x[1,]
## lag quarterly revenue           price index          income level 
##               8.79636               4.70997               5.82110 
##      market potential 
##              12.96990
# the same concept for rows
freeny.x[1,,drop=FALSE]
##      lag quarterly revenue price index income level market potential
## [1,]               8.79636     4.70997       5.8211          12.9699

Square brackets also works for data frames.

mtcars[1:10,1:6]
##                    mpg cyl  disp  hp drat    wt
## Mazda RX4         21.0   6 160.0 110 3.90 2.620
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875
## Datsun 710        22.8   4 108.0  93 3.85 2.320
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440
## Valiant           18.1   6 225.0 105 2.76 3.460
## Duster 360        14.3   8 360.0 245 3.21 3.570
## Merc 240D         24.4   4 146.7  62 3.69 3.190
## Merc 230          22.8   4 140.8  95 3.92 3.150
## Merc 280          19.2   6 167.6 123 3.92 3.440
mtcars[,1]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
mtcars[,1,drop=FALSE]
##                      mpg
## Mazda RX4           21.0
## Mazda RX4 Wag       21.0
## Datsun 710          22.8
## Hornet 4 Drive      21.4
## Hornet Sportabout   18.7
## Valiant             18.1
## Duster 360          14.3
## Merc 240D           24.4
## Merc 230            22.8
## Merc 280            19.2
## Merc 280C           17.8
## Merc 450SE          16.4
## Merc 450SL          17.3
## Merc 450SLC         15.2
## Cadillac Fleetwood  10.4
## Lincoln Continental 10.4
## Chrysler Imperial   14.7
## Fiat 128            32.4
## Honda Civic         30.4
## Toyota Corolla      33.9
## Toyota Corona       21.5
## Dodge Challenger    15.5
## AMC Javelin         15.2
## Camaro Z28          13.3
## Pontiac Firebird    19.2
## Fiat X1-9           27.3
## Porsche 914-2       26.0
## Lotus Europa        30.4
## Ford Pantera L      15.8
## Ferrari Dino        19.7
## Maserati Bora       15.0
## Volvo 142E          21.4

Data frames also have more options around subsetting columns. For example we can subset based on the name of the column or row.

mtcars[,"cyl"]
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[,c("mpg","wt")]
##                      mpg    wt
## Mazda RX4           21.0 2.620
## Mazda RX4 Wag       21.0 2.875
## Datsun 710          22.8 2.320
## Hornet 4 Drive      21.4 3.215
## Hornet Sportabout   18.7 3.440
## Valiant             18.1 3.460
## Duster 360          14.3 3.570
## Merc 240D           24.4 3.190
## Merc 230            22.8 3.150
## Merc 280            19.2 3.440
## Merc 280C           17.8 3.440
## Merc 450SE          16.4 4.070
## Merc 450SL          17.3 3.730
## Merc 450SLC         15.2 3.780
## Cadillac Fleetwood  10.4 5.250
## Lincoln Continental 10.4 5.424
## Chrysler Imperial   14.7 5.345
## Fiat 128            32.4 2.200
## Honda Civic         30.4 1.615
## Toyota Corolla      33.9 1.835
## Toyota Corona       21.5 2.465
## Dodge Challenger    15.5 3.520
## AMC Javelin         15.2 3.435
## Camaro Z28          13.3 3.840
## Pontiac Firebird    19.2 3.845
## Fiat X1-9           27.3 1.935
## Porsche 914-2       26.0 2.140
## Lotus Europa        30.4 1.513
## Ford Pantera L      15.8 3.170
## Ferrari Dino        19.7 2.770
## Maserati Bora       15.0 3.570
## Volvo 142E          21.4 2.780
mtcars["Camaro Z28",c("mpg","wt")]
##             mpg   wt
## Camaro Z28 13.3 3.84

Data frames columns can also be subsetted using the $ notation. The syntax is df$col.

mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

This type of notation can even be used to create new columns. In this example below, we are converting the miles per gallon value to liters per 100km unit. We are also rounding this value to three significant figures.

mtcars$lper100km <- 235.215 / mtcars$mpg

mtcars[,c(1,ncol(mtcars))]
##                      mpg lper100km
## Mazda RX4           21.0 11.200714
## Mazda RX4 Wag       21.0 11.200714
## Datsun 710          22.8 10.316447
## Hornet 4 Drive      21.4 10.991355
## Hornet Sportabout   18.7 12.578342
## Valiant             18.1 12.995304
## Duster 360          14.3 16.448601
## Merc 240D           24.4  9.639959
## Merc 230            22.8 10.316447
## Merc 280            19.2 12.250781
## Merc 280C           17.8 13.214326
## Merc 450SE          16.4 14.342378
## Merc 450SL          17.3 13.596243
## Merc 450SLC         15.2 15.474671
## Cadillac Fleetwood  10.4 22.616827
## Lincoln Continental 10.4 22.616827
## Chrysler Imperial   14.7 16.001020
## Fiat 128            32.4  7.259722
## Honda Civic         30.4  7.737336
## Toyota Corolla      33.9  6.938496
## Toyota Corona       21.5 10.940233
## Dodge Challenger    15.5 15.175161
## AMC Javelin         15.2 15.474671
## Camaro Z28          13.3 17.685338
## Pontiac Firebird    19.2 12.250781
## Fiat X1-9           27.3  8.615934
## Porsche 914-2       26.0  9.046731
## Lotus Europa        30.4  7.737336
## Ford Pantera L      15.8 14.887025
## Ferrari Dino        19.7 11.939848
## Maserati Bora       15.0 15.681000
## Volvo 142E          21.4 10.991355
mtcars$lper100km <- signif(235.215 / mtcars$mpg ,3)

mtcars[,c(1,ncol(mtcars))]
##                      mpg lper100km
## Mazda RX4           21.0     11.20
## Mazda RX4 Wag       21.0     11.20
## Datsun 710          22.8     10.30
## Hornet 4 Drive      21.4     11.00
## Hornet Sportabout   18.7     12.60
## Valiant             18.1     13.00
## Duster 360          14.3     16.40
## Merc 240D           24.4      9.64
## Merc 230            22.8     10.30
## Merc 280            19.2     12.30
## Merc 280C           17.8     13.20
## Merc 450SE          16.4     14.30
## Merc 450SL          17.3     13.60
## Merc 450SLC         15.2     15.50
## Cadillac Fleetwood  10.4     22.60
## Lincoln Continental 10.4     22.60
## Chrysler Imperial   14.7     16.00
## Fiat 128            32.4      7.26
## Honda Civic         30.4      7.74
## Toyota Corolla      33.9      6.94
## Toyota Corona       21.5     10.90
## Dodge Challenger    15.5     15.20
## AMC Javelin         15.2     15.50
## Camaro Z28          13.3     17.70
## Pontiac Firebird    19.2     12.30
## Fiat X1-9           27.3      8.62
## Porsche 914-2       26.0      9.05
## Lotus Europa        30.4      7.74
## Ford Pantera L      15.8     14.90
## Ferrari Dino        19.7     11.90
## Maserati Bora       15.0     15.70
## Volvo 142E          21.4     11.00

You may also want to subset a data frame based on the values. Let’s say you want a car with fuel consumption less than 10 L/100km. Let’s do it the hard way first.

mtcars$lper100km 
##  [1] 11.20 11.20 10.30 11.00 12.60 13.00 16.40  9.64 10.30 12.30 13.20 14.30
## [13] 13.60 15.50 22.60 22.60 16.00  7.26  7.74  6.94 10.90 15.20 15.50 17.70
## [25] 12.30  8.62  9.05  7.74 14.90 11.90 15.70 11.00
mtcars$lper100km < 10
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [25] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
which(mtcars$lper100km < 10)
## [1]  8 18 19 20 26 27 28
mtcars[which(mtcars$lper100km < 10),]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb lper100km
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      9.64
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1      7.26
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2      7.74
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1      6.94
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1      8.62
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2      9.05
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2      7.74

You can see that this is quite complicated. There is an easier way using subset(). Subset is also perfect for filtering based on more than one criteria using the & and ‘|’ operators.

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
##                     lper100km
## Mazda RX4               11.20
## Mazda RX4 Wag           11.20
## Datsun 710              10.30
## Hornet 4 Drive          11.00
## Hornet Sportabout       12.60
## Valiant                 13.00
## Duster 360              16.40
## Merc 240D                9.64
## Merc 230                10.30
## Merc 280                12.30
## Merc 280C               13.20
## Merc 450SE              14.30
## Merc 450SL              13.60
## Merc 450SLC             15.50
## Cadillac Fleetwood      22.60
## Lincoln Continental     22.60
## Chrysler Imperial       16.00
## Fiat 128                 7.26
## Honda Civic              7.74
## Toyota Corolla           6.94
## Toyota Corona           10.90
## Dodge Challenger        15.20
## AMC Javelin             15.50
## Camaro Z28              17.70
## Pontiac Firebird        12.30
## Fiat X1-9                8.62
## Porsche 914-2            9.05
## Lotus Europa             7.74
## Ford Pantera L          14.90
## Ferrari Dino            11.90
## Maserati Bora           15.70
## Volvo 142E              11.00
subset(mtcars,lper100km < 10)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb lper100km
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      9.64
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1      7.26
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2      7.74
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1      6.94
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1      8.62
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2      9.05
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2      7.74
# you want an economical AND quick car
subset(mtcars,lper100km < 10 & qsec < 18)
##                mpg cyl  disp  hp drat    wt qsec vs am gear carb lper100km
## Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2      9.05
## Lotus Europa  30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2      7.74
# you want an economical OR quick car
subset(mtcars,lper100km < 10 | qsec < 18)
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
##                     lper100km
## Mazda RX4               11.20
## Mazda RX4 Wag           11.20
## Hornet Sportabout       12.60
## Duster 360              16.40
## Merc 240D                9.64
## Merc 450SE              14.30
## Merc 450SL              13.60
## Cadillac Fleetwood      22.60
## Lincoln Continental     22.60
## Chrysler Imperial       16.00
## Fiat 128                 7.26
## Honda Civic              7.74
## Toyota Corolla           6.94
## Dodge Challenger        15.20
## AMC Javelin             15.50
## Camaro Z28              17.70
## Pontiac Firebird        12.30
## Fiat X1-9                8.62
## Porsche 914-2            9.05
## Lotus Europa             7.74
## Ford Pantera L          14.90
## Ferrari Dino            11.90
## Maserati Bora           15.70

Subset also works for strings and factors. To look at this we need to look at the iris dataset

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
setosa <- subset(iris,Species == "setosa")
head(setosa)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Row and columns names

You can use the colnames and rownames to get the row or column names and even mofidy them.

colnames(mtcars)
##  [1] "mpg"       "cyl"       "disp"      "hp"        "drat"      "wt"       
##  [7] "qsec"      "vs"        "am"        "gear"      "carb"      "lper100km"
rownames(mtcars)
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"
colnames(mtcars) <- c("miles per gallon",
              "number of cylinders",
              "displacement in cubic inches",
              "gross horsepower",
              "rear axle ratio",
              "weight (pounds/1000)",
              "quarter mile time in seconds",
              "V or straight cylinder configuration",
              "transmission type: auto (0) or manual (1)",
              "number of forward gears",
              "number of carburetors",
              "litres per 100km")

head(mtcars)
##                   miles per gallon number of cylinders
## Mazda RX4                     21.0                   6
## Mazda RX4 Wag                 21.0                   6
## Datsun 710                    22.8                   4
## Hornet 4 Drive                21.4                   6
## Hornet Sportabout             18.7                   8
## Valiant                       18.1                   6
##                   displacement in cubic inches gross horsepower rear axle ratio
## Mazda RX4                                  160              110            3.90
## Mazda RX4 Wag                              160              110            3.90
## Datsun 710                                 108               93            3.85
## Hornet 4 Drive                             258              110            3.08
## Hornet Sportabout                          360              175            3.15
## Valiant                                    225              105            2.76
##                   weight (pounds/1000) quarter mile time in seconds
## Mazda RX4                        2.620                        16.46
## Mazda RX4 Wag                    2.875                        17.02
## Datsun 710                       2.320                        18.61
## Hornet 4 Drive                   3.215                        19.44
## Hornet Sportabout                3.440                        17.02
## Valiant                          3.460                        20.22
##                   V or straight cylinder configuration
## Mazda RX4                                            0
## Mazda RX4 Wag                                        0
## Datsun 710                                           1
## Hornet 4 Drive                                       1
## Hornet Sportabout                                    0
## Valiant                                              1
##                   transmission type: auto (0) or manual (1)
## Mazda RX4                                                 1
## Mazda RX4 Wag                                             1
## Datsun 710                                                1
## Hornet 4 Drive                                            0
## Hornet Sportabout                                         0
## Valiant                                                   0
##                   number of forward gears number of carburetors
## Mazda RX4                               4                     4
## Mazda RX4 Wag                           4                     4
## Datsun 710                              4                     1
## Hornet 4 Drive                          3                     1
## Hornet Sportabout                       3                     2
## Valiant                                 3                     1
##                   litres per 100km
## Mazda RX4                     11.2
## Mazda RX4 Wag                 11.2
## Datsun 710                    10.3
## Hornet 4 Drive                11.0
## Hornet Sportabout             12.6
## Valiant                       13.0
colnames(mtcars)[1] <- "miles per US gallon"

head(mtcars)
##                   miles per US gallon number of cylinders
## Mazda RX4                        21.0                   6
## Mazda RX4 Wag                    21.0                   6
## Datsun 710                       22.8                   4
## Hornet 4 Drive                   21.4                   6
## Hornet Sportabout                18.7                   8
## Valiant                          18.1                   6
##                   displacement in cubic inches gross horsepower rear axle ratio
## Mazda RX4                                  160              110            3.90
## Mazda RX4 Wag                              160              110            3.90
## Datsun 710                                 108               93            3.85
## Hornet 4 Drive                             258              110            3.08
## Hornet Sportabout                          360              175            3.15
## Valiant                                    225              105            2.76
##                   weight (pounds/1000) quarter mile time in seconds
## Mazda RX4                        2.620                        16.46
## Mazda RX4 Wag                    2.875                        17.02
## Datsun 710                       2.320                        18.61
## Hornet 4 Drive                   3.215                        19.44
## Hornet Sportabout                3.440                        17.02
## Valiant                          3.460                        20.22
##                   V or straight cylinder configuration
## Mazda RX4                                            0
## Mazda RX4 Wag                                        0
## Datsun 710                                           1
## Hornet 4 Drive                                       1
## Hornet Sportabout                                    0
## Valiant                                              1
##                   transmission type: auto (0) or manual (1)
## Mazda RX4                                                 1
## Mazda RX4 Wag                                             1
## Datsun 710                                                1
## Hornet 4 Drive                                            0
## Hornet Sportabout                                         0
## Valiant                                                   0
##                   number of forward gears number of carburetors
## Mazda RX4                               4                     4
## Mazda RX4 Wag                           4                     4
## Datsun 710                              4                     1
## Hornet 4 Drive                          3                     1
## Hornet Sportabout                       3                     2
## Valiant                                 3                     1
##                   litres per 100km
## Mazda RX4                     11.2
## Mazda RX4 Wag                 11.2
## Datsun 710                    10.3
## Hornet 4 Drive                11.0
## Hornet Sportabout             12.6
## Valiant                       13.0

If you have a whitespace in the column or row name, it might cause problems later on with subsetting. In that case the column name needs to be wrapped in backticks like this.

economical_cars <- subset(mtcars,`litres per 100km` < 10)

economical_cars
##                miles per US gallon number of cylinders
## Merc 240D                     24.4                   4
## Fiat 128                      32.4                   4
## Honda Civic                   30.4                   4
## Toyota Corolla                33.9                   4
## Fiat X1-9                     27.3                   4
## Porsche 914-2                 26.0                   4
## Lotus Europa                  30.4                   4
##                displacement in cubic inches gross horsepower rear axle ratio
## Merc 240D                             146.7               62            3.69
## Fiat 128                               78.7               66            4.08
## Honda Civic                            75.7               52            4.93
## Toyota Corolla                         71.1               65            4.22
## Fiat X1-9                              79.0               66            4.08
## Porsche 914-2                         120.3               91            4.43
## Lotus Europa                           95.1              113            3.77
##                weight (pounds/1000) quarter mile time in seconds
## Merc 240D                     3.190                        20.00
## Fiat 128                      2.200                        19.47
## Honda Civic                   1.615                        18.52
## Toyota Corolla                1.835                        19.90
## Fiat X1-9                     1.935                        18.90
## Porsche 914-2                 2.140                        16.70
## Lotus Europa                  1.513                        16.90
##                V or straight cylinder configuration
## Merc 240D                                         1
## Fiat 128                                          1
## Honda Civic                                       1
## Toyota Corolla                                    1
## Fiat X1-9                                         1
## Porsche 914-2                                     0
## Lotus Europa                                      1
##                transmission type: auto (0) or manual (1)
## Merc 240D                                              0
## Fiat 128                                               1
## Honda Civic                                            1
## Toyota Corolla                                         1
## Fiat X1-9                                              1
## Porsche 914-2                                          1
## Lotus Europa                                           1
##                number of forward gears number of carburetors litres per 100km
## Merc 240D                            4                     2             9.64
## Fiat 128                             4                     1             7.26
## Honda Civic                          4                     2             7.74
## Toyota Corolla                       4                     1             6.94
## Fiat X1-9                            4                     1             8.62
## Porsche 914-2                        5                     2             9.05
## Lotus Europa                         5                     2             7.74

It is also useful to be able to subset a data frame based on the row names. Let’s get all the Mercedes models. To do this, we need to introduce the grep() command which matches strings.

# let's look again at the car names
rownames(mtcars)
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"
# lets filter all the ones with "Merc in the name"
grep("Merc",rownames(mtcars))
## [1]  8  9 10 11 12 13 14
# now lets extract out all those rows
mercs <- mtcars[grep("Merc",rownames(mtcars)),]

mercs
##             miles per US gallon number of cylinders
## Merc 240D                  24.4                   4
## Merc 230                   22.8                   4
## Merc 280                   19.2                   6
## Merc 280C                  17.8                   6
## Merc 450SE                 16.4                   8
## Merc 450SL                 17.3                   8
## Merc 450SLC                15.2                   8
##             displacement in cubic inches gross horsepower rear axle ratio
## Merc 240D                          146.7               62            3.69
## Merc 230                           140.8               95            3.92
## Merc 280                           167.6              123            3.92
## Merc 280C                          167.6              123            3.92
## Merc 450SE                         275.8              180            3.07
## Merc 450SL                         275.8              180            3.07
## Merc 450SLC                        275.8              180            3.07
##             weight (pounds/1000) quarter mile time in seconds
## Merc 240D                   3.19                         20.0
## Merc 230                    3.15                         22.9
## Merc 280                    3.44                         18.3
## Merc 280C                   3.44                         18.9
## Merc 450SE                  4.07                         17.4
## Merc 450SL                  3.73                         17.6
## Merc 450SLC                 3.78                         18.0
##             V or straight cylinder configuration
## Merc 240D                                      1
## Merc 230                                       1
## Merc 280                                       1
## Merc 280C                                      1
## Merc 450SE                                     0
## Merc 450SL                                     0
## Merc 450SLC                                    0
##             transmission type: auto (0) or manual (1) number of forward gears
## Merc 240D                                           0                       4
## Merc 230                                            0                       4
## Merc 280                                            0                       4
## Merc 280C                                           0                       4
## Merc 450SE                                          0                       3
## Merc 450SL                                          0                       3
## Merc 450SLC                                         0                       3
##             number of carburetors litres per 100km
## Merc 240D                       2             9.64
## Merc 230                        2            10.30
## Merc 280                        4            12.30
## Merc 280C                       4            13.20
## Merc 450SE                      3            14.30
## Merc 450SL                      3            13.60
## Merc 450SLC                     3            15.50

Sorting data tables

We are going to sort our subset of economical cars by their speed based on their quarter mile time. To do this, we need to use the order() command together with the square brackets. order() only returns the index of the values, it doesn’t actually do the sorting. Note that order() default behaviour is to bring the smaller values to the top. That can be reversed by putting a - before the vector being ordered.

economical_cars
##                miles per US gallon number of cylinders
## Merc 240D                     24.4                   4
## Fiat 128                      32.4                   4
## Honda Civic                   30.4                   4
## Toyota Corolla                33.9                   4
## Fiat X1-9                     27.3                   4
## Porsche 914-2                 26.0                   4
## Lotus Europa                  30.4                   4
##                displacement in cubic inches gross horsepower rear axle ratio
## Merc 240D                             146.7               62            3.69
## Fiat 128                               78.7               66            4.08
## Honda Civic                            75.7               52            4.93
## Toyota Corolla                         71.1               65            4.22
## Fiat X1-9                              79.0               66            4.08
## Porsche 914-2                         120.3               91            4.43
## Lotus Europa                           95.1              113            3.77
##                weight (pounds/1000) quarter mile time in seconds
## Merc 240D                     3.190                        20.00
## Fiat 128                      2.200                        19.47
## Honda Civic                   1.615                        18.52
## Toyota Corolla                1.835                        19.90
## Fiat X1-9                     1.935                        18.90
## Porsche 914-2                 2.140                        16.70
## Lotus Europa                  1.513                        16.90
##                V or straight cylinder configuration
## Merc 240D                                         1
## Fiat 128                                          1
## Honda Civic                                       1
## Toyota Corolla                                    1
## Fiat X1-9                                         1
## Porsche 914-2                                     0
## Lotus Europa                                      1
##                transmission type: auto (0) or manual (1)
## Merc 240D                                              0
## Fiat 128                                               1
## Honda Civic                                            1
## Toyota Corolla                                         1
## Fiat X1-9                                              1
## Porsche 914-2                                          1
## Lotus Europa                                           1
##                number of forward gears number of carburetors litres per 100km
## Merc 240D                            4                     2             9.64
## Fiat 128                             4                     1             7.26
## Honda Civic                          4                     2             7.74
## Toyota Corolla                       4                     1             6.94
## Fiat X1-9                            4                     1             8.62
## Porsche 914-2                        5                     2             9.05
## Lotus Europa                         5                     2             7.74
order(economical_cars$`quarter mile time in seconds`)
## [1] 6 7 3 5 2 4 1
sorted <- economical_cars[order(economical_cars$`quarter mile time in seconds`),]

sorted[,c(7,ncol(sorted))]
##                quarter mile time in seconds litres per 100km
## Porsche 914-2                         16.70             9.05
## Lotus Europa                          16.90             7.74
## Honda Civic                           18.52             7.74
## Fiat X1-9                             18.90             8.62
## Fiat 128                              19.47             7.26
## Toyota Corolla                        19.90             6.94
## Merc 240D                             20.00             9.64
reverse_sorted <- economical_cars[order(-economical_cars$`quarter mile time in seconds`),]

reverse_sorted[,c(7,ncol(reverse_sorted))]
##                quarter mile time in seconds litres per 100km
## Merc 240D                             20.00             9.64
## Toyota Corolla                        19.90             6.94
## Fiat 128                              19.47             7.26
## Fiat X1-9                             18.90             8.62
## Honda Civic                           18.52             7.74
## Lotus Europa                          16.90             7.74
## Porsche 914-2                         16.70             9.05

Creating data frames and matrices

First we will create a data frame for some people who completed a survey about their height and weight. You should always run str() to check that the resulting dataframe has the intended structure. You may need to include stringsAsFactors=FALSE to protect character strings being converted to factors.

pnames <- c("Jill", "Matt", "Sam", "Amy", "Bob", "Raj")

pnames
## [1] "Jill" "Matt" "Sam"  "Amy"  "Bob"  "Raj"
pgender <- as.factor(c("F", "M", "F", "F", "M", "M"))

pgender
## [1] F M F F M M
## Levels: F M
pheight <- c(164, 186, 170, 175, 178, 191)

pheight
## [1] 164 186 170 175 178 191
pweight <- c(54.1, 90.3, 64.8, 66.7, 80.4, 86.9)

pweight
## [1] 54.1 90.3 64.8 66.7 80.4 86.9
df <- data.frame(pnames,pgender,pheight,pweight)

str(df)
## 'data.frame':    6 obs. of  4 variables:
##  $ pnames : chr  "Jill" "Matt" "Sam" "Amy" ...
##  $ pgender: Factor w/ 2 levels "F","M": 1 2 1 1 2 2
##  $ pheight: num  164 186 170 175 178 191
##  $ pweight: num  54.1 90.3 64.8 66.7 80.4 86.9
df <- data.frame(pnames,pgender,pheight,pweight,stringsAsFactors = FALSE)

str(df)
## 'data.frame':    6 obs. of  4 variables:
##  $ pnames : chr  "Jill" "Matt" "Sam" "Amy" ...
##  $ pgender: Factor w/ 2 levels "F","M": 1 2 1 1 2 2
##  $ pheight: num  164 186 170 175 178 191
##  $ pweight: num  54.1 90.3 64.8 66.7 80.4 86.9
df
##   pnames pgender pheight pweight
## 1   Jill       F     164    54.1
## 2   Matt       M     186    90.3
## 3    Sam       F     170    64.8
## 4    Amy       F     175    66.7
## 5    Bob       M     178    80.4
## 6    Raj       M     191    86.9

Now we might want to make the row names the name of the person. This makes the data tidier, but it won’t work if there are more then one entry with the same name. You can use the NULL to delete columns. Deleting rows can be done with square brackets.

rownames(df) <- df$pnames

df
##      pnames pgender pheight pweight
## Jill   Jill       F     164    54.1
## Matt   Matt       M     186    90.3
## Sam     Sam       F     170    64.8
## Amy     Amy       F     175    66.7
## Bob     Bob       M     178    80.4
## Raj     Raj       M     191    86.9
df$pnames=NULL

df
##      pgender pheight pweight
## Jill       F     164    54.1
## Matt       M     186    90.3
## Sam        F     170    64.8
## Amy        F     175    66.7
## Bob        M     178    80.4
## Raj        M     191    86.9
# delete row 2 and 4
df <- df[-c(2,4),]

Now we will convert df into a matrix.

as.matrix(df)
##      pgender pheight pweight
## Jill "F"     "164"   "54.1" 
## Sam  "F"     "170"   "64.8" 
## Bob  "M"     "178"   "80.4" 
## Raj  "M"     "191"   "86.9"
df$pgender <- as.numeric(df$pgender)

mymat <- as.matrix(df)

mymat
##      pgender pheight pweight
## Jill       1     164    54.1
## Sam        1     170    64.8
## Bob        2     178    80.4
## Raj        2     191    86.9
str(mymat)
##  num [1:4, 1:3] 1 1 2 2 164 170 178 191 54.1 64.8 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:4] "Jill" "Sam" "Bob" "Raj"
##   ..$ : chr [1:3] "pgender" "pheight" "pweight"

We can also convert other types of data into a matrix. To demonstrate this, I’ll create some random data with rnorm and convert it into a matrix.

mydata <- rnorm(n = 100, mean = 10, sd = 20)

mymatrix <- matrix(data = mydata, nrow = 20, ncol = 5)

mymatrix
##             [,1]       [,2]       [,3]        [,4]       [,5]
##  [1,]  -5.362961  16.813039 -19.886142  15.6748462  19.756765
##  [2,]  22.459162  25.874603  -5.080414  22.8898737   0.838587
##  [3,]  11.705377  10.508057 -15.861401  15.0272497  23.816758
##  [4,]  27.744727  12.607702   9.414938  -5.4591222  21.627774
##  [5,]  16.224883 -10.964592  16.466131  20.8764582  16.610439
##  [6,]   1.181048  23.203298   9.708798  30.9594701  12.142181
##  [7,]  -6.258320 -23.269052  44.366705 -13.1396102 -12.784759
##  [8,]   9.891521  23.368556  -5.800771   6.8270441  -6.385512
##  [9,]   6.776381   6.288497  12.656693   9.8018106   8.477256
## [10,]  27.832346 -11.458067  -3.920119  13.1407818   7.077689
## [11,]  26.053923  19.624235  18.224092  13.1742742  30.674810
## [12,] -18.761566  14.498805   6.885688  30.4425231  -4.005946
## [13,]  21.210878  18.098923 -15.046685  25.9500029  -6.533689
## [14,]  31.375715  32.239798   1.278403  -0.2864208  25.013434
## [15,]  17.098985 -17.500661 -16.159211 -34.9421872  24.783650
## [16,]  37.271382  23.659692  26.064014  20.2886377  19.749735
## [17,]  51.750068  10.656921  48.482437  12.6853048  39.457832
## [18,]  19.495228  48.107814  25.528925  -8.8210692   2.254276
## [19,]  31.726173 -13.789659  19.175284  42.7355994  -3.262027
## [20,]   6.184396   9.908607  -1.698200   1.1724725  -4.662633

Creating charts from data frame and matrix objects

Making charts from data frames and matrix objects is really similar to what we did above. Here are some examples.

# histogram of cylinders
data(mtcars)
hist(mtcars$cyl,xlab="number of cylinders", main="number of cylinders in mtcars data")

# boxplot of qsec values for Toyota and Mercedes
toyota <- mtcars[grep("Toyota",rownames(mtcars)),]
merc <- mtcars[grep("Merc",rownames(mtcars)),]
boxplot(toyota$qsec, merc$qsec,
        ylab="quarter mile time in seconds",
        main="mtcars",
        names = c("Toyota","Mercedes"))

# scatterplot of petal length vs sepal length for setosa irises
# include a trend line using the lm function
setosa <- subset(iris,Species=="setosa")
head(setosa)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
mylm <- lm(setosa$Sepal.Length ~ setosa$Petal.Length)
plot(setosa$Petal.Length,setosa$Sepal.Length,
     xlab="Petal length (cm)",
     ylab="Sepal length (cm)",
     main="Setosa petal and sepal length",
     pch=19)
abline(mylm,col="red",lwd=2,lty=2)

# a pairs plot is a special type of scatterplot
pairs(mymatrix)

# lets make a line diagram of freeny revenues
# first need to normalise each column to the initial value
head(freeny.x)
##      lag quarterly revenue price index income level market potential
## [1,]               8.79636     4.70997      5.82110          12.9699
## [2,]               8.79236     4.70217      5.82558          12.9733
## [3,]               8.79137     4.68944      5.83112          12.9774
## [4,]               8.81486     4.68558      5.84046          12.9806
## [5,]               8.81301     4.64019      5.85036          12.9831
## [6,]               8.90751     4.62553      5.86464          12.9854
freeny_norm <-t( t(freeny.x) / freeny.x[1,] )
head(freeny_norm)
##      lag quarterly revenue price index income level market potential
## [1,]             1.0000000   1.0000000     1.000000         1.000000
## [2,]             0.9995453   0.9983439     1.000770         1.000262
## [3,]             0.9994327   0.9956412     1.001721         1.000578
## [4,]             1.0021031   0.9948216     1.003326         1.000825
## [5,]             1.0018928   0.9851846     1.005027         1.001018
## [6,]             1.0126359   0.9820721     1.007480         1.001195
plot(freeny_norm[,1], type="b", ylim=c(0.9,1.12), col="blue",
     xlab="Quarters beginning 1962 Q2",
     ylab="Change in values overtime")
  lines(freeny_norm[,2], type="b", col="red")
  lines(freeny_norm[,3], type="b", col="black")
  lines(freeny_norm[,4], type="b", col="darkgreen")
legend("topleft", legend = colnames(freeny.x), lty=1 , col = c("blue","red", "black", "darkgreen"))

In R, sometimes we need to load a particular package in order to make a special type of chart. In the example below we are making a heatmap, where the colour indicates the numerical value.

library("gplots")
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
heatmap.2(mymatrix,trace="none",scale="none",main="heatmap")

heatmap.2(cor(mymatrix),trace="none",scale="none", main="correlation")

This is also a good time to let you know that you can save R charts as files. There are different types of file formats, but PNG and PDF are the most used types. You will see these new files appear in your files menu.

pdf("myplot.pdf")
plot(1:10)
dev.off()
## png 
##   2
png("myplot.png")
plot(1:10)
dev.off()
## png 
##   2

Part 2: Homework questions for group study

  1. Create a scatterplot of mtcars weight (x axis) versus mpg (y axis). Include x and y axis labels and a main heading.

  2. Sort mtcars by weight (wt) and create a horizontal barplot of wt values so heaviest ones at shown at the top of the bar plot. The plot should be labeled so it is clear to see which bar belong to which car. Include an axis label and main title.

  3. Create a box plot of iris petal lengths. Each species should be a different category. Chart needs axis labels and a main title.

Intro to R part 3

Dealing with read data

So far we have been working with datasets that are built into R, but in the real world you will be working with data files. These could be in different formats like text (.txt), comma separated values (.csv), tab separated values (.tsv) and perhaps Excel files too (.xls and .xlsx).

We will be working with TSV file of pipetting measurements. URL: https://raw.githubusercontent.com/markziemann/SLE712_files/master/pipette_test.tsv

We will use the read.table() command and show you some really important options.

URL="https://raw.githubusercontent.com/markziemann/SLE712_files/master/pipette_test.tsv"

download.file(URL,"my.tsv")
# look closely at the structure of pip
# can you see what is wrong?
pip <- read.table("my.tsv")
pip
##         V1     V2     V3     V4
## 1 RefValue R100uL R150uL R200uL
## 2       M1  99.62 149.56 200.16
## 3       M2  98.48 147.06 199.88
## 4       M3 100.26 151.34 199.92
## 5       M4 101.12 150.12 200.62
## 6       M5  99.89 149.94 201.37
str(pip)
## 'data.frame':    6 obs. of  4 variables:
##  $ V1: chr  "RefValue" "M1" "M2" "M3" ...
##  $ V2: chr  "R100uL" "99.62" "98.48" "100.26" ...
##  $ V3: chr  "R150uL" "149.56" "147.06" "151.34" ...
##  $ V4: chr  "R200uL" "200.16" "199.88" "199.92" ...
# try again
pip <- read.table(URL,stringsAsFactors = FALSE)
pip
##         V1     V2     V3     V4
## 1 RefValue R100uL R150uL R200uL
## 2       M1  99.62 149.56 200.16
## 3       M2  98.48 147.06 199.88
## 4       M3 100.26 151.34 199.92
## 5       M4 101.12 150.12 200.62
## 6       M5  99.89 149.94 201.37
str(pip)
## 'data.frame':    6 obs. of  4 variables:
##  $ V1: chr  "RefValue" "M1" "M2" "M3" ...
##  $ V2: chr  "R100uL" "99.62" "98.48" "100.26" ...
##  $ V3: chr  "R150uL" "149.56" "147.06" "151.34" ...
##  $ V4: chr  "R200uL" "200.16" "199.88" "199.92" ...
# try again
# looking better
pip <- read.table(URL,stringsAsFactors = FALSE, header=TRUE)
pip
##   RefValue R100uL R150uL R200uL
## 1       M1  99.62 149.56 200.16
## 2       M2  98.48 147.06 199.88
## 3       M3 100.26 151.34 199.92
## 4       M4 101.12 150.12 200.62
## 5       M5  99.89 149.94 201.37
str(pip)
## 'data.frame':    5 obs. of  4 variables:
##  $ RefValue: chr  "M1" "M2" "M3" "M4" ...
##  $ R100uL  : num  99.6 98.5 100.3 101.1 99.9
##  $ R150uL  : num  150 147 151 150 150
##  $ R200uL  : num  200 200 200 201 201
# got it now
pip <- read.table(URL,stringsAsFactors = FALSE, header=TRUE, row.names=1)
pip
##    R100uL R150uL R200uL
## M1  99.62 149.56 200.16
## M2  98.48 147.06 199.88
## M3 100.26 151.34 199.92
## M4 101.12 150.12 200.62
## M5  99.89 149.94 201.37
str(pip)
## 'data.frame':    5 obs. of  3 variables:
##  $ R100uL: num  99.6 98.5 100.3 101.1 99.9
##  $ R150uL: num  150 147 151 150 150
##  $ R200uL: num  200 200 200 201 201

Now let’s try a csv file containing some travel records.

URL="https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"

trav <- read.table(URL,sep=",")
trav
##       V1   V2   V3   V4
## 1  Month 1958 1959 1960
## 2    JAN  340  360  417
## 3    FEB  318  342  391
## 4    MAR  362  406  419
## 5    APR  348  396  461
## 6    MAY  363  420  472
## 7    JUN  435  472  535
## 8    JUL  491  548  622
## 9    AUG  505  559  606
## 10   SEP  404  463  508
## 11   OCT  359  407  461
## 12   NOV  310  362  390
## 13   DEC  337  405  432
str(trav)
## 'data.frame':    13 obs. of  4 variables:
##  $ V1: chr  "Month" "JAN" "FEB" "MAR" ...
##  $ V2: int  1958 340 318 362 348 363 435 491 505 404 ...
##  $ V3: int  1959 360 342 406 396 420 472 548 559 463 ...
##  $ V4: int  1960 417 391 419 461 472 535 622 606 508 ...
trav <- read.table(URL,sep=",",header=TRUE)
trav
##    Month X1958 X1959 X1960
## 1    JAN   340   360   417
## 2    FEB   318   342   391
## 3    MAR   362   406   419
## 4    APR   348   396   461
## 5    MAY   363   420   472
## 6    JUN   435   472   535
## 7    JUL   491   548   622
## 8    AUG   505   559   606
## 9    SEP   404   463   508
## 10   OCT   359   407   461
## 11   NOV   310   362   390
## 12   DEC   337   405   432
str(trav)
## 'data.frame':    12 obs. of  4 variables:
##  $ Month: chr  "JAN" "FEB" "MAR" "APR" ...
##  $ X1958: int  340 318 362 348 363 435 491 505 404 359 ...
##  $ X1959: int  360 342 406 396 420 472 548 559 463 407 ...
##  $ X1960: int  417 391 419 461 472 535 622 606 508 461 ...
trav <- read.csv(URL)
trav
##    Month X1958 X1959 X1960
## 1    JAN   340   360   417
## 2    FEB   318   342   391
## 3    MAR   362   406   419
## 4    APR   348   396   461
## 5    MAY   363   420   472
## 6    JUN   435   472   535
## 7    JUL   491   548   622
## 8    AUG   505   559   606
## 9    SEP   404   463   508
## 10   OCT   359   407   461
## 11   NOV   310   362   390
## 12   DEC   337   405   432
str(trav)
## 'data.frame':    12 obs. of  4 variables:
##  $ Month: chr  "JAN" "FEB" "MAR" "APR" ...
##  $ X1958: int  340 318 362 348 363 435 491 505 404 359 ...
##  $ X1959: int  360 342 406 396 420 472 548 559 463 407 ...
##  $ X1960: int  417 391 419 461 472 535 622 606 508 461 ...

Now let’s try a Microsoft Excel file.

URL="https://github.com/markziemann/SLE712_files/blob/master/misc/file_example_XLS_10.xls?raw=true"
NAME="file_example_XLS_10.xls"
download.file(URL,destfile=NAME)
library("readxl")

mydata <- read_xls(NAME)
mydata
## # A tibble: 9 × 8
##     `0` `First Name` `Last Name` Gender Country         Age Date          Id
##   <dbl> <chr>        <chr>       <chr>  <chr>         <dbl> <chr>      <dbl>
## 1     1 Dulce        Abril       Female United States    32 15/10/2017  1562
## 2     2 Mara         Hashimoto   Female Great Britain    25 16/08/2016  1582
## 3     3 Philip       Gent        Male   France           36 21/05/2015  2587
## 4     4 Kathleen     Hanner      Female United States    25 15/10/2017  3549
## 5     5 Nereida      Magwood     Female United States    58 16/08/2016  2468
## 6     6 Gaston       Brumm       Male   United States    24 21/05/2015  2554
## 7     7 Etta         Hurn        Female Great Britain    56 15/10/2017  3598
## 8     8 Earlean      Melgar      Female United States    27 16/08/2016  2456
## 9     9 Vincenza     Weiland     Female United States    40 21/05/2015  6548
str(mydata)
## tibble [9 × 8] (S3: tbl_df/tbl/data.frame)
##  $ 0         : num [1:9] 1 2 3 4 5 6 7 8 9
##  $ First Name: chr [1:9] "Dulce" "Mara" "Philip" "Kathleen" ...
##  $ Last Name : chr [1:9] "Abril" "Hashimoto" "Gent" "Hanner" ...
##  $ Gender    : chr [1:9] "Female" "Female" "Male" "Female" ...
##  $ Country   : chr [1:9] "United States" "Great Britain" "France" "United States" ...
##  $ Age       : num [1:9] 32 25 36 25 58 24 56 27 40
##  $ Date      : chr [1:9] "15/10/2017" "16/08/2016" "21/05/2015" "15/10/2017" ...
##  $ Id        : num [1:9] 1562 1582 2587 3549 2468 ...

Save and load session and single datasets

When working in R, it is convenient to save the session with save.image(). This results in an Rdata file which contains all the data objects in your current environment. To test that it’s actually working, clear your environment with the sweep/broom icon and then load the Rdata file with load().

save.image("mysession.Rdata")

rm(list=ls())

load("mysession.Rdata")

That is really cool, but sometimes we want to save individual objects, such as a large dataframe as Rdata files.

saveRDS(object = mymatrix , file = "mymatrix.Rds")

rm(list=ls())

x <- readRDS("mymatrix.Rds")

head(x)
##           [,1]      [,2]       [,3]      [,4]      [,5]
## [1,] -5.362961  16.81304 -19.886142 15.674846 19.756765
## [2,] 22.459162  25.87460  -5.080414 22.889874  0.838587
## [3,] 11.705377  10.50806 -15.861401 15.027250 23.816758
## [4,] 27.744727  12.60770   9.414938 -5.459122 21.627774
## [5,] 16.224883 -10.96459  16.466131 20.876458 16.610439
## [6,]  1.181048  23.20330   9.708798 30.959470 12.142181

Part 3: Check your skills

For the TSV file located here: https://raw.githubusercontent.com/markziemann/SLE712_files/master/misc/mydata.tsv

  1. Read it in and show the first 6 rows of data.

  2. Calculate the column and row means.

  3. Use the cor() command to find the correlation coefficients between the 3 data sets. Which two datasets are the most similar?

  4. Make a pairs plot of the three datasets.

  5. Save the script. Clear the environment and execute it to confirm that it is working.

Session information

For reproducibility.

sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] readxl_1.4.3   gplots_3.1.3.1 beeswarm_0.4.0
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5        cli_3.6.2          knitr_1.45         rlang_1.1.3       
##  [5] xfun_0.42          highr_0.10         KernSmooth_2.23-22 jsonlite_1.8.8    
##  [9] gtools_3.9.5       glue_1.7.0         htmltools_0.5.7    sass_0.4.8        
## [13] fansi_1.0.6        rmarkdown_2.25     cellranger_1.1.0   evaluate_0.23     
## [17] jquerylib_0.1.4    caTools_1.18.2     tibble_3.2.1       bitops_1.0-7      
## [21] fastmap_1.1.1      yaml_2.3.8         lifecycle_1.0.4    compiler_4.3.3    
## [25] pkgconfig_2.0.3    digest_0.6.34      R6_2.5.1           utf8_1.2.4        
## [29] pillar_1.9.0       magrittr_2.0.3     bslib_0.6.1        tools_4.3.3       
## [33] cachem_1.0.8