Source: https://github.com/markziemann/bioinformatics_intro_workshop
A statistical computing and data visualisation language, derived from an earlier language called S.
Comments
Load libraries
Read data
Cleaning
Analysis
Data visualisation
Save files
Execute
This week, our goal is to learn:
how to work with R on the command line and in RStudio on the HPC
about different data structures in R
how to do basic math in R
If you are connected to the Burnet network, visit the HPC documentation here.
Create a new tmux session which will be persistent. Then make an interactive SLURM session where you can run some analysis. You can customise the threads, memory and time to your needs.
tmux new -s mysession
srun -p interactive --pty --time=180 --threads=2 --mem=8G bash -i
Then run an example command. Here we are grabbing some random data and compressing it with pigz, which is a parallel compression tool. This will use all available CPU threads.
head -1000000 /dev/random | pigz > /dev/null
If you want to run some scripts, you will likely need to load modules. This is a good way to maintain parallel version of languages like R. Use the following commands to show the available modules.
module avail
module spider
Try loading R version 4.3.2.
module load R/4.3.2
Now run R and check the version and exit.
R
sessionInfo()
q()
Exit interactive sessions with exit
.
Non interactive jobs can be scheduled using SLURM, using the HPC docs as a guide.
Rstudio offers some benefits due to its interactive and graphical nature.
sbatch --time=180 --threads=2 --mem=8G /software/jobs/rstudio.job
It will say “Submitted batch job 1234” (the number will be different), and will create a file in the current working directory called rstudio.job.1234. List the contents with ls
to confirm that the file was created, then type cat rstudio.job.1234
to read the instructions to connect.
Let’s get to know the Rstudio interface a bit better. It consists of a menu bar at the top of the browser window and four panels below:
Menu Bar | |
---|---|
Top left: Script (type and save your scripts here) | Top right: Global Environment (R data objects you can work with) |
Bottom left: Console (commands are executed here) | Bottom right: Files, Plots and help pages |
If you cannot see the Script panel, click “New Script” on the menu bar and it should appear.
Let’s watch a video together to learn the basics of the Rstudio environment.
Here we will begin with simpler data structures and commands.
Open up your Rstudio and try these commands by typing in the script panel and hitting the “Run” button or Ctrl+Enter, observing the output.
We’ll start with some arithmetic.
1+2
## [1] 3
10000^2
## [1] 1e+08
sqrt(64)
## [1] 8
Notice that as you start typing commands like sqrt
that Rstudio will give you suggestions, you can use the tab key to autocomplete.
Use the <-
back arrow or =
to define variables.
s <- sqrt(64)
s
## [1] 8
a <- 2
b <- 5
c <- a*b
c
## [1] 10
We can also do operations on vectors of numbers.
Vectors are specified with c-brackets like this c(1,2,3)
c(1,2,3)
## [1] 1 2 3
sum(c(1,10,100))
## [1] 111
mean(c(1,10,100))
## [1] 37
median(c(1,10,100))
## [1] 10
max(c(1,10,100))
## [1] 100
min(c(1,10,100))
## [1] 1
To make things easier to read and write, we can save the vector as a variable a
. Then we can work on a
.
We use the <-
to save objects in R, but =
also works.
length
tells us the number of elements in the vector which is very useful.
a <- c(1,10,100)
a
## [1] 1 10 100
2*a
## [1] 2 20 200
a+1
## [1] 2 11 101
sum(a)
## [1] 111
mean(a)
## [1] 37
median(a)
## [1] 10
sd(a)
## [1] 54.74486
var(a)
## [1] 2997
length(a)
## [1] 3
Whenever we are using an R object we should check exactly what sort of object it is. This is the most common type of error you will encounter.
We use the str()
command to check the structure. Other commands we can use to investigate the data structure include class()
and typeof()
The colon :
can be used to specify a series or range. For example 1:5 for numbers 1 to 5.
Note the different object types here:
A is a Numerical vector
B is Numerical vector (series)
C is a named numerical vector
D is a character string
E is a character vector
F is a named character vector
A <- c(1,10,100)
A
## [1] 1 10 100
str(A)
## num [1:3] 1 10 100
class(A)
## [1] "numeric"
typeof(A)
## [1] "double"
B <- 1:10
B
## [1] 1 2 3 4 5 6 7 8 9 10
str(B)
## int [1:10] 1 2 3 4 5 6 7 8 9 10
class(B)
## [1] "integer"
typeof(B)
## [1] "integer"
C <- c("prime1"=2, "prime2"=3, "prime3"=5, "prime4"=7)
C
## prime1 prime2 prime3 prime4
## 2 3 5 7
str(C)
## Named num [1:4] 2 3 5 7
## - attr(*, "names")= chr [1:4] "prime1" "prime2" "prime3" "prime4"
class(C)
## [1] "numeric"
typeof(C)
## [1] "double"
D <- "x1"
D
## [1] "x1"
str(D)
## chr "x1"
class(D)
## [1] "character"
typeof(D)
## [1] "character"
E <- c("x1", "y2", "z3")
E
## [1] "x1" "y2" "z3"
str(E)
## chr [1:3] "x1" "y2" "z3"
names(E) <- c("code1","code2","code3")
E
## code1 code2 code3
## "x1" "y2" "z3"
str(E)
## Named chr [1:3] "x1" "y2" "z3"
## - attr(*, "names")= chr [1:3] "code1" "code2" "code3"
class(E)
## [1] "character"
typeof(E)
## [1] "character"
Note how the names can be added during or after the vector is defined.
Factors are an entirely different type of data in R. They are represented as non numeric categories, but are stored internally as numerical data. Typically factors are used in R for categorical data like biological sex. Here the example is an unordered category.
x <- factor(c("single", "married", "married", "single","defacto","widowed"))
x
## [1] single married married single defacto widowed
## Levels: defacto married single widowed
str(x)
## Factor w/ 4 levels "defacto","married",..: 3 2 2 3 1 4
levels(x)
## [1] "defacto" "married" "single" "widowed"
But it is possible to have ordered factors as well.
Take this survey result about a product as an example, where respondents rate the food at a restaurant between very poor and very good. These responses have a natural order, so it makes sense to treat these as ordered factors.
y <- c("good", "very good", "fair", "poor","good")
str(y)
## chr [1:5] "good" "very good" "fair" "poor" "good"
yy <- factor(y, levels = c("very poor","poor","fair","good","very good"),ordered = TRUE)
yy
## [1] good very good fair poor good
## Levels: very poor < poor < fair < good < very good
str(yy)
## Ord.factor w/ 5 levels "very poor"<"poor"<..: 4 5 3 2 4
levels(yy)
## [1] "very poor" "poor" "fair" "good" "very good"
R also uses TRUE/FALSE
a lot, as a data type as well as an option when executing commands.
It is also possible to have vectors of logical values.
myvariable1 <- 0
as.logical(myvariable1)
## [1] FALSE
myvariable2 <- 1
as.logical(myvariable2)
## [1] TRUE
vals <- c(0,1,0,1,0,0,0,1,1,0,1,0,1,0)
vals
## [1] 0 1 0 1 0 0 0 1 1 0 1 0 1 0
str(vals)
## num [1:14] 0 1 0 1 0 0 0 1 1 0 ...
as.logical(vals)
## [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
## [13] TRUE FALSE
Next, we would like to subset vectors. To do this, we use square brackets and inside the square bracket we indicate which values we want, using 1 for the first and 2 for second and so on.
See how it is possible to get the last and second last elements.
We can subset vectors, run arithmetic operations and save the results to a new variable in a single line.
a <- c(2:22,98,124,3002)
length(a)
## [1] 24
a[2]
## [1] 3
a[3:4]
## [1] 4 5
a[c(1,3)]
## [1] 2 4
a[length(a)]
## [1] 3002
a[(length(a)-1)]
## [1] 124
x <- a[10:(length(a)-1)] * 2
x
## [1] 22 24 26 28 30 32 34 36 38 40 42 44 196 248
As shown above with commands like factor
and as.logical
, it is possible to convert objects into different types.
Here are some further examples.
Note that some conversions don’t make sense, which can cause errors in your analysis.
a <- c(1.9,2.7,3.3,5.1,9.9,0)
a
## [1] 1.9 2.7 3.3 5.1 9.9 0.0
as.integer(a)
## [1] 1 2 3 5 9 0
as.character(a)
## [1] "1.9" "2.7" "3.3" "5.1" "9.9" "0"
as.logical(a)
## [1] TRUE TRUE TRUE TRUE TRUE FALSE
as.factor(a)
## [1] 1.9 2.7 3.3 5.1 9.9 0
## Levels: 0 1.9 2.7 3.3 5.1 9.9
b <- c("abc","def","ghi","jkl")
b
## [1] "abc" "def" "ghi" "jkl"
as.numeric(b)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.logical(b)
## [1] NA NA NA NA
as.integer(b)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
my_factor <- as.factor(b)
as.numeric(my_factor)
## [1] 1 2 3 4
This is used a lot in statistics and probability as well as simulation analysis.
nums <- 1:5
sample(x = nums, size = 3)
## [1] 3 4 5
sample(x = nums, size = 5)
## [1] 4 3 1 5 2
sample(x = nums ,size = 10, replace = TRUE)
## [1] 5 3 1 5 3 1 2 3 4 1
We can also sample from distributions. Here we are sampling 5 numbers from a normal distribution around a median of 10 and standard deviation of 2
d <- rnorm(n = 5, mean = 10, sd = 2)
d
## [1] 12.219169 7.641795 11.501564 10.305560 10.606350
Here we are sampling 20 numbers from a binomial distribution with a size of 50 and probability of 0.5
b <- rbinom(n = 20, size = 50, prob = 0.5)
b
## [1] 21 27 24 31 28 32 22 26 23 27 25 26 24 28 18 17 22 25 20 28
mean(b)
## [1] 24.7
Creating basic plots in R isn’t very difficult. Here are some simple ones.
Dot plot and line plot.
Adding extra lines.
Changing line colour and adding a subheading
a <- (1:10)^2
a
## [1] 1 4 9 16 25 36 49 64 81 100
plot(a)
plot(a,type="l")
plot(a,type="b")
plot(a,type="b")
lines(a/2, type="b",col="red")
mtext("Black:Growth of A. Red: growth of A/2")
Now for scatterplots.
We can change the point type (pch
) and size (cex
), as well as add a main heading (main
).
We can also add additional series of points to the chart and adjust the axis limits with xlim
and ylim
.
x_vals <- rnorm(n = 1000, mean = 10, sd = 2)
d_error <- rnorm(n = 1000, mean = 1, sd = 0.1)
y_vals <- x_vals * d_error
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values")
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5, main="Plot of X and Y values")
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5, main="Plot of X and Y values")
points(x=x_vals, y=y_vals/2, pch=19, cex=0.5,col="blue")
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19,
cex=0.5, main="Plot of X and Y values",
ylim=c(0,17))
points(x=x_vals, y=y_vals/2, pch=19, cex=0.5,col="blue")
Let’s now run a linear regression on the relationship of y with x.
linear_regression_model <- lm(y_vals ~ x_vals)
summary(linear_regression_model)
##
## Call:
## lm(formula = y_vals ~ x_vals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5715 -0.6726 -0.0191 0.6932 4.2832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.007809 0.172762 -0.045 0.964
## x_vals 0.999330 0.016953 58.946 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.083 on 998 degrees of freedom
## Multiple R-squared: 0.7769, Adjusted R-squared: 0.7766
## F-statistic: 3475 on 1 and 998 DF, p-value: < 2.2e-16
SLOPE <- linear_regression_model$coefficients[1]
INTERCEPT <- linear_regression_model$coefficients[2]
HEADER <- paste("Slope:",signif(SLOPE,4),"Intercept:",signif(INTERCEPT,4))
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5)
abline(linear_regression_model,col="red",lty=2,lwd=3)
mtext(HEADER)
Barplots are also really useful.
The quantities in the vector need to have names.
names(a) <- 1:length(a)
barplot(a)
barplot(a,horiz = TRUE,las=1, xlab = "Measurements")
Boxplots are relatively easy to create.
boxplot(x_vals, y_vals/2)
boxplot(x_vals, y_vals/2, names=c("X values", "Y values"),ylab="Measurement (cm)")
Enhanced boxplot with beeswarm.
library("beeswarm")
mylist <- list("X vals"=x_vals, "Y vals"=y_vals/2)
boxplot(mylist, cex=0, ylab="Measurement (cm)",col="white",main="Main title") #cex=0 means no outliers shown
beeswarm(mylist,pch=1,add=TRUE,cex=0.5)
Histograms are easily made.
And it is possible to place multiple charts on a single image.
hist(x_vals)
par(mfrow = c(2, 1))
hist(x_vals,main="")
hist(y_vals,main="")
calculate the sum of all integers numbers between 500 and 600
calculate the sum of all the square roots of all integers between 900 and 1000
Create the following datasets and plot a boxplot:
A Sample 10000 datapoints from a normal distribution with mean of 50 and SD of 5
A Sample 10000 datapoints from a normal distribution with mean of 50 and SD of 10
Plot a and b above as a scatterplot, and plot the trend line.
Plot a and b above as histograms on the same chart.
So far we have only worked with 1 dimensional data, so let’s get to know 2D tables.
The two most common types are data frames and matrices. We can use str()
to distinguish them.
head(freeny.x)
## lag quarterly revenue price index income level market potential
## [1,] 8.79636 4.70997 5.82110 12.9699
## [2,] 8.79236 4.70217 5.82558 12.9733
## [3,] 8.79137 4.68944 5.83112 12.9774
## [4,] 8.81486 4.68558 5.84046 12.9806
## [5,] 8.81301 4.64019 5.85036 12.9831
## [6,] 8.90751 4.62553 5.86464 12.9854
str(freeny.x)
## num [1:39, 1:4] 8.8 8.79 8.79 8.81 8.81 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:4] "lag quarterly revenue" "price index" "income level" "market potential"
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
As an example of commands that can be used for matrices and not data frames, try mean()
for freeny.x
and mtcars
.
Find out the number of rows, columns, and simple operations on those rows and columns.
# number of columns
ncol(mtcars)
## [1] 11
# number of rows
nrow(mtcars)
## [1] 32
# dimensions
dim(mtcars)
## [1] 32 11
# analysing rows and columns
colMeans(freeny.x)
## lag quarterly revenue price index income level
## 9.280718 4.496182 6.038596
## market potential
## 13.066831
colSums(freeny.x)
## lag quarterly revenue price index income level
## 361.9480 175.3511 235.5052
## market potential
## 509.6064
rowMeans(freeny.x)
## [1] 8.074333 8.073353 8.072332 8.080375 8.071665 8.095770 8.106082 8.117520
## [9] 8.124863 8.140490 8.149078 8.158940 8.158470 8.180965 8.192552 8.205092
## [17] 8.209112 8.223212 8.230868 8.231193 8.240507 8.250258 8.249220 8.260175
## [25] 8.267057 8.269210 8.282000 8.293537 8.297823 8.310002 8.315660 8.317607
## [33] 8.317818 8.330682 8.330505 8.333490 8.339897 8.345975 8.354988
rowSums(freeny.x)
## [1] 32.29733 32.29341 32.28933 32.32150 32.28666 32.38308 32.42433 32.47008
## [9] 32.49945 32.56196 32.59631 32.63576 32.63388 32.72386 32.77021 32.82037
## [17] 32.83645 32.89285 32.92347 32.92477 32.96203 33.00103 32.99688 33.04070
## [25] 33.06823 33.07684 33.12800 33.17415 33.19129 33.24001 33.26264 33.27043
## [33] 33.27127 33.32273 33.32202 33.33396 33.35959 33.38390 33.41995
You can also transpose a matrix or data frame. But be careful, transposing a data frame will automatically convert it to a matrix which could cause downstream errors.
freeny_flip <- t(freeny.x)
head(freeny_flip)
## [,1] [,2] [,3] [,4] [,5] [,6]
## lag quarterly revenue 8.79636 8.79236 8.79137 8.81486 8.81301 8.90751
## price index 4.70997 4.70217 4.68944 4.68558 4.64019 4.62553
## income level 5.82110 5.82558 5.83112 5.84046 5.85036 5.86464
## market potential 12.96990 12.97330 12.97740 12.98060 12.98310 12.98540
## [,7] [,8] [,9] [,10] [,11] [,12]
## lag quarterly revenue 8.93673 8.96161 8.96044 9.00868 9.03049 9.06906
## price index 4.61991 4.61654 4.61407 4.60766 4.60227 4.58960
## income level 5.87769 5.89763 5.92574 5.94232 5.95365 5.96120
## market potential 12.99000 12.99430 12.99920 13.00330 13.00990 13.01590
## [,13] [,14] [,15] [,16] [,17] [,18]
## lag quarterly revenue 9.05871 9.10698 9.12685 9.17096 9.18665 9.23823
## price index 4.57592 4.58661 4.57997 4.57176 4.56104 4.54906
## income level 5.97805 6.00377 6.02829 6.03475 6.03906 6.05046
## market potential 13.02120 13.02650 13.03510 13.04290 13.04970 13.05510
## [,19] [,20] [,21] [,22] [,23] [,24]
## lag quarterly revenue 9.26487 9.28436 9.31378 9.35025 9.35835 9.39767
## price index 4.53957 4.51018 4.50352 4.49360 4.46505 4.44924
## income level 6.05563 6.06093 6.07103 6.08018 6.08858 6.10199
## market potential 13.06340 13.06930 13.07370 13.07700 13.08490 13.09180
## [,25] [,26] [,27] [,28] [,29] [,30]
## lag quarterly revenue 9.42150 9.44223 9.48721 9.52374 9.53980 9.58123
## price index 4.43966 4.42025 4.41060 4.41151 4.39810 4.38513
## income level 6.11207 6.11596 6.12129 6.12200 6.13119 6.14705
## market potential 13.09500 13.09840 13.10890 13.11690 13.12220 13.12660
## [,31] [,32] [,33] [,34] [,35] [,36]
## lag quarterly revenue 9.60048 9.64496 9.64390 9.69405 9.69958 9.68683
## price index 4.37320 4.32770 4.32023 4.30909 4.30909 4.30552
## income level 6.15336 6.15627 6.16274 6.17369 6.16135 6.18231
## market potential 13.13560 13.14150 13.14440 13.14590 13.15200 13.15930
## [,37] [,38] [,39]
## lag quarterly revenue 9.71774 9.74924 9.77536
## price index 4.29627 4.27839 4.27789
## income level 6.18768 6.19377 6.20030
## market potential 13.15790 13.16250 13.16640
mtcars_flip <- t(mtcars)
head(mtcars_flip)
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
## mpg 21.00 21.000 22.80 21.400 18.70
## cyl 6.00 6.000 4.00 6.000 8.00
## disp 160.00 160.000 108.00 258.000 360.00
## hp 110.00 110.000 93.00 110.000 175.00
## drat 3.90 3.900 3.85 3.080 3.15
## wt 2.62 2.875 2.32 3.215 3.44
## Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
## mpg 18.10 14.30 24.40 22.80 19.20 17.80 16.40
## cyl 6.00 8.00 4.00 4.00 6.00 6.00 8.00
## disp 225.00 360.00 146.70 140.80 167.60 167.60 275.80
## hp 105.00 245.00 62.00 95.00 123.00 123.00 180.00
## drat 2.76 3.21 3.69 3.92 3.92 3.92 3.07
## wt 3.46 3.57 3.19 3.15 3.44 3.44 4.07
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## mpg 17.30 15.20 10.40 10.400
## cyl 8.00 8.00 8.00 8.000
## disp 275.80 275.80 472.00 460.000
## hp 180.00 180.00 205.00 215.000
## drat 3.07 3.07 2.93 3.000
## wt 3.73 3.78 5.25 5.424
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
## mpg 14.700 32.40 30.400 33.900 21.500
## cyl 8.000 4.00 4.000 4.000 4.000
## disp 440.000 78.70 75.700 71.100 120.100
## hp 230.000 66.00 52.000 65.000 97.000
## drat 3.230 4.08 4.930 4.220 3.700
## wt 5.345 2.20 1.615 1.835 2.465
## Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9
## mpg 15.50 15.200 13.30 19.200 27.300
## cyl 8.00 8.000 8.00 8.000 4.000
## disp 318.00 304.000 350.00 400.000 79.000
## hp 150.00 150.000 245.00 175.000 66.000
## drat 2.76 3.150 3.73 3.080 4.080
## wt 3.52 3.435 3.84 3.845 1.935
## Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora
## mpg 26.00 30.400 15.80 19.70 15.00
## cyl 4.00 4.000 8.00 6.00 8.00
## disp 120.30 95.100 351.00 145.00 301.00
## hp 91.00 113.000 264.00 175.00 335.00
## drat 4.43 3.770 4.22 3.62 3.54
## wt 2.14 1.513 3.17 2.77 3.57
## Volvo 142E
## mpg 21.40
## cyl 4.00
## disp 121.00
## hp 109.00
## drat 4.11
## wt 2.78
str(mtcars_flip)
## num [1:11, 1:32] 21 6 160 110 3.9 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:11] "mpg" "cyl" "disp" "hp" ...
## ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
mtcars_flip <- as.data.frame(mtcars_flip)
str(mtcars_flip)
## 'data.frame': 11 obs. of 32 variables:
## $ Mazda RX4 : num 21 6 160 110 3.9 ...
## $ Mazda RX4 Wag : num 21 6 160 110 3.9 ...
## $ Datsun 710 : num 22.8 4 108 93 3.85 ...
## $ Hornet 4 Drive : num 21.4 6 258 110 3.08 ...
## $ Hornet Sportabout : num 18.7 8 360 175 3.15 ...
## $ Valiant : num 18.1 6 225 105 2.76 ...
## $ Duster 360 : num 14.3 8 360 245 3.21 ...
## $ Merc 240D : num 24.4 4 146.7 62 3.69 ...
## $ Merc 230 : num 22.8 4 140.8 95 3.92 ...
## $ Merc 280 : num 19.2 6 167.6 123 3.92 ...
## $ Merc 280C : num 17.8 6 167.6 123 3.92 ...
## $ Merc 450SE : num 16.4 8 275.8 180 3.07 ...
## $ Merc 450SL : num 17.3 8 275.8 180 3.07 ...
## $ Merc 450SLC : num 15.2 8 275.8 180 3.07 ...
## $ Cadillac Fleetwood : num 10.4 8 472 205 2.93 ...
## $ Lincoln Continental: num 10.4 8 460 215 3 ...
## $ Chrysler Imperial : num 14.7 8 440 230 3.23 ...
## $ Fiat 128 : num 32.4 4 78.7 66 4.08 ...
## $ Honda Civic : num 30.4 4 75.7 52 4.93 ...
## $ Toyota Corolla : num 33.9 4 71.1 65 4.22 ...
## $ Toyota Corona : num 21.5 4 120.1 97 3.7 ...
## $ Dodge Challenger : num 15.5 8 318 150 2.76 ...
## $ AMC Javelin : num 15.2 8 304 150 3.15 ...
## $ Camaro Z28 : num 13.3 8 350 245 3.73 ...
## $ Pontiac Firebird : num 19.2 8 400 175 3.08 ...
## $ Fiat X1-9 : num 27.3 4 79 66 4.08 ...
## $ Porsche 914-2 : num 26 4 120.3 91 4.43 ...
## $ Lotus Europa : num 30.4 4 95.1 113 3.77 ...
## $ Ford Pantera L : num 15.8 8 351 264 4.22 3.17 14.5 0 1 5 ...
## $ Ferrari Dino : num 19.7 6 145 175 3.62 2.77 15.5 0 1 5 ...
## $ Maserati Bora : num 15 8 301 335 3.54 3.57 14.6 0 1 5 ...
## $ Volvo 142E : num 21.4 4 121 109 4.11 2.78 18.6 1 1 4 ...
One of the most common tasks in data analysis is to perform filtering. Last prac, we found out how to do this with vectors using the square bracket notation. Eg:x[3]
will retrieve the 3rd element of x
. Square brackets can also be used for two dimensional objects, but we need to provide two indexes. The syntax is df[rows,cols]
.
# get rows 1-10 of column 2
freeny.x[1:10,2]
## [1] 4.70997 4.70217 4.68944 4.68558 4.64019 4.62553 4.61991 4.61654 4.61407
## [10] 4.60766
# get rows 1-6 of columns 1-3
freeny.x[1:6,1:3]
## lag quarterly revenue price index income level
## [1,] 8.79636 4.70997 5.82110
## [2,] 8.79236 4.70217 5.82558
## [3,] 8.79137 4.68944 5.83112
## [4,] 8.81486 4.68558 5.84046
## [5,] 8.81301 4.64019 5.85036
## [6,] 8.90751 4.62553 5.86464
# get rows 1-6 of all columns
freeny.x[1:6,]
## lag quarterly revenue price index income level market potential
## [1,] 8.79636 4.70997 5.82110 12.9699
## [2,] 8.79236 4.70217 5.82558 12.9733
## [3,] 8.79137 4.68944 5.83112 12.9774
## [4,] 8.81486 4.68558 5.84046 12.9806
## [5,] 8.81301 4.64019 5.85036 12.9831
## [6,] 8.90751 4.62553 5.86464 12.9854
# get all rows for columns 1 and 2
freeny.x[,1:2]
## lag quarterly revenue price index
## [1,] 8.79636 4.70997
## [2,] 8.79236 4.70217
## [3,] 8.79137 4.68944
## [4,] 8.81486 4.68558
## [5,] 8.81301 4.64019
## [6,] 8.90751 4.62553
## [7,] 8.93673 4.61991
## [8,] 8.96161 4.61654
## [9,] 8.96044 4.61407
## [10,] 9.00868 4.60766
## [11,] 9.03049 4.60227
## [12,] 9.06906 4.58960
## [13,] 9.05871 4.57592
## [14,] 9.10698 4.58661
## [15,] 9.12685 4.57997
## [16,] 9.17096 4.57176
## [17,] 9.18665 4.56104
## [18,] 9.23823 4.54906
## [19,] 9.26487 4.53957
## [20,] 9.28436 4.51018
## [21,] 9.31378 4.50352
## [22,] 9.35025 4.49360
## [23,] 9.35835 4.46505
## [24,] 9.39767 4.44924
## [25,] 9.42150 4.43966
## [26,] 9.44223 4.42025
## [27,] 9.48721 4.41060
## [28,] 9.52374 4.41151
## [29,] 9.53980 4.39810
## [30,] 9.58123 4.38513
## [31,] 9.60048 4.37320
## [32,] 9.64496 4.32770
## [33,] 9.64390 4.32023
## [34,] 9.69405 4.30909
## [35,] 9.69958 4.30909
## [36,] 9.68683 4.30552
## [37,] 9.71774 4.29627
## [38,] 9.74924 4.27839
## [39,] 9.77536 4.27789
Now we need to see what happens when we subset just one column or row. You can see the default behaviour is to convert the data from matrix format to a vector. We can modify this using drop=FALSE to keep it in matrix format.
# get all rows of column 1
freeny.x[,1]
## [1] 8.79636 8.79236 8.79137 8.81486 8.81301 8.90751 8.93673 8.96161 8.96044
## [10] 9.00868 9.03049 9.06906 9.05871 9.10698 9.12685 9.17096 9.18665 9.23823
## [19] 9.26487 9.28436 9.31378 9.35025 9.35835 9.39767 9.42150 9.44223 9.48721
## [28] 9.52374 9.53980 9.58123 9.60048 9.64496 9.64390 9.69405 9.69958 9.68683
## [37] 9.71774 9.74924 9.77536
# to prevent conversion to vector, use drop=FALSE
freeny.x[,1,drop=FALSE]
## lag quarterly revenue
## [1,] 8.79636
## [2,] 8.79236
## [3,] 8.79137
## [4,] 8.81486
## [5,] 8.81301
## [6,] 8.90751
## [7,] 8.93673
## [8,] 8.96161
## [9,] 8.96044
## [10,] 9.00868
## [11,] 9.03049
## [12,] 9.06906
## [13,] 9.05871
## [14,] 9.10698
## [15,] 9.12685
## [16,] 9.17096
## [17,] 9.18665
## [18,] 9.23823
## [19,] 9.26487
## [20,] 9.28436
## [21,] 9.31378
## [22,] 9.35025
## [23,] 9.35835
## [24,] 9.39767
## [25,] 9.42150
## [26,] 9.44223
## [27,] 9.48721
## [28,] 9.52374
## [29,] 9.53980
## [30,] 9.58123
## [31,] 9.60048
## [32,] 9.64496
## [33,] 9.64390
## [34,] 9.69405
## [35,] 9.69958
## [36,] 9.68683
## [37,] 9.71774
## [38,] 9.74924
## [39,] 9.77536
# the same concept for rows
freeny.x[1,]
## lag quarterly revenue price index income level
## 8.79636 4.70997 5.82110
## market potential
## 12.96990
# the same concept for rows
freeny.x[1,,drop=FALSE]
## lag quarterly revenue price index income level market potential
## [1,] 8.79636 4.70997 5.8211 12.9699
Square brackets also works for data frames.
mtcars[1:10,1:6]
## mpg cyl disp hp drat wt
## Mazda RX4 21.0 6 160.0 110 3.90 2.620
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875
## Datsun 710 22.8 4 108.0 93 3.85 2.320
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440
## Valiant 18.1 6 225.0 105 2.76 3.460
## Duster 360 14.3 8 360.0 245 3.21 3.570
## Merc 240D 24.4 4 146.7 62 3.69 3.190
## Merc 230 22.8 4 140.8 95 3.92 3.150
## Merc 280 19.2 6 167.6 123 3.92 3.440
mtcars[,1]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
mtcars[,1,drop=FALSE]
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
## Lincoln Continental 10.4
## Chrysler Imperial 14.7
## Fiat 128 32.4
## Honda Civic 30.4
## Toyota Corolla 33.9
## Toyota Corona 21.5
## Dodge Challenger 15.5
## AMC Javelin 15.2
## Camaro Z28 13.3
## Pontiac Firebird 19.2
## Fiat X1-9 27.3
## Porsche 914-2 26.0
## Lotus Europa 30.4
## Ford Pantera L 15.8
## Ferrari Dino 19.7
## Maserati Bora 15.0
## Volvo 142E 21.4
Data frames also have more options around subsetting columns. For example we can subset based on the name of the column or row.
mtcars[,"cyl"]
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[,c("mpg","wt")]
## mpg wt
## Mazda RX4 21.0 2.620
## Mazda RX4 Wag 21.0 2.875
## Datsun 710 22.8 2.320
## Hornet 4 Drive 21.4 3.215
## Hornet Sportabout 18.7 3.440
## Valiant 18.1 3.460
## Duster 360 14.3 3.570
## Merc 240D 24.4 3.190
## Merc 230 22.8 3.150
## Merc 280 19.2 3.440
## Merc 280C 17.8 3.440
## Merc 450SE 16.4 4.070
## Merc 450SL 17.3 3.730
## Merc 450SLC 15.2 3.780
## Cadillac Fleetwood 10.4 5.250
## Lincoln Continental 10.4 5.424
## Chrysler Imperial 14.7 5.345
## Fiat 128 32.4 2.200
## Honda Civic 30.4 1.615
## Toyota Corolla 33.9 1.835
## Toyota Corona 21.5 2.465
## Dodge Challenger 15.5 3.520
## AMC Javelin 15.2 3.435
## Camaro Z28 13.3 3.840
## Pontiac Firebird 19.2 3.845
## Fiat X1-9 27.3 1.935
## Porsche 914-2 26.0 2.140
## Lotus Europa 30.4 1.513
## Ford Pantera L 15.8 3.170
## Ferrari Dino 19.7 2.770
## Maserati Bora 15.0 3.570
## Volvo 142E 21.4 2.780
mtcars["Camaro Z28",c("mpg","wt")]
## mpg wt
## Camaro Z28 13.3 3.84
Data frames columns can also be subsetted using the $
notation. The syntax is df$col
.
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
This type of notation can even be used to create new columns. In this example below, we are converting the miles per gallon value to liters per 100km unit. We are also rounding this value to three significant figures.
mtcars$lper100km <- 235.215 / mtcars$mpg
mtcars[,c(1,ncol(mtcars))]
## mpg lper100km
## Mazda RX4 21.0 11.200714
## Mazda RX4 Wag 21.0 11.200714
## Datsun 710 22.8 10.316447
## Hornet 4 Drive 21.4 10.991355
## Hornet Sportabout 18.7 12.578342
## Valiant 18.1 12.995304
## Duster 360 14.3 16.448601
## Merc 240D 24.4 9.639959
## Merc 230 22.8 10.316447
## Merc 280 19.2 12.250781
## Merc 280C 17.8 13.214326
## Merc 450SE 16.4 14.342378
## Merc 450SL 17.3 13.596243
## Merc 450SLC 15.2 15.474671
## Cadillac Fleetwood 10.4 22.616827
## Lincoln Continental 10.4 22.616827
## Chrysler Imperial 14.7 16.001020
## Fiat 128 32.4 7.259722
## Honda Civic 30.4 7.737336
## Toyota Corolla 33.9 6.938496
## Toyota Corona 21.5 10.940233
## Dodge Challenger 15.5 15.175161
## AMC Javelin 15.2 15.474671
## Camaro Z28 13.3 17.685338
## Pontiac Firebird 19.2 12.250781
## Fiat X1-9 27.3 8.615934
## Porsche 914-2 26.0 9.046731
## Lotus Europa 30.4 7.737336
## Ford Pantera L 15.8 14.887025
## Ferrari Dino 19.7 11.939848
## Maserati Bora 15.0 15.681000
## Volvo 142E 21.4 10.991355
mtcars$lper100km <- signif(235.215 / mtcars$mpg ,3)
mtcars[,c(1,ncol(mtcars))]
## mpg lper100km
## Mazda RX4 21.0 11.20
## Mazda RX4 Wag 21.0 11.20
## Datsun 710 22.8 10.30
## Hornet 4 Drive 21.4 11.00
## Hornet Sportabout 18.7 12.60
## Valiant 18.1 13.00
## Duster 360 14.3 16.40
## Merc 240D 24.4 9.64
## Merc 230 22.8 10.30
## Merc 280 19.2 12.30
## Merc 280C 17.8 13.20
## Merc 450SE 16.4 14.30
## Merc 450SL 17.3 13.60
## Merc 450SLC 15.2 15.50
## Cadillac Fleetwood 10.4 22.60
## Lincoln Continental 10.4 22.60
## Chrysler Imperial 14.7 16.00
## Fiat 128 32.4 7.26
## Honda Civic 30.4 7.74
## Toyota Corolla 33.9 6.94
## Toyota Corona 21.5 10.90
## Dodge Challenger 15.5 15.20
## AMC Javelin 15.2 15.50
## Camaro Z28 13.3 17.70
## Pontiac Firebird 19.2 12.30
## Fiat X1-9 27.3 8.62
## Porsche 914-2 26.0 9.05
## Lotus Europa 30.4 7.74
## Ford Pantera L 15.8 14.90
## Ferrari Dino 19.7 11.90
## Maserati Bora 15.0 15.70
## Volvo 142E 21.4 11.00
You may also want to subset a data frame based on the values. Let’s say you want a car with fuel consumption less than 10 L/100km. Let’s do it the hard way first.
mtcars$lper100km
## [1] 11.20 11.20 10.30 11.00 12.60 13.00 16.40 9.64 10.30 12.30 13.20 14.30
## [13] 13.60 15.50 22.60 22.60 16.00 7.26 7.74 6.94 10.90 15.20 15.50 17.70
## [25] 12.30 8.62 9.05 7.74 14.90 11.90 15.70 11.00
mtcars$lper100km < 10
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [25] FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
which(mtcars$lper100km < 10)
## [1] 8 18 19 20 26 27 28
mtcars[which(mtcars$lper100km < 10),]
## mpg cyl disp hp drat wt qsec vs am gear carb lper100km
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9.64
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 7.26
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 7.74
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 6.94
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 8.62
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 9.05
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 7.74
You can see that this is quite complicated. There is an easier way using subset()
. Subset is also perfect for filtering based on more than one criteria using the &
and ‘|’ operators.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## lper100km
## Mazda RX4 11.20
## Mazda RX4 Wag 11.20
## Datsun 710 10.30
## Hornet 4 Drive 11.00
## Hornet Sportabout 12.60
## Valiant 13.00
## Duster 360 16.40
## Merc 240D 9.64
## Merc 230 10.30
## Merc 280 12.30
## Merc 280C 13.20
## Merc 450SE 14.30
## Merc 450SL 13.60
## Merc 450SLC 15.50
## Cadillac Fleetwood 22.60
## Lincoln Continental 22.60
## Chrysler Imperial 16.00
## Fiat 128 7.26
## Honda Civic 7.74
## Toyota Corolla 6.94
## Toyota Corona 10.90
## Dodge Challenger 15.20
## AMC Javelin 15.50
## Camaro Z28 17.70
## Pontiac Firebird 12.30
## Fiat X1-9 8.62
## Porsche 914-2 9.05
## Lotus Europa 7.74
## Ford Pantera L 14.90
## Ferrari Dino 11.90
## Maserati Bora 15.70
## Volvo 142E 11.00
subset(mtcars,lper100km < 10)
## mpg cyl disp hp drat wt qsec vs am gear carb lper100km
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9.64
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 7.26
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 7.74
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 6.94
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 8.62
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 9.05
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 7.74
# you want an economical AND quick car
subset(mtcars,lper100km < 10 & qsec < 18)
## mpg cyl disp hp drat wt qsec vs am gear carb lper100km
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 9.05
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 7.74
# you want an economical OR quick car
subset(mtcars,lper100km < 10 | qsec < 18)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## lper100km
## Mazda RX4 11.20
## Mazda RX4 Wag 11.20
## Hornet Sportabout 12.60
## Duster 360 16.40
## Merc 240D 9.64
## Merc 450SE 14.30
## Merc 450SL 13.60
## Cadillac Fleetwood 22.60
## Lincoln Continental 22.60
## Chrysler Imperial 16.00
## Fiat 128 7.26
## Honda Civic 7.74
## Toyota Corolla 6.94
## Dodge Challenger 15.20
## AMC Javelin 15.50
## Camaro Z28 17.70
## Pontiac Firebird 12.30
## Fiat X1-9 8.62
## Porsche 914-2 9.05
## Lotus Europa 7.74
## Ford Pantera L 14.90
## Ferrari Dino 11.90
## Maserati Bora 15.70
Subset also works for strings and factors. To look at this we need to look at the iris
dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
setosa <- subset(iris,Species == "setosa")
head(setosa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
You can use the colnames
and rownames
to get the row or column names and even mofidy them.
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt"
## [7] "qsec" "vs" "am" "gear" "carb" "lper100km"
rownames(mtcars)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
colnames(mtcars) <- c("miles per gallon",
"number of cylinders",
"displacement in cubic inches",
"gross horsepower",
"rear axle ratio",
"weight (pounds/1000)",
"quarter mile time in seconds",
"V or straight cylinder configuration",
"transmission type: auto (0) or manual (1)",
"number of forward gears",
"number of carburetors",
"litres per 100km")
head(mtcars)
## miles per gallon number of cylinders
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
## displacement in cubic inches gross horsepower rear axle ratio
## Mazda RX4 160 110 3.90
## Mazda RX4 Wag 160 110 3.90
## Datsun 710 108 93 3.85
## Hornet 4 Drive 258 110 3.08
## Hornet Sportabout 360 175 3.15
## Valiant 225 105 2.76
## weight (pounds/1000) quarter mile time in seconds
## Mazda RX4 2.620 16.46
## Mazda RX4 Wag 2.875 17.02
## Datsun 710 2.320 18.61
## Hornet 4 Drive 3.215 19.44
## Hornet Sportabout 3.440 17.02
## Valiant 3.460 20.22
## V or straight cylinder configuration
## Mazda RX4 0
## Mazda RX4 Wag 0
## Datsun 710 1
## Hornet 4 Drive 1
## Hornet Sportabout 0
## Valiant 1
## transmission type: auto (0) or manual (1)
## Mazda RX4 1
## Mazda RX4 Wag 1
## Datsun 710 1
## Hornet 4 Drive 0
## Hornet Sportabout 0
## Valiant 0
## number of forward gears number of carburetors
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
## Datsun 710 4 1
## Hornet 4 Drive 3 1
## Hornet Sportabout 3 2
## Valiant 3 1
## litres per 100km
## Mazda RX4 11.2
## Mazda RX4 Wag 11.2
## Datsun 710 10.3
## Hornet 4 Drive 11.0
## Hornet Sportabout 12.6
## Valiant 13.0
colnames(mtcars)[1] <- "miles per US gallon"
head(mtcars)
## miles per US gallon number of cylinders
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
## displacement in cubic inches gross horsepower rear axle ratio
## Mazda RX4 160 110 3.90
## Mazda RX4 Wag 160 110 3.90
## Datsun 710 108 93 3.85
## Hornet 4 Drive 258 110 3.08
## Hornet Sportabout 360 175 3.15
## Valiant 225 105 2.76
## weight (pounds/1000) quarter mile time in seconds
## Mazda RX4 2.620 16.46
## Mazda RX4 Wag 2.875 17.02
## Datsun 710 2.320 18.61
## Hornet 4 Drive 3.215 19.44
## Hornet Sportabout 3.440 17.02
## Valiant 3.460 20.22
## V or straight cylinder configuration
## Mazda RX4 0
## Mazda RX4 Wag 0
## Datsun 710 1
## Hornet 4 Drive 1
## Hornet Sportabout 0
## Valiant 1
## transmission type: auto (0) or manual (1)
## Mazda RX4 1
## Mazda RX4 Wag 1
## Datsun 710 1
## Hornet 4 Drive 0
## Hornet Sportabout 0
## Valiant 0
## number of forward gears number of carburetors
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
## Datsun 710 4 1
## Hornet 4 Drive 3 1
## Hornet Sportabout 3 2
## Valiant 3 1
## litres per 100km
## Mazda RX4 11.2
## Mazda RX4 Wag 11.2
## Datsun 710 10.3
## Hornet 4 Drive 11.0
## Hornet Sportabout 12.6
## Valiant 13.0
If you have a whitespace in the column or row name, it might cause problems later on with subsetting. In that case the column name needs to be wrapped in backticks like this.
economical_cars <- subset(mtcars,`litres per 100km` < 10)
economical_cars
## miles per US gallon number of cylinders
## Merc 240D 24.4 4
## Fiat 128 32.4 4
## Honda Civic 30.4 4
## Toyota Corolla 33.9 4
## Fiat X1-9 27.3 4
## Porsche 914-2 26.0 4
## Lotus Europa 30.4 4
## displacement in cubic inches gross horsepower rear axle ratio
## Merc 240D 146.7 62 3.69
## Fiat 128 78.7 66 4.08
## Honda Civic 75.7 52 4.93
## Toyota Corolla 71.1 65 4.22
## Fiat X1-9 79.0 66 4.08
## Porsche 914-2 120.3 91 4.43
## Lotus Europa 95.1 113 3.77
## weight (pounds/1000) quarter mile time in seconds
## Merc 240D 3.190 20.00
## Fiat 128 2.200 19.47
## Honda Civic 1.615 18.52
## Toyota Corolla 1.835 19.90
## Fiat X1-9 1.935 18.90
## Porsche 914-2 2.140 16.70
## Lotus Europa 1.513 16.90
## V or straight cylinder configuration
## Merc 240D 1
## Fiat 128 1
## Honda Civic 1
## Toyota Corolla 1
## Fiat X1-9 1
## Porsche 914-2 0
## Lotus Europa 1
## transmission type: auto (0) or manual (1)
## Merc 240D 0
## Fiat 128 1
## Honda Civic 1
## Toyota Corolla 1
## Fiat X1-9 1
## Porsche 914-2 1
## Lotus Europa 1
## number of forward gears number of carburetors litres per 100km
## Merc 240D 4 2 9.64
## Fiat 128 4 1 7.26
## Honda Civic 4 2 7.74
## Toyota Corolla 4 1 6.94
## Fiat X1-9 4 1 8.62
## Porsche 914-2 5 2 9.05
## Lotus Europa 5 2 7.74
It is also useful to be able to subset a data frame based on the row names. Let’s get all the Mercedes models. To do this, we need to introduce the grep()
command which matches strings.
# let's look again at the car names
rownames(mtcars)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
# lets filter all the ones with "Merc in the name"
grep("Merc",rownames(mtcars))
## [1] 8 9 10 11 12 13 14
# now lets extract out all those rows
mercs <- mtcars[grep("Merc",rownames(mtcars)),]
mercs
## miles per US gallon number of cylinders
## Merc 240D 24.4 4
## Merc 230 22.8 4
## Merc 280 19.2 6
## Merc 280C 17.8 6
## Merc 450SE 16.4 8
## Merc 450SL 17.3 8
## Merc 450SLC 15.2 8
## displacement in cubic inches gross horsepower rear axle ratio
## Merc 240D 146.7 62 3.69
## Merc 230 140.8 95 3.92
## Merc 280 167.6 123 3.92
## Merc 280C 167.6 123 3.92
## Merc 450SE 275.8 180 3.07
## Merc 450SL 275.8 180 3.07
## Merc 450SLC 275.8 180 3.07
## weight (pounds/1000) quarter mile time in seconds
## Merc 240D 3.19 20.0
## Merc 230 3.15 22.9
## Merc 280 3.44 18.3
## Merc 280C 3.44 18.9
## Merc 450SE 4.07 17.4
## Merc 450SL 3.73 17.6
## Merc 450SLC 3.78 18.0
## V or straight cylinder configuration
## Merc 240D 1
## Merc 230 1
## Merc 280 1
## Merc 280C 1
## Merc 450SE 0
## Merc 450SL 0
## Merc 450SLC 0
## transmission type: auto (0) or manual (1) number of forward gears
## Merc 240D 0 4
## Merc 230 0 4
## Merc 280 0 4
## Merc 280C 0 4
## Merc 450SE 0 3
## Merc 450SL 0 3
## Merc 450SLC 0 3
## number of carburetors litres per 100km
## Merc 240D 2 9.64
## Merc 230 2 10.30
## Merc 280 4 12.30
## Merc 280C 4 13.20
## Merc 450SE 3 14.30
## Merc 450SL 3 13.60
## Merc 450SLC 3 15.50
We are going to sort our subset of economical cars by their speed based on their quarter mile time. To do this, we need to use the order()
command together with the square brackets. order()
only returns the index of the values, it doesn’t actually do the sorting. Note that order()
default behaviour is to bring the smaller values to the top. That can be reversed by putting a -
before the vector being ordered.
economical_cars
## miles per US gallon number of cylinders
## Merc 240D 24.4 4
## Fiat 128 32.4 4
## Honda Civic 30.4 4
## Toyota Corolla 33.9 4
## Fiat X1-9 27.3 4
## Porsche 914-2 26.0 4
## Lotus Europa 30.4 4
## displacement in cubic inches gross horsepower rear axle ratio
## Merc 240D 146.7 62 3.69
## Fiat 128 78.7 66 4.08
## Honda Civic 75.7 52 4.93
## Toyota Corolla 71.1 65 4.22
## Fiat X1-9 79.0 66 4.08
## Porsche 914-2 120.3 91 4.43
## Lotus Europa 95.1 113 3.77
## weight (pounds/1000) quarter mile time in seconds
## Merc 240D 3.190 20.00
## Fiat 128 2.200 19.47
## Honda Civic 1.615 18.52
## Toyota Corolla 1.835 19.90
## Fiat X1-9 1.935 18.90
## Porsche 914-2 2.140 16.70
## Lotus Europa 1.513 16.90
## V or straight cylinder configuration
## Merc 240D 1
## Fiat 128 1
## Honda Civic 1
## Toyota Corolla 1
## Fiat X1-9 1
## Porsche 914-2 0
## Lotus Europa 1
## transmission type: auto (0) or manual (1)
## Merc 240D 0
## Fiat 128 1
## Honda Civic 1
## Toyota Corolla 1
## Fiat X1-9 1
## Porsche 914-2 1
## Lotus Europa 1
## number of forward gears number of carburetors litres per 100km
## Merc 240D 4 2 9.64
## Fiat 128 4 1 7.26
## Honda Civic 4 2 7.74
## Toyota Corolla 4 1 6.94
## Fiat X1-9 4 1 8.62
## Porsche 914-2 5 2 9.05
## Lotus Europa 5 2 7.74
order(economical_cars$`quarter mile time in seconds`)
## [1] 6 7 3 5 2 4 1
sorted <- economical_cars[order(economical_cars$`quarter mile time in seconds`),]
sorted[,c(7,ncol(sorted))]
## quarter mile time in seconds litres per 100km
## Porsche 914-2 16.70 9.05
## Lotus Europa 16.90 7.74
## Honda Civic 18.52 7.74
## Fiat X1-9 18.90 8.62
## Fiat 128 19.47 7.26
## Toyota Corolla 19.90 6.94
## Merc 240D 20.00 9.64
reverse_sorted <- economical_cars[order(-economical_cars$`quarter mile time in seconds`),]
reverse_sorted[,c(7,ncol(reverse_sorted))]
## quarter mile time in seconds litres per 100km
## Merc 240D 20.00 9.64
## Toyota Corolla 19.90 6.94
## Fiat 128 19.47 7.26
## Fiat X1-9 18.90 8.62
## Honda Civic 18.52 7.74
## Lotus Europa 16.90 7.74
## Porsche 914-2 16.70 9.05
First we will create a data frame for some people who completed a survey about their height and weight. You should always run str()
to check that the resulting dataframe has the intended structure. You may need to include stringsAsFactors=FALSE
to protect character strings being converted to factors.
pnames <- c("Jill", "Matt", "Sam", "Amy", "Bob", "Raj")
pnames
## [1] "Jill" "Matt" "Sam" "Amy" "Bob" "Raj"
pgender <- as.factor(c("F", "M", "F", "F", "M", "M"))
pgender
## [1] F M F F M M
## Levels: F M
pheight <- c(164, 186, 170, 175, 178, 191)
pheight
## [1] 164 186 170 175 178 191
pweight <- c(54.1, 90.3, 64.8, 66.7, 80.4, 86.9)
pweight
## [1] 54.1 90.3 64.8 66.7 80.4 86.9
df <- data.frame(pnames,pgender,pheight,pweight)
str(df)
## 'data.frame': 6 obs. of 4 variables:
## $ pnames : chr "Jill" "Matt" "Sam" "Amy" ...
## $ pgender: Factor w/ 2 levels "F","M": 1 2 1 1 2 2
## $ pheight: num 164 186 170 175 178 191
## $ pweight: num 54.1 90.3 64.8 66.7 80.4 86.9
df <- data.frame(pnames,pgender,pheight,pweight,stringsAsFactors = FALSE)
str(df)
## 'data.frame': 6 obs. of 4 variables:
## $ pnames : chr "Jill" "Matt" "Sam" "Amy" ...
## $ pgender: Factor w/ 2 levels "F","M": 1 2 1 1 2 2
## $ pheight: num 164 186 170 175 178 191
## $ pweight: num 54.1 90.3 64.8 66.7 80.4 86.9
df
## pnames pgender pheight pweight
## 1 Jill F 164 54.1
## 2 Matt M 186 90.3
## 3 Sam F 170 64.8
## 4 Amy F 175 66.7
## 5 Bob M 178 80.4
## 6 Raj M 191 86.9
Now we might want to make the row names the name of the person. This makes the data tidier, but it won’t work if there are more then one entry with the same name. You can use the NULL
to delete columns. Deleting rows can be done with square brackets.
rownames(df) <- df$pnames
df
## pnames pgender pheight pweight
## Jill Jill F 164 54.1
## Matt Matt M 186 90.3
## Sam Sam F 170 64.8
## Amy Amy F 175 66.7
## Bob Bob M 178 80.4
## Raj Raj M 191 86.9
df$pnames=NULL
df
## pgender pheight pweight
## Jill F 164 54.1
## Matt M 186 90.3
## Sam F 170 64.8
## Amy F 175 66.7
## Bob M 178 80.4
## Raj M 191 86.9
# delete row 2 and 4
df <- df[-c(2,4),]
Now we will convert df into a matrix.
as.matrix(df)
## pgender pheight pweight
## Jill "F" "164" "54.1"
## Sam "F" "170" "64.8"
## Bob "M" "178" "80.4"
## Raj "M" "191" "86.9"
df$pgender <- as.numeric(df$pgender)
mymat <- as.matrix(df)
mymat
## pgender pheight pweight
## Jill 1 164 54.1
## Sam 1 170 64.8
## Bob 2 178 80.4
## Raj 2 191 86.9
str(mymat)
## num [1:4, 1:3] 1 1 2 2 164 170 178 191 54.1 64.8 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:4] "Jill" "Sam" "Bob" "Raj"
## ..$ : chr [1:3] "pgender" "pheight" "pweight"
We can also convert other types of data into a matrix. To demonstrate this, I’ll create some random data with rnorm
and convert it into a matrix.
mydata <- rnorm(n = 100, mean = 10, sd = 20)
mymatrix <- matrix(data = mydata, nrow = 20, ncol = 5)
mymatrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] -5.362961 16.813039 -19.886142 15.6748462 19.756765
## [2,] 22.459162 25.874603 -5.080414 22.8898737 0.838587
## [3,] 11.705377 10.508057 -15.861401 15.0272497 23.816758
## [4,] 27.744727 12.607702 9.414938 -5.4591222 21.627774
## [5,] 16.224883 -10.964592 16.466131 20.8764582 16.610439
## [6,] 1.181048 23.203298 9.708798 30.9594701 12.142181
## [7,] -6.258320 -23.269052 44.366705 -13.1396102 -12.784759
## [8,] 9.891521 23.368556 -5.800771 6.8270441 -6.385512
## [9,] 6.776381 6.288497 12.656693 9.8018106 8.477256
## [10,] 27.832346 -11.458067 -3.920119 13.1407818 7.077689
## [11,] 26.053923 19.624235 18.224092 13.1742742 30.674810
## [12,] -18.761566 14.498805 6.885688 30.4425231 -4.005946
## [13,] 21.210878 18.098923 -15.046685 25.9500029 -6.533689
## [14,] 31.375715 32.239798 1.278403 -0.2864208 25.013434
## [15,] 17.098985 -17.500661 -16.159211 -34.9421872 24.783650
## [16,] 37.271382 23.659692 26.064014 20.2886377 19.749735
## [17,] 51.750068 10.656921 48.482437 12.6853048 39.457832
## [18,] 19.495228 48.107814 25.528925 -8.8210692 2.254276
## [19,] 31.726173 -13.789659 19.175284 42.7355994 -3.262027
## [20,] 6.184396 9.908607 -1.698200 1.1724725 -4.662633
Making charts from data frames and matrix objects is really similar to what we did above. Here are some examples.
# histogram of cylinders
data(mtcars)
hist(mtcars$cyl,xlab="number of cylinders", main="number of cylinders in mtcars data")
# boxplot of qsec values for Toyota and Mercedes
toyota <- mtcars[grep("Toyota",rownames(mtcars)),]
merc <- mtcars[grep("Merc",rownames(mtcars)),]
boxplot(toyota$qsec, merc$qsec,
ylab="quarter mile time in seconds",
main="mtcars",
names = c("Toyota","Mercedes"))
# scatterplot of petal length vs sepal length for setosa irises
# include a trend line using the lm function
setosa <- subset(iris,Species=="setosa")
head(setosa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
mylm <- lm(setosa$Sepal.Length ~ setosa$Petal.Length)
plot(setosa$Petal.Length,setosa$Sepal.Length,
xlab="Petal length (cm)",
ylab="Sepal length (cm)",
main="Setosa petal and sepal length",
pch=19)
abline(mylm,col="red",lwd=2,lty=2)
# a pairs plot is a special type of scatterplot
pairs(mymatrix)
# lets make a line diagram of freeny revenues
# first need to normalise each column to the initial value
head(freeny.x)
## lag quarterly revenue price index income level market potential
## [1,] 8.79636 4.70997 5.82110 12.9699
## [2,] 8.79236 4.70217 5.82558 12.9733
## [3,] 8.79137 4.68944 5.83112 12.9774
## [4,] 8.81486 4.68558 5.84046 12.9806
## [5,] 8.81301 4.64019 5.85036 12.9831
## [6,] 8.90751 4.62553 5.86464 12.9854
freeny_norm <-t( t(freeny.x) / freeny.x[1,] )
head(freeny_norm)
## lag quarterly revenue price index income level market potential
## [1,] 1.0000000 1.0000000 1.000000 1.000000
## [2,] 0.9995453 0.9983439 1.000770 1.000262
## [3,] 0.9994327 0.9956412 1.001721 1.000578
## [4,] 1.0021031 0.9948216 1.003326 1.000825
## [5,] 1.0018928 0.9851846 1.005027 1.001018
## [6,] 1.0126359 0.9820721 1.007480 1.001195
plot(freeny_norm[,1], type="b", ylim=c(0.9,1.12), col="blue",
xlab="Quarters beginning 1962 Q2",
ylab="Change in values overtime")
lines(freeny_norm[,2], type="b", col="red")
lines(freeny_norm[,3], type="b", col="black")
lines(freeny_norm[,4], type="b", col="darkgreen")
legend("topleft", legend = colnames(freeny.x), lty=1 , col = c("blue","red", "black", "darkgreen"))
In R, sometimes we need to load a particular package in order to make a special type of chart. In the example below we are making a heatmap, where the colour indicates the numerical value.
library("gplots")
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
heatmap.2(mymatrix,trace="none",scale="none",main="heatmap")
heatmap.2(cor(mymatrix),trace="none",scale="none", main="correlation")
This is also a good time to let you know that you can save R charts as files. There are different types of file formats, but PNG and PDF are the most used types. You will see these new files appear in your files menu.
pdf("myplot.pdf")
plot(1:10)
dev.off()
## png
## 2
png("myplot.png")
plot(1:10)
dev.off()
## png
## 2
Create a scatterplot of mtcars weight (x axis) versus mpg (y axis). Include x and y axis labels and a main heading.
Sort mtcars by weight (wt) and create a horizontal barplot of wt values so heaviest ones at shown at the top of the bar plot. The plot should be labeled so it is clear to see which bar belong to which car. Include an axis label and main title.
Create a box plot of iris petal lengths. Each species should be a different category. Chart needs axis labels and a main title.
So far we have been working with datasets that are built into R, but in the real world you will be working with data files. These could be in different formats like text (.txt), comma separated values (.csv), tab separated values (.tsv) and perhaps Excel files too (.xls and .xlsx).
We will be working with TSV file of pipetting measurements. URL: https://raw.githubusercontent.com/markziemann/SLE712_files/master/pipette_test.tsv
We will use the read.table()
command and show you some really important options.
URL="https://raw.githubusercontent.com/markziemann/SLE712_files/master/pipette_test.tsv"
download.file(URL,"my.tsv")
# look closely at the structure of pip
# can you see what is wrong?
pip <- read.table("my.tsv")
pip
## V1 V2 V3 V4
## 1 RefValue R100uL R150uL R200uL
## 2 M1 99.62 149.56 200.16
## 3 M2 98.48 147.06 199.88
## 4 M3 100.26 151.34 199.92
## 5 M4 101.12 150.12 200.62
## 6 M5 99.89 149.94 201.37
str(pip)
## 'data.frame': 6 obs. of 4 variables:
## $ V1: chr "RefValue" "M1" "M2" "M3" ...
## $ V2: chr "R100uL" "99.62" "98.48" "100.26" ...
## $ V3: chr "R150uL" "149.56" "147.06" "151.34" ...
## $ V4: chr "R200uL" "200.16" "199.88" "199.92" ...
# try again
pip <- read.table(URL,stringsAsFactors = FALSE)
pip
## V1 V2 V3 V4
## 1 RefValue R100uL R150uL R200uL
## 2 M1 99.62 149.56 200.16
## 3 M2 98.48 147.06 199.88
## 4 M3 100.26 151.34 199.92
## 5 M4 101.12 150.12 200.62
## 6 M5 99.89 149.94 201.37
str(pip)
## 'data.frame': 6 obs. of 4 variables:
## $ V1: chr "RefValue" "M1" "M2" "M3" ...
## $ V2: chr "R100uL" "99.62" "98.48" "100.26" ...
## $ V3: chr "R150uL" "149.56" "147.06" "151.34" ...
## $ V4: chr "R200uL" "200.16" "199.88" "199.92" ...
# try again
# looking better
pip <- read.table(URL,stringsAsFactors = FALSE, header=TRUE)
pip
## RefValue R100uL R150uL R200uL
## 1 M1 99.62 149.56 200.16
## 2 M2 98.48 147.06 199.88
## 3 M3 100.26 151.34 199.92
## 4 M4 101.12 150.12 200.62
## 5 M5 99.89 149.94 201.37
str(pip)
## 'data.frame': 5 obs. of 4 variables:
## $ RefValue: chr "M1" "M2" "M3" "M4" ...
## $ R100uL : num 99.6 98.5 100.3 101.1 99.9
## $ R150uL : num 150 147 151 150 150
## $ R200uL : num 200 200 200 201 201
# got it now
pip <- read.table(URL,stringsAsFactors = FALSE, header=TRUE, row.names=1)
pip
## R100uL R150uL R200uL
## M1 99.62 149.56 200.16
## M2 98.48 147.06 199.88
## M3 100.26 151.34 199.92
## M4 101.12 150.12 200.62
## M5 99.89 149.94 201.37
str(pip)
## 'data.frame': 5 obs. of 3 variables:
## $ R100uL: num 99.6 98.5 100.3 101.1 99.9
## $ R150uL: num 150 147 151 150 150
## $ R200uL: num 200 200 200 201 201
Now let’s try a csv file containing some travel records.
URL="https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"
trav <- read.table(URL,sep=",")
trav
## V1 V2 V3 V4
## 1 Month 1958 1959 1960
## 2 JAN 340 360 417
## 3 FEB 318 342 391
## 4 MAR 362 406 419
## 5 APR 348 396 461
## 6 MAY 363 420 472
## 7 JUN 435 472 535
## 8 JUL 491 548 622
## 9 AUG 505 559 606
## 10 SEP 404 463 508
## 11 OCT 359 407 461
## 12 NOV 310 362 390
## 13 DEC 337 405 432
str(trav)
## 'data.frame': 13 obs. of 4 variables:
## $ V1: chr "Month" "JAN" "FEB" "MAR" ...
## $ V2: int 1958 340 318 362 348 363 435 491 505 404 ...
## $ V3: int 1959 360 342 406 396 420 472 548 559 463 ...
## $ V4: int 1960 417 391 419 461 472 535 622 606 508 ...
trav <- read.table(URL,sep=",",header=TRUE)
trav
## Month X1958 X1959 X1960
## 1 JAN 340 360 417
## 2 FEB 318 342 391
## 3 MAR 362 406 419
## 4 APR 348 396 461
## 5 MAY 363 420 472
## 6 JUN 435 472 535
## 7 JUL 491 548 622
## 8 AUG 505 559 606
## 9 SEP 404 463 508
## 10 OCT 359 407 461
## 11 NOV 310 362 390
## 12 DEC 337 405 432
str(trav)
## 'data.frame': 12 obs. of 4 variables:
## $ Month: chr "JAN" "FEB" "MAR" "APR" ...
## $ X1958: int 340 318 362 348 363 435 491 505 404 359 ...
## $ X1959: int 360 342 406 396 420 472 548 559 463 407 ...
## $ X1960: int 417 391 419 461 472 535 622 606 508 461 ...
trav <- read.csv(URL)
trav
## Month X1958 X1959 X1960
## 1 JAN 340 360 417
## 2 FEB 318 342 391
## 3 MAR 362 406 419
## 4 APR 348 396 461
## 5 MAY 363 420 472
## 6 JUN 435 472 535
## 7 JUL 491 548 622
## 8 AUG 505 559 606
## 9 SEP 404 463 508
## 10 OCT 359 407 461
## 11 NOV 310 362 390
## 12 DEC 337 405 432
str(trav)
## 'data.frame': 12 obs. of 4 variables:
## $ Month: chr "JAN" "FEB" "MAR" "APR" ...
## $ X1958: int 340 318 362 348 363 435 491 505 404 359 ...
## $ X1959: int 360 342 406 396 420 472 548 559 463 407 ...
## $ X1960: int 417 391 419 461 472 535 622 606 508 461 ...
Now let’s try a Microsoft Excel file.
URL="https://github.com/markziemann/SLE712_files/blob/master/misc/file_example_XLS_10.xls?raw=true"
NAME="file_example_XLS_10.xls"
download.file(URL,destfile=NAME)
library("readxl")
mydata <- read_xls(NAME)
mydata
## # A tibble: 9 × 8
## `0` `First Name` `Last Name` Gender Country Age Date Id
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Dulce Abril Female United States 32 15/10/2017 1562
## 2 2 Mara Hashimoto Female Great Britain 25 16/08/2016 1582
## 3 3 Philip Gent Male France 36 21/05/2015 2587
## 4 4 Kathleen Hanner Female United States 25 15/10/2017 3549
## 5 5 Nereida Magwood Female United States 58 16/08/2016 2468
## 6 6 Gaston Brumm Male United States 24 21/05/2015 2554
## 7 7 Etta Hurn Female Great Britain 56 15/10/2017 3598
## 8 8 Earlean Melgar Female United States 27 16/08/2016 2456
## 9 9 Vincenza Weiland Female United States 40 21/05/2015 6548
str(mydata)
## tibble [9 × 8] (S3: tbl_df/tbl/data.frame)
## $ 0 : num [1:9] 1 2 3 4 5 6 7 8 9
## $ First Name: chr [1:9] "Dulce" "Mara" "Philip" "Kathleen" ...
## $ Last Name : chr [1:9] "Abril" "Hashimoto" "Gent" "Hanner" ...
## $ Gender : chr [1:9] "Female" "Female" "Male" "Female" ...
## $ Country : chr [1:9] "United States" "Great Britain" "France" "United States" ...
## $ Age : num [1:9] 32 25 36 25 58 24 56 27 40
## $ Date : chr [1:9] "15/10/2017" "16/08/2016" "21/05/2015" "15/10/2017" ...
## $ Id : num [1:9] 1562 1582 2587 3549 2468 ...
When working in R, it is convenient to save the session with save.image()
. This results in an Rdata file which contains all the data objects in your current environment. To test that it’s actually working, clear your environment with the sweep/broom icon and then load the Rdata file with load()
.
save.image("mysession.Rdata")
rm(list=ls())
load("mysession.Rdata")
That is really cool, but sometimes we want to save individual objects, such as a large dataframe as Rdata files.
saveRDS(object = mymatrix , file = "mymatrix.Rds")
rm(list=ls())
x <- readRDS("mymatrix.Rds")
head(x)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -5.362961 16.81304 -19.886142 15.674846 19.756765
## [2,] 22.459162 25.87460 -5.080414 22.889874 0.838587
## [3,] 11.705377 10.50806 -15.861401 15.027250 23.816758
## [4,] 27.744727 12.60770 9.414938 -5.459122 21.627774
## [5,] 16.224883 -10.96459 16.466131 20.876458 16.610439
## [6,] 1.181048 23.20330 9.708798 30.959470 12.142181
For the TSV file located here: https://raw.githubusercontent.com/markziemann/SLE712_files/master/misc/mydata.tsv
Read it in and show the first 6 rows of data.
Calculate the column and row means.
Use the cor()
command to find the correlation coefficients between the 3 data sets. Which two datasets are the most similar?
Make a pairs plot of the three datasets.
Save the script. Clear the environment and execute it to confirm that it is working.
For reproducibility.
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] readxl_1.4.3 gplots_3.1.3.1 beeswarm_0.4.0
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.2 knitr_1.45 rlang_1.1.3
## [5] xfun_0.42 highr_0.10 KernSmooth_2.23-22 jsonlite_1.8.8
## [9] gtools_3.9.5 glue_1.7.0 htmltools_0.5.7 sass_0.4.8
## [13] fansi_1.0.6 rmarkdown_2.25 cellranger_1.1.0 evaluate_0.23
## [17] jquerylib_0.1.4 caTools_1.18.2 tibble_3.2.1 bitops_1.0-7
## [21] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4 compiler_4.3.3
## [25] pkgconfig_2.0.3 digest_0.6.34 R6_2.5.1 utf8_1.2.4
## [29] pillar_1.9.0 magrittr_2.0.3 bslib_0.6.1 tools_4.3.3
## [33] cachem_1.0.8