Source: https://github.com/markziemann/SLE712_files/blob/master/BioinfoPrac5.Rmd
In this practical we will use all the foundational skills we have been developing over the past four sessions to solve bioinformatics based problems.
Specifically, we will:
Download sequence files, unzip them and read them into R.
Perform simple manipulations and analyses with sequence data.
Use provided functions to calculate amino acid and nucleotide frequency.
Use provided functions to calculate codon usage bias.
Use provided tools to calculate protein sequence k-mer frequency.
suppressPackageStartupMessages({
library("seqinr") # is a package designed to process and analyse sequence data.
library("kableExtra") # for making html tables
library("R.utils") # general utilities like zip and unzip
})
We will be downloading E. coli genomic sequence data from Ensembl. Ensembl is one of the biggest databases of genomic data. There are many E. coli strains available in Ensembl, but we will focus today on “Escherichia coli str. K-12 substr. MG1655 str. K12 (GCA_000005845), (TaxID 511145)”. The sequence data for this strain is available here.
Begin by downloading the coding sequences of this bacterium, which can be found in the “cds” folder.
Right click the file and “copy link address”, paste it so you save
the location as URL
.
We will read the sequence into R with the seqinr
package.
Use ?seqinr
to look at the documentation for this
package.
library("R.utils")
URL="http://ftp.ensemblgenomes.org/pub/bacteria/release-53/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/cds/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.cds.all.fa.gz"
unlink("ecoli_cds.fa.gz")
download.file(URL,destfile="ecoli_cds.fa.gz")
unlink("ecoli_cds.fa")
gunzip("ecoli_cds.fa.gz")
list.files()
## [1] "assessment_task3" "BioinfoPrac1.html"
## [3] "BioinfoPrac1.Rmd" "BioinfoPrac2.html"
## [5] "BioinfoPrac2.Rmd" "BioinfoPrac3_takehomeQustions.Rmd"
## [7] "BioinfoPrac3.html" "BioinfoPrac3.Rmd"
## [9] "BioinfoPrac4.html" "BioinfoPrac4.Rmd"
## [11] "BioinfoPrac5.html" "BioinfoPrac5.Rmd"
## [13] "ecoli_cds.fa" "example1_files"
## [15] "example1.html" "example1.Rmd"
## [17] "file_example_XLS_10.xls" "gene_expression"
## [19] "growth_data.csv" "images"
## [21] "misc" "my.tsv"
## [23] "mymatrix.Rds" "myplot.pdf"
## [25] "myplot.png" "mysession.Rdata"
## [27] "pipette_test.tsv" "README.md"
## [29] "SLE712_files.Rproj" "trash"
library("seqinr")
cds <- seqinr::read.fasta("ecoli_cds.fa")
str(head(cds))
## List of 6
## $ AAC73112: 'SeqFastadna' chr [1:66] "a" "t" "g" "a" ...
## ..- attr(*, "name")= chr "AAC73112"
## ..- attr(*, "Annot")= chr ">AAC73112 cds chromosome:ASM584v2:Chromosome:190:255:1 gene:b0001 gene_biotype:protein_coding transcript_biotyp"| __truncated__
## $ AAC73113: 'SeqFastadna' chr [1:2463] "a" "t" "g" "c" ...
## ..- attr(*, "name")= chr "AAC73113"
## ..- attr(*, "Annot")= chr ">AAC73113 cds chromosome:ASM584v2:Chromosome:337:2799:1 gene:b0002 gene_biotype:protein_coding transcript_bioty"| __truncated__
## $ AAC73114: 'SeqFastadna' chr [1:933] "a" "t" "g" "g" ...
## ..- attr(*, "name")= chr "AAC73114"
## ..- attr(*, "Annot")= chr ">AAC73114 cds chromosome:ASM584v2:Chromosome:2801:3733:1 gene:b0003 gene_biotype:protein_coding transcript_biot"| __truncated__
## $ AAC73115: 'SeqFastadna' chr [1:1287] "a" "t" "g" "a" ...
## ..- attr(*, "name")= chr "AAC73115"
## ..- attr(*, "Annot")= chr ">AAC73115 cds chromosome:ASM584v2:Chromosome:3734:5020:1 gene:b0004 gene_biotype:protein_coding transcript_biot"| __truncated__
## $ AAC73116: 'SeqFastadna' chr [1:297] "g" "t" "g" "a" ...
## ..- attr(*, "name")= chr "AAC73116"
## ..- attr(*, "Annot")= chr ">AAC73116 cds chromosome:ASM584v2:Chromosome:5234:5530:1 gene:b0005 gene_biotype:protein_coding transcript_biot"| __truncated__
## $ AAC73117: 'SeqFastadna' chr [1:777] "a" "t" "g" "c" ...
## ..- attr(*, "name")= chr "AAC73117"
## ..- attr(*, "Annot")= chr ">AAC73117 cds chromosome:ASM584v2:Chromosome:5683:6459:-1 gene:b0006 gene_biotype:protein_coding transcript_bio"| __truncated__
Now we can figure out the number of coding sequences.
length(cds)
## [1] 4239
Then we can determine the total sequence length.
Here are two approaches.
head(summary(cds))
## Length Class Mode
## AAC73112 66 SeqFastadna character
## AAC73113 2463 SeqFastadna character
## AAC73114 933 SeqFastadna character
## AAC73115 1287 SeqFastadna character
## AAC73116 297 SeqFastadna character
## AAC73117 777 SeqFastadna character
str(summary(cds))
## 'summaryDefault' chr [1:4239, 1:3] " 66" "2463" " 933" "1287" " 297" " 777" "1431" " 954" " 588" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:4239] "AAC73112" "AAC73113" "AAC73114" "AAC73115" ...
## ..$ : chr [1:3] "Length" "Class" "Mode"
head(summary(cds)[,1])
## AAC73112 AAC73113 AAC73114 AAC73115 AAC73116 AAC73117
## " 66" "2463" " 933" "1287" " 297" " 777"
len <- as.numeric(summary(cds)[,1])
sum(len)
## [1] 3978528
len2 <- sapply(X=cds,FUN=length)
sum(len2)
## [1] 3978528
We can look at the distribution of coding sequence length with different types of plots.
mean(len)
## [1] 938.5534
median(len)
## [1] 831
hist(len)
hist(len,xlab="sequence length (bp)",ylab="frequency",main="histogram of coding sequence length")
boxplot(len)
boxplot(len,ylab="sequence length (bp)")
Composition of DNA sequences.
First with one sequence, then with the whole set of coding sequences.
use ?GC
?count
and ?summary
to
see the documentation for those seqinr
commands.
GC(cds[[1]])
## [1] 0.5151515
count(cds[[1]],1)
##
## a c g t
## 21 22 12 11
count(cds[[1]],2)
##
## aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
## 3 10 2 5 11 7 3 1 2 4 4 2 4 1 3 3
count(cds[[1]],3)
##
## aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat cca ccc ccg cct cga cgc
## 1 2 0 0 1 7 2 0 0 1 1 0 0 1 1 3 0 6 1 4 7 0 0 0 0 1
## cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat
## 2 0 0 0 1 0 1 0 0 0 2 0 1 1 0 1 1 2 1 0 1 0 1 2 1 0
## tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt
## 1 0 0 0 2 1 0 0 3 0 0 0
summary(cds[1:3])
## Length Class Mode
## AAC73112 66 SeqFastadna character
## AAC73113 2463 SeqFastadna character
## AAC73114 933 SeqFastadna character
sum(sapply(cds[1:3],length))
## [1] 3462
length(unlist(cds[1:3]))
## [1] 3462
dna <- unlist(cds)
GC(dna)
## [1] 0.5193114
dna_composition <- count(dna,1)
barplot(dna_composition,xlab="nucleotides",ylab="frequency", main="E coli CDS composition")
dna_composition/sum(dna_composition)
##
## a c g t
## 0.2402316 0.2457175 0.2735939 0.2404570
dna_proportion <- dna_composition/sum(dna_composition)
barplot(dna_proportion,xlab="nucleotides",ylab="proportion", main="E coli CDS composition")
grid()
Now perform similar analysis for protein sequences. First we need to translate the sequences.
translate(cds[[1]])
## [1] "M" "K" "R" "I" "S" "T" "T" "I" "T" "T" "T" "I" "T" "I" "T" "T" "G" "N" "G" "A" "G" "*"
prot <- lapply(cds,translate)
# define the amino acid alphabet
aa <- unique(prot[[2]])
aa <- aa[aa != "*"]
length(aa)
## [1] 20
count(prot[[1]],wordsize=1,alphabet=aa)
##
## A C D E F G H I K L M N P Q R S T V W Y
## 1 0 0 0 0 3 0 4 1 0 1 1 0 0 1 1 8 0 0 0
count(prot[[2]],wordsize=1,alphabet=aa)
##
## A C D E F G H I K L M N P Q R S T V W Y
## 91 12 44 53 30 63 14 46 34 89 23 38 29 30 46 51 34 69 4 20
Can you figure out how to get the total number of each amino acid coded by E. coli? Work in groups of 2-3 and make a barplot of the results.
A codon is a three base triplet that codes for a particular amino acid. As there are 64 possible codons and just 20 amino acids (and 3 termination codons), some amino acids are coded by multiple codons. For example, leucine is coded by six codons.
seqinr
has a command called uco
which
calculates some codon usage information.
Let’s have a look at how it can calculate codon usage for a single protein sequence.
uco(cds[[2]])
##
## aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat cca ccc ccg cct cga cgc
## 22 16 12 22 2 19 8 5 0 12 2 3 1 15 23 30 11 6 19 8 2 6 18 3 3 19
## cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat
## 4 18 2 13 43 8 40 14 13 30 14 36 26 15 9 22 10 22 5 18 27 19 0 8 0 12
## tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt
## 6 10 9 11 1 9 4 3 10 19 13 11
uco(cds[[2]],index="rscu")
## aaa aac aag aat aca acc acg act aga
## 1.29411765 0.84210526 0.70588235 1.15789474 0.23529412 2.23529412 0.94117647 0.58823529 0.00000000
## agc agg agt ata atc atg att caa cac
## 1.41176471 0.26086957 0.35294118 0.06521739 0.97826087 1.00000000 1.95652174 0.73333333 0.85714286
## cag cat cca ccc ccg cct cga cgc cgg
## 1.26666667 1.14285714 0.27586207 0.82758621 2.48275862 0.41379310 0.39130435 2.47826087 0.52173913
## cgt cta ctc ctg ctt gaa gac gag gat
## 2.34782609 0.13483146 0.87640449 2.89887640 0.53932584 1.50943396 0.63636364 0.49056604 1.36363636
## gca gcc gcg gct gga ggc ggg ggt gta
## 0.61538462 1.58241758 1.14285714 0.65934066 0.57142857 1.39682540 0.63492063 1.39682540 0.28985507
## gtc gtg gtt taa tac tag tat tca tcc
## 1.04347826 1.56521739 1.10144928 0.00000000 0.80000000 0.00000000 1.20000000 0.70588235 1.17647059
## tcg tct tga tgc tgg tgt tta ttc ttg
## 1.05882353 1.29411765 3.00000000 1.50000000 1.00000000 0.50000000 0.67415730 1.26666667 0.87640449
## ttt
## 0.73333333
uco(cds[[2]],index="rscu",as.data.frame=TRUE)
## AA codon eff freq RSCU
## aaa Lys aaa 22 0.026796590 1.29411765
## aac Asn aac 16 0.019488429 0.84210526
## aag Lys aag 12 0.014616322 0.70588235
## aat Asn aat 22 0.026796590 1.15789474
## aca Thr aca 2 0.002436054 0.23529412
## acc Thr acc 19 0.023142509 2.23529412
## acg Thr acg 8 0.009744214 0.94117647
## act Thr act 5 0.006090134 0.58823529
## aga Arg aga 0 0.000000000 0.00000000
## agc Ser agc 12 0.014616322 1.41176471
## agg Arg agg 2 0.002436054 0.26086957
## agt Ser agt 3 0.003654080 0.35294118
## ata Ile ata 1 0.001218027 0.06521739
## atc Ile atc 15 0.018270402 0.97826087
## atg Met atg 23 0.028014616 1.00000000
## att Ile att 30 0.036540804 1.95652174
## caa Gln caa 11 0.013398295 0.73333333
## cac His cac 6 0.007308161 0.85714286
## cag Gln cag 19 0.023142509 1.26666667
## cat His cat 8 0.009744214 1.14285714
## cca Pro cca 2 0.002436054 0.27586207
## ccc Pro ccc 6 0.007308161 0.82758621
## ccg Pro ccg 18 0.021924482 2.48275862
## cct Pro cct 3 0.003654080 0.41379310
## cga Arg cga 3 0.003654080 0.39130435
## cgc Arg cgc 19 0.023142509 2.47826087
## cgg Arg cgg 4 0.004872107 0.52173913
## cgt Arg cgt 18 0.021924482 2.34782609
## cta Leu cta 2 0.002436054 0.13483146
## ctc Leu ctc 13 0.015834348 0.87640449
## ctg Leu ctg 43 0.052375152 2.89887640
## ctt Leu ctt 8 0.009744214 0.53932584
## gaa Glu gaa 40 0.048721072 1.50943396
## gac Asp gac 14 0.017052375 0.63636364
## gag Glu gag 13 0.015834348 0.49056604
## gat Asp gat 30 0.036540804 1.36363636
## gca Ala gca 14 0.017052375 0.61538462
## gcc Ala gcc 36 0.043848965 1.58241758
## gcg Ala gcg 26 0.031668697 1.14285714
## gct Ala gct 15 0.018270402 0.65934066
## gga Gly gga 9 0.010962241 0.57142857
## ggc Gly ggc 22 0.026796590 1.39682540
## ggg Gly ggg 10 0.012180268 0.63492063
## ggt Gly ggt 22 0.026796590 1.39682540
## gta Val gta 5 0.006090134 0.28985507
## gtc Val gtc 18 0.021924482 1.04347826
## gtg Val gtg 27 0.032886724 1.56521739
## gtt Val gtt 19 0.023142509 1.10144928
## taa Stp taa 0 0.000000000 0.00000000
## tac Tyr tac 8 0.009744214 0.80000000
## tag Stp tag 0 0.000000000 0.00000000
## tat Tyr tat 12 0.014616322 1.20000000
## tca Ser tca 6 0.007308161 0.70588235
## tcc Ser tcc 10 0.012180268 1.17647059
## tcg Ser tcg 9 0.010962241 1.05882353
## tct Ser tct 11 0.013398295 1.29411765
## tga Stp tga 1 0.001218027 3.00000000
## tgc Cys tgc 9 0.010962241 1.50000000
## tgg Trp tgg 4 0.004872107 1.00000000
## tgt Cys tgt 3 0.003654080 0.50000000
## tta Leu tta 10 0.012180268 0.67415730
## ttc Phe ttc 19 0.023142509 1.26666667
## ttg Leu ttg 13 0.015834348 0.87640449
## ttt Phe ttt 11 0.013398295 0.73333333
The RSCU stands for relative synonymous codon usage, which is a measure as to how preferred a codon of a particular amino acid is compared to the other codons which also encode it.
Can you think of an approach to calculate codon usage bias for the total coding sequence of E. coli?
K-mer is simply a short sequence string of length k
. We
can use k-mer analysis to identify particular sequences that are over-
or under-represented in an organism.
prots <- unlist(prot)
mycounts <- count(prots,wordsize=3,alphabet=aa)
str(mycounts)
## 'table' int [1:8000(1d)] 1338 182 568 749 481 1075 222 863 580 1594 ...
## - attr(*, "dimnames")=List of 1
## ..$ : chr [1:8000] "AAA" "AAC" "AAD" "AAE" ...
head(mycounts)
##
## AAA AAC AAD AAE AAF AAG
## 1338 182 568 749 481 1075
myfreq <- count(prots,wordsize=3,alphabet=aa,freq=TRUE)
Using code, can you identify the top 6 most over- and under-represented triplet sequences?
Read the instructions and the rubric carefully.
Don’t forget the README. The README should communicate what the repo is for, what the contents are and how the programs should be run. Check out some good READMEs:
Work as a team. There needs to be commits to the code and documentation from each group member. Each member also needs to be involved with the issues; opening them, discussing them and making the necessary code changes.
At the start of the coding session, you should “pull” changes that may have been made by your group-mates. If you don’t pull, then you will be working on the older version of the code and it will cause problems when you want to push your new changes.
When preparing the answers to the questions, ensure that
knit
works properly before you commit and push the changes
to the repo. It can take a long time to troubleshoot errors if you leave
this step to the end.
Remove any unnecessary code. Keeping redundant bits of code is a sign that you don’t understand the purpose of each line of code.
Minimise output data. If you are working with a large dataset
(like 10,000 rows), you don’t want to show every data point in the
report as it would occupy many pages and the reader is not interested in
that. Use commands like str()
and head()
to
reduce the amount of data shown. In short, make the reading process easy
for the audience.
Give the context. Include the question that you are answering, as well as a description of how you went about solving the problem. This will ensure that you (or someone else) will be able to read and understand the assignment even after many years after you finish the course.
Customise the theme to your preferences. Here are some setting which I think could be good defaults. More themes are found here.
---
title: "My Title"
author: "Bob and April"
date: "2022-11-24"
output:
html_document:
toc: yes
fig_width: 7
fig_height: 7
theme: spacelab
---
head(mtcars) %>%
kbl(caption = "mtcars data table") %>%
kable_paper(full_width = F, html_font = "Arial")
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
You can include a caption for each figure according to this guide.
If you used any specific sources in your report, you need to provide a bibliography and cite those sources in the appropriate sections. You are free to use any of the common referencing styles.
sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8
## [4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] R.utils_2.12.0 R.oo_1.25.0 R.methodsS3_1.8.2 kableExtra_1.3.4 seqinr_4.2-16
## [6] readxl_1.4.1
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 matrixStats_0.62.0 bit64_4.0.5
## [4] webshot_0.5.4 RColorBrewer_1.1-3 httr_1.4.4
## [7] GenomeInfoDb_1.32.2 tools_4.2.2 bslib_0.4.0
## [10] utf8_1.2.2 R6_2.5.1 DBI_1.1.3
## [13] BiocGenerics_0.42.0 colorspace_2.0-3 ade4_1.7-19
## [16] withr_2.5.0 tidyselect_1.2.0 DESeq2_1.36.0
## [19] downlit_0.4.2 bit_4.0.4 compiler_4.2.2
## [22] cli_3.4.1 rvest_1.0.3 Biobase_2.56.0
## [25] xml2_1.3.3 DelayedArray_0.22.0 sass_0.4.2
## [28] scales_1.2.1 genefilter_1.78.0 systemfonts_1.0.4
## [31] stringr_1.4.1 digest_0.6.30 rmarkdown_2.17
## [34] svglite_2.1.0 XVector_0.36.0 pkgconfig_2.0.3
## [37] htmltools_0.5.3 MatrixGenerics_1.8.1 fastmap_1.1.0
## [40] highr_0.9 rlang_1.0.6 rstudioapi_0.14
## [43] RSQLite_2.2.18 jquerylib_0.1.4 generics_0.1.3
## [46] jsonlite_1.8.3 BiocParallel_1.30.3 dplyr_1.0.10
## [49] RCurl_1.98-1.9 magrittr_2.0.3 GenomeInfoDbData_1.2.8
## [52] Matrix_1.5-1 Rcpp_1.0.9 munsell_0.5.0
## [55] S4Vectors_0.34.0 fansi_1.0.3 lifecycle_1.0.3
## [58] stringi_1.7.8 yaml_2.3.6 MASS_7.3-58.1
## [61] SummarizedExperiment_1.26.1 zlibbioc_1.42.0 grid_4.2.2
## [64] blob_1.2.3 parallel_4.2.2 crayon_1.5.2
## [67] lattice_0.20-45 Biostrings_2.64.0 splines_4.2.2
## [70] annotate_1.74.0 KEGGREST_1.36.3 locfit_1.5-9.6
## [73] knitr_1.40 pillar_1.8.1 GenomicRanges_1.48.0
## [76] geneplotter_1.74.0 codetools_0.2-18 stats4_4.2.2
## [79] XML_3.99-0.12 glue_1.6.2 evaluate_0.17
## [82] png_0.1-7 vctrs_0.5.0 cellranger_1.1.0
## [85] gtable_0.3.1 assertthat_0.2.1 cachem_1.0.6
## [88] ggplot2_3.3.6 xfun_0.34 xtable_1.8-4
## [91] survival_3.4-0 viridisLite_0.4.1 tibble_3.1.8
## [94] AnnotationDbi_1.58.0 memoise_2.0.1 IRanges_2.30.0