Bioinformatics prac 5: working with sequence data in R

Introduction
Loading E. coli sequence data
Basic sequence analysis
Sequence composition
Codon usage
K-mer profiling
Additional tips for putting together a winning assignment
Session info

Source: https://github.com/markziemann/SLE712_files/blob/master/BioinfoPrac5.Rmd

Introduction

In this practical we will use all the foundational skills we have been developing over the past four sessions to solve bioinformatics based problems.

Specifically, we will:

Download sequence files, unzip them and read them into R.
Perform simple manipulations and analyses with sequence data.
Use provided functions to calculate amino acid and nucleotide frequency.
Use provided functions to calculate codon usage bias.
Use provided tools to calculate protein sequence k-mer frequency.

suppressPackageStartupMessages({
  library("seqinr") # is a package designed to process and analyse sequence data.
  library("kableExtra") # for making html tables
  library("R.utils") # general utilities like zip and unzip
})

Loading E. coli sequence data

We will be downloading E. coli genomic sequence data from Ensembl. Ensembl is one of the biggest databases of genomic data. There are many E. coli strains available in Ensembl, but we will focus today on “Escherichia coli str. K-12 substr. MG1655 str. K12 (GCA_000005845), (TaxID 511145)”. The sequence data for this strain is available here.

Begin by downloading the coding sequences of this bacterium, which can be found in the “cds” folder.

Right click the file and “copy link address”, paste it so you save the location as URL.

We will read the sequence into R with the seqinr package.

Use ?seqinr to look at the documentation for this package.

library("R.utils")


URL="http://ftp.ensemblgenomes.org/pub/bacteria/release-53/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/cds/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.cds.all.fa.gz"
unlink("ecoli_cds.fa.gz")
download.file(URL,destfile="ecoli_cds.fa.gz")
unlink("ecoli_cds.fa")
gunzip("ecoli_cds.fa.gz")
list.files()

##  [1] "assessment_task3"                  "BioinfoPrac1.html"                
##  [3] "BioinfoPrac1.Rmd"                  "BioinfoPrac2.html"                
##  [5] "BioinfoPrac2.Rmd"                  "BioinfoPrac3_takehomeQustions.Rmd"
##  [7] "BioinfoPrac3.html"                 "BioinfoPrac3.Rmd"                 
##  [9] "BioinfoPrac4.html"                 "BioinfoPrac4.Rmd"                 
## [11] "BioinfoPrac5.html"                 "BioinfoPrac5.Rmd"                 
## [13] "ecoli_cds.fa"                      "example1_files"                   
## [15] "example1.html"                     "example1.Rmd"                     
## [17] "file_example_XLS_10.xls"           "gene_expression"                  
## [19] "growth_data.csv"                   "images"                           
## [21] "misc"                              "my.tsv"                           
## [23] "mymatrix.Rds"                      "myplot.pdf"                       
## [25] "myplot.png"                        "mysession.Rdata"                  
## [27] "pipette_test.tsv"                  "README.md"                        
## [29] "SLE712_files.Rproj"                "trash"

library("seqinr")
cds <- seqinr::read.fasta("ecoli_cds.fa")
str(head(cds))

## List of 6
##  $ AAC73112: 'SeqFastadna' chr [1:66] "a" "t" "g" "a" ...
##   ..- attr(*, "name")= chr "AAC73112"
##   ..- attr(*, "Annot")= chr ">AAC73112 cds chromosome:ASM584v2:Chromosome:190:255:1 gene:b0001 gene_biotype:protein_coding transcript_biotyp"| __truncated__
##  $ AAC73113: 'SeqFastadna' chr [1:2463] "a" "t" "g" "c" ...
##   ..- attr(*, "name")= chr "AAC73113"
##   ..- attr(*, "Annot")= chr ">AAC73113 cds chromosome:ASM584v2:Chromosome:337:2799:1 gene:b0002 gene_biotype:protein_coding transcript_bioty"| __truncated__
##  $ AAC73114: 'SeqFastadna' chr [1:933] "a" "t" "g" "g" ...
##   ..- attr(*, "name")= chr "AAC73114"
##   ..- attr(*, "Annot")= chr ">AAC73114 cds chromosome:ASM584v2:Chromosome:2801:3733:1 gene:b0003 gene_biotype:protein_coding transcript_biot"| __truncated__
##  $ AAC73115: 'SeqFastadna' chr [1:1287] "a" "t" "g" "a" ...
##   ..- attr(*, "name")= chr "AAC73115"
##   ..- attr(*, "Annot")= chr ">AAC73115 cds chromosome:ASM584v2:Chromosome:3734:5020:1 gene:b0004 gene_biotype:protein_coding transcript_biot"| __truncated__
##  $ AAC73116: 'SeqFastadna' chr [1:297] "g" "t" "g" "a" ...
##   ..- attr(*, "name")= chr "AAC73116"
##   ..- attr(*, "Annot")= chr ">AAC73116 cds chromosome:ASM584v2:Chromosome:5234:5530:1 gene:b0005 gene_biotype:protein_coding transcript_biot"| __truncated__
##  $ AAC73117: 'SeqFastadna' chr [1:777] "a" "t" "g" "c" ...
##   ..- attr(*, "name")= chr "AAC73117"
##   ..- attr(*, "Annot")= chr ">AAC73117 cds chromosome:ASM584v2:Chromosome:5683:6459:-1 gene:b0006 gene_biotype:protein_coding transcript_bio"| __truncated__

Basic sequence analysis

Now we can figure out the number of coding sequences.

length(cds)

## [1] 4239

Then we can determine the total sequence length.

Here are two approaches.

head(summary(cds))

##          Length Class       Mode     
## AAC73112   66   SeqFastadna character
## AAC73113 2463   SeqFastadna character
## AAC73114  933   SeqFastadna character
## AAC73115 1287   SeqFastadna character
## AAC73116  297   SeqFastadna character
## AAC73117  777   SeqFastadna character

str(summary(cds))

##  'summaryDefault' chr [1:4239, 1:3] "  66" "2463" " 933" "1287" " 297" " 777" "1431" " 954" " 588" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:4239] "AAC73112" "AAC73113" "AAC73114" "AAC73115" ...
##   ..$ : chr [1:3] "Length" "Class" "Mode"

head(summary(cds)[,1])

## AAC73112 AAC73113 AAC73114 AAC73115 AAC73116 AAC73117 
##   "  66"   "2463"   " 933"   "1287"   " 297"   " 777"

len <- as.numeric(summary(cds)[,1])

sum(len)

## [1] 3978528

len2 <- sapply(X=cds,FUN=length)

sum(len2)

## [1] 3978528

We can look at the distribution of coding sequence length with different types of plots.

mean(len)

## [1] 938.5534

median(len)

## [1] 831

hist(len)

hist(len,xlab="sequence length (bp)",ylab="frequency",main="histogram of coding sequence length")

boxplot(len)

boxplot(len,ylab="sequence length (bp)")

Sequence composition

Composition of DNA sequences.

First with one sequence, then with the whole set of coding sequences.

use ?GC ?count and ?summary to see the documentation for those seqinr commands.

GC(cds[[1]])

## [1] 0.5151515

count(cds[[1]],1)

## 
##  a  c  g  t 
## 21 22 12 11

count(cds[[1]],2)

## 
## aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
##  3 10  2  5 11  7  3  1  2  4  4  2  4  1  3  3

count(cds[[1]],3)

## 
## aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat cca ccc ccg cct cga cgc 
##   1   2   0   0   1   7   2   0   0   1   1   0   0   1   1   3   0   6   1   4   7   0   0   0   0   1 
## cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat 
##   2   0   0   0   1   0   1   0   0   0   2   0   1   1   0   1   1   2   1   0   1   0   1   2   1   0 
## tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt 
##   1   0   0   0   2   1   0   0   3   0   0   0

summary(cds[1:3])

##          Length Class       Mode     
## AAC73112   66   SeqFastadna character
## AAC73113 2463   SeqFastadna character
## AAC73114  933   SeqFastadna character

sum(sapply(cds[1:3],length))

## [1] 3462

length(unlist(cds[1:3]))

## [1] 3462

dna <- unlist(cds)

GC(dna)

## [1] 0.5193114

dna_composition <- count(dna,1)

barplot(dna_composition,xlab="nucleotides",ylab="frequency", main="E coli CDS composition")

dna_composition/sum(dna_composition)

## 
##         a         c         g         t 
## 0.2402316 0.2457175 0.2735939 0.2404570

dna_proportion <- dna_composition/sum(dna_composition)

barplot(dna_proportion,xlab="nucleotides",ylab="proportion", main="E coli CDS composition")

grid()

Now perform similar analysis for protein sequences. First we need to translate the sequences.

translate(cds[[1]])

##  [1] "M" "K" "R" "I" "S" "T" "T" "I" "T" "T" "T" "I" "T" "I" "T" "T" "G" "N" "G" "A" "G" "*"

prot <- lapply(cds,translate)

# define the amino acid alphabet
aa <- unique(prot[[2]])
aa <- aa[aa != "*"]
length(aa)

## [1] 20

count(prot[[1]],wordsize=1,alphabet=aa)

## 
## A C D E F G H I K L M N P Q R S T V W Y 
## 1 0 0 0 0 3 0 4 1 0 1 1 0 0 1 1 8 0 0 0

count(prot[[2]],wordsize=1,alphabet=aa)

## 
##  A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y 
## 91 12 44 53 30 63 14 46 34 89 23 38 29 30 46 51 34 69  4 20

Can you figure out how to get the total number of each amino acid coded by E. coli? Work in groups of 2-3 and make a barplot of the results.

Codon usage

A codon is a three base triplet that codes for a particular amino acid. As there are 64 possible codons and just 20 amino acids (and 3 termination codons), some amino acids are coded by multiple codons. For example, leucine is coded by six codons.

seqinr has a command called uco which calculates some codon usage information.

Let’s have a look at how it can calculate codon usage for a single protein sequence.

uco(cds[[2]])

## 
## aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat cca ccc ccg cct cga cgc 
##  22  16  12  22   2  19   8   5   0  12   2   3   1  15  23  30  11   6  19   8   2   6  18   3   3  19 
## cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat 
##   4  18   2  13  43   8  40  14  13  30  14  36  26  15   9  22  10  22   5  18  27  19   0   8   0  12 
## tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt 
##   6  10   9  11   1   9   4   3  10  19  13  11

uco(cds[[2]],index="rscu")

##        aaa        aac        aag        aat        aca        acc        acg        act        aga 
## 1.29411765 0.84210526 0.70588235 1.15789474 0.23529412 2.23529412 0.94117647 0.58823529 0.00000000 
##        agc        agg        agt        ata        atc        atg        att        caa        cac 
## 1.41176471 0.26086957 0.35294118 0.06521739 0.97826087 1.00000000 1.95652174 0.73333333 0.85714286 
##        cag        cat        cca        ccc        ccg        cct        cga        cgc        cgg 
## 1.26666667 1.14285714 0.27586207 0.82758621 2.48275862 0.41379310 0.39130435 2.47826087 0.52173913 
##        cgt        cta        ctc        ctg        ctt        gaa        gac        gag        gat 
## 2.34782609 0.13483146 0.87640449 2.89887640 0.53932584 1.50943396 0.63636364 0.49056604 1.36363636 
##        gca        gcc        gcg        gct        gga        ggc        ggg        ggt        gta 
## 0.61538462 1.58241758 1.14285714 0.65934066 0.57142857 1.39682540 0.63492063 1.39682540 0.28985507 
##        gtc        gtg        gtt        taa        tac        tag        tat        tca        tcc 
## 1.04347826 1.56521739 1.10144928 0.00000000 0.80000000 0.00000000 1.20000000 0.70588235 1.17647059 
##        tcg        tct        tga        tgc        tgg        tgt        tta        ttc        ttg 
## 1.05882353 1.29411765 3.00000000 1.50000000 1.00000000 0.50000000 0.67415730 1.26666667 0.87640449 
##        ttt 
## 0.73333333

uco(cds[[2]],index="rscu",as.data.frame=TRUE)

##      AA codon eff        freq       RSCU
## aaa Lys   aaa  22 0.026796590 1.29411765
## aac Asn   aac  16 0.019488429 0.84210526
## aag Lys   aag  12 0.014616322 0.70588235
## aat Asn   aat  22 0.026796590 1.15789474
## aca Thr   aca   2 0.002436054 0.23529412
## acc Thr   acc  19 0.023142509 2.23529412
## acg Thr   acg   8 0.009744214 0.94117647
## act Thr   act   5 0.006090134 0.58823529
## aga Arg   aga   0 0.000000000 0.00000000
## agc Ser   agc  12 0.014616322 1.41176471
## agg Arg   agg   2 0.002436054 0.26086957
## agt Ser   agt   3 0.003654080 0.35294118
## ata Ile   ata   1 0.001218027 0.06521739
## atc Ile   atc  15 0.018270402 0.97826087
## atg Met   atg  23 0.028014616 1.00000000
## att Ile   att  30 0.036540804 1.95652174
## caa Gln   caa  11 0.013398295 0.73333333
## cac His   cac   6 0.007308161 0.85714286
## cag Gln   cag  19 0.023142509 1.26666667
## cat His   cat   8 0.009744214 1.14285714
## cca Pro   cca   2 0.002436054 0.27586207
## ccc Pro   ccc   6 0.007308161 0.82758621
## ccg Pro   ccg  18 0.021924482 2.48275862
## cct Pro   cct   3 0.003654080 0.41379310
## cga Arg   cga   3 0.003654080 0.39130435
## cgc Arg   cgc  19 0.023142509 2.47826087
## cgg Arg   cgg   4 0.004872107 0.52173913
## cgt Arg   cgt  18 0.021924482 2.34782609
## cta Leu   cta   2 0.002436054 0.13483146
## ctc Leu   ctc  13 0.015834348 0.87640449
## ctg Leu   ctg  43 0.052375152 2.89887640
## ctt Leu   ctt   8 0.009744214 0.53932584
## gaa Glu   gaa  40 0.048721072 1.50943396
## gac Asp   gac  14 0.017052375 0.63636364
## gag Glu   gag  13 0.015834348 0.49056604
## gat Asp   gat  30 0.036540804 1.36363636
## gca Ala   gca  14 0.017052375 0.61538462
## gcc Ala   gcc  36 0.043848965 1.58241758
## gcg Ala   gcg  26 0.031668697 1.14285714
## gct Ala   gct  15 0.018270402 0.65934066
## gga Gly   gga   9 0.010962241 0.57142857
## ggc Gly   ggc  22 0.026796590 1.39682540
## ggg Gly   ggg  10 0.012180268 0.63492063
## ggt Gly   ggt  22 0.026796590 1.39682540
## gta Val   gta   5 0.006090134 0.28985507
## gtc Val   gtc  18 0.021924482 1.04347826
## gtg Val   gtg  27 0.032886724 1.56521739
## gtt Val   gtt  19 0.023142509 1.10144928
## taa Stp   taa   0 0.000000000 0.00000000
## tac Tyr   tac   8 0.009744214 0.80000000
## tag Stp   tag   0 0.000000000 0.00000000
## tat Tyr   tat  12 0.014616322 1.20000000
## tca Ser   tca   6 0.007308161 0.70588235
## tcc Ser   tcc  10 0.012180268 1.17647059
## tcg Ser   tcg   9 0.010962241 1.05882353
## tct Ser   tct  11 0.013398295 1.29411765
## tga Stp   tga   1 0.001218027 3.00000000
## tgc Cys   tgc   9 0.010962241 1.50000000
## tgg Trp   tgg   4 0.004872107 1.00000000
## tgt Cys   tgt   3 0.003654080 0.50000000
## tta Leu   tta  10 0.012180268 0.67415730
## ttc Phe   ttc  19 0.023142509 1.26666667
## ttg Leu   ttg  13 0.015834348 0.87640449
## ttt Phe   ttt  11 0.013398295 0.73333333

The RSCU stands for relative synonymous codon usage, which is a measure as to how preferred a codon of a particular amino acid is compared to the other codons which also encode it.

Can you think of an approach to calculate codon usage bias for the total coding sequence of E. coli?

K-mer profiling

K-mer is simply a short sequence string of length k. We can use k-mer analysis to identify particular sequences that are over- or under-represented in an organism.

prots <- unlist(prot)

mycounts <- count(prots,wordsize=3,alphabet=aa)

str(mycounts)

##  'table' int [1:8000(1d)] 1338 182 568 749 481 1075 222 863 580 1594 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:8000] "AAA" "AAC" "AAD" "AAE" ...

head(mycounts)

## 
##  AAA  AAC  AAD  AAE  AAF  AAG 
## 1338  182  568  749  481 1075

myfreq <- count(prots,wordsize=3,alphabet=aa,freq=TRUE)

Using code, can you identify the top 6 most over- and under-represented triplet sequences?

Additional tips for putting together a winning assignment

Read the instructions and the rubric carefully.
Don’t forget the README. The README should communicate what the repo is for, what the contents are and how the programs should be run. Check out some good READMEs:

Work as a team. There needs to be commits to the code and documentation from each group member. Each member also needs to be involved with the issues; opening them, discussing them and making the necessary code changes.
At the start of the coding session, you should “pull” changes that may have been made by your group-mates. If you don’t pull, then you will be working on the older version of the code and it will cause problems when you want to push your new changes.
When preparing the answers to the questions, ensure that knit works properly before you commit and push the changes to the repo. It can take a long time to troubleshoot errors if you leave this step to the end.
Remove any unnecessary code. Keeping redundant bits of code is a sign that you don’t understand the purpose of each line of code.
Minimise output data. If you are working with a large dataset (like 10,000 rows), you don’t want to show every data point in the report as it would occupy many pages and the reader is not interested in that. Use commands like str() and head() to reduce the amount of data shown. In short, make the reading process easy for the audience.
Give the context. Include the question that you are answering, as well as a description of how you went about solving the problem. This will ensure that you (or someone else) will be able to read and understand the assignment even after many years after you finish the course.
Customise the theme to your preferences. Here are some setting which I think could be good defaults. More themes are found here.

---
title: "My Title"
author: "Bob and April"
date: "2022-11-24"
output:
  html_document:
    toc: yes
    fig_width: 7
    fig_height: 7
theme: spacelab
---

Tables should be presented using kableExtra. There is a guide for it here. Here is an example which includes a caption for the table.

head(mtcars) %>%
  kbl(caption = "mtcars data table") %>%
  kable_paper(full_width = F, html_font = "Arial")

mtcars data table
	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

You can include a caption for each figure according to this guide.
If you used any specific sources in your report, you need to provide a bibliography and cite those sources in the appropriate sections. You are free to use any of the common referencing styles.

Session info

sessionInfo()

## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
##  [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] R.utils_2.12.0    R.oo_1.25.0       R.methodsS3_1.8.2 kableExtra_1.3.4  seqinr_4.2-16    
## [6] readxl_1.4.1     
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7                matrixStats_0.62.0          bit64_4.0.5                
##  [4] webshot_0.5.4               RColorBrewer_1.1-3          httr_1.4.4                 
##  [7] GenomeInfoDb_1.32.2         tools_4.2.2                 bslib_0.4.0                
## [10] utf8_1.2.2                  R6_2.5.1                    DBI_1.1.3                  
## [13] BiocGenerics_0.42.0         colorspace_2.0-3            ade4_1.7-19                
## [16] withr_2.5.0                 tidyselect_1.2.0            DESeq2_1.36.0              
## [19] downlit_0.4.2               bit_4.0.4                   compiler_4.2.2             
## [22] cli_3.4.1                   rvest_1.0.3                 Biobase_2.56.0             
## [25] xml2_1.3.3                  DelayedArray_0.22.0         sass_0.4.2                 
## [28] scales_1.2.1                genefilter_1.78.0           systemfonts_1.0.4          
## [31] stringr_1.4.1               digest_0.6.30               rmarkdown_2.17             
## [34] svglite_2.1.0               XVector_0.36.0              pkgconfig_2.0.3            
## [37] htmltools_0.5.3             MatrixGenerics_1.8.1        fastmap_1.1.0              
## [40] highr_0.9                   rlang_1.0.6                 rstudioapi_0.14            
## [43] RSQLite_2.2.18              jquerylib_0.1.4             generics_0.1.3             
## [46] jsonlite_1.8.3              BiocParallel_1.30.3         dplyr_1.0.10               
## [49] RCurl_1.98-1.9              magrittr_2.0.3              GenomeInfoDbData_1.2.8     
## [52] Matrix_1.5-1                Rcpp_1.0.9                  munsell_0.5.0              
## [55] S4Vectors_0.34.0            fansi_1.0.3                 lifecycle_1.0.3            
## [58] stringi_1.7.8               yaml_2.3.6                  MASS_7.3-58.1              
## [61] SummarizedExperiment_1.26.1 zlibbioc_1.42.0             grid_4.2.2                 
## [64] blob_1.2.3                  parallel_4.2.2              crayon_1.5.2               
## [67] lattice_0.20-45             Biostrings_2.64.0           splines_4.2.2              
## [70] annotate_1.74.0             KEGGREST_1.36.3             locfit_1.5-9.6             
## [73] knitr_1.40                  pillar_1.8.1                GenomicRanges_1.48.0       
## [76] geneplotter_1.74.0          codetools_0.2-18            stats4_4.2.2               
## [79] XML_3.99-0.12               glue_1.6.2                  evaluate_0.17              
## [82] png_0.1-7                   vctrs_0.5.0                 cellranger_1.1.0           
## [85] gtable_0.3.1                assertthat_0.2.1            cachem_1.0.6               
## [88] ggplot2_3.3.6               xfun_0.34                   xtable_1.8-4               
## [91] survival_3.4-0              viridisLite_0.4.1           tibble_3.1.8               
## [94] AnnotationDbi_1.58.0        memoise_2.0.1               IRanges_2.30.0