It Takes 2 Lines of R Code to Discover Interesting Biology

The following biological phenomenon demonstrates just how elegant R code can be.

In vertebrate genomes, a methyl group (-CH3) can be added to nucleotides. Such process of methylation is commonly associated with gene suppression. Most of the cytosines in the system are methylated (apart from structures called CpG islands). If a spontaneous event of deamination (removal of amine group) occurs on cytosine, different outcomes happen depending on the methylation state. If cytosine was not methylated, it is then converted to uracil nucleotide, which is not a constituent member of DNA sequences and is quickly repaired by the system. However, if the cytosine is methylated, conversion to another nucleotide occurs – thymine. Since thymine is natural part of each DNA sequence, system cannot recognise this change. Thus in vertebrate genomes, we see much lower proportion of CpG dinucleotides than expected.

It takes 2 lines of R code to discover this phenomenon.

## input is the control sample in ChIP-seq experiment (sequences of un-chipped DNA)
inpt <- readAligned(dirPath="/YourPathToInputFiles/", pattern="Input.bam", type="BAM")

## divide sequence data into dinucleotides and calculate their frequency
ntab <- table(unlist(lapply(sread(inpt)[sample(1:length(inpt), 10000)], function(i) strsplit(gsub("(..)", "\\1_", i), split="_")[[1]])))

You may have also some unexpected nucleotides (>16 permutations) that contain “N”. These are the bases that could not be assigned by sequencing machine.
Finally, let’s visualise CpG dinucleotide depletion.

barplot(ntab[!grepl("N", names(ntab))], ylab="Frequency")

## readAligned is part of ShortRead package, to download visit
## table calculates string (dinucleotide) frequency
## we iterate over 10,000 randomly chosen sequences with lapply function
## sread can access sequence reads from the (binary) sequence alignment map
## strsplit splits the strings (sequences) based on pattern defined by split parameter
## gsub unlike sub replaces the pattern over all matches
## \\1 is a backreference to the grouping regular expression (..); it returns all two-character elements

You may wonder why CpG and not GpC? Cytosine in GpC is rarely methylated (

Edit: You can find the dataset used in this example at
Original research is presented at



ChIP-seq Analysis with Bioconductor

Often scientists are interested in finding genome-wide binding site of their protein of interest. R offers easy way to load and process the sequence files coming from ChIP-seq experiment. During the next weeks I’m going to present a pipeline that uses several Bioconductor packages which contain necessary functions for working on NGS datasets. To keep blog organised, posts are going to be divided into several parts. As content is uploaded, links will be made in this summary post for easier access.

1. Loading and quality control for sequence alignment maps
2. Filtering and coverage calculation
3. Peak finding using local lambda fit for Poisson distribution
4. Motif discovery
5. Motif scanning

So let’s start with the first topic – loading and QA of sequence reads.

## packages for loading BAM files</pre>
library(ShortRead); library(chipseq)

## loading alignment map
input <- readAligned(dirPath = "/Directory/", pattern = "Input.bam", type = "BAM")
chip <- readAligned(dirPath = "/Directory/", pattern = "Foxa2.bam", type = "BAM")

## content of AlignedRead class object

## base calling quality of the input and chip dataset
input_qa <- quality(input)
chip_qa <- quality(chip)


## convert quality encoding to integer matrix and plot the scores
input_matrix_qa <- as(input_qa[sample(1:length(input_qa), 10000)], "matrix")
chip_matrix_qa <- as(chip_qa[sample(1:length(chip_qa), 10000)], "matrix")

plot(apply(input_matrix_qa, 2, mean), pch = 16, col = "red", ylim = c(0, 40), ylab = "Phred Quality Score", xlab = "Base Position")
lines(apply(chip_matrix_qa, 2, mean), pch = 16, col = "black", type = "p")
legend("topright", c("chip", "input"), pch = 16, col = c("black", "red"), cex = 2, bty = "n")

## nucleotide frequency
sread_chip <- as.character(sread(chip)[sample(1:length(chip), 10000)])
sread_chip <-"rbind", lapply(sread_chip, function(i) strsplit(i, "")[[1]]))
sread_chip <- apply(sread_chip, 2, function(i) table(i)[c("A", "C", "G", "T")]) / 10000

barplot(sread_chip, xlim=c(0,50), ylab = "Nucleotide Distribution")
legend("right", fill=gray(1:4/4), c("A", "C", "G", "T"), bty = "n", cex = 1.5)