ggplot2 multiple boxplots with metadata

Recently I was asked for an advice of how to plot values with an additional attached condition separating the boxplots. This turns out to be ugly in base graphics, but amazingly simple in ggplot2.


# create fake dataset with additional attributes - sex, sample, and temperature
x <- data.frame(
 values = c(runif(100, min = -2), runif(100), runif(100, max = 2), runif(100)),
 sex = rep(c('M', 'F'), each = 100),
 sample = rep(c('sample_a', 'sample_b'), each = 200),
 temperature = sample(c('15C', '25C', '30C', '42C'), 400, replace = TRUE)

# compare different sample populations across various temperatures
ggplot(x, aes(x = sample, y = values, fill = sex)) +
geom_boxplot() +
facet_wrap(~ temperature)


It Takes 2 Lines of R Code to Discover Interesting Biology

The following biological phenomenon demonstrates just how elegant R code can be.

In vertebrate genomes, a methyl group (-CH3) can be added to nucleotides. Such process of methylation is commonly associated with gene suppression. Most of the cytosines in the system are methylated (apart from structures called CpG islands). If a spontaneous event of deamination (removal of amine group) occurs on cytosine, different outcomes happen depending on the methylation state. If cytosine was not methylated, it is then converted to uracil nucleotide, which is not a constituent member of DNA sequences and is quickly repaired by the system. However, if the cytosine is methylated, conversion to another nucleotide occurs – thymine. Since thymine is natural part of each DNA sequence, system cannot recognise this change. Thus in vertebrate genomes, we see much lower proportion of CpG dinucleotides than expected.

It takes 2 lines of R code to discover this phenomenon.

## input is the control sample in ChIP-seq experiment (sequences of un-chipped DNA)
inpt <- readAligned(dirPath="/YourPathToInputFiles/", pattern="Input.bam", type="BAM")

## divide sequence data into dinucleotides and calculate their frequency
ntab <- table(unlist(lapply(sread(inpt)[sample(1:length(inpt), 10000)], function(i) strsplit(gsub("(..)", "\\1_", i), split="_")[[1]])))

You may have also some unexpected nucleotides (>16 permutations) that contain “N”. These are the bases that could not be assigned by sequencing machine.
Finally, let’s visualise CpG dinucleotide depletion.

barplot(ntab[!grepl("N", names(ntab))], ylab="Frequency")

## readAligned is part of ShortRead package, to download visit
## table calculates string (dinucleotide) frequency
## we iterate over 10,000 randomly chosen sequences with lapply function
## sread can access sequence reads from the (binary) sequence alignment map
## strsplit splits the strings (sequences) based on pattern defined by split parameter
## gsub unlike sub replaces the pattern over all matches
## \\1 is a backreference to the grouping regular expression (..); it returns all two-character elements

You may wonder why CpG and not GpC? Cytosine in GpC is rarely methylated (

Edit: You can find the dataset used in this example at
Original research is presented at