Reconstructing Principal Component Analysis Matrix


Edit: I’ve updated afterPCA function to work with input of any dimensions. Previously it only generated correct output for square matrices.

PCA is widely used method for finding patterns in high-dimensional data. Whether you use it to compress large matrix or to remove one of the principal components in biological datasets, you’ll end up with the task of performing series of equations from linear algebra to reconstruct the matrix of original dimensions.

To find principal components, we first need to center the input matrix, and then calculate the eigenvalues and eigenvectors of its covariance matrix. To illustrate matrix reconstruction I’ll use 32×32 faceData matrix from Jeff Leek’s coursera course on data analysis.

# get the dataset from
# you probably want to use stats::prcomp for PCA on big matrices
runPCA <- function(mat = 'Unadjusted matrix') eigen(cov(apply(mat, 2, function(i) i - mean(i))))
pca <- runPCA(faceData)

# > str(pca)
# List of 2
# $ values : num [1:32] 18915 10067 4724 2506 1366 ...
# $ vectors: num [1:32, 1:32] -0.236 -0.261 -0.192 -0.164 -0.204 ...

First thing after doing PCA is to check the contributions of each PC in explaining the variance.

varExplained <- function(eigenList) {

par(mfrow = c(1,2))

 eigenList$value / sum(eigenList$value), pch = 21, col = 'black',
 bg = '#549cc4', ylim = c(0, 1), xlab = 'Principal Component',
 ylab = 'Variance Explained'
 ) + abline(h = 0.9)

 cumsum(eigenList$value) / sum(eigenList$value), pch = 21,
 col = 'black', bg = '#549cc4', ylim = c(0, 1), xlab = 'Principal Component',
 ylab = 'Cumulative Variance Explained'
 ) + abline(h = 0.9)



From these plots you can see that faceData has ~5 PC’s that cumulatively explain ~90% of total variance. Lets use this information to reconstruct the matrix, and compare it to the original one.

afterPCA <- function(

 matAdjust = 'Centered matrix',
 meanList = 'List of column means of original (unadjusted) matrix',
 sdList = 'List of column standard deviations of original (unadjusted) matrix',
 eigenList = 'List of eigenvalues and eigenvectors of adjust matrix covariance matrix',
 n = 'selected PC\'s',
 specific_select = 'If True: n == 1:n, if False: just n\'th columns') {

if (length(n) > ncol(matAdjust)) stop('N is higher than the number of PC\'s')
 if (!specific_select & length(n) > 1) stop('Use a single number when selecting up to n\'th PC')
 if (!specific_select) n <- 1:n

tmp_return <- (t(eigenList$vectors[,n] %*% (t(eigenList$vectors[,n]) %*% t(matAdjust))) * matrix(rep(sdList, each = nrow(matAdjust)), ncol = ncol(matAdjust))) + matrix(rep(meanList, each = nrow(matAdjust)), ncol = ncol(matAdjust))
 colnames(tmp_return) <- colnames(matAdjust)

# ColorBrewer palette
showMatrix <- function(x, ...) image(t(x[nrow(x):1,]), xaxt = 'none', yaxt = 'none', col = rev(colorRampPalette(brewer.pal(7, 'Blues'))(100)), ...)

reconstMatrix <- afterPCA(
 matAdjust = apply(faceData, 2, function(i) i - mean(i)),
 meanList = apply(faceData, 2, mean),
 eigenList = pca,
 n = 5,
 specific_select = FALSE

par(mfrow = c(1,2), mar = c(0, 0, 1, 0), bty = 'n')
showMatrix(faceData, main = 'Original Matrix')
showMatrix(reconstMatrix, main = 'First 5 PC\'s')


As seen from eigenvalues (variances), taking only 5/32 PC’s is enough to recreate face that has almost all of the features of the original matrix.

ggplot2 multiple boxplots with metadata

Recently I was asked for an advice of how to plot values with an additional attached condition separating the boxplots. This turns out to be ugly in base graphics, but amazingly simple in ggplot2.


# create fake dataset with additional attributes - sex, sample, and temperature
x <- data.frame(
 values = c(runif(100, min = -2), runif(100), runif(100, max = 2), runif(100)),
 sex = rep(c('M', 'F'), each = 100),
 sample = rep(c('sample_a', 'sample_b'), each = 200),
 temperature = sample(c('15C', '25C', '30C', '42C'), 400, replace = TRUE)

# compare different sample populations across various temperatures
ggplot(x, aes(x = sample, y = values, fill = sex)) +
geom_boxplot() +
facet_wrap(~ temperature)


Controlling heatmap colors with ggplot2

One of the most popular posts on this blog is the very first one, solving the issue of mapping certain ranges of values to particular colors in heatmaps. Given the abundance of ggplot2 usage in R plotting, I thought I’d give it a try and do similar job within the context of graphics grammar.

## required packages (plot, melt data frame, and rolling function)

## repeat random selection

## create 50x10 matrix of random values from [-1, +1]
random_matrix <- matrix(runif(500, min = -1, max = 1), nrow = 50)

## set color representation for specific values of the data distribution
quantile_range <- quantile(random_matrix, probs = seq(0, 1, 0.2))

## use to find optimal divergent color palette (or set own)
color_palette <- colorRampPalette(c("#3794bf", "#FFFFFF", "#df8640"))(length(quantile_range) - 1)

## prepare label text (use two adjacent values for range text)
label_text <- rollapply(round(quantile_range, 2), width = 2, by = 1, FUN = function(i) paste(i, collapse = " : "))

## discretize matrix; this is the most important step, where for each value we find category of predefined ranges (modify probs argument of quantile to detail the colors)
mod_mat <- matrix(findInterval(random_matrix, quantile_range, all.inside = TRUE), nrow = nrow(random_matrix))

## remove background and axis from plot
theme_change <- theme(
 plot.background = element_blank(),
 panel.grid.minor = element_blank(),
 panel.grid.major = element_blank(),
 panel.background = element_blank(),
 panel.border = element_blank(),
 axis.line = element_blank(),
 axis.ticks = element_blank(),
 axis.text.x = element_blank(),
 axis.text.y = element_blank(),
 axis.title.x = element_blank(),
 axis.title.y = element_blank()

## output the graphics
ggplot(melt(mod_mat), aes(x = X1, y = X2, fill = factor(value))) +
geom_tile(color = "black") +
scale_fill_manual(values = color_palette, name = "", labels = label_text) +

Result of the quantile color representation:

We can predefine ranges, and create skewed colorsets:
Trick was to discretize the matrix of continuous values. Alternatively, you can use “breaks” argument in functions such as scale_fill_gradientn, but such method will assign continuous list of colors within set range.


It Takes 2 Lines of R Code to Discover Interesting Biology

The following biological phenomenon demonstrates just how elegant R code can be.

In vertebrate genomes, a methyl group (-CH3) can be added to nucleotides. Such process of methylation is commonly associated with gene suppression. Most of the cytosines in the system are methylated (apart from structures called CpG islands). If a spontaneous event of deamination (removal of amine group) occurs on cytosine, different outcomes happen depending on the methylation state. If cytosine was not methylated, it is then converted to uracil nucleotide, which is not a constituent member of DNA sequences and is quickly repaired by the system. However, if the cytosine is methylated, conversion to another nucleotide occurs – thymine. Since thymine is natural part of each DNA sequence, system cannot recognise this change. Thus in vertebrate genomes, we see much lower proportion of CpG dinucleotides than expected.

It takes 2 lines of R code to discover this phenomenon.

## input is the control sample in ChIP-seq experiment (sequences of un-chipped DNA)
inpt <- readAligned(dirPath="/YourPathToInputFiles/", pattern="Input.bam", type="BAM")

## divide sequence data into dinucleotides and calculate their frequency
ntab <- table(unlist(lapply(sread(inpt)[sample(1:length(inpt), 10000)], function(i) strsplit(gsub("(..)", "\\1_", i), split="_")[[1]])))

You may have also some unexpected nucleotides (>16 permutations) that contain “N”. These are the bases that could not be assigned by sequencing machine.
Finally, let’s visualise CpG dinucleotide depletion.

barplot(ntab[!grepl("N", names(ntab))], ylab="Frequency")

## readAligned is part of ShortRead package, to download visit
## table calculates string (dinucleotide) frequency
## we iterate over 10,000 randomly chosen sequences with lapply function
## sread can access sequence reads from the (binary) sequence alignment map
## strsplit splits the strings (sequences) based on pattern defined by split parameter
## gsub unlike sub replaces the pattern over all matches
## \\1 is a backreference to the grouping regular expression (..); it returns all two-character elements

You may wonder why CpG and not GpC? Cytosine in GpC is rarely methylated (

Edit: You can find the dataset used in this example at
Original research is presented at


ChIP-seq Analysis with Bioconductor

Often scientists are interested in finding genome-wide binding site of their protein of interest. R offers easy way to load and process the sequence files coming from ChIP-seq experiment. During the next weeks I’m going to present a pipeline that uses several Bioconductor packages which contain necessary functions for working on NGS datasets. To keep blog organised, posts are going to be divided into several parts. As content is uploaded, links will be made in this summary post for easier access.

1. Loading and quality control for sequence alignment maps
2. Filtering and coverage calculation
3. Peak finding using local lambda fit for Poisson distribution
4. Motif discovery
5. Motif scanning

So let’s start with the first topic – loading and QA of sequence reads.

## packages for loading BAM files</pre>
library(ShortRead); library(chipseq)

## loading alignment map
input <- readAligned(dirPath = "/Directory/", pattern = "Input.bam", type = "BAM")
chip <- readAligned(dirPath = "/Directory/", pattern = "Foxa2.bam", type = "BAM")

## content of AlignedRead class object

## base calling quality of the input and chip dataset
input_qa <- quality(input)
chip_qa <- quality(chip)


## convert quality encoding to integer matrix and plot the scores
input_matrix_qa <- as(input_qa[sample(1:length(input_qa), 10000)], "matrix")
chip_matrix_qa <- as(chip_qa[sample(1:length(chip_qa), 10000)], "matrix")

plot(apply(input_matrix_qa, 2, mean), pch = 16, col = "red", ylim = c(0, 40), ylab = "Phred Quality Score", xlab = "Base Position")
lines(apply(chip_matrix_qa, 2, mean), pch = 16, col = "black", type = "p")
legend("topright", c("chip", "input"), pch = 16, col = c("black", "red"), cex = 2, bty = "n")

## nucleotide frequency
sread_chip <- as.character(sread(chip)[sample(1:length(chip), 10000)])
sread_chip <-"rbind", lapply(sread_chip, function(i) strsplit(i, "")[[1]]))
sread_chip <- apply(sread_chip, 2, function(i) table(i)[c("A", "C", "G", "T")]) / 10000

barplot(sread_chip, xlim=c(0,50), ylab = "Nucleotide Distribution")
legend("right", fill=gray(1:4/4), c("A", "C", "G", "T"), bty = "n", cex = 1.5)

discrimination between CpG islands and random sequences using Markov chains

Major part of modern research is trying to find patterns in the given dataset using learning methods. One of the methods that can use a priori information for such purpose is Markov chains, in which the probability of symbol occurrence depends only on previous symbol (appropriate for the dinucleotide modeling).

The question I am addressing here is – given the random sequence from mouse genome, can you predict that this is a CpG island? In the next few lines of code you will find many useful approaches to MC modeling, splitting of odd and even elements of R vector using “substring” function, querying the UCSC database solely via R, and informative visualization with transparent histograms.

Note that I got sequences from BSgenome.Mmusculus.UCSC.mm9 library, which contains complete mouse genome. Since the download of the same might be tedious for some (~750mb), I uploaded the cpg and random regions to pastebin.

CpG Islands –
Random -

# load libraries

# ignore following if you're taking sequences from pastebin

# models derived by ML estimator (a_st = c_st / sum(c_st')) from CpG and non-CpG islands
# in each row, nucleotide has a probability of being followed by respective base in the column,
# thus rows should sum up to 1

# notice the drop of G being followed by C in negative model (due to C being target of spontaneous deamination upon methylation),
# same is preserved in CpG-islands rich regions due to the lack of methylation supression on promoter regions

positive.model <- t(matrix(
    data  = c(0.180, 0.274, 0.426, 0.120, 0.171, 0.368, 0.274, 0.188, 0.161, 0.339, 0.375, 0.125, 0.079, 0.355, 0.384, 0.182),
    nrow  = 4,
    ncol  = 4,
    dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))

negative.model <- t(matrix(
    data  = c(0.300, 0.205, 0.285, 0.210, 0.322, 0.298, 0.078, 0.302, 0.248, 0.246, 0.298, 0.208, 0.177, 0.239, 0.292, 0.292),
    nrow  = 4,
    ncol  = 4,
    dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))

# log-likelihood table derived from log-odds ratio of two models
# used for discrimination of sequences as CpG or non-CpG islands

likelihood.values <- round(log2(positive.model/negative.model), 3)

# score (in bits) of the sequence normalized by its length
bits.score <- function(sequence = DNA_region, model = likelihood.values) {

    # sequence length
    seq.length <- nchar(sequence)

    # if sequence is odd, trim the last element
    if (seq.length %% 2 != 0) {
        sequence   <- paste(strsplit(sequence, split = "")[[1]][-seq.length], collapse = "")
        seq.length <- seq.length - 1

    # extract dinucleotides
    idx   <- seq(1, seq.length)
    odds  <- idx[(idx %% 2) == 1]
    evens <- idx[(idx %% 2) == 0]

    tmp.split <- substring(sequence, odds, evens)

    # return normalized sum of the log-likelihood scores
    return(sum(unlist(lapply(tmp.split, function(tmp.dinucleotide) {
        lookup <- strsplit(tmp.dinucleotide, split = "")[[1]]
        return(model[lookup[1], lookup[2]])
    }))) / seq.length)

bits.score <- cmpfun(bits.score)

# setting up the discrimination model
model.score <- function(sequences = DNA_region) {
    all.dinucleotides <-"c", lapply(sequences, function(sequence) {

        # sequence length
        seq.length <- nchar(sequence)

        # if sequence is odd, trim the last element
        if (seq.length %% 2 != 0) {
            sequence   <- paste(strsplit(sequence, split = "")[[1]][-seq.length], collapse = "")
            seq.length <- seq.length - 1

        # extract dinucleotides
        idx   <- seq(1, seq.length)
        odds  <- idx[(idx %% 2) == 1]
        evens <- idx[(idx %% 2) == 0]

        tmp.split <- substring(sequence, odds, evens)

    all.counts <- t(matrix(table(all.dinucleotides), ncol = 4, nrow = 4, dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))))
    all.counts <- t(apply(all.counts, 1, function(tmp.row) { tmp.row / sum(tmp.row) } ))


model.score <- cmpfun(model.score)

# following code extracts sequences from annotated and random parts of the mm9 genome
# ignore it if you're using pastebin text
# querying the UCSC database for CpG islands track
session               <- browserSession()
genome(session)       <- "mm9"
query                 <- ucscTableQuery(session, "CpG Islands")
cpg.islands           <- getTable(query)
cpg.islands           <- subset(cpg.islands, cpg.islands$chrom == "chr1", select = c("chrom", "chromStart", "chromEnd", "length"))
colnames(cpg.islands) <- c("chr", "region.start", "region.end", "region.width")

# generate random positions via "sample" function, but not exceeding largest width of CpG island
random.regions            <- data.frame(chr = "chr1", region.start = sample(seq(0, seqlengths(Mmusculus)[["chr1"]] - max(cpg.islands$region.width)), nrow(cpg.islands)))
random.regions$region.end <- random.regions$region.start + cpg.islands$region.width

markov.chain.seq <- lapply(c("cpg.islands", "random.regions"),  FUN = function(tmp.sample) {
    tmp.dataset <- get(tmp.sample)

    if (nrow(tmp.dataset) > 0) {
        tmp.sview        <- as.character(DNAStringSet(Views(unmasked(Mmusculus[["chr1"]]), start = tmp.dataset$region.start, end = tmp.dataset$region.end)))


# only keep sequences from non-repetitive parts of genome
markov.chain.seq        <- lapply(markov.chain.seq, function(tmp.sample) {"c", lapply(tmp.sample, function(tmp.seq) { tmp.seq[!grepl("N", tmp.seq)] })) })
names(markov.chain.seq) <- c("cpg.islands", "random.regions")

# you can substitute "lapply" with "mclapply" for parallel processing with multicore package at this point
# output of the discrimination plot
png("~/MarkovChain.png", 1000, 600)
    hist(unlist(lapply(markov.chain.seq[[1]], bits.score)), breaks = 100, xlim = c(-0.5, 0.5), ylim = c(0, 50), col = "#78B90050", main = "Markov Chain model discriminates CpG from non-CpG sequences", xlab = "bits score")
    hist(unlist(lapply(markov.chain.seq[[2]], bits.score)), breaks = 100, col = "#444D6150", add = TRUE)
legend("topright", c("CpG-rich sequences", "random sequences"), col = c("#78B900", "#444D61"), pch = 15, cex = 2, bty = "n")

Similar to the HMM example from previous post, the Markov chain is also based on Durbin (1998.) book.

Cheers, mintgene.

quick tips: within function assignment and specific object removal

If you’re familiar with the faster iterations on objects such as lapply, sapply, or apply for matrices, you might get surprised that the function call saves new assignments only locally.

# to assign global variable within function use "<<-" operator
lapply(1:10, function(i) {
    if (i != 1) {
        global.var <<- global.var + i; return(NULL)
    } else {
        global.var <<- i; return(NULL)

# 55

# another neat trick is the multiple assignment to objects
global.var -> global.var.A -> global.var.B
global.var == global.var.B


One of my favorite lines in R comes from the fact the language environment devours the memory. To focus on particular objects (such as when you need to save it as *.rda), replace the search term within grepl function for the one that you wish to keep.

# "global.var"   "global.var.A" "global.var.B"

rm(list = ls()[!grepl("global.var.A", ls())])

# refresh memory

#  "global.var.A"

Cheers, mintgene.

hmm: implementation of viterbi algorithm (Durbin, 1998) Part 2

Previous post presented the problem of dishonest casino that ocassionally uses loaded die. Sequence of the real states is hidden, and we are trying to figure it out just by looking at the observations (symbols).

# backtracking algorithm
for (i in 2:length(symbol.sequence)) {
    # probability vector stores the current emission with respect to (i-1) observation of selected state and transition probability
    # state vector (pointer) on the other hand is only storing the most probable state in (i-1), which we will later use for backtracking

    tmp.path.probability <- lapply(states, function(l) {
        max.k <- unlist(lapply(states, function(k) {
            prob.history[i-1, k] + transition.matrix[k, l]

        return(c(states[which(max.k == max(max.k))], max(max.k) + emission.matrix[symbol.sequence[i], l]))

    prob.history <- rbind(prob.history, data.frame(F = as.numeric(tmp.path.probability[[1]][2]), L = as.numeric(tmp.path.probability[[2]][2])))

    state.history <- data.frame(F = c(as.character(state.history[,tmp.path.probability[[1]][1]]), "F"), L = c(as.character(state.history[,tmp.path.probability[[2]][1]]), "L"))

# selecting the most probable path
viterbi.path <- as.character(state.history[,c("F", "L")[which(max(prob.history[length(symbol.sequence), ]) == prob.history[length(symbol.sequence), ])]])

If we apply our implementation to the data in the previous post, we can get the idea how well can HMM reconstruct the real history.

viterbi.table <- table(viterbi.path == real.path)
cat(paste(round(viterbi.table["TRUE"] / sum(viterbi.table) * 100, 2), "% accuracy\n", sep = ""))
# 71.33% accuracy

Cheers, mintgene.

hmm: implementation of viterbi algorithm (Durbin, 1998) Part 1

Example in the mentioned book goes as following – dishonest casino uses two types of dice. Fair one that has equal probability of landing on either side (1/6), and the loaded one with 50% chance for getting 6. Your task is to figure out which die has been used (states) just based on the sequence of the outcomes (symbols).

Notice that we convert all probabilities in log scale. Viterbi algorithm selects the most probable path that can have very low value by the end of the run. Thus, conversion to appropriate scale helps avoid calculation issues.

data -

# we'll represent loaded die as "L", and the fair one as "F"
states <- c("F", "L")

# following matrix defines the probability of switching the die
transition.matrix <- t(matrix(data = log2(c(0.95, 0.05, 0.1, 0.9)), nrow = 2, ncol = 2, dimnames = list(c("F", "L"), c("F", "L"))))

# emission probabilities tell you what is the change of landing on each side given that the particular die is selected
emission.matrix <- matrix(data = log2(c(rep(1/6, 6), c(rep(1/10, 5), 1/2))), nrow = 6, ncol = 2, dimnames = list(seq(1,6), c("F", "L")))

# initial probabilities define the chance of starting outcome (in our case we are equally likely to start with either states)
initial.matrix <- matrix(data = log2(c(0.5, 0.5)), nrow = 2, ncol = 1, dimnames = list(c("F", "L"), ""))

After we defined the model, we need to initialize two object that will keep the track of probability history and state path (Pi) during the recursion process.

prob.history  <- data.frame()
state.history <- data.frame()

# we start by calculating the probability of being in particular state given the first symbol and initial matrix
# notice a change in log space - every multiplication is converted to summation
prob.history  <- rbind(prob.history, unlist(lapply(states, function(k) {
    initial.matrix[k, 1] + emission.matrix[symbol.sequence[1], k]

state.history <- rbind(state.history, states)

colnames(prob.history)  <- c("F", "L")
colnames(state.history) <- c("F", "L")

Second part coming soon! mintgene

heatmaps: controlling the color representation with set data range

Often you want to set the fixed colors for particular range of your dataset to be sure that the visual output is correctly represented. This is particularly useful for time series data, where the range or your dataset might drastically change during the course of the simulation.

To do that in R, we need to set the “breaks” parameter in plotting functions such as image or heatmap.2.

# gplots contains the heatmap.2 function

# create 50x10 matrix of random values from [-1, +1]
random.matrix  <- matrix(runif(500, min = -1, max = 1), nrow = 50)

# following code limits the lowest and highest color to 5%, and 95% of your range, respectively
quantile.range <- quantile(random.matrix, probs = seq(0, 1, 0.01))
palette.breaks <- seq(quantile.range["5%"], quantile.range["95%"], 0.1)

# use to find optimal divergent color palette (or set own)
color.palette  <- colorRampPalette(c("#FC8D59", "#FFFFBF", "#91CF60"))(length(palette.breaks) - 1)



    dendrogram = "row",
    scale      = "none",
    trace      = "none",
    key        = FALSE,
    labRow     = NA,
    labCol     = NA,

    col    = color.palette,
    breaks = palette.breaks

Enjoy plotting! mintgene