Reconstructing Principal Component Analysis Matrix

faceMatrix

Edit: I’ve updated afterPCA function to work with input of any dimensions. Previously it only generated correct output for square matrices.

PCA is widely used method for finding patterns in high-dimensional data. Whether you use it to compress large matrix or to remove one of the principal components in biological datasets, you’ll end up with the task of performing series of equations from linear algebra to reconstruct the matrix of original dimensions.

To find principal components, we first need to center the input matrix, and then calculate the eigenvalues and eigenvectors of its covariance matrix. To illustrate matrix reconstruction I’ll use 32×32 faceData matrix from Jeff Leek’s coursera course on data analysis.


# get the dataset from https://spark-public.s3.amazonaws.com/dataanalysis/face.rda
# you probably want to use stats::prcomp for PCA on big matrices
load('~/face.rda')
runPCA <- function(mat = 'Unadjusted matrix') eigen(cov(apply(mat, 2, function(i) i - mean(i))))
pca <- runPCA(faceData)

# > str(pca)
# List of 2
# $ values : num [1:32] 18915 10067 4724 2506 1366 ...
# $ vectors: num [1:32, 1:32] -0.236 -0.261 -0.192 -0.164 -0.204 ...

First thing after doing PCA is to check the contributions of each PC in explaining the variance.

varExplained <- function(eigenList) {

par(mfrow = c(1,2))

plot(
 eigenList$value / sum(eigenList$value), pch = 21, col = 'black',
 bg = '#549cc4', ylim = c(0, 1), xlab = 'Principal Component',
 ylab = 'Variance Explained'
 ) + abline(h = 0.9)

plot(
 cumsum(eigenList$value) / sum(eigenList$value), pch = 21,
 col = 'black', bg = '#549cc4', ylim = c(0, 1), xlab = 'Principal Component',
 ylab = 'Cumulative Variance Explained'
 ) + abline(h = 0.9)
}

varExplained(pca)

varExplained

From these plots you can see that faceData has ~5 PC’s that cumulatively explain ~90% of total variance. Lets use this information to reconstruct the matrix, and compare it to the original one.

afterPCA <- function(

 matAdjust = 'Centered matrix',
 meanList = 'List of column means of original (unadjusted) matrix',
 sdList = 'List of column standard deviations of original (unadjusted) matrix',
 eigenList = 'List of eigenvalues and eigenvectors of adjust matrix covariance matrix',
 n = 'selected PC\'s',
 specific_select = 'If True: n == 1:n, if False: just n\'th columns') {

if (length(n) > ncol(matAdjust)) stop('N is higher than the number of PC\'s')
 if (!specific_select & length(n) > 1) stop('Use a single number when selecting up to n\'th PC')
 if (!specific_select) n <- 1:n

tmp_return <- (t(eigenList$vectors[,n] %*% (t(eigenList$vectors[,n]) %*% t(matAdjust))) * matrix(rep(sdList, each = nrow(matAdjust)), ncol = ncol(matAdjust))) + matrix(rep(meanList, each = nrow(matAdjust)), ncol = ncol(matAdjust))
 colnames(tmp_return) <- colnames(matAdjust)

 tmp_return
}
# ColorBrewer palette
library(RColorBrewer)
showMatrix <- function(x, ...) image(t(x[nrow(x):1,]), xaxt = 'none', yaxt = 'none', col = rev(colorRampPalette(brewer.pal(7, 'Blues'))(100)), ...)

reconstMatrix <- afterPCA(
 matAdjust = apply(faceData, 2, function(i) i - mean(i)),
 meanList = apply(faceData, 2, mean),
 eigenList = pca,
 n = 5,
 specific_select = FALSE
)

par(mfrow = c(1,2), mar = c(0, 0, 1, 0), bty = 'n')
showMatrix(faceData, main = 'Original Matrix')
showMatrix(reconstMatrix, main = 'First 5 PC\'s')

firstFivePCs

As seen from eigenvalues (variances), taking only 5/32 PC’s is enough to recreate face that has almost all of the features of the original matrix.

Controlling heatmap colors with ggplot2

One of the most popular posts on this blog is the very first one, solving the issue of mapping certain ranges of values to particular colors in heatmaps. Given the abundance of ggplot2 usage in R plotting, I thought I’d give it a try and do similar job within the context of graphics grammar.

## required packages (plot, melt data frame, and rolling function)
library(ggplot2)
library(reshape)
library(zoo)

## repeat random selection
set.seed(1)

## create 50x10 matrix of random values from [-1, +1]
random_matrix <- matrix(runif(500, min = -1, max = 1), nrow = 50)

## set color representation for specific values of the data distribution
quantile_range <- quantile(random_matrix, probs = seq(0, 1, 0.2))

## use http://colorbrewer2.org/ to find optimal divergent color palette (or set own)
color_palette <- colorRampPalette(c("#3794bf", "#FFFFFF", "#df8640"))(length(quantile_range) - 1)

## prepare label text (use two adjacent values for range text)
label_text <- rollapply(round(quantile_range, 2), width = 2, by = 1, FUN = function(i) paste(i, collapse = " : "))

## discretize matrix; this is the most important step, where for each value we find category of predefined ranges (modify probs argument of quantile to detail the colors)
mod_mat <- matrix(findInterval(random_matrix, quantile_range, all.inside = TRUE), nrow = nrow(random_matrix))

## remove background and axis from plot
theme_change <- theme(
 plot.background = element_blank(),
 panel.grid.minor = element_blank(),
 panel.grid.major = element_blank(),
 panel.background = element_blank(),
 panel.border = element_blank(),
 axis.line = element_blank(),
 axis.ticks = element_blank(),
 axis.text.x = element_blank(),
 axis.text.y = element_blank(),
 axis.title.x = element_blank(),
 axis.title.y = element_blank()
)

## output the graphics
ggplot(melt(mod_mat), aes(x = X1, y = X2, fill = factor(value))) +
geom_tile(color = "black") +
scale_fill_manual(values = color_palette, name = "", labels = label_text) +
theme_change

Result of the quantile color representation:

We can predefine ranges, and create skewed colorsets:
Trick was to discretize the matrix of continuous values. Alternatively, you can use “breaks” argument in functions such as scale_fill_gradientn, but such method will assign continuous list of colors within set range.

Cheers.

ChIP-seq Analysis with Bioconductor

Often scientists are interested in finding genome-wide binding site of their protein of interest. R offers easy way to load and process the sequence files coming from ChIP-seq experiment. During the next weeks I’m going to present a pipeline that uses several Bioconductor packages which contain necessary functions for working on NGS datasets. To keep blog organised, posts are going to be divided into several parts. As content is uploaded, links will be made in this summary post for easier access.

1. Loading and quality control for sequence alignment maps
2. Filtering and coverage calculation
3. Peak finding using local lambda fit for Poisson distribution
4. Motif discovery
5. Motif scanning

So let’s start with the first topic – loading and QA of sequence reads.

## packages for loading BAM files</pre>
library(ShortRead); library(chipseq)

## loading alignment map
input <- readAligned(dirPath = "/Directory/", pattern = "Input.bam", type = "BAM")
chip <- readAligned(dirPath = "/Directory/", pattern = "Foxa2.bam", type = "BAM")

## content of AlignedRead class object
input

## base calling quality of the input and chip dataset
input_qa <- quality(input)
chip_qa <- quality(chip)

input_qa

## convert quality encoding to integer matrix and plot the scores
input_matrix_qa <- as(input_qa[sample(1:length(input_qa), 10000)], "matrix")
chip_matrix_qa <- as(chip_qa[sample(1:length(chip_qa), 10000)], "matrix")

plot(apply(input_matrix_qa, 2, mean), pch = 16, col = "red", ylim = c(0, 40), ylab = "Phred Quality Score", xlab = "Base Position")
lines(apply(chip_matrix_qa, 2, mean), pch = 16, col = "black", type = "p")
legend("topright", c("chip", "input"), pch = 16, col = c("black", "red"), cex = 2, bty = "n")

## nucleotide frequency
sread_chip <- as.character(sread(chip)[sample(1:length(chip), 10000)])
sread_chip <- do.call("rbind", lapply(sread_chip, function(i) strsplit(i, "")[[1]]))
sread_chip <- apply(sread_chip, 2, function(i) table(i)[c("A", "C", "G", "T")]) / 10000

barplot(sread_chip, xlim=c(0,50), ylab = "Nucleotide Distribution")
legend("right", fill=gray(1:4/4), c("A", "C", "G", "T"), bty = "n", cex = 1.5)