Edit: I’ve updated afterPCA function to work with input of any dimensions. Previously it only generated correct output for square matrices.

PCA is widely used method for finding patterns in high-dimensional data. Whether you use it to compress large matrix or to remove one of the principal components in biological datasets, you’ll end up with the task of performing series of equations from linear algebra to reconstruct the matrix of original dimensions.

To find principal components, we first need to center the input matrix, and then calculate the eigenvalues and eigenvectors of its covariance matrix. To illustrate matrix reconstruction I’ll use 32×32 faceData matrix from Jeff Leek’s coursera course on data analysis.

# get the dataset from https://spark-public.s3.amazonaws.com/dataanalysis/face.rda # you probably want to use stats::prcomp for PCA on big matrices load('~/face.rda') runPCA <- function(mat = 'Unadjusted matrix') eigen(cov(apply(mat, 2, function(i) i - mean(i)))) pca <- runPCA(faceData) # > str(pca) # List of 2 # $ values : num [1:32] 18915 10067 4724 2506 1366 ... # $ vectors: num [1:32, 1:32] -0.236 -0.261 -0.192 -0.164 -0.204 ...

First thing after doing PCA is to check the contributions of each PC in explaining the variance.

varExplained <- function(eigenList) { par(mfrow = c(1,2)) plot( eigenList$value / sum(eigenList$value), pch = 21, col = 'black', bg = '#549cc4', ylim = c(0, 1), xlab = 'Principal Component', ylab = 'Variance Explained' ) + abline(h = 0.9) plot( cumsum(eigenList$value) / sum(eigenList$value), pch = 21, col = 'black', bg = '#549cc4', ylim = c(0, 1), xlab = 'Principal Component', ylab = 'Cumulative Variance Explained' ) + abline(h = 0.9) } varExplained(pca)

From these plots you can see that faceData has ~5 PC’s that cumulatively explain ~90% of total variance. Lets use this information to reconstruct the matrix, and compare it to the original one.

afterPCA <- function( matAdjust = 'Centered matrix', meanList = 'List of column means of original (unadjusted) matrix', sdList = 'List of column standard deviations of original (unadjusted) matrix', eigenList = 'List of eigenvalues and eigenvectors of adjust matrix covariance matrix', n = 'selected PC\'s', specific_select = 'If True: n == 1:n, if False: just n\'th columns') { if (length(n) > ncol(matAdjust)) stop('N is higher than the number of PC\'s') if (!specific_select & length(n) > 1) stop('Use a single number when selecting up to n\'th PC') if (!specific_select) n <- 1:n tmp_return <- (t(eigenList$vectors[,n] %*% (t(eigenList$vectors[,n]) %*% t(matAdjust))) * matrix(rep(sdList, each = nrow(matAdjust)), ncol = ncol(matAdjust))) + matrix(rep(meanList, each = nrow(matAdjust)), ncol = ncol(matAdjust)) colnames(tmp_return) <- colnames(matAdjust) tmp_return } # ColorBrewer palette library(RColorBrewer) showMatrix <- function(x, ...) image(t(x[nrow(x):1,]), xaxt = 'none', yaxt = 'none', col = rev(colorRampPalette(brewer.pal(7, 'Blues'))(100)), ...) reconstMatrix <- afterPCA( matAdjust = apply(faceData, 2, function(i) i - mean(i)), meanList = apply(faceData, 2, mean), eigenList = pca, n = 5, specific_select = FALSE ) par(mfrow = c(1,2), mar = c(0, 0, 1, 0), bty = 'n') showMatrix(faceData, main = 'Original Matrix') showMatrix(reconstMatrix, main = 'First 5 PC\'s')

As seen from eigenvalues (variances), taking only 5/32 PC’s is enough to recreate face that has almost all of the features of the original matrix.

]]>library(ggplot2) # create fake dataset with additional attributes - sex, sample, and temperature x <- data.frame( values = c(runif(100, min = -2), runif(100), runif(100, max = 2), runif(100)), sex = rep(c('M', 'F'), each = 100), sample = rep(c('sample_a', 'sample_b'), each = 200), temperature = sample(c('15C', '25C', '30C', '42C'), 400, replace = TRUE) ) # compare different sample populations across various temperatures ggplot(x, aes(x = sample, y = values, fill = sex)) + geom_boxplot() + facet_wrap(~ temperature)]]>

## required packages (plot, melt data frame, and rolling function) library(ggplot2) library(reshape) library(zoo) ## repeat random selection set.seed(1) ## create 50x10 matrix of random values from [-1, +1] random_matrix <- matrix(runif(500, min = -1, max = 1), nrow = 50) ## set color representation for specific values of the data distribution quantile_range <- quantile(random_matrix, probs = seq(0, 1, 0.2)) ## use http://colorbrewer2.org/ to find optimal divergent color palette (or set own) color_palette <- colorRampPalette(c("#3794bf", "#FFFFFF", "#df8640"))(length(quantile_range) - 1) ## prepare label text (use two adjacent values for range text) label_text <- rollapply(round(quantile_range, 2), width = 2, by = 1, FUN = function(i) paste(i, collapse = " : ")) ## discretize matrix; this is the most important step, where for each value we find category of predefined ranges (modify probs argument of quantile to detail the colors) mod_mat <- matrix(findInterval(random_matrix, quantile_range, all.inside = TRUE), nrow = nrow(random_matrix)) ## remove background and axis from plot theme_change <- theme( plot.background = element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.background = element_blank(), panel.border = element_blank(), axis.line = element_blank(), axis.ticks = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank() ) ## output the graphics ggplot(melt(mod_mat), aes(x = X1, y = X2, fill = factor(value))) + geom_tile(color = "black") + scale_fill_manual(values = color_palette, name = "", labels = label_text) + theme_change

Result of the quantile color representation:

We can predefine ranges, and create skewed colorsets:

Trick was to discretize the matrix of continuous values. Alternatively, you can use “breaks” argument in functions such as scale_fill_gradientn, but such method will assign continuous list of colors within set range.

Cheers.

]]>In vertebrate genomes, a methyl group (-CH3) can be added to nucleotides. Such process of methylation is commonly associated with gene suppression. Most of the cytosines in the system are methylated (apart from structures called CpG islands). If a spontaneous event of deamination (removal of amine group) occurs on cytosine, different outcomes happen depending on the methylation state. If cytosine was not methylated, it is then converted to uracil nucleotide, which is not a constituent member of DNA sequences and is quickly repaired by the system. However, if the cytosine is methylated, conversion to another nucleotide occurs – thymine. Since thymine is natural part of each DNA sequence, system cannot recognise this change. Thus in vertebrate genomes, we see much lower proportion of CpG dinucleotides than expected.

It takes 2 lines of R code to discover this phenomenon.

## input is the control sample in ChIP-seq experiment (sequences of un-chipped DNA) inpt <- readAligned(dirPath="/YourPathToInputFiles/", pattern="Input.bam", type="BAM") ## divide sequence data into dinucleotides and calculate their frequency ntab <- table(unlist(lapply(sread(inpt)[sample(1:length(inpt), 10000)], function(i) strsplit(gsub("(..)", "\\1_", i), split="_")[[1]])))

You may have also some unexpected nucleotides (>16 permutations) that contain “N”. These are the bases that could not be assigned by sequencing machine.

Finally, let’s visualise CpG dinucleotide depletion.

barplot(ntab[!grepl("N", names(ntab))], ylab="Frequency")

## readAligned is part of ShortRead package, to download visit bioconductor.org ## table calculates string (dinucleotide) frequency ## we iterate over 10,000 randomly chosen sequences with lapply function ## sread can access sequence reads from the (binary) sequence alignment map ## strsplit splits the strings (sequences) based on pattern defined by split parameter ## gsub unlike sub replaces the pattern over all matches ## \\1 is a backreference to the grouping regular expression (..); it returns all two-character elements

You may wonder why CpG and not GpC? Cytosine in GpC is rarely methylated (http://www.ncbi.nlm.nih.gov/pubmed/6254144).

Edit: You can find the dataset used in this example at http://goo.gl/qrL48.

Original research is presented at http://dev.biologists.org/content/139/14/2625.

Cheers.

]]>1. Loading and quality control for sequence alignment maps

2. Filtering and coverage calculation

3. Peak finding using local lambda fit for Poisson distribution

4. Motif discovery

5. Motif scanning

So let’s start with the first topic – loading and QA of sequence reads.

## packages for loading BAM files</pre> library(ShortRead); library(chipseq) ## loading alignment map input <- readAligned(dirPath = "/Directory/", pattern = "Input.bam", type = "BAM") chip <- readAligned(dirPath = "/Directory/", pattern = "Foxa2.bam", type = "BAM") ## content of AlignedRead class object input ## base calling quality of the input and chip dataset input_qa <- quality(input) chip_qa <- quality(chip) input_qa ## convert quality encoding to integer matrix and plot the scores input_matrix_qa <- as(input_qa[sample(1:length(input_qa), 10000)], "matrix") chip_matrix_qa <- as(chip_qa[sample(1:length(chip_qa), 10000)], "matrix") plot(apply(input_matrix_qa, 2, mean), pch = 16, col = "red", ylim = c(0, 40), ylab = "Phred Quality Score", xlab = "Base Position") lines(apply(chip_matrix_qa, 2, mean), pch = 16, col = "black", type = "p") legend("topright", c("chip", "input"), pch = 16, col = c("black", "red"), cex = 2, bty = "n") ## nucleotide frequency sread_chip <- as.character(sread(chip)[sample(1:length(chip), 10000)]) sread_chip <- do.call("rbind", lapply(sread_chip, function(i) strsplit(i, "")[[1]])) sread_chip <- apply(sread_chip, 2, function(i) table(i)[c("A", "C", "G", "T")]) / 10000 barplot(sread_chip, xlim=c(0,50), ylab = "Nucleotide Distribution") legend("right", fill=gray(1:4/4), c("A", "C", "G", "T"), bty = "n", cex = 1.5)]]>

The question I am addressing here is – given the random sequence from mouse genome, can you predict that this is a CpG island? In the next few lines of code you will find many useful approaches to MC modeling, splitting of odd and even elements of R vector using “substring” function, querying the UCSC database solely via R, and informative visualization with transparent histograms.

Note that I got sequences from BSgenome.Mmusculus.UCSC.mm9 library, which contains complete mouse genome. Since the download of the same might be tedious for some (~750mb), I uploaded the cpg and random regions to pastebin.

CpG Islands – http://pastebin.com/372cfs6U

Random – http://pastebin.com/wZyKFFkx

# load libraries library(compiler) library(rtracklayer) library(GenomicFeatures) # ignore following if you're taking sequences from pastebin library(BSgenome.Mmusculus.UCSC.mm9) # models derived by ML estimator (a_st = c_st / sum(c_st')) from CpG and non-CpG islands # in each row, nucleotide has a probability of being followed by respective base in the column, # thus rows should sum up to 1 # notice the drop of G being followed by C in negative model (due to C being target of spontaneous deamination upon methylation), # same is preserved in CpG-islands rich regions due to the lack of methylation supression on promoter regions positive.model <- t(matrix( data = c(0.180, 0.274, 0.426, 0.120, 0.171, 0.368, 0.274, 0.188, 0.161, 0.339, 0.375, 0.125, 0.079, 0.355, 0.384, 0.182), nrow = 4, ncol = 4, dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T")) )) negative.model <- t(matrix( data = c(0.300, 0.205, 0.285, 0.210, 0.322, 0.298, 0.078, 0.302, 0.248, 0.246, 0.298, 0.208, 0.177, 0.239, 0.292, 0.292), nrow = 4, ncol = 4, dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T")) )) # log-likelihood table derived from log-odds ratio of two models # used for discrimination of sequences as CpG or non-CpG islands likelihood.values <- round(log2(positive.model/negative.model), 3) # score (in bits) of the sequence normalized by its length bits.score <- function(sequence = DNA_region, model = likelihood.values) { # sequence length seq.length <- nchar(sequence) # if sequence is odd, trim the last element if (seq.length %% 2 != 0) { sequence <- paste(strsplit(sequence, split = "")[[1]][-seq.length], collapse = "") seq.length <- seq.length - 1 } # extract dinucleotides idx <- seq(1, seq.length) odds <- idx[(idx %% 2) == 1] evens <- idx[(idx %% 2) == 0] tmp.split <- substring(sequence, odds, evens) # return normalized sum of the log-likelihood scores return(sum(unlist(lapply(tmp.split, function(tmp.dinucleotide) { lookup <- strsplit(tmp.dinucleotide, split = "")[[1]] return(model[lookup[1], lookup[2]]) }))) / seq.length) } bits.score <- cmpfun(bits.score) # setting up the discrimination model model.score <- function(sequences = DNA_region) { all.dinucleotides <- do.call("c", lapply(sequences, function(sequence) { # sequence length seq.length <- nchar(sequence) # if sequence is odd, trim the last element if (seq.length %% 2 != 0) { sequence <- paste(strsplit(sequence, split = "")[[1]][-seq.length], collapse = "") seq.length <- seq.length - 1 } # extract dinucleotides idx <- seq(1, seq.length) odds <- idx[(idx %% 2) == 1] evens <- idx[(idx %% 2) == 0] tmp.split <- substring(sequence, odds, evens) })) all.counts <- t(matrix(table(all.dinucleotides), ncol = 4, nrow = 4, dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T")))) all.counts <- t(apply(all.counts, 1, function(tmp.row) { tmp.row / sum(tmp.row) } )) return(all.counts) } model.score <- cmpfun(model.score) # following code extracts sequences from annotated and random parts of the mm9 genome # ignore it if you're using pastebin text # querying the UCSC database for CpG islands track session <- browserSession() genome(session) <- "mm9" query <- ucscTableQuery(session, "CpG Islands") cpg.islands <- getTable(query) cpg.islands <- subset(cpg.islands, cpg.islands$chrom == "chr1", select = c("chrom", "chromStart", "chromEnd", "length")) colnames(cpg.islands) <- c("chr", "region.start", "region.end", "region.width") # generate random positions via "sample" function, but not exceeding largest width of CpG island random.regions <- data.frame(chr = "chr1", region.start = sample(seq(0, seqlengths(Mmusculus)[["chr1"]] - max(cpg.islands$region.width)), nrow(cpg.islands))) random.regions$region.end <- random.regions$region.start + cpg.islands$region.width markov.chain.seq <- lapply(c("cpg.islands", "random.regions"), FUN = function(tmp.sample) { tmp.dataset <- get(tmp.sample) if (nrow(tmp.dataset) > 0) { tmp.sview <- as.character(DNAStringSet(Views(unmasked(Mmusculus[["chr1"]]), start = tmp.dataset$region.start, end = tmp.dataset$region.end))) } return(tmp.sview) }) # only keep sequences from non-repetitive parts of genome markov.chain.seq <- lapply(markov.chain.seq, function(tmp.sample) { do.call("c", lapply(tmp.sample, function(tmp.seq) { tmp.seq[!grepl("N", tmp.seq)] })) }) names(markov.chain.seq) <- c("cpg.islands", "random.regions") # you can substitute "lapply" with "mclapply" for parallel processing with multicore package at this point # output of the discrimination plot png("~/MarkovChain.png", 1000, 600) hist(unlist(lapply(markov.chain.seq[[1]], bits.score)), breaks = 100, xlim = c(-0.5, 0.5), ylim = c(0, 50), col = "#78B90050", main = "Markov Chain model discriminates CpG from non-CpG sequences", xlab = "bits score") hist(unlist(lapply(markov.chain.seq[[2]], bits.score)), breaks = 100, col = "#444D6150", add = TRUE) legend("topright", c("CpG-rich sequences", "random sequences"), col = c("#78B900", "#444D61"), pch = 15, cex = 2, bty = "n") dev.off()

Similar to the HMM example from previous post, the Markov chain is also based on Durbin (1998.) book.

Cheers, mintgene.

]]># to assign global variable within function use "<<-" operator lapply(1:10, function(i) { if (i != 1) { global.var <<- global.var + i; return(NULL) } else { global.var <<- i; return(NULL) } }) cat(global.var) # 55 # another neat trick is the multiple assignment to objects global.var -> global.var.A -> global.var.B global.var == global.var.B # TRUE

One of my favorite lines in R comes from the fact the language environment devours the memory. To focus on particular objects (such as when you need to save it as *.rda), replace the search term within grepl function for the one that you wish to keep.

ls() # "global.var" "global.var.A" "global.var.B" rm(list = ls()[!grepl("global.var.A", ls())]) # refresh memory gc() ls() # "global.var.A"

Cheers, mintgene.

]]># backtracking algorithm for (i in 2:length(symbol.sequence)) { # probability vector stores the current emission with respect to (i-1) observation of selected state and transition probability # state vector (pointer) on the other hand is only storing the most probable state in (i-1), which we will later use for backtracking tmp.path.probability <- lapply(states, function(l) { max.k <- unlist(lapply(states, function(k) { prob.history[i-1, k] + transition.matrix[k, l] })) return(c(states[which(max.k == max(max.k))], max(max.k) + emission.matrix[symbol.sequence[i], l])) }) prob.history <- rbind(prob.history, data.frame(F = as.numeric(tmp.path.probability[[1]][2]), L = as.numeric(tmp.path.probability[[2]][2]))) state.history <- data.frame(F = c(as.character(state.history[,tmp.path.probability[[1]][1]]), "F"), L = c(as.character(state.history[,tmp.path.probability[[2]][1]]), "L")) } # selecting the most probable path viterbi.path <- as.character(state.history[,c("F", "L")[which(max(prob.history[length(symbol.sequence), ]) == prob.history[length(symbol.sequence), ])]])

If we apply our implementation to the data in the previous post, we can get the idea how well can HMM reconstruct the real history.

viterbi.table <- table(viterbi.path == real.path) cat(paste(round(viterbi.table["TRUE"] / sum(viterbi.table) * 100, 2), "% accuracy\n", sep = "")) # 71.33% accuracy

Cheers, mintgene.

]]>Notice that we convert all probabilities in log scale. Viterbi algorithm selects the most probable path that can have very low value by the end of the run. Thus, conversion to appropriate scale helps avoid calculation issues.

data – http://pastebin.com/NzBE4Fm1

# we'll represent loaded die as "L", and the fair one as "F" states <- c("F", "L") # following matrix defines the probability of switching the die transition.matrix <- t(matrix(data = log2(c(0.95, 0.05, 0.1, 0.9)), nrow = 2, ncol = 2, dimnames = list(c("F", "L"), c("F", "L")))) # emission probabilities tell you what is the change of landing on each side given that the particular die is selected emission.matrix <- matrix(data = log2(c(rep(1/6, 6), c(rep(1/10, 5), 1/2))), nrow = 6, ncol = 2, dimnames = list(seq(1,6), c("F", "L"))) # initial probabilities define the chance of starting outcome (in our case we are equally likely to start with either states) initial.matrix <- matrix(data = log2(c(0.5, 0.5)), nrow = 2, ncol = 1, dimnames = list(c("F", "L"), ""))

After we defined the model, we need to initialize two object that will keep the track of probability history and state path (Pi) during the recursion process.

prob.history <- data.frame() state.history <- data.frame() # we start by calculating the probability of being in particular state given the first symbol and initial matrix # notice a change in log space - every multiplication is converted to summation prob.history <- rbind(prob.history, unlist(lapply(states, function(k) { initial.matrix[k, 1] + emission.matrix[symbol.sequence[1], k] }))) state.history <- rbind(state.history, states) colnames(prob.history) <- c("F", "L") colnames(state.history) <- c("F", "L")

Second part coming soon! mintgene

]]>To do that in R, we need to set the “breaks” parameter in plotting functions such as *image* or *heatmap.2*.

# gplots contains the heatmap.2 function library(gplots) # create 50x10 matrix of random values from [-1, +1] random.matrix <- matrix(runif(500, min = -1, max = 1), nrow = 50) # following code limits the lowest and highest color to 5%, and 95% of your range, respectively quantile.range <- quantile(random.matrix, probs = seq(0, 1, 0.01)) palette.breaks <- seq(quantile.range["5%"], quantile.range["95%"], 0.1) # use http://colorbrewer2.org/ to find optimal divergent color palette (or set own) color.palette <- colorRampPalette(c("#FC8D59", "#FFFFBF", "#91CF60"))(length(palette.breaks) - 1) heatmap.2( random.matrix, dendrogram = "row", scale = "none", trace = "none", key = FALSE, labRow = NA, labCol = NA, col = color.palette, breaks = palette.breaks )

Enjoy plotting! mintgene

]]>