discrimination between CpG islands and random sequences using Markov chains

Major part of modern research is trying to find patterns in the given dataset using learning methods. One of the methods that can use a priori information for such purpose is Markov chains, in which the probability of symbol occurrence depends only on previous symbol (appropriate for the dinucleotide modeling).

The question I am addressing here is – given the random sequence from mouse genome, can you predict that this is a CpG island? In the next few lines of code you will find many useful approaches to MC modeling, splitting of odd and even elements of R vector using “substring” function, querying the UCSC database solely via R, and informative visualization with transparent histograms.

Note that I got sequences from BSgenome.Mmusculus.UCSC.mm9 library, which contains complete mouse genome. Since the download of the same might be tedious for some (~750mb), I uploaded the cpg and random regions to pastebin.

CpG Islands – http://pastebin.com/372cfs6U
Random – http://pastebin.com/wZyKFFkx

# load libraries
library(compiler)
library(rtracklayer)
library(GenomicFeatures)

# ignore following if you're taking sequences from pastebin
library(BSgenome.Mmusculus.UCSC.mm9)

# models derived by ML estimator (a_st = c_st / sum(c_st')) from CpG and non-CpG islands
# in each row, nucleotide has a probability of being followed by respective base in the column,
# thus rows should sum up to 1

# notice the drop of G being followed by C in negative model (due to C being target of spontaneous deamination upon methylation),
# same is preserved in CpG-islands rich regions due to the lack of methylation supression on promoter regions

positive.model <- t(matrix(
    data  = c(0.180, 0.274, 0.426, 0.120, 0.171, 0.368, 0.274, 0.188, 0.161, 0.339, 0.375, 0.125, 0.079, 0.355, 0.384, 0.182),
    nrow  = 4,
    ncol  = 4,
    dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
))

negative.model <- t(matrix(
    data  = c(0.300, 0.205, 0.285, 0.210, 0.322, 0.298, 0.078, 0.302, 0.248, 0.246, 0.298, 0.208, 0.177, 0.239, 0.292, 0.292),
    nrow  = 4,
    ncol  = 4,
    dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
))

# log-likelihood table derived from log-odds ratio of two models
# used for discrimination of sequences as CpG or non-CpG islands

likelihood.values <- round(log2(positive.model/negative.model), 3)

# score (in bits) of the sequence normalized by its length
bits.score <- function(sequence = DNA_region, model = likelihood.values) {

    # sequence length
    seq.length <- nchar(sequence)

    # if sequence is odd, trim the last element
    if (seq.length %% 2 != 0) {
        sequence   <- paste(strsplit(sequence, split = "")[[1]][-seq.length], collapse = "")
        seq.length <- seq.length - 1
    }

    # extract dinucleotides
    idx   <- seq(1, seq.length)
    odds  <- idx[(idx %% 2) == 1]
    evens <- idx[(idx %% 2) == 0]

    tmp.split <- substring(sequence, odds, evens)

    # return normalized sum of the log-likelihood scores
    return(sum(unlist(lapply(tmp.split, function(tmp.dinucleotide) {
        lookup <- strsplit(tmp.dinucleotide, split = "")[[1]]
        return(model[lookup[1], lookup[2]])
    }))) / seq.length)
}

bits.score <- cmpfun(bits.score)

# setting up the discrimination model
model.score <- function(sequences = DNA_region) {
    all.dinucleotides <- do.call("c", lapply(sequences, function(sequence) {

        # sequence length
        seq.length <- nchar(sequence)

        # if sequence is odd, trim the last element
        if (seq.length %% 2 != 0) {
            sequence   <- paste(strsplit(sequence, split = "")[[1]][-seq.length], collapse = "")
            seq.length <- seq.length - 1
        }

        # extract dinucleotides
        idx   <- seq(1, seq.length)
        odds  <- idx[(idx %% 2) == 1]
        evens <- idx[(idx %% 2) == 0]

        tmp.split <- substring(sequence, odds, evens)
    }))

    all.counts <- t(matrix(table(all.dinucleotides), ncol = 4, nrow = 4, dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))))
    all.counts <- t(apply(all.counts, 1, function(tmp.row) { tmp.row / sum(tmp.row) } ))

return(all.counts)
}

model.score <- cmpfun(model.score)

# following code extracts sequences from annotated and random parts of the mm9 genome
# ignore it if you're using pastebin text
# querying the UCSC database for CpG islands track
session               <- browserSession()
genome(session)       <- "mm9"
query                 <- ucscTableQuery(session, "CpG Islands")
cpg.islands           <- getTable(query)
cpg.islands           <- subset(cpg.islands, cpg.islands$chrom == "chr1", select = c("chrom", "chromStart", "chromEnd", "length"))
colnames(cpg.islands) <- c("chr", "region.start", "region.end", "region.width")

# generate random positions via "sample" function, but not exceeding largest width of CpG island
random.regions            <- data.frame(chr = "chr1", region.start = sample(seq(0, seqlengths(Mmusculus)[["chr1"]] - max(cpg.islands$region.width)), nrow(cpg.islands)))
random.regions$region.end <- random.regions$region.start + cpg.islands$region.width

markov.chain.seq <- lapply(c("cpg.islands", "random.regions"),  FUN = function(tmp.sample) {
    tmp.dataset <- get(tmp.sample)

    if (nrow(tmp.dataset) > 0) {
        tmp.sview        <- as.character(DNAStringSet(Views(unmasked(Mmusculus[["chr1"]]), start = tmp.dataset$region.start, end = tmp.dataset$region.end)))
    }

    return(tmp.sview)
})

# only keep sequences from non-repetitive parts of genome
markov.chain.seq        <- lapply(markov.chain.seq, function(tmp.sample) { do.call("c", lapply(tmp.sample, function(tmp.seq) { tmp.seq[!grepl("N", tmp.seq)] })) })
names(markov.chain.seq) <- c("cpg.islands", "random.regions")

# you can substitute "lapply" with "mclapply" for parallel processing with multicore package at this point
# output of the discrimination plot
png("~/MarkovChain.png", 1000, 600)
    hist(unlist(lapply(markov.chain.seq[[1]], bits.score)), breaks = 100, xlim = c(-0.5, 0.5), ylim = c(0, 50), col = "#78B90050", main = "Markov Chain model discriminates CpG from non-CpG sequences", xlab = "bits score")
    hist(unlist(lapply(markov.chain.seq[[2]], bits.score)), breaks = 100, col = "#444D6150", add = TRUE)
legend("topright", c("CpG-rich sequences", "random sequences"), col = c("#78B900", "#444D61"), pch = 15, cex = 2, bty = "n")
dev.off()

Similar to the HMM example from previous post, the Markov chain is also based on Durbin (1998.) book.

Cheers, mintgene.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s