quick tips: within function assignment and specific object removal

Posted on 31/01/2012 by tpakozdi

If you’re familiar with the faster iterations on objects such as lapply, sapply, or apply for matrices, you might get surprised that the function call saves new assignments only locally.

# to assign global variable within function use "<<-" operator
lapply(1:10, function(i) {
    if (i != 1) {
        global.var <<- global.var + i; return(NULL)
    } else {
        global.var <<- i; return(NULL)
    }
})

cat(global.var)
# 55

# another neat trick is the multiple assignment to objects
global.var -> global.var.A -> global.var.B
global.var == global.var.B

# TRUE

One of my favorite lines in R comes from the fact the language environment devours the memory. To focus on particular objects (such as when you need to save it as *.rda), replace the search term within grepl function for the one that you wish to keep.

ls()
# "global.var"   "global.var.A" "global.var.B"

rm(list = ls()[!grepl("global.var.A", ls())])

# refresh memory
gc()

ls()
#  "global.var.A"

Cheers, mintgene.

hmm: implementation of viterbi algorithm (Durbin, 1998) Part 2

Posted on 29/01/2012 by tpakozdi

Previous post presented the problem of dishonest casino that ocassionally uses loaded die. Sequence of the real states is hidden, and we are trying to figure it out just by looking at the observations (symbols).

# backtracking algorithm
for (i in 2:length(symbol.sequence)) {
    # probability vector stores the current emission with respect to (i-1) observation of selected state and transition probability
    # state vector (pointer) on the other hand is only storing the most probable state in (i-1), which we will later use for backtracking

    tmp.path.probability <- lapply(states, function(l) {
        max.k <- unlist(lapply(states, function(k) {
            prob.history[i-1, k] + transition.matrix[k, l]
        }))

        return(c(states[which(max.k == max(max.k))], max(max.k) + emission.matrix[symbol.sequence[i], l]))
    })

    prob.history <- rbind(prob.history, data.frame(F = as.numeric(tmp.path.probability[[1]][2]), L = as.numeric(tmp.path.probability[[2]][2])))

    state.history <- data.frame(F = c(as.character(state.history[,tmp.path.probability[[1]][1]]), "F"), L = c(as.character(state.history[,tmp.path.probability[[2]][1]]), "L"))
}

# selecting the most probable path
viterbi.path <- as.character(state.history[,c("F", "L")[which(max(prob.history[length(symbol.sequence), ]) == prob.history[length(symbol.sequence), ])]])

If we apply our implementation to the data in the previous post, we can get the idea how well can HMM reconstruct the real history.

viterbi.table <- table(viterbi.path == real.path)
cat(paste(round(viterbi.table["TRUE"] / sum(viterbi.table) * 100, 2), "% accuracy\n", sep = ""))
# 71.33% accuracy

Cheers, mintgene.

hmm: implementation of viterbi algorithm (Durbin, 1998) Part 1

Posted on 28/01/2012 by tpakozdi

Example in the mentioned book goes as following – dishonest casino uses two types of dice. Fair one that has equal probability of landing on either side (1/6), and the loaded one with 50% chance for getting 6. Your task is to figure out which die has been used (states) just based on the sequence of the outcomes (symbols).

Notice that we convert all probabilities in log scale. Viterbi algorithm selects the most probable path that can have very low value by the end of the run. Thus, conversion to appropriate scale helps avoid calculation issues.

data – http://pastebin.com/NzBE4Fm1


# we'll represent loaded die as "L", and the fair one as "F"
states <- c("F", "L")

# following matrix defines the probability of switching the die
transition.matrix <- t(matrix(data = log2(c(0.95, 0.05, 0.1, 0.9)), nrow = 2, ncol = 2, dimnames = list(c("F", "L"), c("F", "L"))))

# emission probabilities tell you what is the change of landing on each side given that the particular die is selected
emission.matrix <- matrix(data = log2(c(rep(1/6, 6), c(rep(1/10, 5), 1/2))), nrow = 6, ncol = 2, dimnames = list(seq(1,6), c("F", "L")))

# initial probabilities define the chance of starting outcome (in our case we are equally likely to start with either states)
initial.matrix <- matrix(data = log2(c(0.5, 0.5)), nrow = 2, ncol = 1, dimnames = list(c("F", "L"), ""))

After we defined the model, we need to initialize two object that will keep the track of probability history and state path (Pi) during the recursion process.


prob.history  <- data.frame()
state.history <- data.frame()

# we start by calculating the probability of being in particular state given the first symbol and initial matrix
# notice a change in log space - every multiplication is converted to summation
prob.history  <- rbind(prob.history, unlist(lapply(states, function(k) {
    initial.matrix[k, 1] + emission.matrix[symbol.sequence[1], k]
})))

state.history <- rbind(state.history, states)

colnames(prob.history)  <- c("F", "L")
colnames(state.history) <- c("F", "L")

Second part coming soon! mintgene

heatmaps: controlling the color representation with set data range

Posted on 27/01/2012 by tpakozdi

Often you want to set the fixed colors for particular range of your dataset to be sure that the visual output is correctly represented. This is particularly useful for time series data, where the range or your dataset might drastically change during the course of the simulation.

To do that in R, we need to set the “breaks” parameter in plotting functions such as image or heatmap.2.

# gplots contains the heatmap.2 function
library(gplots)

# create 50x10 matrix of random values from [-1, +1]
random.matrix  <- matrix(runif(500, min = -1, max = 1), nrow = 50)

# following code limits the lowest and highest color to 5%, and 95% of your range, respectively
quantile.range <- quantile(random.matrix, probs = seq(0, 1, 0.01))
palette.breaks <- seq(quantile.range["5%"], quantile.range["95%"], 0.1)

# use http://colorbrewer2.org/ to find optimal divergent color palette (or set own)
color.palette  <- colorRampPalette(c("#FC8D59", "#FFFFBF", "#91CF60"))(length(palette.breaks) - 1)

heatmap.2(

    random.matrix,

    dendrogram = "row",
    scale      = "none",
    trace      = "none",
    key        = FALSE,
    labRow     = NA,
    labCol     = NA,

    col    = color.palette,
    breaks = palette.breaks
)

Enjoy plotting! mintgene