When I first started dating my wife, one of the little things that impressed me about her was that she would always know when it was about to rain. We were living in Pittsburgh at the time, where sudden downpours and thunderstorms are common. This ability to predict the weather was a valuable skill to have. Finally, I asked her how she was so good at predicting a storm. “The leaves on trees,” she replied. “They turn upside down right before the storm hits. It’s something I learned in Girl Scouts.”

Humans excel at determining the probability of underlying events based on observation. In this article, we’ll examine how to introduce this type of behavior into our JavaScript web applications using Hidden Markov Models (HMMs).

In part one of my Markov Models in JavaScript series, I explained how to use Observable Markov Models (OMMs) to calculate the probability of a chain of observable events occurring in the future. But what do we do if the event chain we’re interested in isn’t directly observable?

Building upon my wife’s storm predicting prowess, let’s assume there are three possible impending weather states: sun, clouds, and rain. We can’t directly observe impending weather, because it’s always in the future. The best we can do is observe other events and infer a *correlation* with the impending weather.

Let’s begin with a state chart similar to the one used in part one of the series. Each grid block represents the probability of transitioning from one underlying state to another. For instance, the probability of transitioning from impending sunny weather to impending rain is .1, or 10%. Unfortunately, this is a relationship we can’t directly observe. (Note that only the rows must add up to one, because we’re expressing the transition of row states to column states. We’re making no such statement about column states transitioning to row states.)

NEXT STATE >
CURRENT STATE v |
Impending Sun | Impending Clouds | Impending Rain |
---|---|---|---|

Impending Sun | .7 | .2 | .1 |

Impending Clouds | .3 | .4 | .3 |

Impending Rain | .1 | .4 | .5 |

Now let’s look at another chart. This one represents the probability of something we can observe, namely whether leaves on trees are upturned, as dependent on the behavior of an impending weather state. Based on our data, the underlying state ‘impending rain’ occurs in tandem with leaves being upturned .9, or 90% of the time.

OBSERVATION >
CURRENT STATE v |
Leaves Normal | Leaves Upturned |
---|---|---|

Impending Sun | .8 | .2 |

Impending Clouds | .7 | .3 |

Impending Rain | .1 | .9 |

This is the essential distinguishing feature of HMMs. The state transition matrix (based on the first graph above) from an OMM becomes the underlying state transition matrix. These states are then linked to observable events through a second matrix, which expresses how likely an observation is to occur dependent on the current underlying state. It’s a fairly simple concept, but allows for enormous adaptability when compared with an OMM.

With that out of the way, let’s examine the implementation of a HMM for something more applicable to front-end JavaScript web development.

Let’s say we’re developing a streaming video web application that has a user interface (UI) similar to Netflix’s current desktop web incarnation. Each row of video choices relates to a specific category. The category row extends off the screen with additional video choices. To navigate the category, a user must mouse over the category row, and scroller arrows appear on the left and right ends. Clicking/pressing on those arrows scrolls the list. If someone is using this UI for the first time, it may not be immediately obvious how to scroll through a list. If after 10 seconds they are still confused about how the navigation works, they may decide that our app isn’t worth the hassle. It would be helpful to provide a friendly tip if it seems like their intent is to scroll though a category, but they’re struggling to figure out how to do so.

In this situation, ‘is the user struggling to figure out how to scroll through a category list?’ is our underlying, unobserved event. We can’t get inside the user’s mind to determine the answer with 100% certainty. When you think about it, though, people deal with this problem all the time. For instance, if I am having a conversation with you, I have no way of definitively knowing what your mood is. I rely on inference based on what I *can* actually observe, like your facial expression or body language. We’re going to use a HMM to give our application similar inference abilities.

What could be a useful observation to help us determine if a user is struggling to figure out category scrolling? Preferably, it would be one that we have easy access to as JavaScript developers. Let’s try ‘change in mouse position per second’. This will be our observable state.

There’s a catch, though. Put yourself in the shoes of a confused user. If we directly divide our underlying state into confused/not confused, and try to determine this state every second, is that really an accurate model of a person? If I was a confused user, I wouldn’t be switching my internal confusion state every second. Even if I did, it wouldn’t be measurable over just a single second of mouse movement. I’d probably have a general, persistent feeling of confusion which would manifest itself in my collective mouse movement over the entire 10 second period we’re measuring. All I care about is how closely the user’s mouse behavior matches our model of a persistently confused user *over the entire period*. Fortunately, HMMs are very good at solving this problem when used in conjunction with the forward algorithm described below.

If our HMM had an internal monologue, it doesn’t really matter what story it’s telling itself in order to create output that simulates a confused user. Perhaps it believes the output is a brilliant piece of performance art, or a love ballad in mouse position form. I am fine with this, so long as it satisfies our purposes. Just keep in mind that for other situations, more closely managing the internal states can be very important.

An important concept when working with HMMs is imprinting, which is more properly referred to in computer science as re-estimation. The basic idea is to take an observation data set and use a re-estimation algorithm to adjust our model matrices so they are strongly attuned to the data set, as well as similar ones. In other words, we’re training the HMM to recognize certain data patterns. The method I use to accomplish this is based on the Baum-Welch expectation maximization algorithm. It works by finding the local maximum value and associated argument for each index in our model matrices.

I created a demonstration of how HMM training works based on our streaming video app concept. It allows you to record 10 second sets of mouse movement data, imprint the data onto your HMM, and record new sets of mouse movement to compare with the imprinted HMM. You’ll be able to see how an HMM model scores the ‘closeness’ of data.

I recommend using your browser’s debugger to insert a breakpoint at the end of onConditionBtnClick() in trainer.js. Take a look at the multidimensional array named ‘model’. Indices 0, 1, and 2 store ‘Beginning underlying state probabilities’, ‘underlying state transition probabilities’, and ‘observation probabilities dependent on state’, respectively. Study how the values change depending on the data you imprint on your HMM.

We can define the certainty that a user is confused by how well our HMM, which has been conditioned to recognize a confused user’s mouse movement, recognizes the user data we are sending it. The higher the value, the more likely that a user is confused. Our probability value is calculated using the forward algorithm.

**Forward Algorithm**: Answers the question “Given a set of observations, what’s the likelihood we’ll end up in state X at time t in the observation sequence given our model?”

This allows us to acquire the probability of an observation sequence being produced by our model. The forward algorithm is closely related to the backward algorithm.

**Backward Algorithm**: If we are in state X at time t in an observation sequence, what is the probability that we will observe the remaining observations given our model?

I use the forward algorithm in the HMM Trainer example to calculate the closeness of user input to our model. However, the backward algorithm would work just a well for this purpose. Both are also used in the Baum-Welch re-estimation algorithm. When combined together, they allow us to ‘bracket’ an individual state at any time t, and discover its probability at that specific point in the observation sequence.

For the purpose of our example, any score above 50 could be considered a high probability of user confusion. Remember that this threshold is subjective, and should be tweaked based on additional data and user feedback. It’s also worth examining your combined data to determine if what we’re measuring is even a strong indicator of user confusion. Perhaps additional behaviors, such as number of mouse clicks/touches per second, are needed to make a better inference of user confusion. If so, multiple HMMs can be used, with the results combined as you see fit. This is all very much an art, albeit one backed by numbers.

For a ‘real’ app, your data set should be much larger. Herein lies the beauty of the web. By recording user behavior on your app/site, you can have a big data set even if you’re a one-person team.

There is one **major** caveat: Recording user behavior beyond the norm without first verifying that a person is consenting to it can be considered a violation of privacy. One only needs to browse the front page of a news website to understand that this is an issue people get very passionate about. If you’re going to record in-depth user behavior, I recommend an easy to understand acceptance/rejection modal. Explain exactly what information will be stored, if said information will be linked to the user in any way, and why. Beyond satisfying ethical issues, this builds trust. Trust is always a good thing.

If you imprint a data set which is very evenly distributed, your model won’t be able to determine an observation match with a high probability no matter what. For example, let’s say you add 20 recordings to the conditioning stack. Recording one is [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], recording two is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] and so on, through [19, 19, 19, 19, 19, 19, 19, 19, 19, 19]. By imprinting this data set, your HMM will attempt to create a model which will return an equally low match probability no matter the observation set when used in the forward algorithm.

Note that as our observation sets get bigger, the probability that the entire set will be generated by our HMM becomes exponentially remote. Even results that we would consider “close” to the imprint data could be something like 1.8435830716181838e-7.

To account for this, I’m normalizing (read: making understandable for humans) the results from our forward algorithm by using their associated logarithms. This has the effect of greatly compressing the difference in probability value of less likely results, and greatly increasing said difference between very likely results. In short, this step makes the forward-backward algorithm’s results much easier to understand and work with.

In addition to the HMM Trainer I’ve included a short example of how to initialize an HMM in JSFiddle:

As you may have determined, I’ve glossed over much of the underlying math for HMMs and their associated algorithms. There are many good resources more focused on this side of things. In particular, Lawrence Rabiner’s classic paper ‘A Tutorial on Hidden Markov Models’ was an immense help in writing this article.

Finally, I didn’t even touch on the issue of finding the most likely sequence of underlying states that would produce a given observation sequence. This was outside the scope of our problem, but can be solved using the Viterbi algorithm. You can find the associated method in the HMM JavaScript module. I mention it because there are a ton of applications across various scientific fields. Say we wanted to link the observation of sound waves to an underlying state matrix of words, or solve quantum communication…

**References:**

Baum, L. E.; Petrie, T. 1966. *Statistical Inference for Probabilistic Functions of Finite State Markov Chains*. The Annals of Mathematical Statistics 37 (6): 1554–1563.

Fosler-Lussier, Eric. December 1998. *Markov Models and Hidden Markov Models: A Brief Tutorial*. International Computer Science Institute.

Rabiner, Lawrence R. February 1989. *A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition*. Proceedings of the IEEE, Vol. 77, No. 2.

Wunsch, Holger. *Hidden Markov Model Implementation in Java*. http://pastebin.com/6prc53ZE

…and Wikipedia, of course.

]]>

Ideas have a funny way of spreading. Sometimes a concept that has become ubiquitous in a particular field of study can take years to be widely understood in another field that could have equal use for it.

Variations of Markov Models have found many uses in computer programming recently, from stock trading algorithms to speech recognition systems. However, there is a dearth of solid explanation for client-side JavaScript implementation. I’d imagine this has to do with the fact that JavaScript in web browsers was historically too slow and lacked the low-level access to audio/video input for many of the common uses of these probability models to be implemented. Fortunately, with massive cross-browser performance improvements, and new multimedia API’s such as WebRTC in the pipeline, these limitations are becoming problems of the past.

We’ll begin this series with the simplest Markov Model data structure, which all others are built upon: the Observable Markov Model (OMM). Also known as a Markov Chain, an OMM is a way of estimating the probability of a future series of events using only the previous event for each item in the series as inference. If a data structure exhibits this behavior, we can say it possesses the Markov Property.

To understand why this is useful, let’s consider a practical example relating to client-side web development. Say we have a bandwidth-heavy web application. It would be nice to lazy load content so the initial wait time is manageable. However, we need to balance this with a responsive user interface. When a user attempts to access a certain functionality, they shouldn’t have to wait an inordinate amount of time for it to load for the first time.

One solution is to try and predict what functionality the user will access next. This is where an OMM comes into play. If we have an accurate model of the likelihood that a user will transition from the current state to a new state, we will have much better odds of correctly guessing what data our app will need to load ahead of time.

Let’s assume that QA has been running usability tests on our app, and has compiled statistics showcasing the chance of transitioning from one app state to another. The following is a chart based on their findings:

NEXT STATE >
CURRENT STATE v |
2 Megabyte Animated GIF | 500 Photos | Video |
---|---|---|---|

2 Megabyte Animated GIF | .2 | .65 | .15 |

500 Photos | .1 | .8 | .1 |

Video | .7 | .25 | .05 |

Think of each decimal number in the chart as the percentage probability of transitioning from our current state to the next state. For example, if we are in state ’2 Megabyte Animated GIF’, there is a .65, or 65% chance of transitioning to state ’500 Photos’. Notice how the content of each row adds up to 1. This is a way of saying that we’re examining an entire mutually exclusive set of outcomes. In other words, there is a 100% chance of the current state transitioning to one of these three states.

Now let’s examine what the ‘chain’ in Markov Chain is all about. We’re going to use an OMM to calculate the likelihood of events occurring several states in the future, instead of just one.

Say we are in state ’500 Photos’. State ‘Video’ is an especially large download. We’d like to begin even if the user is two states away from navigating to it, provided there is greater than a 15% probability they will end up in the section at that time.

Let’s break the problem down. We begin in state ’500 Photos’. It doesn’t matter what the state after ’500 Photos’ is; it can be any of the three. The final state will always be ‘Video’. For any one chain of probabilities, we multiply the individual likelihoods together. Then we add the three separate chains together to get our total probability. In pseudo-equation form:

(500 Photos -> Video)* (Video ->Video) +

(500 Photos -> Animated GIF)*(Animated GIF -> Video) +

(500 Photos -> 500 Photos)*(500 Photos -> Video)

=

(.1)*(.05) +

(.1)*(.15) +

(.8)*(.1)

=

.1

Based on our model, we have an approximately 10% chance of navigating to ‘Video’ two navigation steps from now. This is below our threshold for lazy loading.

We can infer another key advantage of Markov Models from this example. To express the probabilities of all possible states for any length of Markov Chain, we only need s^2 numbers stored, with ‘s’ representing the number of possible states. In contrast, a statistical model relying on knowledge of every possible past probability state will require s^n space, where ‘n’ represents the chain length. Obviously, the non-Markov method can become very large very quickly as chain length increases.

I’ve created a helper library (the GitHub repository is located here) which contains a method that returns the probability of a single Markov Chain occurring. Just pass in a chain and probability model as parameters. Let’s look at how to calculate the above solution using the library:

It’s important to understand that the accuracy of our prediction is only as good as the statistics collected, and the relevance of our underlying Markov Assumption (for exhibit A of what can happen if this isn’t the case, see this excellent Wired article on the probability algorithm that contributed to the 2008 financial crisis). If, in fact, there are much more important underlying trends that affect state transition than just the previous state, we’re in trouble. Regarding statistics collected, we can always augment our QA stats with locally collected and stored user navigation information to create a more accurate model.

*Check back soon for part 2 of the series. We’ll examine how to increase the usefulness of the core OMM data structure by incorporating the aforementioned underlying trends to influence state transitions.*