When I first started dating my wife, one of the little things that impressed me about her was that she would always know when it was about to rain. We were living in Pittsburgh at the time, where sudden downpours and thunderstorms are common. This ability to predict the weather was a valuable skill to have. Finally, I asked her how she was so good at predicting a storm. “The leaves on trees,” she replied. “They turn upside down right before the storm hits. It’s something I learned in Girl Scouts.”
Building upon my wife’s storm predicting prowess, let’s assume there are three possible impending weather states: sun, clouds, and rain. We can’t directly observe impending weather, because it’s always in the future. The best we can do is observe other events and infer a correlation with the impending weather.
Let’s begin with a state chart similar to the one used in part one of the series. Each grid block represents the probability of transitioning from one underlying state to another. For instance, the probability of transitioning from impending sunny weather to impending rain is .1, or 10%. Unfortunately, this is a relationship we can’t directly observe. (Note that only the rows must add up to one, because we’re expressing the transition of row states to column states. We’re making no such statement about column states transitioning to row states.)
|NEXT STATE >
CURRENT STATE v
|Impending Sun||Impending Clouds||Impending Rain|
Now let’s look at another chart. This one represents the probability of something we can observe, namely whether leaves on trees are upturned, as dependent on the behavior of an impending weather state. Based on our data, the underlying state ‘impending rain’ occurs in tandem with leaves being upturned .9, or 90% of the time.
CURRENT STATE v
|Leaves Normal||Leaves Upturned|
This is the essential distinguishing feature of HMMs. The state transition matrix (based on the first graph above) from an OMM becomes the underlying state transition matrix. These states are then linked to observable events through a second matrix, which expresses how likely an observation is to occur dependent on the current underlying state. It’s a fairly simple concept, but allows for enormous adaptability when compared with an OMM.
Let’s say we’re developing a streaming video web application that has a user interface (UI) similar to Netflix’s current desktop web incarnation. Each row of video choices relates to a specific category. The category row extends off the screen with additional video choices. To navigate the category, a user must mouse over the category row, and scroller arrows appear on the left and right ends. Clicking/pressing on those arrows scrolls the list. If someone is using this UI for the first time, it may not be immediately obvious how to scroll through a list. If after 10 seconds they are still confused about how the navigation works, they may decide that our app isn’t worth the hassle. It would be helpful to provide a friendly tip if it seems like their intent is to scroll though a category, but they’re struggling to figure out how to do so.
In this situation, ‘is the user struggling to figure out how to scroll through a category list?’ is our underlying, unobserved event. We can’t get inside the user’s mind to determine the answer with 100% certainty. When you think about it, though, people deal with this problem all the time. For instance, if I am having a conversation with you, I have no way of definitively knowing what your mood is. I rely on inference based on what I can actually observe, like your facial expression or body language. We’re going to use a HMM to give our application similar inference abilities.
There’s a catch, though. Put yourself in the shoes of a confused user. If we directly divide our underlying state into confused/not confused, and try to determine this state every second, is that really an accurate model of a person? If I was a confused user, I wouldn’t be switching my internal confusion state every second. Even if I did, it wouldn’t be measurable over just a single second of mouse movement. I’d probably have a general, persistent feeling of confusion which would manifest itself in my collective mouse movement over the entire 10 second period we’re measuring. All I care about is how closely the user’s mouse behavior matches our model of a persistently confused user over the entire period. Fortunately, HMMs are very good at solving this problem when used in conjunction with the forward algorithm described below.
If our HMM had an internal monologue, it doesn’t really matter what story it’s telling itself in order to create output that simulates a confused user. Perhaps it believes the output is a brilliant piece of performance art, or a love ballad in mouse position form. I am fine with this, so long as it satisfies our purposes. Just keep in mind that for other situations, more closely managing the internal states can be very important.
An important concept when working with HMMs is imprinting, which is more properly referred to in computer science as re-estimation. The basic idea is to take an observation data set and use a re-estimation algorithm to adjust our model matrices so they are strongly attuned to the data set, as well as similar ones. In other words, we’re training the HMM to recognize certain data patterns. The method I use to accomplish this is based on the Baum-Welch expectation maximization algorithm. It works by finding the local maximum value and associated argument for each index in our model matrices.
I created a demonstration of how HMM training works based on our streaming video app concept. It allows you to record 10 second sets of mouse movement data, imprint the data onto your HMM, and record new sets of mouse movement to compare with the imprinted HMM. You’ll be able to see how an HMM model scores the ‘closeness’ of data.
I recommend using your browser’s debugger to insert a breakpoint at the end of onConditionBtnClick() in trainer.js. Take a look at the multidimensional array named ‘model’. Indices 0, 1, and 2 store ‘Beginning underlying state probabilities’, ‘underlying state transition probabilities’, and ‘observation probabilities dependent on state’, respectively. Study how the values change depending on the data you imprint on your HMM.
We can define the certainty that a user is confused by how well our HMM, which has been conditioned to recognize a confused user’s mouse movement, recognizes the user data we are sending it. The higher the value, the more likely that a user is confused. Our probability value is calculated using the forward algorithm.
Forward Algorithm: Answers the question “Given a set of observations, what’s the likelihood we’ll end up in state X at time t in the observation sequence given our model?”
This allows us to acquire the probability of an observation sequence being produced by our model. The forward algorithm is closely related to the backward algorithm.
Backward Algorithm: If we are in state X at time t in an observation sequence, what is the probability that we will observe the remaining observations given our model?
I use the forward algorithm in the HMM Trainer example to calculate the closeness of user input to our model. However, the backward algorithm would work just a well for this purpose. Both are also used in the Baum-Welch re-estimation algorithm. When combined together, they allow us to ‘bracket’ an individual state at any time t, and discover its probability at that specific point in the observation sequence.
For the purpose of our example, any score above 50 could be considered a high probability of user confusion. Remember that this threshold is subjective, and should be tweaked based on additional data and user feedback. It’s also worth examining your combined data to determine if what we’re measuring is even a strong indicator of user confusion. Perhaps additional behaviors, such as number of mouse clicks/touches per second, are needed to make a better inference of user confusion. If so, multiple HMMs can be used, with the results combined as you see fit. This is all very much an art, albeit one backed by numbers.
For a ‘real’ app, your data set should be much larger. Herein lies the beauty of the web. By recording user behavior on your app/site, you can have a big data set even if you’re a one-person team.
There is one major caveat: Recording user behavior beyond the norm without first verifying that a person is consenting to it can be considered a violation of privacy. One only needs to browse the front page of a news website to understand that this is an issue people get very passionate about. If you’re going to record in-depth user behavior, I recommend an easy to understand acceptance/rejection modal. Explain exactly what information will be stored, if said information will be linked to the user in any way, and why. Beyond satisfying ethical issues, this builds trust. Trust is always a good thing.
If you imprint a data set which is very evenly distributed, your model won’t be able to determine an observation match with a high probability no matter what. For example, let’s say you add 20 recordings to the conditioning stack. Recording one is [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], recording two is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] and so on, through [19, 19, 19, 19, 19, 19, 19, 19, 19, 19]. By imprinting this data set, your HMM will attempt to create a model which will return an equally low match probability no matter the observation set when used in the forward algorithm.
Note that as our observation sets get bigger, the probability that the entire set will be generated by our HMM becomes exponentially remote. Even results that we would consider “close” to the imprint data could be something like 1.8435830716181838e-7.
To account for this, I’m normalizing (read: making understandable for humans) the results from our forward algorithm by using their associated logarithms. This has the effect of greatly compressing the difference in probability value of less likely results, and greatly increasing said difference between very likely results. In short, this step makes the forward-backward algorithm’s results much easier to understand and work with.
In addition to the HMM Trainer I’ve included a short example of how to initialize an HMM in JSFiddle:
As you may have determined, I’ve glossed over much of the underlying math for HMMs and their associated algorithms. There are many good resources more focused on this side of things. In particular, Lawrence Rabiner’s classic paper ‘A Tutorial on Hidden Markov Models’ was an immense help in writing this article.
Baum, L. E.; Petrie, T. 1966. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics 37 (6): 1554–1563.
Fosler-Lussier, Eric. December 1998. Markov Models and Hidden Markov Models: A Brief Tutorial. International Computer Science Institute.
Rabiner, Lawrence R. February 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, Vol. 77, No. 2.
Wunsch, Holger. Hidden Markov Model Implementation in Java. http://pastebin.com/6prc53ZE
…and Wikipedia, of course.