Discovering Clusters in Motion
Our primary goal is to find clusters within a sequence database, while at the same time estimating HMMs for each cluster. Stated differently, we want to assign sequences to those clusters whose HMMs best explain them. The problem of assigning a particular sequence to each cluster can be approached in two ways: via the k-means approach, or via the Expectation Maximization (EM) approach. In the k-means approach, a particular sequence is assigned to a single cluster, and then only those sequences that are assigned to a particular cluster are used for re-estimating its HMM parameters. These two steps are repeated until convergence. In the EM approach, a particular sequence can be partially assigned to all clusters, with the degree of cluster membership determined by the a posteriori probability, which depends via Bayes rule on the cluster a priori probability and the data likelihood. The cluster HMM parameters are then re-estimated, but in contrast to k-means, the parameter estimates of a single HMM cluster are influenced by all the observation sequences with the corresponding a posteriori probabilities. These two steps are repeated until convergence.
In the previous paragraph, we have assumed that the number of mixture components M is known. The problem of estimating the "correct'' number of clusters is a difficult one: a full Bayesian solution for obtaining the posterior probability on M, requires a complex integration over the HMM parameter space, as well as knowledge about the priors on the mixture parameters and about the priors on M itself. Often this integration cannot be solved in closed form, and Monte-Carlo methods and other approximation methods are used to evaluate it. These methods however are computationally intensive. Other methods that sacrifice some accuracy for efficiency, are the penalized likelihood approaches, where the log-likelihood term is penalized by subtraction of a complexity term. We use such a method that tries to find a model with Minimum Description Length (MDL).
Last Modified: January 14, 2003