Next: Estimating the means of Up: Bayesian Learning 2: EM Previous: The EM Algorithm

## Derivation of the Baum-Welch algorithm for HMMs

Consider our formulation of HMMs as before, i.e. is the alphabet, is the output sequence, st is the state at time t, A is the transition probability aij = P(st=j|st-1=i) and B is the state-symbol probability for observing a given output symbol bjk = P(Ot = vk|st=s).

Let Z = f(A,B) be a random variable such that xijk = f(A,B) denotes the joint event consisting of the transition from state i to j and the observation of the vk from state j. So the observed output is completely specified by a sequence . Let pijk = P(xijk).

Then the parameter of interest, , is exactly Z since it represents the totality of transition and state-symbol probabilities.

We wish to find a such that

The expression on the left is called Baum's auxiliary function. We have already seen that if the above inequality holds, then P'(O) > P(O). So maximizing the LHS wrt. is equivalent to maximizing P(O).

To do this, we first find a stationary point of the LHS subject to the constraints . Using Lagrange multipliers, the function to maximize with respect to the variables pijk' is thus . Then

 = (16)

Since X completely specifies O (recall that xijk also specifies that symbol vk is observed in state j),

If cijk denotes the number of times xijk occurs in X, then

Hence,

We can also satisfy ourselves that

and so the stationary point obtained by setting the first derivative to zero is indeed a maximum.

Substituting in (17) and setting to zero,

If we now write , then

where Ki can be interpreted as a normalizing constant. The RHS can now be summed over each of the times by grouping the sequences such that in each group xitjtkt = xijk.

The RHS above is just the reestimated probability of xijk because prior to normalization, the sum yields the expected frequency of the xijk computed using the current values of the pijk.

We can now proceed to calculate each pijk' as above. The set of all such pijk's constitutes our which we can use in place of during the next iteration.

The EM theorem tells us that if we continue in this fashion, then after each iteration

1.
either the reestimated is more likely than the original in the sense that or

2.
we have reached a stationary point of the likelihood function w at which

This is an algorithm, because it is iterative. There is an expectation step (aka estimation step) and a maximization step (aka modification step) in each iteration.

Note that the algorithm is not guaranteed to converge at the global maximum. But it has been found to yield good results in practice.

Next: Estimating the means of Up: Bayesian Learning 2: EM Previous: The EM Algorithm
Anand Venkataraman
1999-09-16