** Next:** Estimating the means of
** Up:** Bayesian Learning 2: EM
** Previous:** The EM Algorithm

Consider our formulation of HMMs as before, i.e.
is the alphabet,
is the output sequence,
*s*_{t} is the state at time *t*, *A* is the transition probability
*a*_{ij} = *P*(*s*_{t}=*j*|*s*_{t-1}=*i*) and *B* is the state-symbol probability
for observing a given output symbol
*b*_{jk} = *P*(*O*_{t} = *v*_{k}|*s*_{t}=*s*).

Let
*Z* = *f*(*A*,*B*) be a random variable such that
*x*_{ijk} = *f*(*A*,*B*)
denotes the joint event consisting of the transition from state *i* to
*j* and the observation of the *v*_{k} from state *j*. So the observed
output is completely specified by a sequence
.
Let
*p*_{ijk} = *P*(*x*_{ijk}).

Then the parameter of interest, ,
is exactly *Z* since it
represents the totality of transition and state-symbol probabilities.

We wish to find a
such that

The expression on the left is called Baum's auxiliary function. We have already seen that if the above inequality holds, then

To do this, we first find a stationary point of the LHS subject to the
constraints
.
Using
Lagrange multipliers, the function to maximize with respect to the
variables *p*_{ijk}' is thus
.
Then

Since *X* completely specifies *O* (recall that *x*_{ijk} also specifies
that symbol *v*_{k} is observed in state *j*),

If *c*_{ijk} denotes the number of times *x*_{ijk} occurs in *X*, then

Hence,

We can also satisfy ourselves that

and so the stationary point obtained by setting the first derivative to zero is indeed a maximum.

Substituting in (17) and setting to zero,

If we now write
,
then

where

The RHS above is just the reestimated probability of *x*_{ijk} because
prior to normalization, the sum yields the expected frequency of the
*x*_{ijk} computed using the current values of the *p*_{ijk}.

We can now proceed to calculate each *p*_{ijk}' as above. The set of
all such *p*_{ijk}'s constitutes our
which we can use in place
of
during the next iteration.

The EM theorem tells us that if we continue in this fashion, then after each iteration

- 1.
- either the reestimated
is more likely than the
original
in the sense that
or
- 2.
- we have reached a stationary point of the likelihood function
*w*at which

This is an algorithm, because it is iterative. There is an expectation step (aka estimation step) and a maximization step (aka modification step) in each iteration.

Note that the algorithm is not guaranteed to converge at the global maximum. But it has been found to yield good results in practice.