Consider our formulation of HMMs as before, i.e.
is the alphabet,
is the output sequence,
st is the state at time t, A is the transition probability
aij = P(st=j|st-1=i) and B is the state-symbol probability
for observing a given output symbol
bjk = P(Ot = vk|st=s).
Let
Z = f(A,B) be a random variable such that
xijk = f(A,B)
denotes the joint event consisting of the transition from state i to
j and the observation of the vk from state j. So the observed
output is completely specified by a sequence
.
Let
pijk = P(xijk).
Then the parameter of interest,
,
is exactly Z since it
represents the totality of transition and state-symbol probabilities.
We wish to find a
such that
To do this, we first find a stationary point of the LHS subject to the
constraints
.
Using
Lagrange multipliers, the function to maximize with respect to the
variables pijk' is thus
.
Then
Since X completely specifies O (recall that xijk also specifies
that symbol vk is observed in state j),
If cijk denotes the number of times xijk occurs in X, then
We can also satisfy ourselves that
Substituting in (17) and setting to zero,
If we now write
,
then
The RHS above is just the reestimated probability of xijk because prior to normalization, the sum yields the expected frequency of the xijk computed using the current values of the pijk.
We can now proceed to calculate each pijk' as above. The set of
all such pijk's constitutes our
which we can use in place
of
during the next iteration.
The EM theorem tells us that if we continue in this fashion, then after each iteration
This is an algorithm, because it is iterative. There is an expectation step (aka estimation step) and a maximization step (aka modification step) in each iteration.
Note that the algorithm is not guaranteed to converge at the global maximum. But it has been found to yield good results in practice.