next up previous
Next: Estimating the means of Up: Bayesian Learning 2: EM Previous: The EM Algorithm

Derivation of the Baum-Welch algorithm for HMMs

Consider our formulation of HMMs as before, i.e. $\{v_1, \cdots,
v_m\}$ is the alphabet, $O_1, \cdots, O_T$ is the output sequence, st is the state at time t, A is the transition probability aij = P(st=j|st-1=i) and B is the state-symbol probability for observing a given output symbol bjk = P(Ot = vk|st=s).

Let Z = f(A,B) be a random variable such that xijk = f(A,B) denotes the joint event consisting of the transition from state i to j and the observation of the vk from state j. So the observed output is completely specified by a sequence $X = x_{i_1j_1k_1} \cdots
x_{i_Tj_Tk_T}$. Let pijk = P(xijk).

Then the parameter of interest, $\theta$, is exactly Z since it represents the totality of transition and state-symbol probabilities.

We wish to find a $\theta'$ such that

\begin{displaymath}\sum_X P(X\vert O) \log P'(X,O) > \sum_X P(X\vert O) \log P(X,O)
\end{displaymath}

The expression on the left is called Baum's auxiliary function. We have already seen that if the above inequality holds, then P'(O) > P(O). So maximizing the LHS wrt. $\theta$ is equivalent to maximizing P(O).

To do this, we first find a stationary point of the LHS subject to the constraints $\forall i: \sum_j \sum_k p_{ijk}' - 1 = 0$. Using Lagrange multipliers, the function to maximize with respect to the variables pijk' is thus $w = \sum_X P(X\vert O) \log P'(X,O) - \sum_N
\lambda_i \left(\sum_j \sum_k p_{ijk}'-1\right)$. Then $\frac{\partial w}{\partial p_{ijk}'} = $

 
    $\displaystyle \frac{\partial }{\partial p_{ijk}'} \left[ \sum_X P(X\vert O) \log P'(X,O) -
\sum_N \lambda_i \left(\sum_j \sum_k p_{ijk}'-1\right) \right]$  
  = $\displaystyle \sum_X P(X\vert O) \frac{\partial/\partial p_{ijk}' P'(X,O)}
{P'(X,O)}-\lambda_i$ (16)

Since X completely specifies O (recall that xijk also specifies that symbol vk is observed in state j),

\begin{eqnarray*}P'(X,O) = P'(X) &=& \prod_{t=0}^T p_{i_tj_tk_t}'\\
\frac{\part...
...rac{\partial }{\partial p_{ijk}'}\prod_{t=1}^T p_{i_tj_tk_t}'\\
\end{eqnarray*}


If cijk denotes the number of times xijk occurs in X, then

\begin{eqnarray*}\frac{\partial }{\partial p_{ijk}'}P'(X,O)
&=& \frac{\partial ...
..._tj_tk_t}}}{p_{ijk}'}\\
&=& \frac{c_{ijk} P'(X,O)}{p_{ijk}'}\\
\end{eqnarray*}


Hence,

\begin{displaymath}\frac{\partial/\partial p_{ijk}' P'(X,O)}{P'(X,O)} = \frac{c_{ijk}}{p_{ijk}'}
\end{displaymath}

We can also satisfy ourselves that

\begin{displaymath}\partial^2 P'(X,O)/\partial
(p_{ijk}')^2 = -c_{ijk}/(p_{ijk}')^2 <0 \end{displaymath}

and so the stationary point obtained by setting the first derivative to zero is indeed a maximum.

Substituting in (17) and setting to zero,


\begin{eqnarray*}\sum_X P(X\vert O) \frac{c_{ijk}}{p_{ijk}'} -\lambda_i = 0 \qqu...
...}\\
p_{ijk}' = \frac{1}{\lambda_i}\sum_X P(X\vert O) c_{ijk}\\
\end{eqnarray*}


If we now write $P(X\vert O) = \frac{P(X,O)}{P(O)}$, then

\begin{eqnarray*}p_{ijk}' &=& \frac{1}{\lambda_i P(O)}\sum_X c_{ijk}P(X,O) =
\f...
...i} \sum_X c_{ijk} P(X,O)\\
K_ip_{ijk}' &=& \sum_X c_{ijk}P(X,O)
\end{eqnarray*}


where Ki can be interpreted as a normalizing constant. The RHS can now be summed over each of the times $ 1, \cdots, T$ by grouping the sequences such that in each group xitjtkt = xijk.

\begin{eqnarray*}&=& \sum_{t=1}^T P(O_1,\cdots,O_{t-1}, S_{t-1}=i) p_{ijk}
P(O_...
...\frac{\sum_{t=1}^T \alpha_t(i)a_{ij}b_j(v_k)\beta_{t+1}(j)}{K_i}
\end{eqnarray*}


The RHS above is just the reestimated probability of xijk because prior to normalization, the sum yields the expected frequency of the xijk computed using the current values of the pijk.

We can now proceed to calculate each pijk' as above. The set of all such pijk's constitutes our $\theta'$ which we can use in place of $\theta$ during the next iteration.

The EM theorem tells us that if we continue in this fashion, then after each iteration

1.
either the reestimated $\theta'$ is more likely than the original $\theta$ in the sense that $P(O\vert\theta') > P(O\vert\theta)$ or

2.
we have reached a stationary point of the likelihood function w at which $\theta' = \theta$

This is an algorithm, because it is iterative. There is an expectation step (aka estimation step) and a maximization step (aka modification step) in each iteration.

Note that the algorithm is not guaranteed to converge at the global maximum. But it has been found to yield good results in practice.


next up previous
Next: Estimating the means of Up: Bayesian Learning 2: EM Previous: The EM Algorithm
Anand Venkataraman
1999-09-16