Stochastic approximation to gradient descent next up previous
Next: Multilayer Nets, Sigmoid Units Up: Artificial Neural Nets Previous: Gradient Descent in ANNs

Stochastic approximation to gradient descent

What are the problems with standard gradient descent?

1.
Slow convergence (Work out a measure for 1000 iterations, 100 training examples and 100 weights).
2.
May settle into local minima

A stochastic approximation to gradient descent has been proposed by Widrow and Hoff (1960).

Instead of computing $\Delta w_i$ over all training examples, update wi after considering each training example. So as we iterate through the training examples, we update each weight wi according to:

\begin{displaymath}w_i \leftarrow w_i + \eta(t-o)x_i
\end{displaymath}

where t, o and xi are the target value, unit output and the ith input for this training example alone. This is called the delta rule.

This can be imagined to be an attempt to minimize the training error Ed for each individual training example d, where

\begin{displaymath}E_d(\vec{w}) = \frac{1}{2}(t_d - o_d)^2
\end{displaymath}

Important observation: The training examples for a perceptron (thresholded unit) have target values of $\pm 1$. So if we train the pre-threshold stage of the perceptron (which is the unthresholded unit) to fit these target values using the delta rule, then clearly the perceptron has been trained to fit the given target values as well.


next up previous
Next: Multilayer Nets, Sigmoid Units Up: Artificial Neural Nets Previous: Gradient Descent in ANNs
Anand Venkataraman
1999-09-16