What are the problems with standard gradient descent?

- 1.
- Slow convergence (Work out a measure for 1000 iterations, 100 training examples and 100 weights).
- 2.
- May settle into local minima

A stochastic approximation to gradient descent has been proposed by Widrow and Hoff (1960).

Instead of computing
over all training examples, update
*w*_{i} after considering each training example. So as we iterate
through the training examples, we update each weight *w*_{i} according
to:

where

This can be imagined to be an attempt to minimize the training error
*E*_{d} for each individual training example *d*, where

- Stochastic gradient descent updates weights more frequently (
*m*times more frequently) than standard gradient descent. - If there are multiple local minima on the surface , stochastic gradient descent can sometimes pull itself out of them due to the effects of the various different .

**Important observation:** The training examples for a perceptron
(thresholded unit) have target values of .
So if we train the
pre-threshold stage of the perceptron (which is the unthresholded
unit) to fit these target values using the delta rule, then clearly
the perceptron has been trained to fit the given target values as
well.