What are the problems with standard gradient descent?
1.
Slow convergence (Work out a measure for 1000
iterations, 100 training examples and 100 weights).
2.
May settle into local minima
A stochastic approximation to gradient descent has been proposed by
Widrow and Hoff (1960).
Instead of computing
over all training examples, update
wi after considering each training example. So as we iterate
through the training examples, we update each weight wi according
to:
where t, o and xi are the target value, unit output and the
ith input for this training example alone. This is called the
delta rule.
This can be imagined to be an attempt to minimize the training error
Ed for each individual training example d, where
Stochastic gradient descent updates weights more frequently (m
times more frequently) than standard gradient descent.
If there are multiple local minima on the surface
,
stochastic gradient descent can sometimes pull itself out of them due
to the effects of the various different
.
Important observation: The training examples for a perceptron
(thresholded unit) have target values of .
So if we train the
pre-threshold stage of the perceptron (which is the unthresholded
unit) to fit these target values using the delta rule, then clearly
the perceptron has been trained to fit the given target values as
well.