Read Section 6.4 of Mitchell. Under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a maximum likelihood hypothesis.
Consider a hypothesis space H consisting of some real valued
functions
.
Suppose we are to learn a target
function
from H from a set of m training
examples
(xi, di). Assume that data is corrupted by noise that is
Normally distributed about the target values, i.e.
di = f(xi) +
ei. What is hML?
From Eqn 3
If we assume the training examples are independent of each other, we can write the above as:
But P(D|h) = 0 for any particular data item since the error is distributed Normally (continuously).
However, note
So,
Now, since we assumed that the di are Normally distributed around
,
we can write
So hML is the hypothesis that minimizes the sum of squared errors between the training examples and the hypothesis predictions.