Read Section 6.9 of Mitchell. Suppose that each instance
is
a sequence of attribute values
and that the discrete valued target function we are trying to
learn is
where
.
A set of m training instances of the form
is presented followed by a new instance x. What is the most likely
value of f(x)?
The prior probability of vj is easy to estimate as the relative frequency of vj in the training examples.
But what about P(x|vj)? To estimate this we need to count the
number of times the n-gram
occurs in the subset of the training data for which
f(x) =
vj. Technically this is infeasible.
So we make a simplifying assumption (like in Markov chains) that the ai are conditionally independent of each other.
That is,
If we substitute the above in Eqn 6, we get the expression for the classification given by the Naive Bayes classifier:
Note that P(ai|vj) can be easily estimated as the ratio of the frequency of the attribute ai in the subset of the training examples for which f(x) = vj to the size of this subset.
How many different values must we estimate now? (n |V|). How many different values must we estimate if we didn't make the simplifying assumption? (n!|V|).