Read Sections 3.6.2 and 6.6 of Mitchell, esp. the last paragraph of Section 6.6.
Also read the papers given: Wallace and Boulton (1968), Wallace and Georgeff (1984) and Rissanen (1978).
MDL follows directly from Eqn 2:
From results we derived in information theory, we can interpret the RHS as the length of optimally coding both the hypothesis and the data with respect to this hypothesis.
Note that MDL will only produce the MAP hypothesis if the encodings chosen for h and D/h are optimal. It won't work for any arbitrary encoding strategy.
Why is MDL useful? Sometimes it is easier to design a code that captures the knowledge in the system when precise quantification of probabilities is difficult. In all other case use a strictly probabilistic framework and by all means strive for the latter.
Points to ponder/discuss:
We will now read and discuss the distributed papers and look at an example of trying to learn PFSA using MDL.