What is the difference between markov chain models and hidden markov model? I've read in Wikipedia, but couldn't understand the differences.
To explain by example, I'll use an example from natural language processing. Imagine you want to know the probability of this sentence:
I enjoy coffee
In a Markov model, you could estimate its probability by calculating:
P(WORD = I) x P(WORD = enjoy | PREVIOUS_WORD = I) x P(word = coffee| PREVIOUS_WORD = enjoy)
Now, imagine we wanted to know the parts-of-speech tags of this sentence, that is, if a word is a past tense verb, a noun, etc.
We did not observe any parts-of-speech tags in that sentence, but we assume they are there. Thus, we calculate what's the probability of the parts-of-speech tag sequence. In our case, the actual sequence is:
PRP-VBP-NN
(where PRP=“Personal Pronoun”, VBP=“Verb, non-3rd person singular present”, NN=“Noun, singular or mass”. See https://cs.nyu.edu/grishman/jet/guide/PennPOS.html for complete notation of Penn POS tagging)
But wait! This is a sequence that we can apply a Markov model to. But we call it hidden, since the parts-of-speech sequence is never directly observed. Of course in practice, we will calculate many such sequences and we'd like to find the hidden sequence that best explains our observation (e.g. we are more likely to see words such as 'the', 'this', generated from the determiner (DET) tag)
The best explanation I have ever encountered is in a paper from 1989 by Lawrence R. Rabiner: http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf