I saw this question, and was curious as to what the pumping lemma was (Wikipedia didn't help much).
I understand that it's basically a theoretical proof that must be true in order for a language to be in a certain class, but beyond that I don't really get it.
Anyone care to try to explain it at a fairly granular level in a way understandable by non mathematicians/comp sci doctorates?
The pumping lemma is a simple proof to show that a language is not regular, meaning that a Finite State Machine cannot be built for it. The canonical example is the language (a^n)(b^n)
. This is the simple language which is just any number of a
s, followed by the same number of b
s. So the strings
ab
aabb
aaabbb
aaaabbbb
etc. are in the language, but
aab
bab
aaabbbbbb
etc. are not.
It's simple enough to build a FSM for these examples:
This one will work all the way up to n=4. The problem is that our language didn't put any constraint on n, and Finite State Machines have to be, well, finite. No matter how many states I add to this machine, someone can give me an input where n equals the number of states plus one and my machine will fail. So if there can be a machine built to read this language, there must be a loop somewhere in there to keep the number of states finite. With these loops added:
all of the strings in our language will be accepted, but there is a problem. After the first four a
s, the machine loses count of how many a
s have been input because it stays in the same state. That means that after four, I can add as many a
s as I want to the string, without adding any b
s, and still get the same return value. This means that the strings:
aaaa(a*)bbbb
with (a*)
representing any number of a
s, will all be accepted by the machine even though they obviously aren't all in the language. In this context, we would say that the part of the string (a*)
can be pumped. The fact that the Finite State Machine is finite and n is not bounded, guarantees that any machine which accepts all strings in the language MUST have this property. The machine must loop at some point, and at the point that it loops the language can be pumped. Therefore no Finite State Machine can be built for this language, and the language is not regular.
Remember that Regular Expressions and Finite State Machines are equivalent, then replace a
and b
with opening and closing Html tags which can be embedded within each other, and you can see why it is not possible to use regular expressions to parse Html