Steps to creating an NFA from a regular expression

kiliki picture kiliki · Aug 5, 2012 · Viewed 48k times · Source

I'm having issues 'describing each step' when creating an NFA from a regular expression. The question is as follows:

Convert the following regular expression to a non-deterministic finite-state automaton (NFA), clearly describing the steps of the algorithm that you use: (b|a)*b(a|b)

I've made a simple 3-state machine but it's very much from intuition. This is a question from a past exam written by my lecturer, who also wrote the following explanation of Thompson's algorithm: http://www.cs.may.ie/staff/jpower/Courses/Previous/parsing/node5.html

Can anyone clear up how to 'describe each step clearly'? It just seems like a set of basic rules rather than an algorithm with steps to follow.

Maybe there's an algorithm I've glossed over somewhere but so far I've just created them with an educated guess.

Answer

Wayland Smith picture Wayland Smith · Aug 6, 2012

Short version for general approach.
There's an algo out there called the Thompson-McNaughton-Yamada Construction Algorithm or sometimes just "Thompson Construction." One builds intermediate NFAs, filling in the pieces along the way, while respecting operator precedence: first parentheses, then Kleene Star (e.g., a*), then concatenation (e.g., ab), followed by alternation (e.g., a|b).

Here's an in-depth walkthrough for building (b|a)*b(a|b)'s NFA

Building the top level

  1. Handle parentheses. Note: In actual implementation, it can make sense to handling parentheses via a recursive call on their contents. For the sake of clarity, I'll defer evaluation of anything inside of parens.

  2. Kleene Stars: only one * there, so we build a placeholder Kleene Star machine called P (which will later contain b|a). Intermediate result:
    Non-Deterministic Finite Automata for P*

  3. Concatenation: Attach P to b, and attach b to a placeholder machine called Q (which will contain (a|b). Intermediate result:
    Non-Deterministic Finite Automata for P*bQ

  4. There's no alternation outside of parentheses, so we skip it.

Now we're sitting on a P*bQ machine. (Note that our placeholders P and Q are just concatenation machines.) We replace the P edge with the NFA for b|a, and replace the Q edge with the NFA for a|b via recursive application of the above steps.


Building P

  1. Skip. No parens.

  2. Skip. No Kleene stars.

  3. Skip. No contatenation.

  4. Build the alternation machine for b|a. Intermediate result:
    NFA for b or a


Integrating P

Next, we go back to that P*bQ machine and we tear out the P edge. We have the source of the P edge serve as the starting state for the P machine, and the destination of the P edge serve as the destination state for the P machine. We also make that state reject (take away its property of being an accept state). The result looks like this:
enter image description here


Building Q

  1. Skip. No parens.

  2. Skip. No Kleene stars.

  3. Skip. No contatenation.

  4. Build the alternation machine for a|b. Incidentally, alternation is commutative, so a|b is logically equivalent to b|a. (Read: skipping this minor footnote diagram out of laziness.)


Integrating Q

We do what we did with P above, except replacing the Q edge with the intermedtae b|a machine we constructed. This is the result:
enter image description here

Tada! Er, I mean, QED.


Want to know more?

All the images above were generated using an online tool for automatically converting regular expressions to non-deterministic finite automata. You can find its source code for the Thompson-McNaughton-Yamada Construction algorithm online.

The algorithm is also addressed in Aho's Compilers: Principles, Techniques, and Tools, though its explanation is sparse on implementation details. You can also learn from an implementation of the Thompson Construction in C by the excellent Russ Cox, who described it some detail in a popular article about regular expression matching.