Today's Code Golf challenge is to create a regex parser in as few characters as possible.
No, I'm not asking you to match Perl-style regular expressions. There's already a very reliable interpreter for those, after all! :-)
Here's all you need to know about regex syntax for this challenge:
()
.*
(asterisk) character represents a Kleene star operation on the previous TERM. This means zero or more of the previous term, concatenated together.+
(plus) character represents a convenient shortcut: a+
is equivalent to aa*
, meaning one or more of the previous term.?
(question mark) character represents zero or one of the previous term.|
(pipe) character represents an alternation, meaning that the REGULAR EXPRESSIONS on either side can be used in the match.[0-9A-Za-z]
(i.e., all English alphanumerics).Or, put another way: *
/+
/?
have highest precedence, then concatenation, then alternation. Since alternation has lower precedence than concatenation, its use within a regex without parentheses causes it to be bound to the full regex on each side. *
and +
and ?
, on the other hand, would just apply to the immediately preceding term.
Your challenge is to write a program that will compile or interpret a regular expression (as defined above) and then test a number of strings against it.
I'm leaving input up to you. My recommendation would be that the regex should probably come first, and then any number of strings to be tested against it; but if you want to make it last, that's fine. If you want to put everything in command-line arguments or into stdin, or the regex in command-line and the strings in stdin, or whatever, that's fine. Just show a usage example or two.
Output should be true
or false
, one per line, to reflect whether or not the regex matches.
Notes:
()*+?|
are not expected to occur literally. If one comes up in the input, it is safe to assume that no pattern can match the string in question.For the examples, I'm assuming everything is done in command-line arguments, regex first. (As I said above, input is up to you.) myregex
here represents your invocation of the program.
> myregex easy easy Easy hard
true
false
false
> myregex ab*a aa abba abab b
true
true
false
false
> myregex 0*1|10 1 10 0110 00001
true
true
false
true
> myregex 0*(1|1+0) 1 10 0110 00001
true
true
true
true
> myregex a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
true
true
false
true
false
true
NOTE: Sorry, forgot to make community wiki! :-(
This understands parentheses, and works on the regexp live (i.e. not parsed first)
#define C char
#define M m(s,o
m(C*s,C*o,C*S,C*p,C*P,C T){C*n=P-1,*q=s,h=*P==41,c=1;for(;h*c;c-=*n--==40)c+=*n==41;
c=*P-42;c=p-P?c-82?T&&c&~1&&c-21?h?2:*S==*P&s<S?M,S-1,p,n,2)||(T&4&&M,S-1,p,P,T|1)):
4:M,T?S:o,p,P-1,T?c&~1?3:7-c:0):T&&s==S||M,o,p,P-1,2):T&&s==S;if(c==2)for(c=4;q<=S;q
++)c|=m(q,S,S,n+h,P-h,2)?M,q,p,n,2)||q<S&T/4&&M,q,p,P,T&6):0;return
c&4?c&1?:T&1&&M,S,p,n,2)||M,o,p,n,0):c;}main(C*w,C**v){C*u;for(w=*++v;*++v;)puts(m(*v
-1,u,u=index(*v,0)-1,w-1,index(w,0)-1,2)?"true":"false");}
compiling with -Wall complains about argc being char. Will crash on illegal patterns.
This parses regexp and string from right to left, saving a few chars.
input on argv, output in reverse normal order:
$ ./a.out ab*a aa abba abab b
true
true
false
false
$ ./a.out "0*1|10" 1 10 0110 00001
true
true
false
true
$ ./a.out "0*(1|1+0)" 1 10 0110 00001
true
true
true
true
$ ./a.out "a?b+|(a+b|b+a?)+" abb babab aaa aabba a b
true
true
false
true
false
true
$ ./a.out "((A|B|C)+(a|(bbbbb)|bb|c)+)+" ABCABCaccabbbbbaACBbbb
false
$ ./a.out "((A|B|C)+(a|(bbbbb)|bb|c)+)+" ABCABCaccabbbbbaACBbbbb
true