Regular expressions - Matching whitespace

mrjasmin picture mrjasmin · Nov 10, 2012 · Viewed 14.2k times · Source

I am having a big problem to write a regexp that will trim all the whitespace in my input.

I have tried \s+ and [ \t\t\r]+ but that don't work.

I need this because I am writing a scanner using flex, and I am stuck at matching whitespace. The whitespace should just be matched and not removed.

Example input:

program 
3.3 5 7 
{ comment }
string
panic: cant happen

Answer

Matthew Slattery picture Matthew Slattery · Nov 11, 2012
  1. flex uses (approximately) the POSIX "Extended Regular Expression" syntax -- \s doesn't work, because it's a Perl extension.

  2. Is [ \t\t\r]+ a typo? I think you'll want a \n in there.

Something like [ \n\t\r]+ certainly should work. For example, this lexer (which I've saved as lexer.l):

%{

#include <stdio.h>

%}

%option noyywrap

%%

[ \n\t\r]+  { printf("Whitespace: '%s'\n", yytext); }
[^ \n\t\r]+ { printf("Non-whitespace: '%s'\n", yytext); }

%%

int main(void)
{
    yylex();
    return 0;
}

...successfully matches the whitespace in your example input (which I've saved as input.txt):

$ flex lexer.l
$ gcc -o test lex.yy.c
$ ./test < input.txt
Non-whitespace: 'program'
Whitespace: ' 
'
Non-whitespace: '3.3'
Whitespace: ' '
Non-whitespace: '5'
Whitespace: ' '
Non-whitespace: '7'
Whitespace: ' 
'
Non-whitespace: '{'
Whitespace: ' '
Non-whitespace: 'comment'
Whitespace: ' '
Non-whitespace: '}'
Whitespace: '
'
Non-whitespace: 'string'
Whitespace: '
'
Non-whitespace: 'panic:'
Whitespace: ' '
Non-whitespace: 'cant'
Whitespace: ' '
Non-whitespace: 'happen'
Whitespace: '
'