Parsing Strings with JavaCC

425nesp picture 425nesp · Aug 9, 2012 · Viewed 7.4k times · Source

I'm trying to think of a good way to parse strings using JavaCC without mistakenly matching it to another token. These strings should be able to have spaces, letters, and numbers.

My identifier and number token are as follows:

<IDENTIFIER: (["a"-"z", "A"-"Z"])+>
<NUMBER: (["0"-"9"])+>

My current string token is:

<STRING: "\"" (<IDENTIFIER> | <NUMBERS> | " ")+ "\"">

Ideally, I want to only save the stuff that's inside of the quotes. I have a separate file in which I do the actual saving of variables and values. Should I remove the quotes in there?

I originally had a method in the parser file like this:

variable=<INDENTIFIER> <ASSIGN> <QUOTE> message=<IDENTIFIER> <QUOTE>
{File.saveVariable(variable.image, message.image);}

But, as you might guess, this didn't allow for spaces—or numbers for that matter. For identifiers such as variable names, I only want to allow letters.

So, I'd just like to get some advice on how I could go about capturing string literals. In particular, I'd like to make strings such as:

" hello", "hello ", " hello " and "\nhello", "hello\n", "\nhello\n"

valid in my syntax.

Answer

DerMike picture DerMike · Aug 17, 2012

When passing the first " your parser would like to go into a STRING STATE and leave it upon the next (Bonus: unquoted) ".

Like:

TOKEN:
{
  <QUOTE:"\""> : STRING_STATE
}

<STRING_STATE> MORE:
{
  "\\" : ESC_STATE
}

<STRING_STATE> TOKEN:
{
  <ENDQUOTE:<QUOTE>> : DEFAULT
| <CHAR:~["\"","\\"]>
}

<ESC_STATE> TOKEN:
{
  <CNTRL_ESC:["\"","\\","/","b","f","n","r","t"]> : STRING_STATE
}

You can use this like:

/**
 * Match a quoted string.
 */
String string() :
{
  StringBuilder builder = new StringBuilder();
}
{
  <QUOTE> ( getChar(builder) )* <ENDQUOTE>
  {
    return builder.toString();
  }
}

/**
 * Match char inside quoted string.
 */
void getChar(StringBuilder builder):
{
  Token t;
}
{
  ( t = <CHAR> | t = <CNTRL_ESC> )
  {
    if (t.image.length() < 2)
    {
      // CHAR
      builder.append(t.image.charAt(0));
    }
    else if (t.image.length() < 6)
    {
      // ESC
      char c = t.image.charAt(1);
      switch (c)
      {
        case 'b': builder.append((char) 8); break;
        case 'f': builder.append((char) 12); break;
        case 'n': builder.append((char) 10); break;
        case 'r': builder.append((char) 13); break;
        case 't': builder.append((char) 9); break;
        default: builder.append(c);
      }
    }
  }
}

HTH.