Tokenize problem in Java with separator ". "

poiuytrez picture poiuytrez · Jun 4, 2010 · Viewed 17.2k times · Source

I need to split a text using the separator ". ". For example I want this string :

Washington is the U.S Capital. Barack is living there.

To be cut into two parts:

Washington is the U.S Capital. 
Barack is living there.

Here is my code :

// Initialize the tokenizer
StringTokenizer tokenizer = new StringTokenizer("Washington is the U.S Capital. Barack is living there.", ". ");
 while (tokenizer.hasMoreTokens()) {
      System.out.println(tokenizer.nextToken());

}

And the output is unfortunately :

Washington
is
the
U
S
Capital
Barack
is
living
there

Can someone explain what's going on?

Answer

polygenelubricants picture polygenelubricants · Jun 4, 2010

Don't use StringTokenizer; it's a legacy class. Use java.util.Scanner or simply String.split instead.

    String text = "Washington is the U.S Capital. Barack is living there.";
    String[] tokens = text.split("\\. ");
    for (String token : tokens) {
        System.out.println("[" + token + "]");
    }

This prints:

[Washington is the U.S Capital]
[Barack is living there.]

Note that split and Scanner are "regex"-based (regular expressions), and since . is a special regex "meta-character", it needs to be escaped with \. In turn, since \ is itself an escape character for Java string literals, you need to write "\\. " as the delimiter.

This may sound complicated, but it really isn't. split and Scanner are much superior to StringTokenizer, and regex isn't that hard to pick up.

Regular expressions tutorials

Related questions

API Links


But what went wrong?

The problem is that StringTokenizer takes each character in the delimiter string as individual delimiters, i.e. NOT the entire String itself.

From the API:

StringTokenizer(String str, String delim): Constructs a string tokenizer for the specified string. The characters in the delim argument are the delimiters for separating tokens. Delimiter characters themselves will not be treated as tokens.