Remove all whitespaces from String but keep ONE newline

friesoft picture friesoft · Mar 19, 2013 · Viewed 13.8k times · Source

I have this input String (containg tabs, spaces, linebreaks):


        That      is a test.              
    seems to work       pretty good? working.








    Another test  again.

[Edit]: I should have provided the String for better testing as stackoverflow removes all special characters (tabs, ...)

String testContent = "\n\t\n\t\t\t\n\t\t\tDas      ist ein Test.\t\t\t  \n\tsoweit scheint das \t\tganze zu? funktionieren.\n\n\n\n\t\t\n\t\t\n\t\t\t      \n\t\t\t      \n    \t\t\t\n    \tNoch ein  Test.\n    \t\n    \t\n    \t";

And I want to reach this state:


That is a test.
seems to work pretty good? working.
Another test again.

String expectedOutput = "Das ist ein Test.\nsoweit scheint das ganze zu? funktionieren.\nNoch ein Test.\n";

Any ideas? Can this be achieved using regexes?

replaceAll("\\s+", " ") is NOT what I'm looking for. If this regex would preserve exactly 1 newline of the ones existing it would be perfect.

I have tried this but this seems suboptimal to me...:

BufferedReader bufReader = new BufferedReader(new StringReader(testContent));
String line = null;
StringBuilder newString = new StringBuilder();
while ((line = bufReader.readLine()) != null) {
    String temp = line.replaceAll("\\s+", " ");
    if (!temp.trim().equals("")) {
        newString.append(temp.trim());
        newString.append("\n");
    }
}

Answer

Marko Topolnik picture Marko Topolnik · Mar 19, 2013

In a single regex (plus a small patch for tabs):

input.replaceAll("^\\s+|\\s+$|\\s*(\n)\\s*|(\\s)\\s*", "$1$2")
     .replace("\t"," ");

The regex looks daunting, but in fact decomposes nicely into these parts that are OR-ed together:

  • ^\s+ – match whitespace at the beginning;
  • \s+$ – match whitespace at the end;
  • \s*(\n)\s* – match whitespace containing a newline, and capture that newline;
  • (\s)\s* – match whitespace, capturing the first whitespace character.

The result will be a match with two capture groups, but only one of the groups may be non-empty at a time. This allows me to replace the match with "$1$2", which means "concatenate the two capture groups."

The only remaining problem is that I can't replace a tab with a space using this approach, so I fix that up with a simple non-regex character replacement.