Tokenizing a String with tab delimiter in Java while skipping some tokens

Michael picture Michael · Oct 13, 2012 · Viewed 10.4k times · Source

I have a huge file with data (~8Gb / ~80 Million records). Every record has 6-8 attributes which are split by a single tab. I would like for starters to copy some given attributes in another file. So I would like a more elegant code than the above, for example if I want only the second and the last token from a total of 4:

StringTokenizer st = new StringTokenizer(line, "\t");
st.nextToken(); //get rid of the first token
System.out.println(st.nextToken()); //show me the second token
st.nextToken(); //get rid of the third token
System.out.println(st.nextToken()); //show me the fourth token

I'm reminding that it's a huge file so I have to avoid any redundant if checks.

Answer

DSK picture DSK · Oct 14, 2012

Your question got me wondering about performance. Lately I've been using Guava's Splitter where possible, just because I dig the syntax. I've never measured performance, so I put together a quick test of four parsing styles. I put these together really quickly, so pardon mistakes in style and edge-case correctness. They're based on the understanding that we're only interested in the second and fourth items.

What I found interesting is that the "homeGrown" (really crude code) solution is the fastest when parsing a 350MB tab-delimited text file (with four columns), ex:

head test.txt 
0   0   0   0
1   2   3   4
2   4   6   8
3   6   9   12

When operating over 350MB of data on my laptop, I got the following results:

  • homegrown: 2271ms
  • guavaSplit: 3367ms
  • regex: 7302ms
  • tokenize: 3466ms

Given that, I think I'll stick with Guava's splitter for most work and consider custom code for larger data sets.

  public static List<String> tokenize(String line){
    List<String> result = Lists.newArrayList();
    StringTokenizer st = new StringTokenizer(line, "\t");
    st.nextToken(); //get rid of the first token
    result.add(st.nextToken()); //show me the second token
    st.nextToken(); //get rid of the third token
    result.add(st.nextToken()); //show me the fourth token
    return result;
  }

  static final Splitter splitter = Splitter.on('\t');
  public static List<String> guavaSplit(String line){
    List<String> result = Lists.newArrayList();
    int i=0;
    for(String str : splitter.split(line)){
      if(i==1 || i==3){
        result.add(str);
      }
      i++;
    }
    return result;
  }

  static final Pattern p = Pattern.compile("^(.*?)\\t(.*?)\\t(.*?)\\t(.*)$");
  public static List<String> regex(String line){
    List<String> result = null;
    Matcher m = p.matcher(line);
    if(m.find()){
      if(m.groupCount()>=4){
        result= Lists.newArrayList(m.group(2),m.group(4));
      }
    }
    return result;
  }

  public static List<String> homeGrown(String line){
    List<String> result = Lists.newArrayList();
    String subStr = line;
    int cnt = -1;
    int indx = subStr.indexOf('\t');
    while(++cnt < 4 && indx != -1){
      if(cnt==1||cnt==3){
        result.add(subStr.substring(0,indx));
      }
      subStr = subStr.substring(indx+1);
      indx = subStr.indexOf('\t');
    }
    if(cnt==1||cnt==3){
      result.add(subStr);
    }
    return result;
  }

Note that all of these would likely be slower with proper bound checking and more elegant implementation.