I have a huge file with data (~8Gb / ~80 Million records). Every record has 6-8 attributes which are split by a single tab. I would like for starters to copy some given attributes in another file. So I would like a more elegant code than the above, for example if I want only the second and the last token from a total of 4:
StringTokenizer st = new StringTokenizer(line, "\t");
st.nextToken(); //get rid of the first token
System.out.println(st.nextToken()); //show me the second token
st.nextToken(); //get rid of the third token
System.out.println(st.nextToken()); //show me the fourth token
I'm reminding that it's a huge file so I have to avoid any redundant if checks.
Your question got me wondering about performance. Lately I've been using Guava's Splitter where possible, just because I dig the syntax. I've never measured performance, so I put together a quick test of four parsing styles. I put these together really quickly, so pardon mistakes in style and edge-case correctness. They're based on the understanding that we're only interested in the second and fourth items.
What I found interesting is that the "homeGrown" (really crude code) solution is the fastest when parsing a 350MB tab-delimited text file (with four columns), ex:
head test.txt
0 0 0 0
1 2 3 4
2 4 6 8
3 6 9 12
When operating over 350MB of data on my laptop, I got the following results:
Given that, I think I'll stick with Guava's splitter for most work and consider custom code for larger data sets.
public static List<String> tokenize(String line){
List<String> result = Lists.newArrayList();
StringTokenizer st = new StringTokenizer(line, "\t");
st.nextToken(); //get rid of the first token
result.add(st.nextToken()); //show me the second token
st.nextToken(); //get rid of the third token
result.add(st.nextToken()); //show me the fourth token
return result;
}
static final Splitter splitter = Splitter.on('\t');
public static List<String> guavaSplit(String line){
List<String> result = Lists.newArrayList();
int i=0;
for(String str : splitter.split(line)){
if(i==1 || i==3){
result.add(str);
}
i++;
}
return result;
}
static final Pattern p = Pattern.compile("^(.*?)\\t(.*?)\\t(.*?)\\t(.*)$");
public static List<String> regex(String line){
List<String> result = null;
Matcher m = p.matcher(line);
if(m.find()){
if(m.groupCount()>=4){
result= Lists.newArrayList(m.group(2),m.group(4));
}
}
return result;
}
public static List<String> homeGrown(String line){
List<String> result = Lists.newArrayList();
String subStr = line;
int cnt = -1;
int indx = subStr.indexOf('\t');
while(++cnt < 4 && indx != -1){
if(cnt==1||cnt==3){
result.add(subStr.substring(0,indx));
}
subStr = subStr.substring(indx+1);
indx = subStr.indexOf('\t');
}
if(cnt==1||cnt==3){
result.add(subStr);
}
return result;
}
Note that all of these would likely be slower with proper bound checking and more elegant implementation.