Encouraged by this, and the fact I have billions of string to parse, I tried to modify my code to accept StringTokenizer instead of String[]
The only thing left between me and getting that delicious x2 performance boost is the fact that when you're doing
"dog,,cat".split(",")
//output: ["dog","","cat"]
StringTokenizer("dog,,cat")
// nextToken() = "dog"
// nextToken() = "cat"
How can I achieve similar results with the StringTokenizer? Are there faster ways to do this?
Are you only actually tokenizing on commas? If so, I'd write my own tokenizer - it may well end up being even more efficient than the more general purpose StringTokenizer which can look for multiple tokens, and you can make it behave however you'd like. For such a simple use case, it can be a simple implementation.
If it would be useful, you could even implement Iterable<String>
and get enhanced-for-loop support with strong typing instead of the Enumeration
support provided by StringTokenizer
. Let me know if you want any help coding such a beast up - it really shouldn't be too hard.
Additionally, I'd try running performance tests on your actual data before leaping too far from an existing solution. Do you have any idea how much of your execution time is actually spent in String.split
? I know you have a lot of strings to parse, but if you're doing anything significant with them afterwards, I'd expect that to be much more significant than the splitting.