Replicating String.split with StringTokenizer

Dani picture Dani · Jun 12, 2009 · Viewed 7.3k times · Source

Encouraged by this, and the fact I have billions of string to parse, I tried to modify my code to accept StringTokenizer instead of String[]

The only thing left between me and getting that delicious x2 performance boost is the fact that when you're doing

"dog,,cat".split(",")
//output: ["dog","","cat"]

StringTokenizer("dog,,cat")
// nextToken() = "dog"
// nextToken() = "cat"

How can I achieve similar results with the StringTokenizer? Are there faster ways to do this?

Answer

Jon Skeet picture Jon Skeet · Jun 12, 2009

Are you only actually tokenizing on commas? If so, I'd write my own tokenizer - it may well end up being even more efficient than the more general purpose StringTokenizer which can look for multiple tokens, and you can make it behave however you'd like. For such a simple use case, it can be a simple implementation.

If it would be useful, you could even implement Iterable<String> and get enhanced-for-loop support with strong typing instead of the Enumeration support provided by StringTokenizer. Let me know if you want any help coding such a beast up - it really shouldn't be too hard.

Additionally, I'd try running performance tests on your actual data before leaping too far from an existing solution. Do you have any idea how much of your execution time is actually spent in String.split? I know you have a lot of strings to parse, but if you're doing anything significant with them afterwards, I'd expect that to be much more significant than the splitting.