I've got a pipe delimited file that I would like to split (I'm using C#). For example:
This|is|a|test
However, some of the data can contain a pipe in it. If it does, it will be escaped with a backslash:
This|is|a|pip\|ed|test (this is a pip|ed test)
I'm wondering if there is a regexp or some other method to split this apart on just the "pure" pipes (that is, pipes that have no backslash in front of them). My current method is to replace the escaped pipes with a custom bit of text, split on pipes, and then replace my custom text with a pipe. Not very elegant and I can't help but think there's a better way. Thanks for any help.
Just use String.IndexOf()
to find the next pipe. If the previous character is not a backslash, then use String.Substring()
to extract the word. Alternatively, you could use String.IndexOfAny()
to find the next occurrence of either the pipe or backslash.
I do a lot of parsing like this, and this is really pretty straight forward. Taking my approach, if done correctly will also tend to run faster as well.
EDIT
In fact, maybe something like this. It would be interesting to see how this compares performance-wise to a RegEx solution.
public List<string> ParseWords(string s)
{
List<string> words = new List<string>();
int pos = 0;
while (pos < s.Length)
{
// Get word start
int start = pos;
// Get word end
pos = s.IndexOf('|', pos);
while (pos > 0 && s[pos - 1] == '\\')
{
pos++;
pos = s.IndexOf('|', pos);
}
// Adjust for pipe not found
if (pos < 0)
pos = s.Length;
// Extract this word
words.Add(s.Substring(start, pos - start));
// Skip over pipe
if (pos < s.Length)
pos++;
}
return words;
}