Regex to split a CSV

Code Ninja picture Code Ninja · Aug 9, 2013 · Viewed 98.4k times · Source

I know this (or similar) has been asked many times but having tried out numerous possibilities I've not been able to find a a regex that works 100%.

I've got a CSV file and I'm trying to split it into an array, but encountering two problems: quoted commas and empty elements.

The CSV looks like:

123,2.99,AMO024,Title,"Description, more info",,123987564

The regex I've tried to use is:

thisLine.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)

The only problem is that in my output array the 5th element comes out as 123987564 and not an empty string.

Answer

Ro Yo Mi picture Ro Yo Mi · Aug 9, 2013

Description

Instead of using a split, I think it would be easier to simply execute a match and process all the found matches.

This expression will:

  • divide your sample text on the comma delimits
  • will process empty values
  • will ignore double quoted commas, providing double quotes are not nested
  • trims the delimiting comma from the returned value
  • trims surrounding quotes from the returned value

Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

enter image description here

Example

Sample Text

123,2.99,AMO024,Title,"Description, more info",,123987564

ASP example using the non-java expression

Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.MultiLine = True
sourcestring = "your source string"
regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
Set Matches = regEx.Execute(sourcestring)
  For z = 0 to Matches.Count-1
    results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
    For zz = 0 to Matches(z).SubMatches.Count-1
      results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
    next
    results=Left(results,Len(results)-1) & chr(13)
  next
Response.Write "<pre>" & results

Matches using the non-java expression

Group 0 gets the entire substring which includes the comma
Group 1 gets the quote if it's used
Group 2 gets the value not including the comma

[0][0] = 123
[0][1] = 
[0][2] = 123

[1][0] = ,2.99
[1][1] = 
[1][2] = 2.99

[2][0] = ,AMO024
[2][1] = 
[2][2] = AMO024

[3][0] = ,Title
[3][1] = 
[3][2] = Title

[4][0] = ,"Description, more info"
[4][1] = "
[4][2] = Description, more info

[5][0] = ,
[5][1] = 
[5][2] = 

[6][0] = ,123987564
[6][1] = 
[6][2] = 123987564