I am using awk to parse my data with "," as separator as the input is a csv file. However, there are "," within the data which is escaped by double quotes ("...").
Example
filed1,filed2,field3,"field4,FOO,BAR",field5
How can i ignore the comma "," within the the double quote so that I can parse the output correctly using awk? I know we can do this in excel, but how do we do it in awk?
It's easy, with GNU awk 4:
zsh-4.3.12[t]% awk '{
for (i = 0; ++i <= NF;)
printf "field %d => %s\n", i, $i
}' FPAT='([^,]+)|("[^"]+")' infile
field 1 => filed1
field 2 => filed2
field 3 => field3
field 4 => "field4,FOO,BAR"
field 5 => field5
Adding some comments as per OP requirement.
From the GNU awk manual on "Defining fields by content:
The value of FPAT should be a string that provides a regular expression. This regular expression describes the contents of each field. In the case of CSV data as presented above, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant, we would have
/([^,]+)|("[^"]+")/
. Writing this as a string requires us to escape the double quotes, leading to:
FPAT = "([^,]+)|(\"[^\"]+\")"
Using +
twice, this does not work properly for empty fields, but it can be fixed as well:
As written, the regexp used for FPAT requires that each field contain at least one character. A straightforward modification (changing the first ‘
+
’ to ‘*
’) allows fields to be empty:
FPAT = "([^,]*)|(\"[^\"]+\")"