Awk & Sort-Output as Comma Delimited?

pooh80133 picture pooh80133 · Feb 9, 2015 · Viewed 13.8k times · Source

I am trying to get this to output as comma delimited. The current version doesn't work at all (I get a blank file as an output), and previous versions (where I keep the awk BEGIN statements but don't have the sort delimiter) will just output as tab delimited, not comma delimited. In the previous versions, without attempting to get the comma delimiters, I do get the expected answer (with the complicated filters, etc), so I'm not asking for help with that portion of it. I realize this is a very ugly way to filter and the numbers are also ugly/very large.

The background of the question: Find the regions in the file lamina.bed that overlap with the region chr12:5000000-6000000, and to sort descending by column 4, output as comma delimited. Chromosome is the first column, start position of the region is column 2, end position is column 3, value is column 4. We are supposed to use awk (in Unix bash shell). Thank you in advance for your help!

awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000)' /vol1/opt/data/lamina.bed | awk 'BEGIN{FS=","; OFS=","} ($1 == "chr12") ' | sort -t$"," -k4rn > ~/MOLB7621/PS_2/results/2015_02_05/PS2_p3_n1.csv
cat ~/MOLB7621/PS_2/results/2015_02_05/PS2_p3_n1.csv

sample lines of input (tab delimited, including the lines on chr12 that should work):

#chrom  start   end value
chr1    11323785    11617177    0.86217008797654
chr1    12645605    13926923    0.934891485809683
chr1    14750216    15119039    0.945945945945946
chr12   3306736 5048326 0.913561847988077
chr12   5294045 5393088 0.923076923076923
chr12   5505370 6006665 0.791318864774624
chr12   7214638 7827375 0.8562874251497
chr12   8139885 10173149    0.884353741496599

Answer

John1024 picture John1024 · Feb 9, 2015

To get comma-separated output, use the following:

$ awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000) {$1=$1;print}' file | awk 'BEGIN{FS=","; OFS=","} ($1 == "chr12") ' | sort -t$"," -k4rn 
chr12,5294045,5393088,0.923076923076923
chr12,3306736,5048326,0.913561847988077
chr12,5505370,6006665,0.791318864774624

The only change above is the addition on the action:

{$1=$1;print}

awk will only reformat a line with a new field separator if the one or more of the fields on the line have been changed in some way. $1=$1 is sufficient to indicate that field 1 has been changed. Consequently, the new field separators are inserted.

Also, the two calls to awk can be combined into a single call:

awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000) {$1=$1; if($1 == "chr12") print}' file | sort -t$"," -k4rn

Simpler Example

In the following, the input is tab-separated and the output field separator, OFS, is set to a comma. In this first example, the awk command print is used:

$ echo $'a\tb\tc' | awk -v OFS=, '{print}'
a       b       c

Despite OFS=,, the output retains the tab-separator.

Now, we add the simple statement $1=$1 and observe the output:

$ echo $'a\tb\tc' | awk -v OFS=, '{$1=$1;print}'
a,b,c

The output is now comma-separated. Again, that is because awk only reformats a line with the new OFS if it thinks that a field on the line has been changed in some way. The assignment of $1 to itself is sufficient to trigger that reformat.

Note that it is not sufficient to make a change that affects the line as a whole. For example, the following does not trigger a reformat:

$ echo $'a\tb\tc' | awk -v OFS=, '{$0=$0;print}'
a       b       c

It is necessary to change one or more fields of the line individually. In the following, sub operates on $0 as a whole and, consequently, no reformat is triggered:

$ echo $'a\tb\tc' | awk -v OFS=, '{sub($1,"NEW");print}'
NEW     b       c

In the example below, however, sub operates specifically on field $1 and hence triggers a reformat:

$ echo $'a\tb\tc' | awk -v OFS=, '{sub($1,"NEW", $1);print}'
NEW,b,c