Merge two CSV files based on a data from a column

abn picture abn · Nov 20, 2014 · Viewed 9.6k times · Source

I have two csv files like below.

CSV1

data13      data23      d      main_data1;main_data2      data13         data23
data12      data22      d      main_data1;main_data2      data12         data22
data11      data21      d      main_data1;main_data2      data11         data21
data3       data4       d      main_data2;main_data4      data3          data4
data52      data62      d      main_data3                 data51         data62
data51      data61      d      main_data3                 main_data3     data61
data7       data8       d      main_data4                 data7          data8

CSV2

id1      main_data1      a1      a2      a3
id2      main_data2      b1      b2      b3
id3      main_data3      c1      c2      c3
id4      main_data4      d1      d2      d3
id5      main_data5      e1      e2      e3

Now my question is, I know how to merge two CSV files when one of the columns is exactly the same in both the files. But my question is a little different. column 4 from CSV1 could contain column 2 from CSV2. I'd like to get a CSV file as below

FINAL_CSV

id1      main_data1      a1      a2      a3      data13
id2      main_data2      b1      b2      b3      data3
id3      main_data3      c1      c2      c3      main_data3
id4      main_data4      d1      d2      d3      data7
id5      main_data5      e1      e2      e3

where:
1. it matches the data from both the columns and gets corresponding rows from the first occurrence and write to the csv file.
2. When there's no match, it can leave the last column in FINAL_CSV blank or write 'NA' or anything of that sort.
3. When data in columns 4 and 5 of CSV1 match exactly, it returns that row instead of the first occurrence.

I'm totally lost on how to do this. Helping with a part of it is fine too. Any suggestions are highly appreciated.
PS- I know data from csv file should be separated by a comma, but for the sake of clarity, I preferred tabs, though the actual data is separated by commas.

EDIT: Actually, the 'main_data' can be in any column in CSV2, not in just column2. The same 'main_data' could also repeat in multiple rows, then I'd like to get all the corresponding rows.

Answer

user3442743 picture user3442743 · Jan 2, 2015

A way with (g)awk .

 awk -F, 'NR==FNR{a[$2]=$0;next}
         {split($4,b,";");x=b[1]}
         (x in a)&&!c[x]++{d[x]=$5}
         ($5 in a){d[$5]=$5}
         END{n=asorti(a,e);for(i=1;i<=n;i++)print a[e[i]]","d[e[i]]}'  CSV1 CSV2

Output

id1,main_data1,a1,a2,a3,data13
id2,main_data2,b1,b2,b3,data3
id3,main_data3,c1,c2,c3,main_data3
id4,main_data4,d1,d2,d3,data7
id5,main_data5,e1,e2,e3,