Merging two CSV Files in Azure Data Factory by using custom .NET activity

Jayendran picture Jayendran · Jan 26, 2017 · Viewed 8k times · Source

I have two CSV file which contains which contains many n-columns.I have to merge this two csv files with a single CSV file which having one unique column from the both input file.

I browsed thoroughly all the blogs and sites.All will result into using the custom .NET Activity.So i just go through this site

But still am not able to figure out which part in the C# Coding.Can any one share the code for how to merge this two CSV files using custom .NET Activity in Azure Data Factory.

Answer

wBob picture wBob · Jan 29, 2017

Here is an example of how to join those two tab-separated files on Zip_Code column using U-SQL. This example assumes both files are held in Azure Data Lake Storage (ADLS). This script could easily be incorporated into a Data Factory pipeline:

// Get raw input from file A
@inputA =
    EXTRACT 
        Date_received   string,
        Product string,
        Sub_product string,
        Issue   string,
        Sub_issue   string,
        Consumer_complaint_narrative    string,
        Company_public_response string,
        Company string,
        State   string,
        ZIP_Code    string,
        Tags    string,
        Consumer_consent_provided   string,
        Submitted_via   string,
        Date_sent_to_company    string,
        Company_response_to_consumer    string,
        Timely_response string,
        Consumer_disputed   string,
        Complaint_ID    string

    FROM "/input/input48A.txt"
    USING Extractors.Tsv();


// Get raw input from file B
@inputB =
    EXTRACT Provider_ID string,
            Hospital_Name string,
            Address string,
            City string,
            State string,
            ZIP_Code string,
            County_Name string,
            Phone_Number string,
            Hospital_Type string,
            Hospital_Ownership string,
            Emergency_Services string,
            Meets_criteria_for_meaningful_use_of_EHRs string,
            Hospital_overall_rating string,
            Hospital_overall_rating_footnote string,
            Mortality_national_comparison string,
            Mortality_national_comparison_footnote string,
            Safety_of_care_national_comparison string,
            Safety_of_care_national_comparison_footnote string,
            Readmission_national_comparison string,
            Readmission_national_comparison_footnote string,
            Patient_experience_national_comparison string,
            Patient_experience_national_comparison_footnote string,
            Effectiveness_of_care_national_comparison string,
            Effectiveness_of_care_national_comparison_footnote string,
            Timeliness_of_care_national_comparison string,
            Timeliness_of_care_national_comparison_footnote string,
            Efficient_use_of_medical_imaging_national_comparison string,
            Efficient_use_of_medical_imaging_national_comparison_footnote string,
            Location string

    FROM "/input/input48B.txt"
    USING Extractors.Tsv();


// Join the two files on the Zip_Code column
@output =
    SELECT b.Provider_ID,
           b.Hospital_Name,
           b.Address,
           b.City,
           b.State,
           b.ZIP_Code,
           a.Complaint_ID

    FROM @inputA AS a
         INNER JOIN
             @inputB AS b
         ON a.ZIP_Code == b.ZIP_Code
    WHERE a.ZIP_Code == "36033";


// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

This could also be converted into a U-SQL stored procedure with parameters for the filenames and Zip Code.

There are of course may ways to achieve this, each with their own pros and cons. The .net custom activity for example might feel more comfortable for someone with a .net background but you'll need some compute to run it on. Importing the files into an Azure SQL Database would be a good option for someone with a SQL / database background and an Azure SQL DB in the subscription.