How can I incorporate the current input filename into my Pig Latin script?

Kevin Fink picture Kevin Fink · Mar 17, 2012 · Viewed 10.3k times · Source

I am processing data from a set of files which contain a date stamp as part of the filename. The data within the file does not contain the date stamp. I would like to process the filename and add it to one of the data structures within the script. Is there a way to do that within Pig Latin (an extension to PigStorage maybe?) or do I need to preprocess all of the files using Perl or the like beforehand?

I envision something like the following:

-- Load two fields from file, then generate a third from the filename
rawdata = LOAD '/directory/of/files/' USING PigStorage AS (field1:chararray, field2:int, field3:filename);

-- Reformat the filename into a datestamp
annotated = FOREACH rawdata GENERATE
  REGEX_EXTRACT(field3,'*-(20\d{6})-*',1) AS datestamp,
  field1, field2;

Note the special "filename" datatype in the LOAD statement. Seems like it would have to happen there as once the data has been loaded it's too late to get back to the source filename.

Answer

user1591487 picture user1591487 · Dec 19, 2012

You can use PigStorage by specify -tagsource as following

A = LOAD 'input' using PigStorage(',','-tagsource'); 
B = foreach A generate INPUT_FILE_NAME; 

The first field in each Tuple will contain input path (INPUT_FILE_NAME)

According to API doc http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html

Dan