I am very new to PIG and I am having what feels like a very basic problem. I have a line of code that reads:
A = load 'Sites/trial_clustering/shortdocs/*'
AS (word1:chararray, word2:chararray, word3:chararray, word4:chararray);
where each file is basically a line of 4 comma separated words. However PIG is not splitting this into the 4 words. When I do dump A
, I get: (Money, coins, loans, debt,,,)
I have tried googling and I cannot seem to find what format my file needs to be in so that PIG will interpret it properly. Please help!
Your problem is that Pig, by default, loads files delimited by tab, not comma. What's happening is "Money, coins, loans, debt"
are getting stuck in your first column, word1
. When you are printing it, you get the illusion that you have multiple columns, but really the first one is filled with your whole line, then the others are null.
To fix this, you should specify PigStorage
to load by comma by doing:
A = LOAD '...' USING PigStorage(',') AS (...);