ansaurus

Question

Pig Latin: Load multiple files from a date range (part of the directory structure)

Answer 1

A:

Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?

Probably you don't - this can be done using custom Load UDF, or try rethinking you directory structure (this will work good if your ranges are mostly static).

additionally: Pig accepts parameters, maybe this would help you (maybe you could do function that will load data from one day and union it to resulting set, but I don't know if it's possible)

edit: probably writing simple python or bash script that generates list of dates (folders) is the easiest solution, you than just have to pass it to Pig, and this should work fine

Wojtek 2010-08-18 19:57:13

Thanks Wojtek. Well, the grid is already in place and its not feasible to change the directory structure. I see that, temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...); and hadoop fs -ls /user/training/test/{20100810,20100811,20100812} works fine. hadoop fs -ls /user/training/test/{20100810..20100812} also works but temp = LOAD '/user/training/test/{20100810..20100812}' USING SomeLoader() AS (...); fails at dump temp or store temp.

Andriyev 2010-08-18 20:51:45

Answer 2

A:

Pig support globe status of hdfs, so I think pig can handle the pattern '/user/training/test/{20100810,20100811,20100812}', could you paste the error logs ?

zjffdu 2010-08-20 06:14:10

Hi zjffdu, I have copied the error log into the question. Thanks

Andriyev 2010-08-26 19:11:35

Answer 3

A:

Hi Andriyev,

I found this problem is caused by linux shell. Linux shell will help you expand {20100810..20100812} to 20100810 20100811 20100812, then you actually run command bin/hadoop fs -ls 20100810 20100811 20100812. But in the hdfs api, it won't help you to expand the expression.

zjffdu 2010-09-15 10:12:44

Answer 4

A:

Hi,

As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway):

shell:

pig -f script.pig -param input=/user/training/test/{20100810..20100812}

script.pig:

temp = LOAD '$input' USING SomeLoader() AS (...);

Ro 2010-09-24 18:07:41

ansaurus

tags:

views:

answers:

Pig Latin: Load multiple files from a date range (part of the directory structure)

related questions