tags:

views:

47

answers:

2

I'm working on a Pig script (my first) that loads a large text file. For each record in that text file, the content of one field needs to be sent off to a RESTful service for processing. Nothing needs to be evaluated or filtered. Capture data, send it off and the script doesn't need anything back.

I'm assuming that a UDF is required for this kind of functionality, but I'm new enough to Pig that I don't have a clear picture of what type of function I should build. My best guess would be a Store Function since the data is ultimately getting stored somewhere, but I feel like the amount of guesswork involved in coming to that conclusion is higher than I'd like.

Any insight or guidance would be much appreciated.

A: 

Having never found even a hint of an answer to this, I decided to move in a different direction. I'm using Pig to load and parse the large file, but then streaming each record that I care about to PHP for additional processing that Pig doesn't seem to have the capability to handle cleanly.

It's still not complete (read: there's a great big, very unhappy bug in the mix), but I think the concept is solid--just need to work out the implementation details.

everything = LOAD 'categories.txt' USING PigStorage() AS (category:chararray);
-- apply filter
-- apply filter
-- ...
-- apply last filter
ordered  = ORDER filtered_categories BY category;

streamed = STREAM limited THROUGH `php -nF process_categories.php`;
DUMP streamed;
Rob Wilkerson
A: 

Have you had a look to DBStorage which does something similar?

everything = LOAD 'categories.txt' USING PigStorage() AS (category:chararray);
...
STORE ordered INTO RestStorage('https://...');
Ro