views:

42

answers:

1

I have a Pig script--currently running in local mode--that processes a huge file containing a list of categories:

/root/level1/level2/level3
/root/level1/level2/level3/level4
...

I need to insert each of these into an existing database by calling a stored procedure. Because I'm new to Pig and the UDF interface is a little daunting, I'm trying to get something done by streaming the file's content through a PHP script.

I'm finding that the PHP script only sees half of the category lines I'm passing through it, though. More precisely, I see a record returned for ceil( pig_categories/2 ). A limit of 15 will produce 8 entries after streaming through the PHP script--the last one will be empty.

-- Pig script snippet
ordered  = ORDER mappable_categories BY category;
limited  = LIMIT ordered 20;

categories = FOREACH limited GENERATE category;
DUMP categories; -- Displays all 20 categories

streamed = STREAM limited THROUGH `php -nF categorize.php`;
DUMP streamed; -- Displays 10 categories

# categorize.php
$category = fgets( STDIN );
echo $category;

Any thoughts on what I'm missing. I've poured over the Pig reference manual for a while now and there doesn't seem to be much information related to streaming through a PHP script. I've also tried the #hadoop channel on IRC to no avail. Any guidance would be much appreciated.

Thanks.

UPDATE

It's becoming evident that this is EOL-related. If I change the PHP script from using fgets() to stream_get_line(), then I get 10 items back, but the record that should be first is skipped and there's a trailing empty record that gets displayed.

(Arts/Animation)
(Arts/Animation/Anime)
(Arts/Animation/Anime/Characters)
(Arts/Animation/Anime/Clubs_and_Organizations)
(Arts/Animation/Anime/Collectibles)
(Arts/Animation/Anime/Collectibles/Cels)
(Arts/Animation/Anime/Collectibles/Models_and_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam)
()

In that result set, there should be a first item of (Arts). Closing in, but there's still some gap to close.

A: 

So it turns out that this is one of those instances where whitespace matters. I had an empty line in front of my opening <?php tag. Once I tightened all of that up, everything sailed through and produced as expected. /punitive headslap/

Rob Wilkerson