I have a Pig script--currently running in local mode--that processes a huge file containing a list of categories:
/root/level1/level2/level3
/root/level1/level2/level3/level4
...
I need to insert each of these into an existing database by calling a stored procedure. Because I'm new to Pig and the UDF interface is a little daunting, I'm trying to get something done by streaming the file's content through a PHP script.
I'm finding that the PHP script only sees half of the category lines I'm passing through it, though. More precisely, I see a record returned for ceil( pig_categories/2 )
. A limit of 15 will produce 8 entries after streaming through the PHP script--the last one will be empty.
-- Pig script snippet
ordered = ORDER mappable_categories BY category;
limited = LIMIT ordered 20;
categories = FOREACH limited GENERATE category;
DUMP categories; -- Displays all 20 categories
streamed = STREAM limited THROUGH `php -nF categorize.php`;
DUMP streamed; -- Displays 10 categories
# categorize.php
$category = fgets( STDIN );
echo $category;
Any thoughts on what I'm missing. I've poured over the Pig reference manual for a while now and there doesn't seem to be much information related to streaming through a PHP script. I've also tried the #hadoop channel on IRC to no avail. Any guidance would be much appreciated.
Thanks.
UPDATE
It's becoming evident that this is EOL-related. If I change the PHP script from using fgets()
to stream_get_line()
, then I get 10 items back, but the record that should be first is skipped and there's a trailing empty record that gets displayed.
(Arts/Animation)
(Arts/Animation/Anime)
(Arts/Animation/Anime/Characters)
(Arts/Animation/Anime/Clubs_and_Organizations)
(Arts/Animation/Anime/Collectibles)
(Arts/Animation/Anime/Collectibles/Cels)
(Arts/Animation/Anime/Collectibles/Models_and_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures)
(Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam)
()
In that result set, there should be a first item of (Arts)
. Closing in, but there's still some gap to close.