views:

24

answers:

1

We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):

35 6009
521 21599
225 51991
12 6129

We wrote a UDF that takes in the column value (so: "35 521 225") and the mappings from the file. We would then split the column value and iterate over each and return the first mapped value from the passed in mappings (thinking that is how it would logically work).

We are loading the data in PIG like this:

data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);

mappings = LOAD 'mappings.txt' USING PigStorage() AS (ourId:chararray, theirId:chararray);

Then our generate is:

output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings);

However the error we get is:
'there is an error during parsing: Invalid alias mappings in [data::title: chararray,data::category, chararray]`

It seems that Pig is trying to find a column called "mappings" on our original data. Which if course isn't there. Is there any way to pass a relation that is loaded into a UDF?

Is there any way the "Map" type in PIG will help us here? Or do we need to somehow join the values?

EDIT: To be more specific - we don't want to map ALL of the category ids to the 3rd party ids. We just wanted to map the first. The UDF will iterate over the list of our category ids - and will return when it finds the first mapped value. So if the input looked like:

someProduct\t35 521 225

the output would be:
someProduct\t6009

A: 

Hi,

I don't think you can do it this wait in Pig.

A solution similar to what you wanted to do would be to load the mapping file in the UDF and then process each record in a FOREACH. An example is available in PiggyBank LookupInFiles. It is recommended to use the DistributedCache instead of copying the file directly from the DFS.

DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings.txt');

data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);

output = FOREACH data GENERATE title, MAP_PRODUCT(category);

This will work if your mapping file is not too big. If it does not fit in memory you will have to partition the mapping file and run the script several time or tweak the mapping file's schema by adding a line number and use a native join and nested FOREACH ORDER BY/LIMIT 1 for each product.

Ro