views:

131

answers:

1

I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing.

It looks something like this:

public abstract class Foo extends EvalFunc<Tuple> {
    public Foo() {
        super();
    }

    public Tuple exec(Tuple input) throws IOException {
        try {
            // do stuff with input
        } catch (Exception e) {
            throw WrappedIOException.wrap("Error with line", e);
        }
    }
}

My question is: if it throws the IOException, will it stop completely, or will it return results for the rest of the lines that don't throw an exception?

Example: I run this in pig

REGISTER myjar.jar
DEFINE Extractor com.namespace.Extractor();

logs = LOAD '$IN' USING TextLoader AS (line: chararray);
events = FOREACH logs GENERATE FLATTEN(Extractor(line));

With this input:

1.5 7 "Valid Line"
1.3 gghyhtt Inv"alid line"" I throw an exceptioN!!
1.8 10 "Valid Line 2"

Will it process the two lines and will 'logs' have 2 tuples, or will it just die in a fire?

+3  A: 

If the exception is thrown by the UDF the task will fail and will be retried.

It will fail again three more times (4 attempts by default) and the whole job will be FAILED.

If you want to log the error and do not want to have the Job stopped you can return a null:

public Tuple exec(Tuple input) throws IOException {
    try {
        // do stuff with input
    } catch (Exception e) {
        System.err.println("Error with ...");
        return null;
    }
}

And filter them later in Pig:

events_all = FOREACH logs GENERATE Extractor(line) AS line;
events_valid = FILTER events_all by line IS NOT null;
events = FOREACH events_valid GENERATE FLATTEN(line);

In your example the output will only have the two valid lines (but be careful with this behavior as the error is only present in the logs and won't fail your job!).

Reply to comment #1:

Actually, the whole resultant tuple would be null (so there is no fields inside).

For example if your schema has 3 fields:

 events_all = FOREACH logs
              GENERATE Extractor(line) AS line:tuple(a:int,b:int,c:int);

and some lines are incorrect we would get:

 ()
 ((1,2,3))
 ((1,2,3))
 ()
 ((1,2,3))

And if you don't filter the null line and try to access a field you get a java.lang.NullPointerException:

events = FOREACH events_all GENERATE line.a;
Ro
In my case, I also define a schema in the UDF, so by returning null, everything in the resultant tuple would be null, correct?
Daniel Huckstep
How do you filter that then? FILTER events BY a IS NOT NULL, assuming the EvalFunc always returns null if it can't figure out 'a'?
Daniel Huckstep
You need to filter on the name of the field returned by the UDF. In our case its name is 'line' and its values could be 'null' or '(1,2,3)'.So you do a 'FILTER events by line IS NOT null' as shown in the first Pig example.If you were returning a tuple with 3 null fields e.g. '(,,)' instead of 'null' you could do your 'FILTER events BY line.a IS NOT NULL' but it is less simple.
Ro