ansaurus

Question

Splitting input into substrings in PIG (Hadoop)

Answer 1

+1 A:

Niels, TOKENIZE takes a delimiter argument, so you can make it split each letter; however I can't think of a way to make it produce overlapping tokens.

It's pretty straightforward to write a UDF in Pig, though. You just implement a simple interface called EvalFunc (details here: http://wiki.apache.org/pig/UDFManual ). Pig was built around the idea of users writing their own functions to process most anything, and writing your own UDF is therefore a common and natural thing to do.

An even easier option, although not as efficient, is to use Pig streaming to pass your data through a script (I find whipping up a quick Perl or Python script to be faster than implementing Java classes for one-off jobs). There is an example of this here: http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/ -- it demonstrates the use of a pre-existing library, a Perl script, a UDF, and even an on-the-fly awk script.

SquareCog 2009-09-09 16:10:01

Answer 2

A:

Hi Niels,

Here is how you might do it with pig streaming and python without writing custom UDFs:

Suppose your data is just 1 column of words. The python script (lets call it wordSeq.py) to process things would be:

#!/usr/bin/python
### wordSeq.py ### [don't forget to chmod u+x wordSeq.py !]
import sys
for word in sys.stdin:
  word = word.rstrip()
  sys.stdout.write('\n'.join([word[:i+1] for i in xrange(len(word))]) + '\n')

Then, in your pig script, you tell pig you are using streaming with the above script and that you want to ship your script as necessary:

-- wordSplitter.pig ---
DEFINE CMD `wordSeq.py` ship('wordSeq.py');
W0 = LOAD 'words';
W = STREAM W0 THROUGH CMD as (word: chararray);

eytan 2009-11-13 08:59:31

Answer 3

+1 A:

Use the piggybank library.

http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/piggybank/evaluation/string/SUBSTRING.html

Use like this:

REGISTER /path/to/piggybank.jar;
DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

OUTPUT = FOREACH INPUT GENERATE SUBSTRING((chararray)$0, 0, 10);

morganchristiansson 2010-06-17 16:36:50

Sounds good, is this a new feature in the 0.7.0 release?

Niels Basjes 2010-06-17 18:32:37

ansaurus

tags:

views:

answers:

Splitting input into substrings in PIG (Hadoop)

related questions