Hi Niels,
Here is how you might do it with pig streaming and python without writing custom UDFs:
Suppose your data is just 1 column of words. The python script (lets call it wordSeq.py) to process things would be:
#!/usr/bin/python
### wordSeq.py ### [don't forget to chmod u+x wordSeq.py !]
import sys
for word in sys.stdin:
word = word.rstrip()
sys.stdout.write('\n'.join([word[:i+1] for i in xrange(len(word))]) + '\n')
Then, in your pig script, you tell pig you are using streaming with the above script and that you want to ship your script as necessary:
-- wordSplitter.pig ---
DEFINE CMD `wordSeq.py` ship('wordSeq.py');
W0 = LOAD 'words';
W = STREAM W0 THROUGH CMD as (word: chararray);