pig

Reference manual for Apache Pig Latin

Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin. Does anyone know of a good reference manual for PigLatin? I'm looking for something that includes all the syntax and commands descriptions for the language. Unfortunately the wiki page in Pig wiki is broken. ...

Splitting input into substrings in PIG (Hadoop)

Assume I have the following input in Pig: some And I would like to convert that into: s so som some I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that requires a Java class to do that? ...

What language could I use for fast execution of this database summarization task?

So I wrote a Python program to handle a little data processing task. Here's a very brief specification in a made-up language of the computation I want: parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \ flatten | format "%s %lf %s" aa bb cc That is, for each line, parse out a word, a floating-point number, an...

STREAM keyword in pig script that runs in Amazon Mapreduce

Hi. I have a pig script, that activates another python program. I was able to do so in my own hadoop environment, but I always fail when I run my script in Amazon map reduce WS. The log say: org.apache.pig.backend.executionengine.ExecException: ERROR 2090: Received Error while processing the reduce plan: '' failed with exit status: 127...

Storing data to SequenceFile from Apache Pig

Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader: REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar; DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); log = LOAD '/data/logs' USING SequenceFileLoader AS (...) Is there also a library out there that w...

Does throwing an exception in an EvalFunc pig UDF skip just that line, or stop completely?

I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing. It looks something like this: public abstract class Foo extends EvalFunc<Tuple> { public Foo() { super(); } public Tuple exec(Tuple input) throws IOException { ...

Examples of simple stats calculation with hadoop

I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my...

How can I load a file into a DataBag from within a Yahoo PigLatin UDF?

I have a Pig program where I am trying to compute the minimum center between two bags. In order for it to work, I found I need to COGROUP the bags into a single dataset. The entire operation takes a long time. I want to either open one of the bags from disk within the UDF, or to be able to pass another relation into the UDF without ne...

How to use Cassandra's Map Reduce with or w/o Pig?

Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end. https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/ For instance, let's say I'm using Python and Pycassa, how would I load in a new ma...

Regexp matching in pig

Using apache pig and the text hahahah. my brother just didnt do anything wrong. He cheated on a test? no way! I'm trying to match "my brother just didnt do anything wrong." Ideally, I'd want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL. Looking at the pig docs, and then ...

Pass a relation to a PIG UDF when using FOREACH on another relation?

We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data): 35 6009 521 21599 225 51991 12 6129 We wrote a UD...

Difference between Pig and Hive? Why have both?

Hi My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS. I understand that- Pig's language Pig Latin is a shift from(suits the way programmers think) SQL like declarative style of programming and Hive's query language closely ...

Retrieving information from aggregated weblogs data, how to do it?

Hello, I would like to know how to retrieve data from aggregated logs? This is what I have: - about 30GB daily of uncompressed log data loaded into HDFS (and this will grow soon to about 100GB) This is my idea: - each night this data is processed with Pig - logs are read, split, and custom UDF retrieves data like: timestamp, url, user_id...

generating bigram combinations from grouped data in pig.

given my input data in userid,itemid format: raw: {userid: bytearray,itemid: bytearray} dump raw; (A,1) (A,2) (A,4) (A,5) (B,2) (B,3) (B,5) (C,1) (C,5) grpd = GROUP raw BY userid; dump grpd; (A,{(A,1),(A,2),(A,4),(A,5)}) (B,{(B,2),(B,3),(B,5)}) (C,{(C,1),(C,5)}) I'd like to generate all of the combinations(order not important) of ...

How can I group a large dataset

I have simple text file containing two columns, both integers 1 5 1 12 2 5 2 341 2 12 and so on.. I need to group the dataset by second value, such that the output will be. 5 1 2 12 1 2 341 2 Now the problem is that the file is very big around 34 Gb in size, I tried writing a python script to group them into a dictionary with val...

Bundling jars when submitting map/reduce jobs via Pig?

I'm trying to combine Hadoop, Pig and Cassandra to be able to work on data stored in Cassandra by means of simple Pig queries. Problem is I can't get Pig to create Map/Reduce jobs that actually work with the CassandraStorage. What I did is I copied the storage-conf.xml file from one of my cluster machines on top of the one in contrib/pi...

Is there a canonical problem that provably can't be aided with map/reduce?

I'm trying to understand the boundaries of hadoop and map/reduce and it would help to know a non-trivial problem, or class of problems, that we know map/reduce can't assist in. It certainly would be interesting if changing one factor of the problem would allow simplification from map/reduce. Thank you ...

Pig Latin: Load multiple files from a date range (part of the directory structure)

I have the following scenario- Pig version used 0.70 Sample HDFS directory structure: /user/training/test/20100810/<data files> /user/training/test/20100811/<data files> /user/training/test/20100812/<data files> /user/training/test/20100813/<data files> /user/training/test/20100814/<data files> As you can see in the paths listed abo...

Merging multiple files into one within Hadoop

Hello, I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig? Thanks! ...

Call RESTful service in Pig script

I'm working on a Pig script (my first) that loads a large text file. For each record in that text file, the content of one field needs to be sent off to a RESTful service for processing. Nothing needs to be evaluated or filtered. Capture data, send it off and the script doesn't need anything back. I'm assuming that a UDF is required for...