Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin.
Does anyone know of a good reference manual for PigLatin? I'm looking for something that includes all the syntax and commands descriptions for the language. Unfortunately the wiki page in Pig wiki is broken.
...
Assume I have the following input in Pig:
some
And I would like to convert that into:
s
so
som
some
I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries.
So can "pig latin" do this or is this something that requires a Java class to do that?
...
So I wrote a Python program to handle a little data processing
task.
Here's a very brief specification in a made-up language of the computation I want:
parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \
flatten | format "%s %lf %s" aa bb cc
That is, for each line, parse out a word, a floating-point number, an...
Hi.
I have a pig script, that activates another python program.
I was able to do so in my own hadoop environment, but I always fail when I run my script in Amazon map reduce WS.
The log say:
org.apache.pig.backend.executionengine.ExecException: ERROR 2090: Received Error while processing the reduce plan: '' failed with exit status: 127...
Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader:
REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
log = LOAD '/data/logs' USING SequenceFileLoader AS (...)
Is there also a library out there that w...
I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing.
It looks something like this:
public abstract class Foo extends EvalFunc<Tuple> {
public Foo() {
super();
}
public Tuple exec(Tuple input) throws IOException {
...
I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my...
I have a Pig program where I am trying to compute the minimum center between two bags. In order for it to work, I found I need to COGROUP the bags into a single dataset. The entire operation takes a long time. I want to either open one of the bags from disk within the UDF, or to be able to pass another relation into the UDF without ne...
Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end.
https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/
For instance, let's say I'm using Python and Pycassa, how would I load in a new ma...
Using apache pig and the text
hahahah. my brother just didnt do anything wrong. He cheated on a test? no way!
I'm trying to match "my brother just didnt do anything wrong."
Ideally, I'd want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL.
Looking at the pig docs, and then ...
We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):
35 6009
521 21599
225 51991
12 6129
We wrote a UD...
Hi
My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS.
I understand that-
Pig's language Pig Latin is a shift
from(suits the way programmers think)
SQL like declarative style of
programming and Hive's query language closely
...
Hello,
I would like to know how to retrieve data from aggregated logs? This is what I have:
- about 30GB daily of uncompressed log data loaded into HDFS (and this will grow soon to about 100GB)
This is my idea:
- each night this data is processed with Pig
- logs are read, split, and custom UDF retrieves data like: timestamp, url, user_id...
given my input data in userid,itemid format:
raw: {userid: bytearray,itemid: bytearray}
dump raw;
(A,1)
(A,2)
(A,4)
(A,5)
(B,2)
(B,3)
(B,5)
(C,1)
(C,5)
grpd = GROUP raw BY userid;
dump grpd;
(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2),(B,3),(B,5)})
(C,{(C,1),(C,5)})
I'd like to generate all of the combinations(order not important) of ...
I have simple text file containing two columns, both integers
1 5
1 12
2 5
2 341
2 12
and so on..
I need to group the dataset by second value,
such that the output will be.
5 1 2
12 1 2
341 2
Now the problem is that the file is very big around 34 Gb
in size, I tried writing a python script to group them into a dictionary with val...
I'm trying to combine Hadoop, Pig and Cassandra to be able to work on data stored in Cassandra by means of simple Pig queries. Problem is I can't get Pig to create Map/Reduce jobs that actually work with the CassandraStorage.
What I did is I copied the storage-conf.xml file from one of my cluster machines on top of the one in contrib/pi...
I'm trying to understand the boundaries of hadoop and map/reduce and it would help to know a non-trivial problem, or class of problems, that we know map/reduce can't assist in.
It certainly would be interesting if changing one factor of the problem would allow simplification from map/reduce.
Thank you
...
I have the following scenario-
Pig version used 0.70
Sample HDFS directory structure:
/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>
As you can see in the paths listed abo...
Hello,
I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?
Thanks!
...
I'm working on a Pig script (my first) that loads a large text file. For each record in that text file, the content of one field needs to be sent off to a RESTful service for processing. Nothing needs to be evaluated or filtered. Capture data, send it off and the script doesn't need anything back.
I'm assuming that a UDF is required for...