hadoop

Needed software that uses Hadoop/Nagios.

Hello, I'm wanting to experiment with Hadoop and its related parts like Pig and HBase and I'm trying to think of an interesting (and most importantly useful) project to start on with these tools that would be related to Nagios. I have access to ~200 hosting servers running Apache/MySQL/PHP/Nagios (so I have lots of logs/data that I cou...

Can cat but cannot ls file in Hadoop DFS

This is the weirdest thing ever. So I can see these files and cat them: [jchen@host hadoop-0.20.2]$ bin/hadoop fs -ls /users/jchen/ Found 3 items -rw-r--r-- 1 jchen supergroup 26553445 2010-07-14 21:10 /users/jchen/20100714T192827^AS17.data -rw-r--r-- 1 jchen supergroup 461957962 2010-07-14 21:10 /users/j...

Hadoop... Text.toString() conversion problems

Hi everyone, I'm writing a simple program for enumerating triangles in directed graphs for my project. First, for each input arc (e.g. a b, b c, c a, note: a tab symbol serves as a delimiter) I want my map function output the following pairs ([a, to_b], [b, from_a], [a_b, -1]): public void map(LongWritable key, Text value, ...

Jar works with standalone hadoop, but not on the actual cluster (java.lang.ClassNotFoundException: org.jfree.data.xy.XYDataset)

Hi, I am trying to build my project using eclipse on windows and execute on a linux cluster. The project depends on some external jars, which I enclosed using eclipse's "Export->Runnable JAR -> Package required library into jar" build option. I checked the jar contains the classes within a folder structure, and the external jars are in ...

Partitioner without Mapper

I have, essentially a series of reduce jobs I am running on a lot of data using Hadoop Streaming. I am not really using my Mappers for anything, so am just using Identity Mappers, but I do need the default partitioner hadoop is giving me to group my data in a different manner for each step of my MR job.. I don't know enough the system we...

Hadoop begineers

Hi, I'm trying to practice some data mining algorithms over hadoop. Can I do it with HDFS alone or do I need to use the sub-projects like hive/hbase/pig? Thanks, ram. ...

Which key class is suitable for secondary sort?

In Hadoop you can use the secondary-sort mechanism to sort the values before they are sent to the reducer. The way this is done in Hadoop is that you add the value to sort by to the key and then have some custom group and key compare methods that hook into the sorting system. So you'll need to have a key that consists essentially of bo...

Regexp matching in pig

Using apache pig and the text hahahah. my brother just didnt do anything wrong. He cheated on a test? no way! I'm trying to match "my brother just didnt do anything wrong." Ideally, I'd want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL. Looking at the pig docs, and then ...

a Reducer per HBase table

Basically, I need to route data to the right Reducer. Each Reducer is going to be a TableReducer. I have a the following file venodor1, user1, xxxx=n venodor1, user1, xxxx=n venodor2, user2, xxxx=n venodor2, user2, xxxx=n I need to insert that in the following hbase tables Table vendor1: [user1] => {data:xxxx = n} [user2] => {data:xx...

Just how much Java does one need to use Hadoop and Mahout effectively?

I'm a PHP developer. Let's just get that out of the way now. But Hadoop – and Mahout in particular – have piqued my interest. I'm ready to take the dive into Java in order to use them. So from people experience enough to know, just how much Java will I need to be able to use these effectively? From what I've seen, programming mappers/re...

Hadoop and Eclipse

Hi, i'm trying to implement PageRank algorithm on Hadoop platform with Eclipse, but I'm facing some unusual problems :). I tried locally: installed cygwin, set up Hadoop 0.19.2 (and 0.18.0), started the necessary daemons and installed Eclipse 3.3.1. I uploaded testinf .txt file and then tried to run the WordCount example or even a simpl...

Pass a relation to a PIG UDF when using FOREACH on another relation?

We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data): 35 6009 521 21599 225 51991 12 6129 We wrote a UD...

Hadoop in windows : file not found exception

Hi.. I'm using hadoop in windows and i've configured everything good (installing cygwin, passwordless ssh etc..) I've compiled the wordcount program in WC.jar and tried to run. Its running perfectly in standalone mode.. but in fully distributed mode it gives FileNotFoundException Please look into the logs and tel me what is wrong wit...

Hadoop Streaming Multiline Input

I'm using Dumbo for some Hadoop Streaming jobs. I have a bunch of JSON dictionaries each containing an article (multiline text) and some meta data. I know Hadoop performs best when give large files, so I want to concat all the JSON dictionaries into a single file. The problem is that I don't know how to make Hadoop read each dictionar...

How to reference subclasses of static Java classes with generics in Scala

I have this Java code: public class TestMapper extends AppEngineMapper<Key, Entity, NullWritable, NullWritable> { public TestMapper() { } // [... other overriden methods ...] @Override public void setup(Context context) { log.warning("Doing per-worker setup"); } } ...which I've converted to: class TestMa...

File Processing with Elastic MapReduce - No Reducer Step?

I have a large set of text files in an S3 directory. For each text file, I want to apply a function (an executable loaded through bootstrapping) and then write the results to another text file with the same name in an output directory in S3. So there's no obvious reducer step in my MapReduce job. I have tried using NONE as my reducer,...

Hadoop job fails when invoked by cron

I have created the following shell script for invoking a hadoop job: #!/bin/bash /opt/hadoop/bin/hadoop jar /path/to/job.jar com.do.something <param-1> ... <param-n> & wait %1 STATUS=$? if [ $STATUS -eq 0 ] then echo "SUCCESS" | mailx -s "Status: "$STATUS -r "[email protected]" "[email protected]" exit $STATUS else echo "FAI...

Hadoop Map Reduce: Algorithms

Can someone point me to a good web site with good collection of Hadoop algorithms. For example, the most complex thing that I can do with Hadoop right now is Page Rank. Other than that, I can do trivial things like word counting and stuff. I want to see a web site that show me other usage of hadoop. Thanks! ...

Twitter (Social networking) Dataset

I am looking for twitter or other social networking sites dataset for my project. I currently have the CAW 2.0 twitter dataset but it only contains tweets of users. I want a data that shows the number of friends, follower and such. It does not have to be twitter but I would prefer twitter or facebook. I already tried infochimps but app...

hadoop on vmware, namenode not finding slaves

I set up 3 identical linux (CentOS) servers on Vmware. Basically built one and made 2 fully clones. I edit each servers hostnames : server1, server2,server3 and added them to each other hosts. Worked with ssh and enabled passwordless ssh. server1 # ssh server2 server2 # So this works. Formatted the dfs on the namenode. started the d...