Hello,
I'm wanting to experiment with Hadoop and its related parts like Pig and HBase and I'm trying to think of an interesting (and most importantly useful) project to start on with these tools that would be related to Nagios.
I have access to ~200 hosting servers running Apache/MySQL/PHP/Nagios (so I have lots of logs/data that I cou...
This is the weirdest thing ever. So I can see these files and cat them:
[jchen@host hadoop-0.20.2]$ bin/hadoop fs -ls /users/jchen/
Found 3 items
-rw-r--r-- 1 jchen supergroup 26553445 2010-07-14 21:10 /users/jchen/20100714T192827^AS17.data
-rw-r--r-- 1 jchen supergroup 461957962 2010-07-14 21:10 /users/j...
Hi everyone,
I'm writing a simple program for enumerating triangles in directed graphs for my project. First, for each input arc (e.g. a b, b c, c a, note: a tab symbol serves as a delimiter) I want my map function output the following pairs ([a, to_b], [b, from_a], [a_b, -1]):
public void map(LongWritable key, Text value,
...
Hi,
I am trying to build my project using eclipse on windows and execute on a linux cluster. The project depends on some external jars, which I enclosed using eclipse's "Export->Runnable JAR -> Package required library into jar" build option. I checked the jar contains the classes within a folder structure, and the external jars are in ...
I have, essentially a series of reduce jobs I am running on a lot of data using Hadoop Streaming. I am not really using my Mappers for anything, so am just using Identity Mappers, but I do need the default partitioner hadoop is giving me to group my data in a different manner for each step of my MR job.. I don't know enough the system we...
Hi,
I'm trying to practice some data mining algorithms over hadoop. Can I do it with HDFS alone or do I need to use the sub-projects like hive/hbase/pig?
Thanks,
ram.
...
In Hadoop you can use the secondary-sort mechanism to sort the values before they are sent to the reducer.
The way this is done in Hadoop is that you add the value to sort by to the key and then have some custom group and key compare methods that hook into the sorting system.
So you'll need to have a key that consists essentially of bo...
Using apache pig and the text
hahahah. my brother just didnt do anything wrong. He cheated on a test? no way!
I'm trying to match "my brother just didnt do anything wrong."
Ideally, I'd want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL.
Looking at the pig docs, and then ...
Basically, I need to route data to the right Reducer. Each Reducer is going to be a TableReducer.
I have a the following file
venodor1, user1, xxxx=n
venodor1, user1, xxxx=n
venodor2, user2, xxxx=n
venodor2, user2, xxxx=n
I need to insert that in the following hbase tables
Table vendor1:
[user1] => {data:xxxx = n}
[user2] => {data:xx...
I'm a PHP developer. Let's just get that out of the way now. But Hadoop – and Mahout in particular – have piqued my interest. I'm ready to take the dive into Java in order to use them.
So from people experience enough to know, just how much Java will I need to be able to use these effectively? From what I've seen, programming mappers/re...
Hi,
i'm trying to implement PageRank algorithm on Hadoop platform with Eclipse, but I'm facing some unusual problems :). I tried locally: installed cygwin, set up Hadoop 0.19.2 (and 0.18.0), started the necessary daemons and installed Eclipse 3.3.1. I uploaded testinf .txt file and then tried to run the WordCount example or even a simpl...
We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):
35 6009
521 21599
225 51991
12 6129
We wrote a UD...
Hi.. I'm using hadoop in windows and i've configured everything good (installing cygwin, passwordless ssh etc..)
I've compiled the wordcount program in WC.jar and tried to run. Its running perfectly in standalone mode.. but in fully distributed mode it gives FileNotFoundException
Please look into the logs and tel me what is wrong wit...
I'm using Dumbo for some Hadoop Streaming jobs. I have a bunch of JSON dictionaries each containing an article (multiline text) and some meta data. I know Hadoop performs best when give large files, so I want to concat all the JSON dictionaries into a single file.
The problem is that I don't know how to make Hadoop read each dictionar...
I have this Java code:
public class TestMapper extends AppEngineMapper<Key, Entity, NullWritable, NullWritable> {
public TestMapper() {
}
// [... other overriden methods ...]
@Override
public void setup(Context context) {
log.warning("Doing per-worker setup");
}
}
...which I've converted to:
class TestMa...
I have a large set of text files in an S3 directory. For each text file, I want to apply a function (an executable loaded through bootstrapping) and then write the results to another text file with the same name in an output directory in S3. So there's no obvious reducer step in my MapReduce job.
I have tried using NONE as my reducer,...
I have created the following shell script for invoking a hadoop job:
#!/bin/bash
/opt/hadoop/bin/hadoop jar /path/to/job.jar com.do.something <param-1> ... <param-n> &
wait %1
STATUS=$?
if [ $STATUS -eq 0 ]
then
echo "SUCCESS" | mailx -s "Status: "$STATUS -r "[email protected]" "[email protected]"
exit $STATUS
else
echo "FAI...
Can someone point me to a good web site with good collection of Hadoop algorithms. For example, the most complex thing that I can do with Hadoop right now is Page Rank. Other than that, I can do trivial things like word counting and stuff.
I want to see a web site that show me other usage of hadoop.
Thanks!
...
I am looking for twitter or other social networking sites dataset for my project. I currently have the CAW 2.0 twitter dataset but it only contains tweets of users. I want a data that shows the number of friends, follower and such.
It does not have to be twitter but I would prefer twitter or facebook. I already tried infochimps but app...
I set up 3 identical linux (CentOS) servers on Vmware. Basically built one and made 2 fully clones.
I edit each servers hostnames : server1, server2,server3 and added them to each other hosts. Worked with ssh and enabled passwordless ssh.
server1 # ssh server2
server2 #
So this works.
Formatted the dfs on the namenode. started the d...