hadoop

Better to build or buy a compute grid platform?

I am looking to do some quite processor-intensive brute force processing for string matching. I have run my prototype in a multi-threaded environment and compared the performance to an implementation using Gridgain with a couple of nodes (also multithreaded). The performance I observed was that my Gridgain implementation performed slow...

How do you cleanly uninstall the Eclipse MapReduce plugin?

I'm trying to get the eclipse plugin for hadoop development to work, I'm using hadoop 0.18.3. I installed the old MapReduce plugin (http://www.alphaworks.ibm.com/tech/mapreducetools) on Eclipse v3.5.2 (M20100211-1343) by copying it to /Applications/eclipse/plugins and restarting eclipse but that didn't work, I figured it was because it w...

CouchDB, HDFS, HBase or which is right for my situation?

Hello all, This question is regarding data storage systems such as CouchDB, HDFS and HBase, specifically, which is right. I am looking at making a simple and customized Document Management System for my organization. Basically, we need the ability to store some Word Documents, PDFs and other similar files. I also want to store metada...

Does throwing an exception in an EvalFunc pig UDF skip just that line, or stop completely?

I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing. It looks something like this: public abstract class Foo extends EvalFunc<Tuple> { public Foo() { super(); } public Tuple exec(Tuple input) throws IOException { ...

Hadooop map reduce

Im very much new to map reduce and i completed hadoop wordcount example. In that example it produces unsorted file (with key value) of word counts. So is it possible to make it sorted according to the most number of word occurrences by combining another map reduce task to the earlier one. Thanks in Advance ...

Examples of simple stats calculation with hadoop

I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my...

Hadoop Hive web interface options

I've been experimenting with Hive for some data mining activities and would like to make it easily available to less command line orientated colleagues. Hive does now ship with a web interface (http://wiki.apache.org/hadoop/Hive/HiveWebInterface) but it's very basic at this stage. My question is does a visually polished and fully featu...

How do I concatenate a lot of files into one inside Hadoop, with no mapping or reduction

I'm trying to combine multiple files in multiple input directories into a single file, for various odd reasons I won't go into. My initial try was to write a 'nul' mapper and reducer that just copied input to output, but that failed. My latest try is: vcm_hadoop lester jar /vcm/home/apps/hadoop/contrib/streaming/hadoop-*-streaming.ja...

Hadoop 0.2: How to read outputs from TextOutputFormat?

My reducer class produces outputs with TextOutputFormat (the default OutputFormat given by Job). I like to consume this outputs after the MapReduce job complete to aggregate the outputs. In addition to this, I like to write out the aggregated information with TextInputFormat so that the output from this process can be consumed by the nex...

Tracking Hadoop job status via web interface? (Exposing Hadoop to internal clients in the company)

I want to develop a website that will allow analysts within the company to run Hadoop jobs (choose from a set of defined jobs) and see their job's status\progress. Is there an easy way to do this (get running jobs statuses etc.) via Ruby\Python? How do you expose your Hadoop cluster to internal clients on your company? ...

Searches (and general querying) with HBase and/or Cassandra (best practices?)

I have User model object with quite few fields (properties, if you wish) in it. Say "firstname", "lastname", "city" and "year-of-birth". Each user also gets "unique id". I want to be able to search by them. How do I do that properly? How to do that at all? My understanding (will work for pretty much any key-value storage -- first goes ...

How to learn using Hadoop

hi, I want to learn hadoop. However, I don't have access to a cluster now. Is it possible for me to learn it and use it for writing programs and learn it properly. Would it be helpful to run multiple linux VMs and then use them as boxes to run hadoop? Or you think that is more of a stretch and running it on a multiple hosts is the sam...

Running Hadoop example in psuedo-distributed mode on vm

I have set-up Hadoop on a OpenSuse 11.2 VM using Virtualbox.I have made the prerequisite configs. I ran this example in the Standalone mode successfully. But in psuedo-distributed mode I get the following error: $./bin/hadoop fs -put conf input 10/04/13 15:56:25 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.Socke...

Using Hadoop, are my reducers guaranteed to get all the records with the same key?

I'm running a hadoop job (using hive actually) which is supposed to uniq lines in a lot of text file. More specifically it chooses the most recently timestamped record for each key in the reduce step. Does hadoop guarantee that every record with the same key, output by the map step, will go to a single reducer, even if there are many r...

Configuring Hadoop logging to avoid too many log files

I'm having a problem with Hadoop producing too many log files in $HADOOP_LOG_DIR/userlogs (the Ext3 filesystem allows only 32000 subdirectories) which looks like the same problem in this question: http://stackoverflow.com/questions/2091287/error-in-hadoop-mapreduce My question is: does anyone know how to configure Hadoop to roll the log...

Hadoop: Iterative MapReduce Performance

Is it correct to say that the parallel computation with iterative MapReduce can be justified mainly when the training data size is too large for the non-parallel computation for the same logic? I am aware that the there is overhead for starting MapReduce jobs. This can be critical for overall execution time when a large number of iterat...

Is there a good library for accessing HBase from Python?

I spent some time looking around, and all I could find is Jython. It's an option, but is there something that could be used in a more pythonesque (simpler) way? ...

Changing the block size of a dfs file in Hadoop

I found that my map tasks is currently inefficient when parsing one particular set of files (total 2 TB). I'd like to change the block size of files in the Hadoop dfs (from 64MB to 128 MB). I can't find how to do it in the documentation for only one set of files and not the entire cluster, does anyone know the command that would change t...

Free Large datasets to experiment with Hadoop

Do you know any large datasets to experiment with Hadoop which is free/low cost? Any pointers/links related is appreciated. Prefernce: Atleast one GB of data. Production log data of webserver. Few of them which I found so far: http://dumps.wikimedia.org/enwiki/20100130/ http://wiki.freebase.com/wiki/Data_dumps http://aws.amazon.co...

Hadoop Map/Reduce - simple use example to do the following...

I have MySQL database, where I store the following BLOB (which contains JSON object) and ID (for this JSON object). JSON object contains a lot of different information. Say, "city:Los Angeles" and "state:California". There are about 500k of such records for now, but they are growing. And each JSON object is quite big. My goal is to do ...