I am looking to do some quite processor-intensive brute force processing for string matching. I have run my prototype in a multi-threaded environment and compared the performance to an implementation using Gridgain with a couple of nodes (also multithreaded).
The performance I observed was that my Gridgain implementation performed slow...
I'm trying to get the eclipse plugin for hadoop development to work, I'm using hadoop 0.18.3. I installed the old MapReduce plugin (http://www.alphaworks.ibm.com/tech/mapreducetools) on Eclipse v3.5.2 (M20100211-1343) by copying it to /Applications/eclipse/plugins and restarting eclipse but that didn't work, I figured it was because it w...
Hello all,
This question is regarding data storage systems such as CouchDB, HDFS and HBase, specifically, which is right.
I am looking at making a simple and customized Document Management System for my organization. Basically, we need the ability to store some Word Documents, PDFs and other similar files. I also want to store metada...
I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing.
It looks something like this:
public abstract class Foo extends EvalFunc<Tuple> {
public Foo() {
super();
}
public Tuple exec(Tuple input) throws IOException {
...
Im very much new to map reduce and i completed hadoop wordcount example.
In that example it produces unsorted file (with key value) of word counts. So is it possible to make it sorted according to the most number of word occurrences by combining another map reduce task to the earlier one.
Thanks in Advance
...
I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my...
I've been experimenting with Hive for some data mining activities and would like to make it easily available to less command line orientated colleagues.
Hive does now ship with a web interface (http://wiki.apache.org/hadoop/Hive/HiveWebInterface) but it's very basic at this stage.
My question is does a visually polished and fully featu...
I'm trying to combine multiple files in multiple input directories into a single file, for various odd reasons I won't go into. My initial try was to write a 'nul' mapper and reducer that just copied input to output, but that failed. My latest try is:
vcm_hadoop lester jar /vcm/home/apps/hadoop/contrib/streaming/hadoop-*-streaming.ja...
My reducer class produces outputs with TextOutputFormat (the default OutputFormat given by Job). I like to consume this outputs after the MapReduce job complete to aggregate the outputs. In addition to this, I like to write out the aggregated information with TextInputFormat so that the output from this process can be consumed by the nex...
I want to develop a website that will allow analysts within the company to run Hadoop jobs (choose from a set of defined jobs) and see their job's status\progress.
Is there an easy way to do this (get running jobs statuses etc.) via Ruby\Python?
How do you expose your Hadoop cluster to internal clients on your company?
...
I have User model object with quite few fields (properties, if you wish) in it. Say "firstname", "lastname", "city" and "year-of-birth". Each user also gets "unique id".
I want to be able to search by them. How do I do that properly? How to do that at all?
My understanding (will work for pretty much any key-value storage -- first goes ...
hi,
I want to learn hadoop. However, I don't have access to a cluster now. Is it possible for me to learn it and use it for writing programs and learn it properly.
Would it be helpful to run multiple linux VMs and then use them as boxes to run hadoop? Or you think that is more of a stretch and running it on a multiple hosts is the sam...
I have set-up Hadoop on a OpenSuse 11.2 VM using Virtualbox.I have made the prerequisite configs. I ran this example in the Standalone mode successfully.
But in psuedo-distributed mode I get the following error:
$./bin/hadoop fs -put conf input
10/04/13 15:56:25 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.Socke...
I'm running a hadoop job (using hive actually) which is supposed to uniq lines in a lot of text file. More specifically it chooses the most recently timestamped record for each key in the reduce step.
Does hadoop guarantee that every record with the same key, output by the map step, will go to a single reducer, even if there are many r...
I'm having a problem with Hadoop producing too many log files in $HADOOP_LOG_DIR/userlogs (the Ext3 filesystem allows only 32000 subdirectories) which looks like the same problem in this question: http://stackoverflow.com/questions/2091287/error-in-hadoop-mapreduce
My question is: does anyone know how to configure Hadoop to roll the log...
Is it correct to say that the parallel computation with iterative MapReduce can be justified mainly when the training data size is too large for the non-parallel computation for the same logic?
I am aware that the there is overhead for starting MapReduce jobs.
This can be critical for overall execution time when a large number of iterat...
I spent some time looking around, and all I could find is Jython. It's an option, but is there something that could be used in a more pythonesque (simpler) way?
...
I found that my map tasks is currently inefficient when parsing one particular set of files (total 2 TB). I'd like to change the block size of files in the Hadoop dfs (from 64MB to 128 MB). I can't find how to do it in the documentation for only one set of files and not the entire cluster, does anyone know the command that would change t...
Do you know any large datasets to experiment with Hadoop which is free/low cost?
Any pointers/links related is appreciated.
Prefernce:
Atleast one GB of data.
Production log data of webserver.
Few of them which I found so far:
http://dumps.wikimedia.org/enwiki/20100130/
http://wiki.freebase.com/wiki/Data_dumps
http://aws.amazon.co...
I have MySQL database, where I store the following BLOB (which contains JSON object) and ID (for this JSON object). JSON object contains a lot of different information. Say, "city:Los Angeles" and "state:California".
There are about 500k of such records for now, but they are growing. And each JSON object is quite big.
My goal is to do ...