hadoop

Hadoop searching words from one file in another file

Hi, I want to build a hadoop application which can read words from one file and search in another file. If the word exists - it has to write to one output file If the word doesn't exist - it has to write to another output file I tried a few examples in hadoop. I have two questions Two files are approximately 200MB each. Checking ever...

Efficient search in a corpus

I am having a few million words which I want to search in a billion words corpus. What will be the efficient way to do this. I am thinking of a trie, but is there an open source implementation of trie available? Thank you -- Updated -- Let me add few more details about what exactly is required. We have a system where we crawled news...

I'm using Hadoop for data processing with python, what file format should be used?

I'm using Hadoop for data processing with python, what file format should be used? I have project with a substantial amount of text pages. Each text file has some header information that I need to preserve during the processing; however, I don't want the headers to interfere with the clustering algorithms. I'm using python on Hadoop (...

New to Hadoop and dumbo, how to correctly sequence these operations?

Consider the following log file format: id v1 v2 v3 1 15 30 25 2 10 10 20 3 50 30 30 We are to calculate the average value frequency (AVF) for each data row on a Hadoop cluster using dumbo. AVF for a data point with m attributes is defined as: avf ...

Distributed Computing applications

Map reduce/Hadoop is one of the framework/program that s used for distributed systems. What are some other popular frameworks/programs? Thanks. ...

Is there anything like Hadoop in C++?

What is the closest thing like Hadoop, but in C++? In particular, I want to do distributed computing using MapReduce. Thanks! ...

Using set/list data types for intermediate keys in Hadoop

In an Apache Hadoop map-reduce program, what are the options for using sets/lists as keys in the output from the mapper? My initial idea was to use ArrayWritable as key type, but that is not allowed, as the class does not implement WritableComparable. Do I need to define a custom class, or is there some other set like class in the Hadoo...

Generating Multiple Output files with Hadoop 0.20+

I am trying to output the results of my reducer to multiple files. The data results are all contained in one file, and the rest of the results are split based on a category in their respected files. I know with 0.18 that you can do this with MultipleOutputs and it has not been removed. However, I am trying to make my application 0.20+ co...

Any tested Frameworks/Solutions similar to Apache Hadoop?

Hello, I am interested in the Apache Hadoop project, but i would like to know if any other tested (please mind the 'tested') projects/frameworks are out there. Appreciate any information/links to projects similar to Apache Hadoop and any comments on the Apache Hadoop project from anyone that has used it. Regards, ...

Hadoop for super O(N) or O(N log N) algorithms?

Is there any documented case of Hadoop working for any algorithm that's more than approximately linear? Or does huge data sets pretty much mean that anything above linear is unacceptable? I'm trying to find algorithms that run on Hadoop that to more complicated things than just sorting/agregrating. Thanks! ...

Finding matching lines with Hadoop/MapReduce

I am playing around with Hadoop and have set up a two node cluster on Ubuntu. The WordCount example runs just fine. Now I'd like to write my own MapReduce program to analyze some log data (main reason: it looks simple and I have plenty of data) Each line in the log hast this format <UUID> <Event> <Timestamp> where event can be INIT,...

Hadoop: Disadvantages of using just 2 machines?

I want to do log parsing of huge amounts of data and gather analytic information. However all the data comes from external sources and I have only 2 machines to store - one as backup/replication. I'm trying to using Hadoop, Lucene... to accomplish that. But, all the training docs mention that Hadoop is useful for distributed processing...

is this architecture possible in Hadoop MR?

Is the following architecture possible in Hadoop MapReduce? A distributed key-value store is used (HBase). So along with values, there would be a timestamp associated with the values. Map & Reduce tasks are executed iteratively. Map, in each iteration should take in values which were added in the previous iteration to the store (perhaps...

Inject and index a single url with Nutch

Hello; I want to inject a single url to the crawldb as a string not a urlDir, I'm thinking in add a modified method of the Injector.inject that take the url as a string parameter, but I cant inject the string url in the crawldb; I guess the current injector using the fileInput.. from hadoop. how can I do this ? and I test to crawl url...

Syncronizing data between Hadoop and PostgreSql using SymmetricDs

Hi.. I m using Hadoop to store the data of our application.Can someone suggest how to syncronize data between a PostgreSql and Hadoop. I m using SymmetricDS as the replication tool. Thanks ...

Running multiple hadoop instances on same machine

I wish to run a second instance of Hadoop on a machine which already has an instance of Hadoop running. After untar'ing hadoop distribution, some config files need to changed from hadoop-version/conf directory. The linux user will be same for both the instances. I have identified the following attributes, but, I am not sure if this is go...

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions. All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions. Can I do this with Hadoop? I've se...

Computational Linguistics project idea using Hadoop MapReduce

I need to do a project on Computational Linguistics course. Is there any interesting "linguistic" problem which is data intensive enough to work on using Hadoop map reduce. Solution or algorithm should try and analyse and provide some insight in "lingustic" domain. however it should be applicable to large datasets so that i can use hadoo...

What should be hadoop.tmp.dir ?

Hadoop has configuration parameter hadoop.tmp.dir which, as per documentation, is "A base for other temporary directories." I presume, this path refers to local file system. I set this value to /mnt/hadoop-tmp/hadoop-${user.name}. After formatting the namenode and starting all services, I see exactly same path created on HDFS. Doesn th...

How to ensure MapReduce tasks are independent from each other?

I'm curious, but how does MapReduce, Hadoop, etc., break a chunk of data into independently operated tasks? I'm having a hard time imagining how that can be, considering it is common to have data that is quite interelated, with state conditions between tasks, etc. Thanks. ...