hadoop

Where HDFS stores files locally by default?

Hi, I am running hadoop with default configuration with one-node cluster, and would like to find where HDFS stores files locally. Any ideas? Thanks. ...

Processing live feed of logs from web server using Hadoop

I want to process the logs from my web server as it comes in using Hadoop (Amazon Elastic mapreduce). I googled for help but nothing useful. I would like to know if this can be done or is there any alternative way to do this. ...

Project Idea with Hadoop MapReduce

Hello, I learnt Hadoop a few months back and managed to do a very introductory programming project on it. I want to do a small - medium sized project or series of small programming assignments with Hadoop. I have seen lot of ideas around but I dont see anything that can be finished in about 60-70 hours of work so a pretty small scale pr...

Matching large datasets using Hadoop ?

I would love to get a sense if haddop is right tool for the problem I have. I'm building offline process (once a month or one a quarter) that matches 2 data sets: A and B. The dataset A is located on Oracle, dataset B is an XML file. Dataset A is about 20M records, dataset B is 6M records. Each record represents a musical song and ha...

Splitting large XML files into manageble sections for Hadoop

Is there a input class to deal with [multiple] large XML files based on their tree structure in Hadoop? I have a set of XML files that are of the same schema, but I need to split them into sections of data, as opposed to breaking the sections up. For example the XML file would be: <root> <parent> data </parent> <parent> more data<...

Hadoop dfs -ls returns list of files in my hadoop/ dir

I've set up a sigle-node Hadoop configuration running via cygwin under Win7. After starting Hadoop bybin/start-all.sh I run bin/hadoop dfs -ls which returns me a list of files in my hadoop directory. Then I run bin/hadoop datanode -formatbin/hadoop namenode -format but -ls still returns me the contents of my hadoop directory. As far as I...

Hadoop - job statistics

Hi, I used hadoop to run map-reduce applications on our cluster. The jobs take around 10 hours to complete daily. I want to know the time taken for each job, and the time taken by the longest job etc..so, that I can optimize those jobs. Is there any plugin or script that does this? Thank you Bala ...

Concepts and tools required to scale up algorithms

Hi, I'd like to begin thinking about how I can scale up my algorithms that I write for data analysis so that they can be applied to arbitrarily large sets of data. I wonder what are the relevant concepts (threads, concurrency, immutable data structures, recursion) and tools (Hadoop/MapReduce, Terracota, and Eucalyptus) to make this happe...

Free data warehouse - Infobright, Hadoop/Hive or what ?

I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to : store them securely use them to analysis (mostly time-oriented) retrieve some raw data occasionally It would be nice if it could be used with JasperReports or BIRT My first shot was Infobright Community - ...

Storing data to SequenceFile from Apache Pig

Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader: REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar; DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); log = LOAD '/data/logs' USING SequenceFileLoader AS (...) Is there also a library out there that w...

ClassNotFoundException error in implementing Bayesian algorithm in Apache Mahout on Hadoop

Hi, I have a problem in executing the Bayesian algorithm in Mahout. I built it with Maven and the job file is in target directory. When run from terminal using hadoop, I'm getting the ClassNotFoundException error. What should be done? $HADOOP_HOME/bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.bayes.mapre...

Number of connections to the host at the same time

How can I handle this? ...

hadoop beginners question

I've read some documentation about hadoop and seen the impressive results. I get the bigger picture but am finding it hard whether it would fit our setup. Question isnt programming related but I'm eager to get opinion of people who currently work with hadoop and how it would fit our setup: We use Oracle for backend Java (Struts2/Servl...

what is a data serialization system?

according to Apache AVRO project, "Avro is a serialization system". By saying data serialization system, does it mean that avro is a product or api? also, I am not quit sure about what a data serialization system is? for now, my understanding is that it is a protocol that defines how data object is passed over the network. Can anyone he...

Hadoop streaming job : stuck

Hi, I am running a hadoop streaming job. It got stuck due to no reason. I am not sure how to cancel the task, so that hadoop schedules another task for the same job. I tried killing the job, but it still doesn't work. Anyone know, how to do this? Thank you Bala ...

Can somebody give a high-level, simple explanation to a beginner about how Hadoop works?

I know how memcached works. How does Hadoop work? ...

Chaining multiple MapReduce jobs in Hadoop.

In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps. I.e. Map1 , Reduce1 , Map2 , Reduce2 , etc. So you have the output from the last reduce that is needed as the input for the next map. The intermediate data is something you (in general) do not want to keep once the pipel...

Hadoop application development, and PHP

For hadoop application development, are PHP frameworks less popular ?If so, why? Else,please do point literature/documentation/tutorials for a specific framework? (stuff for Symfony would be awesome!) ...

How to pick random (small) data samples using Map/Reduce?

I want to write a map/reduce job to select a number of random samples from a large dataset based on a row level condition. I want to minimize the number of intermediate keys. Pseudocode: for each row if row matches condition put the row.id in the bucket if the bucket is not already large enough Have you done something like th...

OpenStreetMap and Hadoop

Hi, I need some ideas for a weekend project about Hadoop and OpenStreetMap. I have access to AWS EC2 instance with OpenStreetMap snapshot in my EBS volume. The OpenStreetMap data is in a PostgreSQL database. What kind of MapReduce function can be run on the OpenStreetMap data, assuming I can export them into xml format, and then place...