questions about hadoop | ansaurus

hadoop

How to convert a Hadoop Path object into a Java File object

Hi Is there a way to change a valid and existing Hadoop Path object into a useful Java File object. Is there a nice way of doing this or do I need to bludgeon to code into submission? The more obvious approaches don't work, and it seems like it would be a common bit of code void func(Path p) { if (p.isAbsolute()) { File f = new ...

How does Hadoop's RunJar method distribute class/jar files across nodes?

I'm trying to use JIT compilation in clojure to generate mapper and reducer classes on the fly. However, these classes aren't being recognized by the JobClient (it's the usual ClassNotFoundException.) If I AOT compile the Mapper,Reducer and Tool, and run the job using RunJar, everything seems fine. After looking through the source, it s...

How do I use Elastic MapReduce to run an XSLT transformation on millions of small S3 xml files?

More specifically, is there a somewhat easy streaming solution? ...

Requiring external libraries in ruby streaming scripts for Amazon EMR

How do I require external libraries when running Amazon EMR streaming jobs written in Ruby? I've defined my mapper, and am getting this output in my logs: /mnt/var/lib/hadoop/mapred/taskTracker/jobcache/job_201008110139_0001/attempt_201008110139_0001_m_000000_0/work/./mapper_stage1.rb: line 1: require: command not found My ...

amazon-web-services

How to keep the sequence file created by map in hadoop

Hi I am using hadoop and working with a map task that creates files that I want to keep, currently I am passing these files through the collector to the reduce task. The reduce task then passes these files on to its collector, this allows me to retain the files. My question is how do I reliably and efficiently keep the files created by...

Is it possible to pick specific machines to run a particular type of hadoop jobs?

As far as I understand hadoop architecture considers all machines to be equal with any task/job being able to run on all and any of the machines in the cluster. Is there a way to change this model to tag certain machines as having certain capabilities and then only pick machines that have capabilities required by a job to run that job? ...

What does this Java Syntax mean?

In the code below, what does Iterator<V> and OutputCollector<K, V> mean? Is it a special data type? public void reduce(K key, Iterator<V> values, OutputCollector<K, V> output, Reporter reporter) throws IOException { ...

manupulating iterator in mapreduce

I was trying to find the sum of any given points using hadoop, but my problem is on getting all values from a given key in a single reducer. It is some thing like this. I have this reducer public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator<IntWritable> values, ...

parallel-programming

On demand slave generation in Hadoop cluster on EC2

Hi, I am planning to use Hadoop on EC2. Since we have to pay per instance usage, it is not good to have fixed number of instances than what are actually required for the job. In our application, many jobs are executed concurrently and we do not know the slave requirement all the time. Is it possible to start the hadoop cluster with mini...

Where do I begin learning Lucene.NET Solr Hadoop and MapReduce?

I'm a .NET developer and I need to learn Lucene so we can run a very large scale search service that removes entries that the end user doesn't have access to. (ie a User can search for all documents with clearance level 3 or higher, but not clearance level 2 or 1) Where do I start learning, which products should I consider? To be hon...

Can the Hadoop distributed cache addFileToClassPath .class files or is it limited to .jar files?

I've tried the following: DistributedCache.addFileToClassPath(new Path("something.jar"), config); DistributedCache.addFileToClassPath(new Path("something.class"),config); The first one works, the second doesn't. Does addFileToClassPath only work for jars? This seems weird because there's also an addArchiveToClassPath method. ...

MultipleOutputFormat in hadoop

Hi. I'm a newbie in Hadoop. I'm trying out the Wordcount program. Now to try out multiple output files, i use MultipleOutputFormat. this link helped me in doing it. http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html in my driver class i had MultipleOutputs.addNamedOutput(conf, "...

How to connect to Hadoop/Hive from .NET

I am working on a solution where I will have a Hadoop cluster with Hive running and I want to send jobs and hive queries from a .NET application to be processed and get notified when they are done. I can't find any solutions for interfacing with Hadoop other than directly from a Java app, is there an API I can access that I am just not f...

Spring-Batch for a massive nightly / hourly Hive / MySQL data processing

I'm looking into replacing a bunch of Python ETL scripts that perform a nightly / hourly data summary and statistics gathering on a massive amount of data. What I'd like to achieve is Robustness - a failing job / step should be automatically restarted. In some cases I'd like to execute a recovery step instead. The framework must be ab...

Does it make sense to use Hadoop for import operations and Solr to provide a web interface?

I'm looking at the need to import a lot of data in realtime into a Lucene index. This will consist of files of various formats (Doc, Docx, Pdf, etc). The data will be imported as batches compressed files, and so they will need to be decompressed and indexed into an individual file, and somehow relate to the file batch as a whole. I'm ...

Hadoop: Do not re-schedule a failed reducer

Hello, This is how Hadoop currently works: If a reducer fails (throws a NullPointerException for example), Hadoop will reschedule another reducer to do the task of the reducer that failed. Is it possible to configure Hadoop to not reschedule failed reducers i.e. if any reducer fails, Hadoop merely reports failure and does nothing else....

Cassandra or Hadoop Hive or MYSQL?

Hey. I am Developing a Web Crawler,Which is Good for storing data? Cassandra or Hadoop Hive or MySQL?and why?i am having 1TB of Data from past 6 Months in my MySQL DB,i need to index them and i need to get the out put in my search ASAP,and as i think,it will store more amount of DATA,like 10 Peta Byes as my crawler are working fast,i nee...

Hadoop Reducer::cleanup function: Is it called only after the reducer has successfully finished execution?

Can the Reducer::cleanup function get called if the reducer had to abort due to an error.....such as an exception being thrown from the reduce(...) function. ...

hadoop inputFile as a BufferedImage

Hi everybody; Sorry for my poor english. i hope you'll understand my problem. I have a question about hadoop developpment. I have to train myself on a simple image processing project using hadoop. All i want to do is rotate an image with Hadoop (of course i don't want hadoop use the whole image). I have a problem with the inputFormat....

Pig Latin: Load multiple files from a date range (part of the directory structure)

I have the following scenario- Pig version used 0.70 Sample HDFS directory structure: /user/training/test/20100810/<data files> /user/training/test/20100811/<data files> /user/training/test/20100812/<data files> /user/training/test/20100813/<data files> /user/training/test/20100814/<data files> As you can see in the paths listed abo...

1
...
14
15
16
17
18