hadoop

How to convert dumbo sequence file input to tab separated text

I have in input, which could be a single primitive or a list or tuple of primitives. I'd like to flatten it to just a list, like so: def flatten(values): return list(values) The normal case would be flatten(someiterablethatisn'tastring) But if values = '1234', I'd get ['1', '2', '3', '4'], but I'd want ['1234'] And if values = ...

Generating Separate Output files in Hadoop Streaming

Using only a mapper (a Python script) and no reducer, how can I output a separate file with the key as the filename, for each line of output, rather than having long files of output? ...

Is HBase meaningful if it's not running in a distributed environment?

I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight). I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term identifiers as string values than foreign keys to other tables. I'm re-writing the software...

Is this a suitable (or possible) use of HBase?

I want to use HBase as a store where I can push in a few million entries of the format {document => {term => weight}} e.g. "Insert term X into document Y with weight Z" and then issue a command like "Select the top 1000 terms for this document" or "Select the top 1000 term for each document". This works in my current MySQL implementation...

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performi...

MapReduce, Python and NetworkX

I have implemented an unweighted random walk function for a graph that I built in Python using NetworkX. Below is a snippet of my program that deals with the random walk. Elsewhere in my program, I have a method that creates the graph, and I have a method that simulates various custom graph testing methods that I've written. One of these...

Building Apache Hive - impossible to resolve dependencies

I am trying out the Apache Hive as per http://wiki.apache.org/hadoop/Hive/GettingStarted and am getting this error from Ivy: Downloaded file size doesn't match expected Content Length for http://archive.apache.org/dist/hadoop/core/hadoop-0.19.0/hadoop-0.19.0.tar.gz. Please retry. This error repeats 4 times for 4 different versions of ...

Hadoop mapreduce streaming from HBase

I'm building a Hadoop (0.20.1) mapreduce job that uses HBase (0.20.1) as both the data source and data sink. I would like to write the job in Python which has required me to use hadoop-0.20.1-streaming.jar to stream data to and from my Python scripts. This works fine if the data source/sink are HDFS files. Does Hadoop support streaming...

How to parallelize execution on remote systems

What's a good method for assigning work to a set of remote machines? Consider an example where the task is very CPU and RAM intensive, but doesn't actually process a large dataset. The language of choice would be Java. I was thinking Hadoop would be a good option, but the dataset passed between remote machines is fairly small, and Had...

Looking for a good HBase tutorial

I'm looking for a good and tested HBase tutorial, where I can find one? ...

hadoop- determine if a file is being written to

Is there a way to determine if a file in hadoop is being written to? eg- I have a process that puts logs into hdfs. I have another process that monitors for the existence of new logs in hdfs, but I'd like it to make sure the file has been completely uploaded into hdfs before processing. Is something like this possible? ...

Hadoop: intervals and JOIN

Hi all, I'm very new to Hadoop and I'm currently trying to join two sources of data where the key is an interval (say [date-begin/date-end]). For example: input1: 20091001-20091002 A 20091011-20091104 B 20080111-20091103 C (...) input2: 20090902-20091003 D 20081015-20091204 E 20040011-20050101 F (...) I'd like to...

Hadoop MapReduce job on file containing HTML tags

I have a bunch of large HTML files and I want to run a Hadoop MapReduce job on them to find the most frequently used words. I wrote both my mapper and reducer in Python and used Hadoop streaming to run them. Here is my mapper: #!/usr/bin/env python import sys import re import string def remove_html_tags(in_text): ''' Remove any HTM...

Reading Hadoop Writable objects

Hi I have an application which stores Hadoop writable objects in a Amazon S3 bucket. How do i now read the object through a java application ? The problem i am facing is that the SequenceFileRecordReader is unable to read from the S3 bucket whereas it can read the same writable object from local disk. Any suggestions will be greatly app...

Writing single Hadoop map reduce output into multiple S3 objects

Hi I am implementing a Hadoop Map reduce job that needs to create output in multiple S3 objects. Hadoop itself creates only a single output file (an S3 object) but I need to partition the output into multiple files. How do I achieve this ? Any pointers will be much appreciated. thanks ...

Should map() and reduce() return key/value pairs of same type?

When writing a MapReduce job (specifically Hadoop if relevant), one must define a map() and a reduce() function, both yielding a sequence of key/value pairs. The data types of the key and value is free to be defined by the application. In the canonical example of word counting, both functions yield pairs of type (string, int) with the k...

Which Hadoop product is more appropriate for a quick query on a large data set?

I am researching Hadoop to see which of its products suits our need for quick queries against large data sets (billions of records per set) The queries will be performed against chip sequencing data. Each record is one line in a file. To be clear below shows a sample record in the data set. one line (record) looks like: 1-1-174-418 ...

How to set access control on the hdfs that results from installing the hadoop plugin for hudson

I installed the hudson plugin that enables hadoop. Now I find that I don't have access as myself to put any data in there. It's not at all obvious to me how hudson has configured hadoop. Can someone tell me how to change these permisisons? ...

Establishing Eclipse project environment for HadoopDB

I have checked-out a project from SourceForge named HadoopDB. It uses some class in another project named Hive. I have used Eclipse Java build path setting to link source to the Hive project root folder, but classes in the HadooDB project have some error as: The import org.**.**.classname can't be resolved Should I link the Hive root...

Crawling engine architecture - Java/ Perl integration

Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business instructions and goals. As you can...