hadoop

Parsing bulk text with Hadoop: best practices for generating keys.

Hello, I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into ...

Difference between Pig and Hive? Why have both?

Hi My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS. I understand that- Pig's language Pig Latin is a shift from(suits the way programmers think) SQL like declarative style of programming and Hive's query language closely ...

Retrieving information from aggregated weblogs data, how to do it?

Hello, I would like to know how to retrieve data from aggregated logs? This is what I have: - about 30GB daily of uncompressed log data loaded into HDFS (and this will grow soon to about 100GB) This is my idea: - each night this data is processed with Pig - logs are read, split, and custom UDF retrieves data like: timestamp, url, user_id...

generating bigram combinations from grouped data in pig.

given my input data in userid,itemid format: raw: {userid: bytearray,itemid: bytearray} dump raw; (A,1) (A,2) (A,4) (A,5) (B,2) (B,3) (B,5) (C,1) (C,5) grpd = GROUP raw BY userid; dump grpd; (A,{(A,1),(A,2),(A,4),(A,5)}) (B,{(B,2),(B,3),(B,5)}) (C,{(C,1),(C,5)}) I'd like to generate all of the combinations(order not important) of ...

Including jars in hadoop streaming using groovy

I love hadoop streaming for it's ability to quickly pump out quick and dirty one off map reduce jobs. I also love groovy for making all my carefully coded java accessible to a scripting language. Now I'd like to put the 2 together. I'd like to take a jar with some of my java classes, and utilize these in groovy-based mappers and reducers...

What is the computational complexity of the MapReduce overhead

Given that the complexity of the map and reduce tasks are O(map)=f(n) and O(reduce)=g(n) has anybody taken the time to write down how the Map/Reduce intrinsic operations (sorting, shuffling, sending data, etc.) increases the computational complexity? What is the overhead of the Map/Reduce orchestration? I know that this is a nonsense wh...

Where should Map put temporary files when running under Hadoop

Hi, I am running Hadoop 0.20.1 under SLES 10 (SUSE). My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice. Right...

mapreduce distance calculation in hadoop

Hi, Is there a distance calculation implementation using hadoop map/reduce. I am trying to calculate a distance between a given set of points. Looking for any resources .. //edited ............ This is a very intelligent solution. I have tried some how like the first algorithm, and i get almost what i was looking for. I am not concer...

How to avoid OutOfMemoryException when running Hadoop?

Hi all! I'm running a Hadoop job over 1,5 TB of data with doing much pattern matching. I have several machines with 16GB RAM each, and I always get OutOfMemoryException on this job with this data (I'm using Hive). I would like to know how to optimally set option HADOOP_HEAPSIZE in file hadoop-env.sh so, my job would not fail. Is it eve...

How can I group a large dataset

I have simple text file containing two columns, both integers 1 5 1 12 2 5 2 341 2 12 and so on.. I need to group the dataset by second value, such that the output will be. 5 1 2 12 1 2 341 2 Now the problem is that the file is very big around 34 Gb in size, I tried writing a python script to group them into a dictionary with val...

custom word count using hadoop

Hi all, I'm a beginer in hadoop. I've understood the WordCount program. Now I have a problem. I dont want the output of all the words.. - Words_I_Want.txt - hello echo raj - Text.txt - hello eveyone. I want hello and echo count output should be hello 2 echo 1 raj 0 Now that was an exaple, My actual data is very large. ...

Running a standalone Hadoop application on multiple CPU cores.

My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output. Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared". When I run...

Hadoop block size issues

I've been tasked with processing multiple terabytes worth of SCM data for my company. I set up a hadoop cluster and have a script to pull data from our SCM servers. Since I'm processing data with batches through the streaming interface, I came across an issue with the block sizes that O'Reilly's Hadoop book doesn't seem to address: wha...

Why does DistributedCache mangle my file names.

Hi I have a weird problem, DistributedCache appears to change the names of my files, it uses the original name as the parent folder and adds the file as a child. i.e. folder\filename.ext becomes folder\filename.ext\filename.ext Any ideas, my code is below. Thanks Akintayo String paramsLocation="/user/fwang/settings/ecgparams.txt"; D...

What do you recommend for learning HBase?

What are recommended resources for learning HBase? The only one I can think of is HBase wiki and one chapter in book "Hadoop: The Definitive Guide", are there any other good resources? I'm looking for links, books, wikis, etc. Stuff about BigTable is also welcome. Thanks. ...

Running MRToolkit hadoop jobs on AWS elastic map/reduce

Hi All, Loving MRToolkit -- great to get away from Java while writing Hadoop jobs. It has become apparent that the library was written to interface with an EC2 cluster, and not with Amazon's elastic map/reduce system. Does anybody have insights into running jobs defined using the toolkit on elastic map/reduce servers? It isn't readil...

How HBase partitions table across regionservers?

Hello, Please tell me how HBase partitions table across regionservers. For example, let's say my row keys are integers from 0 to 10M and I have 10 regionservers. Does this mean that first regionserver will store all rows with keys with values 0 - 10M, second 1M - 2M, third 2M-3M , ... tenth 9M - 10M ? I would like my row key to be ti...

Is there a canonical problem that provably can't be aided with map/reduce?

I'm trying to understand the boundaries of hadoop and map/reduce and it would help to know a non-trivial problem, or class of problems, that we know map/reduce can't assist in. It certainly would be interesting if changing one factor of the problem would allow simplification from map/reduce. Thank you ...

Why does the Hadoop incompatible namespaceIDs issue happen?

This is a fairly well-documented error and the fix is easy, but does anyone know why Hadoop datanode NamespaceIDs can get screwed up so easily or how Hadoop assigns the NamespaceIDs when it starts up the datanodes? Here's the error: 2010-08-06 12:12:06,900 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Inco...

Can DistributedCache copy an entire folder of files

Hi, I am using hadoop, and working on my map and reduce tasks, where I need to be able to access configuration files that are stored in a named folder. I also have a folder with jars that map must call using process builder, as these jars don't belong to me. At this point I am storing the files in nfs, where all the hadoop nodes can ac...