Hello,
I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into ...
Hi
My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS.
I understand that-
Pig's language Pig Latin is a shift
from(suits the way programmers think)
SQL like declarative style of
programming and Hive's query language closely
...
Hello,
I would like to know how to retrieve data from aggregated logs? This is what I have:
- about 30GB daily of uncompressed log data loaded into HDFS (and this will grow soon to about 100GB)
This is my idea:
- each night this data is processed with Pig
- logs are read, split, and custom UDF retrieves data like: timestamp, url, user_id...
given my input data in userid,itemid format:
raw: {userid: bytearray,itemid: bytearray}
dump raw;
(A,1)
(A,2)
(A,4)
(A,5)
(B,2)
(B,3)
(B,5)
(C,1)
(C,5)
grpd = GROUP raw BY userid;
dump grpd;
(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2),(B,3),(B,5)})
(C,{(C,1),(C,5)})
I'd like to generate all of the combinations(order not important) of ...
I love hadoop streaming for it's ability to quickly pump out quick and dirty one off map reduce jobs. I also love groovy for making all my carefully coded java accessible to a scripting language. Now I'd like to put the 2 together. I'd like to take a jar with some of my java classes, and utilize these in groovy-based mappers and reducers...
Given that the complexity of the map and reduce tasks are O(map)=f(n) and O(reduce)=g(n) has anybody taken the time to write down how the Map/Reduce intrinsic operations (sorting, shuffling, sending data, etc.) increases the computational complexity? What is the overhead of the Map/Reduce orchestration?
I know that this is a nonsense wh...
Hi,
I am running Hadoop 0.20.1 under SLES 10 (SUSE).
My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice.
Right...
Hi,
Is there a distance calculation implementation using hadoop map/reduce. I am trying to calculate a distance between a given set of points.
Looking for any resources ..
//edited ............
This is a very intelligent solution. I have tried some how like the first algorithm, and i get almost what i was looking for. I am not concer...
Hi all!
I'm running a Hadoop job over 1,5 TB of data with doing much pattern matching. I have several machines with 16GB RAM each, and I always get OutOfMemoryException on this job with this data (I'm using Hive).
I would like to know how to optimally set option HADOOP_HEAPSIZE in file hadoop-env.sh so, my job would not fail. Is it eve...
I have simple text file containing two columns, both integers
1 5
1 12
2 5
2 341
2 12
and so on..
I need to group the dataset by second value,
such that the output will be.
5 1 2
12 1 2
341 2
Now the problem is that the file is very big around 34 Gb
in size, I tried writing a python script to group them into a dictionary with val...
Hi all,
I'm a beginer in hadoop.
I've understood the WordCount program. Now I have a problem. I dont want the output of all the words..
- Words_I_Want.txt -
hello
echo
raj
- Text.txt -
hello eveyone. I want hello and echo count
output should be
hello 2
echo 1
raj 0
Now that was an exaple, My actual data is very large.
...
My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output.
Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared".
When I run...
I've been tasked with processing multiple terabytes worth of SCM data for my company. I set up a hadoop cluster and have a script to pull data from our SCM servers.
Since I'm processing data with batches through the streaming interface, I came across an issue with the block sizes that O'Reilly's Hadoop book doesn't seem to address: wha...
Hi
I have a weird problem, DistributedCache appears to change the names of my files, it uses the original name as the parent folder and adds the file as a child.
i.e. folder\filename.ext becomes folder\filename.ext\filename.ext
Any ideas, my code is below.
Thanks
Akintayo
String paramsLocation="/user/fwang/settings/ecgparams.txt";
D...
What are recommended resources for learning HBase? The only one I can think of is HBase wiki and one chapter in book "Hadoop: The Definitive Guide", are there any other good resources? I'm looking for links, books, wikis, etc.
Stuff about BigTable is also welcome.
Thanks.
...
Hi All,
Loving MRToolkit -- great to get away from Java while writing Hadoop jobs. It has become apparent that the library was written to interface with an EC2 cluster, and not with Amazon's elastic map/reduce system. Does anybody have insights into running jobs defined using the toolkit on elastic map/reduce servers? It isn't readil...
Hello,
Please tell me how HBase partitions table across regionservers.
For example, let's say my row keys are integers from 0 to 10M and I have 10 regionservers.
Does this mean that first regionserver will store all rows with keys with values 0 - 10M, second 1M - 2M, third 2M-3M , ... tenth 9M - 10M ?
I would like my row key to be ti...
I'm trying to understand the boundaries of hadoop and map/reduce and it would help to know a non-trivial problem, or class of problems, that we know map/reduce can't assist in.
It certainly would be interesting if changing one factor of the problem would allow simplification from map/reduce.
Thank you
...
This is a fairly well-documented error and the fix is easy, but does anyone know why Hadoop datanode NamespaceIDs can get screwed up so easily or how Hadoop assigns the NamespaceIDs when it starts up the datanodes?
Here's the error:
2010-08-06 12:12:06,900 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Inco...
Hi,
I am using hadoop, and working on my map and reduce tasks, where I need to be able to access configuration files that are stored in a named folder. I also have a folder with jars that map must call using process builder, as these jars don't belong to me.
At this point I am storing the files in nfs, where all the hadoop nodes can ac...