hadoop

Implementing large scale log file analytics

Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics? Focusing on web analytics in particular, I'm interested in two closely-related aspects: query perf...

hadoop behind the scenes

Can someone explain what is hadoop in terms of the ideas behind the software ? What makes it so popular and/or powerful ? ...

Parallelizing Ruby reducers in Hadoop?

A simple wordcount reducer in Ruby looks like this: #!/usr/bin/env ruby wordcount = Hash.new STDIN.each_line do |line| keyval = line.split("|") wordcount[keyval[0]] = wordcount[keyval[0]].to_i+keyval[1].to_i end wordcount.each_pair do |word,count| puts "#{word}|#{count}" end it gets in the STDIN all mappers intermediate values. Not f...

How do I control output files name and content of an Hadoop streaming job?

Is there a way to control the output filenames of an Hadoop Streaming job? Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key. Update: Just found the answer - Using a Java class that derives from ...

Nutch search always returns 0 results

I have set up nutch 1.0 on a cluster. It has been setup and has successfully crawled, I copied the crawl directory using the dfs -copyToLocal and set the value of searcher.dir in the nutch-site.xml file located in the tomcat directory to point to that directory. Still when I try to search I receive 0 results. Any help would be greatly a...

Hadoop cluster. 2 Fast, 4 Medium, 8 slower machines?

We're going to purchase some new hardware to use just for a Hadoop cluster and we're stuck on what we should purchase. Say we have a budget of $5k should we buy two super nice machines at $2500/each, four at around $1200/each or eight at around $600 each? Will hadoop work better with more slower machines or fewest much faster machines? ...

How do I set Priority\Pool on an Hadoop Streaming job?

How can I set the Priority\Pool of an Hadoop Streaming job? It's probably a command-line jobconf parameter (e.g -jobconf something=pool.name) but I haven't been able to find any documentation on this online... ...

Is HBase stable and production-ready?

For folks who have deployed HBase on their own clusters, do you feel that it's sufficiently stable for production use? What types of troubles or issues have you run into? I do see a bunch of companies listed as using HBase in production (http://wiki.apache.org/hadoop/Hbase/PoweredBy), but I'm curious as to whether a lot of maintenance,...

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop- Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp...

Processing files with headers in Hadoop

I want to process a lot of files in Hadoop -- each file has some header information, followed by a lot of records, each stored in a fixed number of bytes. Any suggestions on that? ...

HBase distributed scanner

In the "API usage example" on "Getting started" page in HBase documentation there is an example of scanner usage: Scanner scanner = table.getScanner(new String[]{"myColumnFamily:columnQualifier1"}); RowResult rowResult = scanner.next(); while (rowResult != null) { //... rowResult = scanner.next(); } As I understand, t...

hadoop hive question

I'm trying to create tables pragmatically using JDBC. However, I can't really see the table I created from the hive shell. What's worse, when i access hive shell from different directories, i see different result of the database. Is any setting i need to configure? Thanks in advance. ...

copy resultSet without using cachedRowSet

I 'm trying to close the connection after executing a query. Before, I just create a CacheRowSetImpl instance and it will take care of release the resources for me. However, I am using hive database driver from hadoop project. It doesn't support CachedRowSetImpl.execute(). I'm wondering is there any other way that allow me to copy the re...

How does the MapReduce sort algorithm work?

Hi, One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment. To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sor...

Java Generics & Hadoop: how to get a class variable

Hi, I'm a .NET programmer doing some Hadoop work in Java and I'm kind of lost here. In Hadoop I am trying to setup a Map-Reduce job where the output key of the Map job is of the type Tuple<IntWritable,Text>. When I set the output key using setOutputKeyclass as follows JobConf conf2 = new JobConf(OutputCounter.class); conf2.setOutputKey...

Thrift C# getRows

I'm having trouble implementing the Thrift API in my c# program. The lib's are built and it seems to run like it should, but one function is giving me trouble. As I understand it, getRows() is supposed to return a list of TRowResult, however it's only returning the first row in my table. My foreach loop only runs once. Anyone have experi...

Dealing with Gigabytes of Data

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows: Lots of writes and Lots of reads on same tables, very realtime Scaling is very important as the client insists expansion of database serv...

Streaming data and Hadoop? (not Hadoop Streaming)

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I...

Hadoop Input Files

Is there a difference between having say n files with 1 line each in the input folder and having 1 file with n lines in the input folder when running hadoop? If there are n files, does the "InputFormat" just see it all as 1 continuous file? ...

How Can I Use The Input Logs .PCAP(Binary) With Map Rreduce Hadoop

Logs Tcpdumps are binary files, i wanna know what FileInputFormat of hadoop i should use for split chunks the input data...please help me!! ...