Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics?
Focusing on web analytics in particular, I'm interested in two closely-related aspects: query perf...
Can someone explain what is hadoop in terms of the ideas behind the software ? What makes it so popular and/or powerful ?
...
A simple wordcount reducer in Ruby looks like this:
#!/usr/bin/env ruby
wordcount = Hash.new
STDIN.each_line do |line|
keyval = line.split("|")
wordcount[keyval[0]] = wordcount[keyval[0]].to_i+keyval[1].to_i
end
wordcount.each_pair do |word,count|
puts "#{word}|#{count}"
end
it gets in the STDIN all mappers intermediate values. Not f...
Is there a way to control the output filenames of an Hadoop Streaming job?
Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key.
Update:
Just found the answer - Using a Java class that derives from ...
I have set up nutch 1.0 on a cluster. It has been setup and has successfully crawled, I copied the crawl directory using the dfs -copyToLocal and set the value of searcher.dir in the nutch-site.xml file located in the tomcat directory to point to that directory. Still when I try to search I receive 0 results.
Any help would be greatly a...
We're going to purchase some new hardware to use just for a Hadoop cluster and we're stuck on what we should purchase. Say we have a budget of $5k should we buy two super nice machines at $2500/each, four at around $1200/each or eight at around $600 each? Will hadoop work better with more slower machines or fewest much faster machines? ...
How can I set the Priority\Pool of an Hadoop Streaming job?
It's probably a command-line jobconf parameter (e.g -jobconf something=pool.name) but I haven't been able to find any documentation on this online...
...
For folks who have deployed HBase on their own clusters, do you feel that it's sufficiently stable for production use? What types of troubles or issues have you run into?
I do see a bunch of companies listed as using HBase in production (http://wiki.apache.org/hadoop/Hbase/PoweredBy), but I'm curious as to whether a lot of maintenance,...
I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp...
I want to process a lot of files in Hadoop -- each file has some header information, followed by a lot of records, each stored in a fixed number of bytes. Any suggestions on that?
...
In the "API usage example" on "Getting started" page in HBase documentation there is an example of scanner usage:
Scanner scanner = table.getScanner(new
String[]{"myColumnFamily:columnQualifier1"});
RowResult rowResult = scanner.next();
while (rowResult != null) {
//...
rowResult = scanner.next();
}
As I understand, t...
I'm trying to create tables pragmatically using JDBC. However, I can't really see the table I created from the hive shell. What's worse, when i access hive shell from different directories, i see different result of the database.
Is any setting i need to configure?
Thanks in advance.
...
I 'm trying to close the connection after executing a query. Before, I just create a CacheRowSetImpl instance and it will take care of release the resources for me. However, I am using hive database driver from hadoop project. It doesn't support CachedRowSetImpl.execute(). I'm wondering is there any other way that allow me to copy the re...
Hi,
One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment.
To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sor...
Hi,
I'm a .NET programmer doing some Hadoop work in Java and I'm kind of lost here. In Hadoop I am trying to setup a Map-Reduce job where the output key of the Map job is of the type Tuple<IntWritable,Text>. When I set the output key using setOutputKeyclass as follows
JobConf conf2 = new JobConf(OutputCounter.class);
conf2.setOutputKey...
I'm having trouble implementing the Thrift API in my c# program. The lib's are built and it seems to run like it should, but one function is giving me trouble. As I understand it, getRows() is supposed to return a list of TRowResult, however it's only returning the first row in my table. My foreach loop only runs once. Anyone have experi...
I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:
Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database serv...
I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I...
Is there a difference between having say n files with 1 line each in the input folder and having 1 file with n lines in the input folder when running hadoop?
If there are n files, does the "InputFormat" just see it all as 1 continuous file?
...
Logs Tcpdumps are binary files, i wanna know what FileInputFormat of hadoop i should use for split chunks the input data...please help me!!
...