questions about mapreduce | ansaurus

mapreduce

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL. Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip,...

Sorting the values before they are send to the reducer.

I'm thinking about building a small testing application in hadoop to get the hang of the system. The application I have in mind will be in the realm of doing statistics. I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys). What I have pla...

What are some scenarios for which MPI is a better fit than MapReduce?

As far as I understand, MPI gives me much more control over how exactly different nodes in the cluster will communicate. In MapReduce/Hadoop, each node does some computation, exchanges data with other nodes, and then collates its partition of results. Seems simple, but since you can iterate the process, even algorithms like K-means or P...

parallel-processing

Generating Separate Output files in Hadoop Streaming

Using only a mapper (a Python script) and no reducer, how can I output a separate file with the key as the filename, for each line of output, rather than having long files of output? ...

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performi...

batch-processing

MapReduce, Python and NetworkX

I have implemented an unweighted random walk function for a graph that I built in Python using NetworkX. Below is a snippet of my program that deals with the random walk. Elsewhere in my program, I have a method that creates the graph, and I have a method that simulates various custom graph testing methods that I've written. One of these...

Hadoop mapreduce streaming from HBase

I'm building a Hadoop (0.20.1) mapreduce job that uses HBase (0.20.1) as both the data source and data sink. I would like to write the job in Python which has required me to use hadoop-0.20.1-streaming.jar to stream data to and from my Python scripts. This works fine if the data source/sink are HDFS files. Does Hadoop support streaming...

Recommendations for a data processing (MapReduce / DHT?) framework

I have a need to perform distributed searching across a largish set of small files (~10M) with each file being a set of key: value pairs. I have a set of servers with a total of 56 CPU cores available for this - these are mostly dual core and quad core, but also a large DL785 with 16 cores. The system needs to be designed for online qu...

distributed-computing

Hadoop MapReduce job on file containing HTML tags

I have a bunch of large HTML files and I want to run a Hadoop MapReduce job on them to find the most frequently used words. I wrote both my mapper and reducer in Python and used Hadoop streaming to run them. Here is my mapper: #!/usr/bin/env python import sys import re import string def remove_html_tags(in_text): ''' Remove any HTM...

Should map() and reduce() return key/value pairs of same type?

When writing a MapReduce job (specifically Hadoop if relevant), one must define a map() and a reduce() function, both yielding a sequence of key/value pairs. The data types of the key and value is free to be defined by the application. In the canonical example of word counting, both functions yield pairs of type (string, int) with the k...

Displaying access log analysis

I'm doing some work to analyse the access logs from a Catalyst web application. The data is from the load balancers in front of the web farm and totals about 35Gb per day. It's stored in a Hadoop HDFS filesystem and I use MapReduce (via Dumbo, which is great) to crunch the numbers. The purpose of the analysis is try to establish a usage...

capacity-planning

Parallelized record combining - matching on multiple keys

I have been looking at using MapReduce to build a parallelized record combining system. The language doesn't matter, I can use a pre-existing library such as Hadoop or build my own if necessary, I'm not worried about that. The problem that I keep running into, however, is that I need the records to be matched on multiple criteria. For e...

parallel-processing

Availiable reducers in Elastic MapReduce

I hope I'm asking this in the right way. I'm learning my way around Elastic MapReduce and I've seen numerous references to the "Aggregate" reducer that can be used with "Streaming" job flows. In Amazon's "Introduction to Amazon Elastic MapReduce" PDF it states "Amazon Elastic MapReduce has a default reducer called aggregrate" What I wo...

Hadoop or Hadoop Streaming for MapReduce on AWS

I'm about to start a mapreduce project which will run on AWS and I am presented with a choice, to either use Java or C++. I understand that writing the project in Java would make more functionality available to me, however C++ could pull it off too, through Hadoop Streaming. Mind you, I have little background in either language. A simi...

amazon-web-services

Amazon MapReduce no reducer job

Hi. I am trying to create a mapper only job via AWS (a streaming job). The reducer field is required, so I am giving a dummy executable, and adding -jobconf mapred.map.tasks=0 to the Extra Args box. In the hadoop environment (version 0.20) I've installed, no reducer jobs will launch, but in AWS the dummy executable launches and fails. ...

amazon-web-services

Is MapReduce right for me?

I am working on a project that deals with analyzing a very large amount of data, so I discovered MapReduce fairly recently, and before i dive any further into it, i would like to make sure my expectations are correct. The interaction with the data will happen from a web interface, so response time is critical here, i am thinking a 10-1...

Error in using Hadoop MapReduce in Eclipse

When I executed a MapReduce program in Eclipse using Hadoop, I got the below error. It has to be some change in path, but I'm not able to figure it out. Any idea? 16:35:39 INFO mapred.JobClient: Task Id : attempt_201001151609_0001_m_000006_0, Status : FAILED java.io.FileNotFoundException: File C:/tmp/hadoop-Shwe/mapred/local/taskTracker...

Netezza, Teradata, DB2 Parallel/Enterprise, ... versus Hadoop or others?

I'm looking at building some data warehousing/querying infrastructure, right now on top of Map/Reduce solutions like Hadoop. However, it strikes me that all the M/R work is just repeating what the RDBMS guys have solved for the last 20 years with parallel SQL databases. Parallel SQL implementations scale reads and writes across nodes, j...

Error in Hadoop MapReduce

When I run a mapreduce program using Hadoop, I get the following error. 10/01/18 10:52:48 INFO mapred.JobClient: Task Id : attempt_201001181020_0002_m_000014_0, Status : FAILED java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) 10/01/18 10:52:48 WARN map...

Data structure/Algorithm for Streaming Data and identifying topics

Hi, I want to know the effective algorithms/data structures to identify the below information in streaming data. Consider a real-time streaming data like twitter. I am mainly interested in the below queries rather than storing the actual data. I need my queries to run on actual data but not any of the duplicates. As I am not i...

data-structures

natural-language

1
2
3
4
5
...
13