ansaurus

Question

mapreduce distance calculation in hadoop

Answer 1

A:

This problem does not sound like a good fit for map-reduce since you're not really able to break it into pieces and calculate each piece independently. If you could have a separate program that generates the complete graph of your points as a list (x1,y1,x2,y2) then you could do a straightforward map to get the distance.

Jieren 2010-08-10 01:10:39

Thanks for the replay, but i don't understand a separate program to generate the complete graph of the points. Can you make this more clear so that i can apply your idea to my program.

tsegay 2010-08-10 03:55:53

Well you would want a program to create every unique combination of the points. You can sort of say that each point is a node in a graph and you want to generate the complete graph (http://en.wikipedia.org/wiki/Complete_graph) of the points. I don't know of any good way to do this with mapreduce because every point needs to be aware of every other point. You're better off just using a nested loop.

Jieren 2010-08-10 16:18:45

Answer 2

A:

you need to do a self join on that data set. In hive that would look like, more or less

select dist(P1.x,P1.y,P2.x, P2.y) from points P1 join points P2 on (True) where P1.x < P2.x or (P1.x = P2.x and P1.y < P2.y)

The function dist would need to be implemented using other hive functions or written in Java and added as a UDF. Also I am not sure about the True constant but you can write 0=0 to the same effect. The where clause is to avoid computing the same distances twice or 0 distances. The question is: would hive optimize this the way you can do programming carefully in hadoop? I am not sure. This is a sketch in hadoop

map(x,y) {
  for i in 1:N #number of points
     emit(i, (x,y))

reduce (i, X)
  p1 = X[i]
  for j in i:N
     emit(dist(X[i], X[j]))

For this to work you need X to get to the reducer sorted in some order, for instance by x and then by y using secondary sort keys (that do not affect the grouping). This way every reducer gets a copy of all the points and works on a column of the distance matrix you are trying to generate. The memory requirements are minimal. You could trade some communication for memory by re-organizing the computation so that every reducer computes a square submatrix of the final matrix, knowing only two subsets of the points and calculating the distances among all of them. To achieve this, you need to make explicit the order of your points, say you are storing i, x, y

map(i,x,y) {
  for j in 1:N/k #k is size of submatrix
     emit((i/k, j), ("row", (x,y)))
     emit((j, i/k), ("col", (x,y)))

reduce ((a,b), Z)
  split Z in rows X and cols Y
  for x in X
     for y in Y
     emit(dist(x,y))

In this case you can see that the map phase emits only 2*N*N/k points, whereas the previous algorithm emitted N^2. Here we have (N/k)^2 reducers vs N for the other one. Each reducer has to hold k values in memory (using the secondary key technique to have all the rows get to the reducer before all the columns), vs only 2 before. So you see there are tradeoffs and for the second algorithm you can use the parameter k for perf tuning.

piccolbo 2010-10-05 07:05:31

Thanks for the post, i have updated my question which is a replay for your answer. Please refer to that for your replay.

tsegay 2010-10-05 17:03:28

So this is now an iterator problem. I think if you look at the standard wordcount example, there is a loop using an iterator over the reduced values in the reducer. It looks pretty standard to me, you declare it as Iterator<T> and call next() and hasNext() on it, see http://wiki.apache.org/hadoop/WordCount If you could be more specific as to what happens when you try and get your values instead of what you are expecting, maybe I could be more helpful. Errors? The wrong values? Nothing? Can you share the declaration for the Iterator and the lines of code where you access it?

piccolbo 2010-10-05 17:53:32

Also I think you edited away the definition of your problem, which takes away from the value of this conversation for other people. You are not supposed to edit questions this way. Adding clarifications is fine and I thank you for the comments on my solution, but please restore the detailed problem def and example.

piccolbo 2010-10-05 17:54:22

ansaurus

tags:

views:

answers:

mapreduce distance calculation in hadoop

related questions