views:

1505

answers:

3

Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end.

https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/

For instance, let's say I'm using Python and Pycassa, how would I load in a new map reduce function, and then call it? Does my map reduce function have to be java that's installed on the cassandra server? If so, how do I call it from Pycassa?

There's also mention of Pig making this all easier, but I'm a complete Hadoop noob, so that didn't really help.

Your answer can use Thrift or whatever, I just mentioned Pycassa to denote the client side. I'm just trying to understand the difference between what runs in the Cassandra cluster vs. the actual server making the requests.

+3  A: 

From what I've heard (and from here), the way that a developer writes a MapReduce program that uses Cassandra as the data source is as follows. You write a regular MapReduce program (the example you linked to is for the pure-Java version) and the jars that are now available provide a CustomInputFormat that allows the input source to be Cassandra (instead of the default, which is Hadoop).

If you're using Pycassa I'd say you're out of luck until either (1) the maintainer of that project adds support for MapReduce or (2) you throw some Python functions together that write up a Java MapReduce program and run it. The latter is definitely a bit of a hack but would get you up and going.

Chris Bunch
So the Cassandra nodes aren't doing the map reduce, wherever your Java was running is?
UltimateBrent
Yes, the Hadoop jobtrackers run the m/r jobs.
jbellis
so isn't the point of map reduce that it's distributed? If it's not run on the cassandra nodes, what's the point?
UltimateBrent
It's still distributed, just distributed over the Hadoop nodes in your system. The main point of the Cassandra interface is that the way some people were doing it before was to dump a subset of their Cassandra database and then read it in, run a MR job, then dump it back to Cassandra. This removes a bit of that boilerplate code (mainly).
Chris Bunch
+1  A: 

The win of using a direct InputFormat from cassandra is that it streams the data efficiently, which is a very big win. Each input split covers a range of tokens and rolls off the disk at its full bandwidth: no seeking, no complex querying. I don't think it knows about locality -- to have each tasktracker prefer input splits from a cassandra process on the same node.

You can try using Pig with the STREAM method as a hack until more direct hadoop streaming support is in place.

mrflip
A: 

It Knows about the locality ; The Cassandra InputFormat overrides getLocations() to preserve data locality

Radha