ansaurus

Question

MapReduce on more than one datastore kind in Google App Engine

Answer 1

+2 A:

It's difficult to know what to recommend without more details of what you're actually calculating. One simple option is to simply fetch the related entity inside the map call - there's nothing preventing you from doing datastore operations there.

This will result in a lot of small calls, though. Writing a custom InputReader, as you suggest, will allow you to fetch both sets of entities in parallel, which will significantly improve performance.

If you give more details as to how you need to join these entities, we may be able to provide more concrete suggestions.

Nick Johnson 2010-09-22 12:27:38

Just added more information!

fjsj 2010-09-23 19:15:13

Answer 2

A:

Following Nick Johnson suggestion, I wrote my own InputReader. This reader fetch entities from two different kinds. It yields tuples with all combinations of these entities. Here it is:

class TwoKindsInputReader(InputReader):
    _APP_PARAM = "_app"
    _KIND1_PARAM = "kind1"
    _KIND2_PARAM = "kind2"
    MAPPER_PARAMS = "mapper_params"

    def __init__(self, reader1, reader2):
        self._reader1 = reader1
        self._reader2 = reader2

    def __iter__(self):
        for u in self._reader1:
            for e in self._reader2:
                yield (u, e)

    @classmethod
    def from_json(cls, input_shard_state):
        reader1 = DatastoreInputReader.from_json(input_shard_state[cls._KIND1_PARAM])
        reader2 = DatastoreInputReader.from_json(input_shard_state[cls._KIND2_PARAM])

        return cls(reader1, reader2)

    def to_json(self):
        json_dict = {}
        json_dict[self._KIND1_PARAM] = self._reader1.to_json()
        json_dict[self._KIND2_PARAM] = self._reader2.to_json()
        return json_dict

    @classmethod
    def split_input(cls, mapper_spec):
        params = mapper_spec.params
        app = params.get(cls._APP_PARAM)
        kind1 = params.get(cls._KIND1_PARAM)
        kind2 = params.get(cls._KIND2_PARAM)
        shard_count = mapper_spec.shard_count
        shard_count_sqrt = int(math.sqrt(shard_count))

        splitted1 = DatastoreInputReader._split_input_from_params(app, kind1, params, shard_count_sqrt)
        splitted2 = DatastoreInputReader._split_input_from_params(app, kind2, params, shard_count_sqrt)
        inputs = []

        for u in splitted1:
            for e in splitted2:
                inputs.append(TwoKindsInputReader(u, e))

        #mapper_spec.shard_count = len(inputs) #uncomment this in case of "Incorrect number of shard states" (at line 408 in handlers.py)
        return inputs

    @classmethod
    def validate(cls, mapper_spec):
        return True #TODO

This code should be used when you need to process all combinations of entities of two kinds. You can also generalize this for more than two kinds.

Here it is a valid the mapreduce.yaml for TwoKindsInputReader:

mapreduce:
- name: recommendationMapReduce
  mapper:
    input_reader: customInputReaders.TwoKindsInputReader
    handler: recommendation.calculateCorrelationHandler
    params:
    - name: kind1
      default: kinds.User
    - name: kind2
      default: kinds.Item
    - name: shard_count
      default: 16

fjsj 2010-10-20 13:38:48

ansaurus

tags:

views:

answers:

MapReduce on more than one datastore kind in Google App Engine

related questions