views:

169

answers:

1

I am trying to use an early experimental release of mapper implementation to empty the datastore. This solution was proposed in a similar SO question.

This is the AppEngineMapper I am currently using. It just deletes the entity.

public class EmptyFixesMapper extends AppEngineMapper<Key, Entity, NullWritable, NullWritable> {

    public EmptyFixesMapper() {
    }

    @Override
    public void taskSetup(Context context) {
    }

    @Override
    public void taskCleanup(Context context) {
    }

    @Override
    public void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
    }

    @Override
    public void cleanup(Context context) {
        getAppEngineContext(context).flush();
    }

    @Override
    public void map(Key key, Entity value, Context context) {
        log.warning("Mapping key: " + key);

        DatastoreMutationPool mutationPool = 
                    this.getAppEngineContext(context).getMutationPool();
        mutationPool.delete(value.getKey());
    }
}

This is my mapreduce.xml configuration file:

<configurations>
    <configuration name="Empty Entities">
        <property>
            <name>mapreduce.map.class</name>
            <value>com.google.appengine.demos.mapreduce.EmptyFixesMapper</value>
        </property>
        <property>
            <name>mapreduce.inputformat.class</name>
            <value>com.google.appengine.tools.mapreduce.DatastoreInputFormat</value>
        </property>
        <property>
            <name human="Entity Kind to Map Over">mapreduce.mapper.inputformat.datastoreinputformat.entitykind</name>
            <value template="optional">Fix</value>
        </property>
    </configuration>
...

When I enter the the mapreduce control panel in mydomain/mapreduce/status, I can launch the tasks, but they never complete. This is the screenshot where you can see a field "0/0 shards":

mapreduce control panel

And I can see some tasks are created in the appengine default task queue, with a lot of retries:

appengine task queue

An finally, in my GAE application logs I see:

1. 09-11 03:23AM 08.556 /mapreduce/mapperCallback 500 10081ms 0cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine)

  0.1.0.2 - - [11/Sep/2010:03:23:18 -0700] "POST

/mapreduce/mapperCallback HTTP/1.1" 500 0 "http://xxx.appspot.com/mapreduce/command/start_job" "AppEngine-Google; (+http://code.google.com/appengine)" xxx.appspot.com" ms=10081 cpu_ms=0 api_cpu_ms=0 cpm_usd=0.000057 queue_name=default task_name=worker-attempt-1284198892815-0001-m-000002-1--0

2. W 09-11 03:23AM 18.638

  Request was aborted after waiting too long to attempt to service

your request. This may happen sporadically when the App Engine serving cluster is under unexpectedly high or uneven load. If you see this message frequently, please contact the App Engine team.

What could be happening? I'm sure I've followed steps described in the getting started guide, and I have less than 1000 entities in the datastore...

A: 

Well, the problem has nothing to do with appengine-mapreduce. I was securing /mapreduce/** URIs, so the task in the default task queue was not being able to reach /mapreduce/mapperCallback, /mapreduce/command/start_job, etc because no username/password information is sent.

It is an interesting issue anyway, because I don't really want to open /mapreduce/** to everyone...

Guido