ansaurus

Question

app engine DeadlineExceededError for cron jobs and task queue for wikipedia crawler

Answer 1

+1 A:

There is not a way "around" the deadline exception aside from making your code execute within the proper timeframe.

Amber 2010-10-12 22:16:43

The way around it is to redesign your code to fit the AppEngine infrastructure and leverage it to use tons of machines in small batches.

TheJacobTaylor 2010-10-13 16:20:17

Answer 2

+1 A:

I have had great success with datetimes on GAE.

from datetime import datetime, timedelta
time_start = datetime.now()
time_taken = datetime.now() - time_start

time_taken will be a timedelta. You can compare it against another timedelta that has the duration you are interested in.

ten_seconds = timedelta(seconds=10)
if time_taken > ten_seconds:
    ....do something quick.

It sounds like you would be far better served using mapreduce or Task Queues. Both are great fun for dealing with huge numbers of records.

A cleaner pattern for the code you have is to fetch only some records.

nobranches=TreeNode.all().fetch(100)

This code will only pull 100 records. If you have a full 100, when you are done, you can throw another item on the queue to launch off more.

-- Based on comment about needing trees without branches --

I do not see your model up there, but if I were trying to create a list of all of the trees without branches and process them, I would: Fetch the keys only for trees in blocks of 100 or so. Then, I would fetch all of the branches that belong to those trees using an In query. Order by the tree key. Scan the list of branches, the first time you find a tree's key, pull the key tree from the list. When done, you will have a list of "branchless" tree keys. Schedule each one of them for processing.

A simpler version is to use MapReduce on the trees. For each tree, find one branch that matches its ID. If you cannot, flag the tree for follow up. By default, this function will pull batches of trees (I think 25) with 8 simultaneous workers. And, it manages the job queues internally so you don't have to worry about timing out.

Cheers, Jacob

TheJacobTaylor 2010-10-12 23:12:23

Thank You, datatime seems to be working better. I can't use <code>nobranches=TreeNode.all().fetch(100)</code> since my for loop after that looks for only branches where the nodes are [] <code>for tree in nobranches: if tree.branches==[]:</code> using fetch(100) would return the same 100 nodes everytime and I want to add to new untouched branches. I wish I could get the nodes without branches in gql, but this seems to be the only way

Venkat S. Rao 2010-10-12 23:41:44

I would say either, use map-reduce on the whole beast or create a filtered query that retrieves only the nodes you need. I also noticed that you are retrieving records one by one in add_branches. If you get all of the child_node records in one round trip, it should speed up your function.

TheJacobTaylor 2010-10-12 23:48:04

Glad datetime is working for you.

TheJacobTaylor 2010-10-12 23:48:25

Answer 3

+1 A:

When DeadlineExcededErrors happen you want the request to eventually succeed if called again. This may require that your crawling state is guaranteed to have made some progress that can be skipped the next time. (Not addressed here)

Parallelized calls can help tremendously.

Urlfetch
Datastore Put (mixed entities together using db.put)
Datastore Query (queries in parallel - asynctools)

Urlfetch:

When you make your urlfetch calls be sure to use the asynchronous mode to collapse your loop.

Datastore

Combine Entities being put into a single round trip call.

# put newNodes+tree at the same time
db.put(newNodes+tree)

Pull TreeNode.gql from inside loop up into parallel query tool like asynctools http://asynctools.googlecode.com

Asynctools Example

    if pyNode is not None:

        runner = AsyncMultiTask()
        for child in pyNode:
             title = child.attributes["title"].value
             query = db.GqlQuery("SELECT __key__ FROM TreeNode WHERE name = :1", title)
             runner.append(QueryTask(query, limit=1, client_state=title))

        # kick off the work
        runner.run()

        # peel out the results
        treeNodes = []
        for task in runner:
            task_result = task.get_result() # will raise any exception that occurred for the given query
            treeNodes.append(task_result)

        for node in treeNodes:
            if node is None:
                newNodes.append(TreeNode(name=child.attributes["title"].value))

            else:
                tree.branches.append(node.key())
        for node in newNodes:
            tree.branches.append(node.key())
            self.log.debug("Node Added: %s" % node.name)

        # put newNodes+tree at the same time
        db.put(newNodes+tree)
        return tree.branches

DISCLOSURE: I am associated with asynctools.

kevpie 2010-10-13 05:52:58

Answer 4

+1 A:

The problem here is that you're doing a query operation for every link in your document. Since wikipedia pages can contain a lot of links, this means a lot of queries - and hence, you run out of processing time. This approach is also going to consume your quota at a fantastic rate!

Instead, you should use the page name of the Wikipedia page as the key name of the entity. Then, you can collect up all the links from the document into a list, construct keys from them (which is an entirely local operation), and do a single batch db.get for all of them. Once you've updated and/or created them as appropriate, you can do a batch db.put to store them all to the datastore - reducing your total datastore operations from numlinks*2 to just 2!

Nick Johnson 2010-10-13 09:37:31

Agree! For added bonus points, I think you might be able to stick a yield in front of that db.put to allow the put to be asynchronous. (I know you can in mapreduce workers).

TheJacobTaylor 2010-10-13 16:19:02

Putting `yield` in front of a normal `db.put` operation does not magically turn it into an asynchronous operation. What you `yield` from `mapreduce` workers is a "special" put operation, and the `mapreduce` framework itself is built to work efficiently with the generators created by yielding these special operations.

Will McCutchen 2010-10-13 17:20:42

ansaurus

tags:

views:

answers:

app engine DeadlineExceededError for cron jobs and task queue for wikipedia crawler

related questions