views:

1078

answers:

4

I'm using the following view function to iterate over all items in the database (in order to find a tag), but I think the performance is very poor if the dataset is large. Any other approach?

def by_tag(tag):
return  '''
        function(doc) {
            if (doc.tags.length > 0) {
                for (var tag in doc.tags) {
                    if (doc.tags[tag] == "%s") {
                        emit(doc.published, doc)
                    }
                }
            }
        };
        ''' % tag
+1  A: 

You are very much on the right track with the view. A list of thoughts though:

View generation is incremental. If you're read traffic is greater than you're write traffic, then your views won't cause an issue at all. People that are concerned about this generally shouldn't be. Frame of reference, you should be worried if you're dumping hundreds of records into the view without an update.

Emitting an entire document will slow things down. You should only emit what is necessary for use of the view.

Not sure what the val == "%s" performance would be, but you shouldn't over think things. If there's a tag array you should emit the tags. Granted if you expect a tags array that will contain non-strings, then ignore this.

Paul J. Davis
Hi Paul, thanks for the points. Does temporary view is also generated incrementally? as Jan L. mentioned in another thread that, temp-views should not be used in production. Dose it means I have to create perm-views for each of the tags I have?
Senmiao Liu
Senmiao,The problem with temp views is that they're temporary. Once there are no clients using them they get deleted and the next request requires the entire view to be rebuilt.
Paul J. Davis
Try not to think of views as an SQL query. A view should generally be designed to answer a class of queries rather than a specific query. In your case, you want one view for all tags, not one view for each tag.
Paul J. Davis
Thanks Paul, one view for one tag, that helps.
Senmiao Liu
+7  A: 

Disclaimer: I didn't test this and don't know if it can perform better.

Create a single perm view:

function(doc) {
  for (var tag in doc.tags) {
    emit([tag, doc.published], doc)
  }
};

And query with _view/your_view/all?startkey=['your_tag_here']&endkey=['your_tag_here', {}]

Resulting JSON structure will be slightly different but you will still get the publish date sorting.

Bahadır
Indeed! Using `startkey` and such would be the 'correct' way to do this.
thenduks
Shouldn't it be emit([doc.tags[tag], doc.published], doc)? otherwise, if I'm not mistaken, you'll just be emitting the indices.
Markus Olsson
+3  A: 

You can define a single permanent view, as Bahadir suggests. when doing this sort of indexing, though, don't output the doc for each key. Instead, emit([tag, doc.published], null). In current release versions you'd then have to do a separate lookup for each doc, but SVN trunk now has support for specifying "include_docs=True" in the query string and CouchDB will automatically merge the docs into your view for you, without the space overhead.

Nick Johnson
A: 
# Works on CouchDB 0.8.0
from couchdb import Server # http://code.google.com/p/couchdb-python/

byTag = """
function(doc) {
if (doc.type == 'post' && doc.tags) {
    doc.tags.forEach(function(tag) {
        emit(tag, doc);
    });
}
}
"""

def findPostsByTag(self, tag):
    server = Server("http://localhost:1234")
    db = server['my_table']
    return [row for row in db.query(byTag, key = tag)]

The byTag map function returns the data with each unique tag in the "key", then each post with that tag in value, so when you grab key = "mytag", it will retrieve all posts with the tag "mytag".

I've tested it against about 10 entries and it seems to take about 0.0025 seconds per query, not sure how efficient it is with large data sets..

dbr
It'll be more efficient if you use a permanent view as suggested by Bahadır.
Carlos Villela