views:

611

answers:

6

How is it possible to implement an efficient large Sqlite db search (more than 90000 entries)?

I'm using Python and SQLObject ORM:

    import re
    ...

    def search1():
        cr = re.compile(ur'foo')
        for item in Item.select():
            if cr.search(item.name) or cr.search(item.skim):
                print item.name

This function runs in more than 30 seconds. How should I make it run faster?

UPD: The test:

    for item in Item.select():
        pass

... takes almost the same time as my initial function (0:00:33.093141 to 0:00:33.322414). So the regexps eat no time.

A Sqlite3 shell query:

    select '' from item where name like '%foo%';

runs in about a second. So the main time consumption happens due to the inefficient ORM's data retrieval from db. I guess SQLObject grabs entire rows here, while Sqlite touches only necessary fields.

+3  A: 

The best way would be to rework your logic to do the selection in the database instead of in your python program.

Instead of doing Item.select(), you should rework it to do Item.select("""name LIKE ....

If you do this, and make sure you have the name and skim columns indexed, it will return very quickly. 90000 entries is not a large database.

Reed Copsey
Alas, indexing won't help for a "LIKE '%foo%'", which appears to be what @Alex needs:-(
Alex Martelli
Ahh, true. It'll still be much faster than returning each record (constructing all of the objects) and doing the search logically with regular expressions.
Reed Copsey
If the search is one that can be done with a strait value, instead of a LIKE expression, indexing will help. Given that this was using "foo", I'm guessing the real code is slightly different.
Reed Copsey
A: 

If you really need to use a regular expression, there's not really anything you can do to speed that up tremendously.

The best thing would be to write an sqlite function that performs the comparison for you in the db engine, instead of Python.

You could also switch to a db server like postgresql that has support for SIMILAR.

http://www.postgresql.org/docs/8.3/static/functions-matching.html

Christopher
+1  A: 

30 seconds to fetch 90,000 rows might not be all that bad.

Have you benchmarked the time required to do the following?

    for item in Item.select():
        pass

Just to see if the time is DB time, network time or application time?

If your SQLite DB is physically very large, you could be looking at -- simply -- a lot of physical I/O to read all that database stuff in.

S.Lott
@S.Lott: The test you're suggesting takes almost the same time as my initial function (0:00:33.093141 to 0:00:33.322414). So the regexps eat no time. A "select '' from item where name like '%foo%'" query runs in about a second in the ssqlite3 shell. So the main time consumption happens due to the inefficient ORM's data retrieval from db. I guess SQLObject grabs entire rows here, while Sqlite touches only necessary fields.
Alex
Not surprising. Here's the golden rule: "Before optimizing, measure." Since your question didn't have any measurements, that had to be the first step. Please update your question with this benchmark information.
S.Lott
A: 

Given your example and expanding on Reed's answer your code could look a bit like the following:

import re
import sqlalchemy.sql.expression as expr

...

def search1():
    searchStr = ur'foo'
    whereClause = expr.or_(itemsTable.c.nameColumn.contains(searchStr), itemsTable.c.skimColumn.contains(searchStr))
    for item in Items.select().where(whereClause):
        print item.name

which translates to

SELECT * FROM items WHERE name LIKE '%foo%' or skim LIKE '%foo%'

This will have the database do all the filtering work for you instead of fetching all 90000 records and doing possibly two regex operations on each record.

You can find some info here on the .contains() method: http://bit.ly/zj8Pb

As well as the SQLAlchemy SQL Expression Language Tutorial here: http://bit.ly/10NIC7

Of course the example above assumes variable names for your itemsTable and the column it has (nameColumn and skimColumn).

pboucher
A: 

I would definitely take a suggestion of Reed to pass the filter to the SQL (forget the index part though).

I do not think that selecting only specified fields or all fields make a difference (unless you do have a lot of large fields). I would bet that SQLObject creates/instanciates 80K objects and puts them into a Session/UnitOfWork for tracking. This could definitely take some time.

Also if you do not need objects in your session, there must be a way to select just what the fields you need using custom-query creation so that no Item objects are created, but only tuples.

van
A: 

Initially doing regex via Python was considered for y_serial, but that was dropped in favor of SQLite's GLOB (which is far faster). GLOB is similar to LIKE except that it's syntax is more conventional: * instead of %, ? instead of _ .

See the Endnotes at http://yserial.sourceforge.net/ for more details.

code43