ansaurus

Question

How to optimize Core Data query for full text search

Answer 1

+2 A:

Dive in.

Here's one way to go about it:

Put your records into a Core Data persistent store
Use NSFetchedResultsController to manage a result set on your Word entities (Core Data equivalent with SQL "words" table)
Use UISearchDisplayController to apply an NSPredicate on the result set in real time

Once you have a result set via NSFetchedResultsController, it is quite easy to apply a predicate. In my experience it will be responsive, too. For example:

if ([self.searchBar.text length]) {
    _predicate = [NSPredicate predicateWithFormat:[NSString stringWithFormat:@"(word contains[cd] '%@')", self.searchBar.text]];
    [self.fetchedResultsController.fetchRequest setPredicate:_predicate];
}

NSError *error;
if (![self.fetchedResultsController performFetch:&error]) {
    // handle error...
}
NSLog(@"filtered results: %@", [self.fetchedResultsController fetchedObjects]);

will filter the result set [self.fetchedResultsController fetchedObjects] on the fly, doing a case-insensitive search on word.

Alex Reynolds 2009-11-21 04:49:11

Thanks for your reply. I'm just now writing up the command-line tool to get the initial sqlite data loaded into an xcdatamodel compliant db. Substantial labor involved. I'll report back on my experience.

dk 2009-11-23 17:38:02

To follow up on your example, I think the problem is that a fetch request wouldn't be on the Word entity, but on the textTable entity. (E.g. suppose textTable contains email messages and Word contains all the words in all the email fields.) I think this significantly complicates the matter because the fetchResultsController must hold textTable entities that are filtered via a predicate -- and such a ANY or SUBQUERY predicate is slow.Maybe there is a way to do this in the "opposite" direction: by starting w/ Word matches, following the inverse relationship, and uniquifying textTable. Hmmm.

dk 2009-12-11 06:27:33

If the first portion of your predicate reduces the search space as much as possible, the remainder of the predicate will perform faster, overall, with less space it has to search within. Take a peek at the performance section of the Core Data guide here: http://developer.apple.com/mac/library/documentation/cocoa/conceptual/CoreData/Articles/cdPerformance.html#//apple_ref/doc/uid/TP40003468

Alex Reynolds 2009-12-11 06:47:24

Answer 2

+1 A:

To follow up on this question, I've found that querying is dog slow using Core Data. I've scratched my head on this one for many hours.

As in the SQL example in my question, there are two entities: textTable and words where words contains each word, it is indexed, and there is a many-to-many relationship between textTable and words. I populated the database with a mere 4000 words and 360 textTable objects. Suppose the textTable relationship to the words object is called searchWords, then I can use a predicate on the textTable entity that looks like

predicate = [NSPredicate predicateWithFormat:@"ANY searchWords.word BEGINSWITH %@", query];

(I can add conjunctions of this predicate for multiple query terms.)

On the iPhone this query takes multiple seconds. The response for my hand-coded SQL using a larger test set was instant.

But this isn't even the end of it. There are limitations to NSPredicate that make rather simple queries slow and complex. For example, imagine in the above example that you want to filter using a scope button. Suppose the words entity contains all words in all text fields, but the scope would limit it to words from specific fields. Thus, words might have a "source" attribute (e.g. header and message body of email).

Naturally, then, a full text would ignore the source attribute, as in the example above, but a filtered query would limit the search to a particular source value. This seemingly simple change requires a SUBQUERY. For example, this doesn't work:

ANY searchWords.word BEGINSWITH "foo" AND ANY searchWords.source = 3

because the entities that are true for the two expressions can be different. Instead, you have to do something like:

SUBQUERY(searchWords, $x, $x.word BEGINSWITH "foo" AND $x.source = 3).@count > 0

I've found that these subqueries are, perhaps not surprisingly, slower than predicates using "ANY".

At this point I'm very curious how Cocoa programmers efficiently use Core Data for full text search because I'm discouraged by both the speed of predicate evaluation and expressibility of NSPredicates. I've run up against a wall.

dk 2009-12-11 06:12:21

Consider taking a look at the performance section here: http://developer.apple.com/mac/library/documentation/cocoa/conceptual/CoreData/Articles/cdPerformance.html#//apple_ref/doc/uid/TP40003468

Alex Reynolds 2009-12-11 08:04:03

Thanks for that link. From there I discovered that the executable argument "-com.apple.CoreData.SQLDebug 1" will send sqlite debug to stderr. From that dump I saw the query. There was nothing actually wrong with the query, but because the word <=> textTable relationship is many-to-many there is a relationship table to join. Thus, the query must join across 3 tables. When I removed the inverse the query now runs much faster on the iPhone hardware! Alas, the new schema has the foreign key in the Word table, thus the word itself and metadata are repeated for each occurrence. Space wasted.

dk 2009-12-11 17:17:27

You may gain speed, but Apple recommends maintaining inverse relationships to maintain data integrity. "You should typically model relationships in both directions, and specify the inverse relationships appropriately. Core Data uses this information to ensure the consistency of the object graph if a change is made (see “Manipulating Relationships and Object Graph Integrity”)." Take a look here for more info: http://developer.apple.com/DOCUMENTATION/Cocoa/Conceptual/CoreData/Articles/cdRelationships.html

Alex Reynolds 2009-12-16 22:02:26

http://stackoverflow.com/questions/2197496/fixing-predicated-nsfetchedresultscontroller-nsfetchrequest-performance-with-sqli - here's a related thread about many-to-many relations and SQLite performance. Forcing to pick between performance and integrity (by removing inverse relationship) is not acceptable. I need to have both. This problem seems specific to SQLite backend, smaller projects may be able to fix it by just using a binary store.

Jaanus 2010-02-04 19:45:18

Answer 3

+2 A:

After struggling with this same issue, I ran across a series of posts where the author had the same problem and came up with this solution. He reports an improvement from 6-7 second search time to between 0.13 and 0.05 seconds.

His dataset for FTS was 79 documents (file size 175k, 3600 discrete tokens, 10000 references). I haven't yet tried his solution, but thought I'd post ASAP. See also Part 2 of his posts for his documentation of the problem and Part 1 for his documentation of the dataset.

jluckyiv 2010-02-21 15:58:33

The problem I have with this solution is that the query and keyword must be an exact match. For real time results you want any keyword prefix to match the query. In that case it's not possible to use the object instead of the string in the predicate.

dk 2010-02-23 05:00:28

Tried implementing this myself and got no improvement, probably because I was using contains[cd]. I gave up and started with sqlite3 fts. Peter, thanks for the extra links. I was limited to only one.

jluckyiv 2010-03-16 05:49:05

Answer 4

+1 A:

I made a workaround solution. I think it's similar to this post. I added the amalgamation source code to my Core Data project, then created a full-text search class that was not a managed object subclass. In the FTS class I #import "sqlite3.h" (the source file) instead of the sqlite framework. The FTS class saves to a different .sqlite file than the Core Data persistent store.

When I import my data, the Core Data object stores the rowid of the related FTS object as an integer attribute. I have a static dataset, so I don't worry about referential integrity, but the code to maintain integrity should be trivial.

To perform FTS, I MATCH query the FTS class, returning a set of rowids. In my managed object class, I query for the corresponding objects with [NSPredicate predicateWithFormat:@"rowid IN %@", rowids]. I avoid traversing any many-to-many relationships this way.

The performance improvement is dramatic. My dataset is 142287 rows, comprising 194MB (Core Data) and 92MB (FTS with stopwords removed). Depending on the search term frequency, my searches went from several seconds to 0.1 seconds for infrequent terms (<100 hits) and 0.2 seconds for frequent terms (>2000 hits).

I'm sure there are myriad problems with my approach (code bloat, possible namespace collisions, loss of some Core Data features), but it seems to be working.

jluckyiv 2010-04-10 21:18:43

ansaurus

tags:

views:

answers:

How to optimize Core Data query for full text search

related questions