views:

61

answers:

3

I'm using Core Data to store a lot (1000s) of items. A pair of properties on each item are used to determine uniqueness, so when a new item comes in, I compare it against the existing items before inserting it. Since the incoming data is in the form of an RSS feed, there are often many duplicates, and the cost of the uniquing step is O(N^2), which has become significant.

Right now, I create a set of existing items before iterating over the list of (possible) new items. My theory is that on the first iteration, all the items will be faulted in, and assuming we aren't pressed for memory, most of those items will remain resident over the course of the iteration.

I see my options thusly:

  1. Use string comparison for uniquing, iterating over all "new" items and comparing to all existing items (Current approach)
  2. Use a predicate to filter the set of existing items against the properties of the "new" items.
  3. Use a predicate with Core Data to determine uniqueness of each "new" item (without retrieving the set of existing items).

Is option 3 likely to be faster than my current approach? Do you know of a better way?

+2  A: 

A comparison of integer hash values — generate the hash of the components of an incoming RSS news item, use a predicate to filter all items with the same hash attribute value, and see if any object is fetched — may yield better performance than string comparisons of all the attributes within every entity.

Alex Reynolds
I wound up going this route and it seems pretty performant so far. Thanks for the help.
warrenm
A: 

As per Alex's answer, a predicate on an integer property should be quicker, but the strategy should be adjusted to better suit the task:

  1. collect a list of all incoming item hashes

  2. fetch all objects matching that hash list (only fetch the hash property)

  3. iterate over the incoming items, skipping those that have their hash in the fetched matches

Additionally you can fetch a dictionary result to avoid having the managed objects set up that you will not use (unless you intend on updating the existing objects instead of just skipping the identical incoming items)

ohhorob
+1  A: 

The third step of ohhorob's proposed solution is probably most efficently implemented as described in the Core Data docs in the section 'Implementing Find-or-Create Efficiently'. That is, sorting both the incoming items and their correspondant existing items after the hash property, then looping over the two collections in parallell.

niblha
That's an outstanding document. I don't know how I haven't seen it before.
warrenm