views:

1059

answers:

4

If you have an answer that is not Java / SQLite related, I'd be pleased to read it.

The environment

I store items in a database with the following scheme :

###################
#       Item      #    
###################
#      _id        #    This is the primary key
#    parent_id    #    If set, it the ID of the item containing this item
#      date       #    An ordinary date
#  geocontext_id  #    Foreign key to a pair of named coordinates
###################

###################
#   Geocontext    #    
###################
#       _id       #    This is the primary key
#       name      #    Way for the user to label a pair of coordinates (e.g : "home", "work")
#         x       #    One of the coordinate
#         y       #    The other one
###################

The problem

I must filter the items according to the geocontext and the date. It would be a easy job if items were all at the same level, but the trick is it's recursive. E.G :

root
      |_item 1
      |_item 2 
      |      |_item 4
      |      |_item 5
      |             |_item 6
      |_item 3
      |      |_item 8
      |             |_item 10
      |_item 11
      |       |_item 12
      |_item 7

There is no explicit limit to the recursive depth.

Now, if we are in any node and filter with the date "1st of April", we must see not only the items directly contained in the node that match the date, but we must see the items that contain items matching the date as well.

E.G : We are in "Items 2", if "item 6" matches the date, then we consider "item 5" matches the date too and we must keep it. If we are at the root, then item 2 must be displayed.

Same goes for the geocontext, but it's even harder because  :

  • It's stored in an other table.
  • Matching the context is a costly mathematical computation.

Of course, brute forcing the matching would result in the software to be slow and a very poor user experience.

NOTE : I don't need to display a tree. I display a list of filtered data from a tree. We must see only a flat list of the top elements. The challenge is to decide whether to display each element or not, according to all the children hierachy.

How I tried to solve it

I thought I could ease a bit the problem by using more tables to cache flat data :

###################
# Geocontex_cache #    
###################
#     item_id     #     I can Join the items table on this field
#     child_id    #     I can delete / update a child, and so delete / update the cache
#  geocontext_id  #     I can delete / update a geocontext, and so delete / update the cache
#        x        #      Here, I can brute force :-)
#        y        # 
###################

###################
#    Date_cache   #    
###################
#     item_id     #     
#     child_id    #    
#       date      #    
###################

This seems reasonable, but I haven't tried it yet. Nevertheless, it should the following drawbacks :

  • I moved the costly process to the get / set / create / delete methods that will have to manage the cached date. That will be a troublesome code to write and maintain. A five depth level item will triger a process that will hit recursively five parents.

  • The size ot the data base could become HUGE. A five de depth level item store cached data for five parents. Don't know if it's relevant, since this a a mono-user app with a manual input. I don't think anybody would insert more thatn 1000 items with more than 10 level of depth.

Now the good news is we go from the bottom of the pyramid to the top, not the other way, so it's not has horrible as it seems. When I will have to deal with parent item deletion, it will be another nice headache, but I save it for another question ;-).

Now my question

How would you store the data and process the filtering int the most optimal way ?

Optional :

Should I define an explicit recursive depth limit ? Should I perform the filtering using SQL or Java ? SQL surely will be faster, but matching the geocontext is far easier to do in Java.

As I am working on the Android platform, I have the following constraints :

  • Java is the only language available, and not with the entire standard lib.

  • SQLite is the only DBMS available.

  • Performance and memory are important issues. In case you have to choose, battery life and therefore performance is the priority.

  • Exotics external libs may not be able to be used.

P.S : I dug into SO and found some interesting pieces of information (espacially http://stackoverflow.com/questions/192220/what-is-the-most-efficientelegant-way-to-parse-a-flat-table-into-a-tree). It's a hint, but not a problem solver.

+1  A: 

This might be offtopic but.. have you considered using serialization?

Google Protocol Buffers could be used to serialize the data in a very efficient manner (time and space), you'd have to then create a suitable tree structure (look in any CS book) to help with the searching.

I mentioned protocol buffers because being a Google library they may be available on Android.

Just a thought.

Fortyrunner
+1 just for giving me to think. But If you can give me more details, I'd me glad. I am a total noob when it comes from serialization. I always tried to avoid it because of my web background and trouble with loosing state object while trying to pass it in POST / GET METHODS.
e-satis
http://code.google.com/apis/protocolbuffers/Just look in any standard Java text for details of serialization. We use it in a big production system for storing archived data and it is very fast and easy to use.
Fortyrunner
A: 

AFAICT you can use hierarchical queries (google for "CONNECT BY" "START WITH") in SQLite...

Massa
Are you sure about that ? I really thought it was an Oracle noly keyword. I tried a quick google but found nothing...
e-satis
You can't: http://www.sqlite.org/lang.html
Soonil
+4  A: 

1) First, let's look at simply putting everything in memory. This is simple, flexible, and above all, fast, solution. Drawbacks include the fact that you'll have to read everything into memory at startup (give the user a pretty loading bar and they won't even notice), and perhaps have to do a little extra work to ensure everything is reflected to disk when the user thinks it is, so that data isn't lost.

In this analysis I'm making some generic assumptions about Android/Dalvik I don't really know that much about, so hopefully it's somewhat accurate :) Remember the G1 has 192MB of RAM. Also, your assumption above was a max around 1000 items.

Object superclass ~ 8 bytes
parent/child pointer ~ 4 bytes
date (long) ~ 8 bytes
name (non interned string avg 32 chars) ~ 64 bytes
x point (int) ~ 4 bytes
y point (int) ~ 4 bytes

Total = 92 bytes + possible memory alignment + fudge factor = 128 bytes
1000 items = 125kB
10000 items = 1.22MB

Note: I realize that while a child can only have one pointer, a parent can have multiple children. However, the number of parent->child pointers is (elements - 1), so the average cost of parent->child pointer is (elements - 1)/elements ~ 1 element or 4 bytes. This assumes a child structure that doesn't allocate unused memory, such as a LinkedList (as opposed to an ArrayList)

2) The nerd in me says that this would be a fun place to profile a B+ tree, but I think that's overkill for what you want at the moment :) However, whatever solution you end up adopting, if you are not holding everything in memory, you will definitely want to cache as much of the top levels of the tree in memory as you can. This may cut down on the amount of disk activity drastically.

3) If you don't want to go all memory, another possible solution might be as follows. Bill Karwin suggests a rather elegant RDBMS structure called a Closure Table for optimizing tree based reads, while making writes more complex. Combining this with top level cache may give you performance benefits, although I would test this before taking my word on it:

When evaluating a view, use whatever you have in memory to evaluate as many children as you can. For those children that do not match, use an SQL join between the closure table and the flat table with an appropriate where clause to find out if there are any matching children. If so, you'll be displaying that node on your result list.

Hope this all makes sense and seems like it would work for what you need.

Soonil
Sometimes I am so nicely surprised by he quality of the answers on SO.
e-satis
I'd wish I could vote this one twice :-)
e-satis
+1  A: 

I listened to Soonil and gave a try to the « closure table ». I added the following table :

################
#   Closure    #
################
# ancestor_id  #
#   item_id    #
################

If like me you never used that model before, it works that way :

You add a row for every direct or indirect relationship in the hierarchy. If C is a child of B, and B a child of A, you've got :

ancestor    item
   B         C
   A         B
   A         C      # you add the indirect relationship   
   A         A
   B         B
   C         C      # don't forget any item is in relation with himself

Nevertheless, with this scheme, you are missing an important information : what are the direct relationships ? What if you want only the direct children of an item ?

For that, you can add a column “is_direct” with a bool in the closure table, or you can just keep the column “parent_id” in the “item” table. That what I did because it prevents me from rewriting a lot of my previous code.

The nice part is that I can now check if an item matches a date or a geocontext in one single query.

E.G, if I am browsing all the items contained in the item number 4 and want to get only the ones matching or containing a children matching the date D :

SELECT ti.parent_id, ti.id, ti.title 
FROM item AS di                                  # item to filter with the date
              JOIN closure AS c                  # closure table
                  ON (di.id = c.item_id) 
              JOIN item AS ti 
                  ON (c.ancestor_id = ti.id)     # top item to display
WHERE di.date = D                                # here you filter by date   
AND ti.parent_id = 4                             # here you ensure you got only the top items

So I can throw away all my *_cache tables. I still have a lot of work to do one UPDATE / DELETE / CREATE, but everything is centralized and most of it is procedural, not recursive. Pretty cool :-)

The only pain is that I must recursively add an item to all its ancestor. But getting the ancestors is a one query shot, so it's really reasonable. And of course the closure table take a lot of space, but in my case I just don't care. Don't forget to index it if you are looking for perfs...

Love this SQL trick, thanks a lot guys ! It's a bit tricky to get at first glance, but so obvious once you have it done ;-)

e-satis