ansaurus

Question

Recursive data processing performance using Java and SQLite

Answer 1

+1 A:

This might be offtopic but.. have you considered using serialization?

Google Protocol Buffers could be used to serialize the data in a very efficient manner (time and space), you'd have to then create a suitable tree structure (look in any CS book) to help with the searching.

I mentioned protocol buffers because being a Google library they may be available on Android.

Just a thought.

Fortyrunner 2009-04-04 10:28:34

+1 just for giving me to think. But If you can give me more details, I'd me glad. I am a total noob when it comes from serialization. I always tried to avoid it because of my web background and trouble with loosing state object while trying to pass it in POST / GET METHODS.

e-satis 2009-04-04 10:38:28

http://code.google.com/apis/protocolbuffers/Just look in any standard Java text for details of serialization. We use it in a big production system for storing archived data and it is very fast and easy to use.

Fortyrunner 2009-04-04 10:44:58

Answer 2

A:

AFAICT you can use hierarchical queries (google for "CONNECT BY" "START WITH") in SQLite...

Massa 2009-04-04 16:35:47

Are you sure about that ? I really thought it was an Oracle noly keyword. I tried a quick google but found nothing...

e-satis 2009-04-04 17:02:07

You can't: http://www.sqlite.org/lang.html

Soonil 2009-04-07 15:21:48

Answer 3

+4 A:

1) First, let's look at simply putting everything in memory. This is simple, flexible, and above all, fast, solution. Drawbacks include the fact that you'll have to read everything into memory at startup (give the user a pretty loading bar and they won't even notice), and perhaps have to do a little extra work to ensure everything is reflected to disk when the user thinks it is, so that data isn't lost.

In this analysis I'm making some generic assumptions about Android/Dalvik I don't really know that much about, so hopefully it's somewhat accurate :) Remember the G1 has 192MB of RAM. Also, your assumption above was a max around 1000 items.

Object superclass ~ 8 bytes
parent/child pointer ~ 4 bytes
date (long) ~ 8 bytes
name (non interned string avg 32 chars) ~ 64 bytes
x point (int) ~ 4 bytes
y point (int) ~ 4 bytes

Total = 92 bytes + possible memory alignment + fudge factor = 128 bytes
1000 items = 125kB
10000 items = 1.22MB

Note: I realize that while a child can only have one pointer, a parent can have multiple children. However, the number of parent->child pointers is (elements - 1), so the average cost of parent->child pointer is (elements - 1)/elements ~ 1 element or 4 bytes. This assumes a child structure that doesn't allocate unused memory, such as a LinkedList (as opposed to an ArrayList)

2) The nerd in me says that this would be a fun place to profile a B+ tree, but I think that's overkill for what you want at the moment :) However, whatever solution you end up adopting, if you are not holding everything in memory, you will definitely want to cache as much of the top levels of the tree in memory as you can. This may cut down on the amount of disk activity drastically.

3) If you don't want to go all memory, another possible solution might be as follows. Bill Karwin suggests a rather elegant RDBMS structure called a Closure Table for optimizing tree based reads, while making writes more complex. Combining this with top level cache may give you performance benefits, although I would test this before taking my word on it:

When evaluating a view, use whatever you have in memory to evaluate as many children as you can. For those children that do not match, use an SQL join between the closure table and the flat table with an appropriate where clause to find out if there are any matching children. If so, you'll be displaying that node on your result list.

Hope this all makes sense and seems like it would work for what you need.

Soonil 2009-04-07 15:56:15

Sometimes I am so nicely surprised by he quality of the answers on SO.

e-satis 2009-04-07 17:56:35

I'd wish I could vote this one twice :-)

e-satis 2009-04-21 10:22:10

Answer 4

+1 A:

I listened to Soonil and gave a try to the « closure table ». I added the following table :

################
#   Closure    #
################
# ancestor_id  #
#   item_id    #
################

If like me you never used that model before, it works that way :

You add a row for every direct or indirect relationship in the hierarchy. If C is a child of B, and B a child of A, you've got :

ancestor    item
   B         C
   A         B
   A         C      # you add the indirect relationship   
   A         A
   B         B
   C         C      # don't forget any item is in relation with himself

Nevertheless, with this scheme, you are missing an important information : what are the direct relationships ? What if you want only the direct children of an item ?

For that, you can add a column “is_direct” with a bool in the closure table, or you can just keep the column “parent_id” in the “item” table. That what I did because it prevents me from rewriting a lot of my previous code.

The nice part is that I can now check if an item matches a date or a geocontext in one single query.

E.G, if I am browsing all the items contained in the item number 4 and want to get only the ones matching or containing a children matching the date D :

SELECT ti.parent_id, ti.id, ti.title 
FROM item AS di                                  # item to filter with the date
              JOIN closure AS c                  # closure table
                  ON (di.id = c.item_id) 
              JOIN item AS ti 
                  ON (c.ancestor_id = ti.id)     # top item to display
WHERE di.date = D                                # here you filter by date   
AND ti.parent_id = 4                             # here you ensure you got only the top items

So I can throw away all my *_cache tables. I still have a lot of work to do one UPDATE / DELETE / CREATE, but everything is centralized and most of it is procedural, not recursive. Pretty cool :-)

The only pain is that I must recursively add an item to all its ancestor. But getting the ancestors is a one query shot, so it's really reasonable. And of course the closure table take a lot of space, but in my case I just don't care. Don't forget to index it if you are looking for perfs...

Love this SQL trick, thanks a lot guys ! It's a bit tricky to get at first glance, but so obvious once you have it done ;-)

e-satis 2009-04-17 15:58:17

ansaurus

tags:

views:

answers:

Recursive data processing performance using Java and SQLite

The environment

The problem

How I tried to solve it

Now my question

related questions