views:

123

answers:

2

I'm building a management application to help manage my mobile auto detailing company (and hopefully others). I'm struggling to figure out how to model some of the data.

This question is related to a previous question that I've posted, but I've reproduced the relevant information below: http://stackoverflow.com/questions/3120192/database-design-google-app-engine

In this application, there are concepts of "Appointments" and "Line Items."

Appointments are a place and time where employees are expected to be in order to deliver a service.

Line Items are a service, fee or discount and its associated information. An example of line items that might go into an appointment:

Name:                          Price: Commission: Time estimate   
Full Detail, Regular Size:        160       75       3.5 hours 
$10 Off Full Detail Coupon:       -10        0         0 hours 
Premium Detail:                   220      110       4.5 hours 
Derived totals(not a line item): $370     $185       8.0 hours

In my previous implementation of this application, Line Items were contained by a single appointment. This worked fine most of the time, but caused problems sometimes. An example would be if an appointment got interrupted half-way through because of rain and the technician had to come back out the next day and finish up. This situation required two appointments for the same line item. In cases like this, I would just fudge the data a little by setting the "line item" on the second appointment to read something like "Finish Up" and then the cost would be $0.

In this next version, I am considering enabling Line Items to be matched with more than one appointment with a table structure that looks like this:

Appointment
 start_time
 etc...

Line_Item
 appointment_Key_List
 name
 price
 etc...

A general problem with this structure is that it is complicated and I'm not even sure if its appropriate to match one line item with multiple appointments. If Line Items can only be part of one Appointment, then I can actually just put a list of line items IN each Appointment, when I get Appointments, I'd already be getting Line Items.

A more specific problem is that I am using google app engine and if I want to query for a set of appointments and their associated line items, I'd have to first query for the set of appointments and then do a second query for the line items using the IN operator to test if any of the Line_Item's appointment keys fall into the set of appointment keys the were returned from the previous query. The second query will fail if I have more than 30 keys requiring me to shard the query. I could denormalize the data to avoid this complicated and extensive read query, and I will probably have to denormalize to some degree anyway, but I'd rather avoid complexity where appropriate.

My question is how is this type of situation usually modeled? Is it even appropriate for a Line Item to be paired with more than one appointment, or is it normal to simply split line items into separate ones for each appointment such as "1st half of 2 day job" and "2nd half of two day job." How do similar successful applications do this? What are the rules of thumb in this type of situation? What implementations have turned out to be less problematic?

Thanks!

+1  A: 

The usual solution for this kind of problems is normalizing the model, i.e. to the First Normal Form.

Your model, in normalized form, would have a third table, with references to the Appointment and Line_Item rows:

Appointment
 start_time
 ...

Line_Item
 name
 price
 ...

Appointment_Line_Item
 appointment_key
 line_item_key

There is a problem however! Since you are using Google App Engine, and their Datastore is quite limited ("GQL cannot perform an SQL-like JOIN") and mostly requires denormalization.

You suggested using a list-like field. It is a possiblity to use this, but it is very hard to index it. Searching for a key (the appointment_key) in a list per row in the database is not really performing. I propose two possiblities:

  1. Duplicate Line_Item.

    Line_Item
     appointment_key
     name
     price
     finished
     ...
    

    A Line_Item should have a finished state, when the item was finished or not by the employee. If an employee hadn't finished all line items, mark them as unfinished, create a new appointment and copy all items that were unfinished. You can index on the appointment_key field on all Line_Items, which is a Good Thing. However, the duplicated data may be a problem.

  2. Dynamic fields for Line_Item:

    Line_Item
     duplicate_key
     appointment_key
     name
     price
     finished
     ...
    

    Create a new field, duplicate_key, for Line_Item which points to another Line_Item or to null (reserve this key!). Null means that the Line_Item is original, any other value means that this Line_Item is a duplicate of the Line_Item the field points to. All fields of Line_Item marked as a duplicate inherit the fields of the original Line_Item, except the appointment_key: so it will take less storage. Also this solution should have appointment_key indexed, to speed up lookup times. This requires one additional query per duplicated Line_Item, which may be a problem.

Now, it's a clear choice: either better speed or better storage. I would go for the first, as it reduces complexity of your model, and storage is never a problem with modern systems. Less complexity generally means less bugs and less development/testing costs, which justifies the cost of the storage requirement.

Pindatjuh
Thanks for your response. I never thought about the duplicate key approach, thats a really interesting solution. One thing to bear in mind with app engine is that they do index lists and let you search on them. They call it a "merge-join" and it seems to expand their capabilities beyond a simple key-value store:http://code.google.com/events/io/2009/sessions/BuildingScalableComplexApps.html
DutrowLLC
" Searching for a key (the appointment_key) in a list per row in the database is not really performing." - not true. You can filter on list properties in App Engine just as efficiently as on non-lists.
Nick Johnson
@Nick Johnson - Thanks for chiming in with that. I think thats a key game changer with the app engine that is unexpected and not well known.
DutrowLLC
No, it doesn't. Lists are stored in the datastore as a sequence of key, value pairs with the same key, so the property "foo = db.IntegerProperty()" is stored the same way (well, almost) as "foo = db.ListProperty(int)" with one value would be. Indexing works in the same manner, enumerating the values from a list property (a multi-valued property, in low-level parlance). Thus, indexing and querying on a list is precisely the same as on singly-valued properties.
Nick Johnson
Deserialization has nothing to do with App Engine's indexing. Index rows are created from the Protocol Buffer that entities are represented as; multi-valued properties (lists) are treated the same way as singly valued properties, and can be queried on just as efficiently. The answer to the question you linked to is one I wrote, along with hundreds of other App Engine answers here on SO. I'm on the Google App Engine team; I _wrote_ parts of App Engine. Before that, I was on the Bigtable team. I know what I'm talking about with reference to the App Engine datastore.
Nick Johnson
"You can filter on list properties in App Engine just as efficiently as on non-lists.", the problem with this statement is that Google needs an O(1) algorithm to determine whether a value is in a list. According to the talks I've seen, this is not the case. Hence: searching in lists is *not* as efficient as searching for a value in a non-lists. Please provide a better answer, if you think my answer is wrong.
Pindatjuh
Very well. See my answer for details.
Nick Johnson
+2  A: 

The approach you're suggesting will work fine; you can model the line item's 'appointment_Key_list' as a list property and it will work as you expect. You don't have to use the IN operator - that's for matching a single value in the datastore against a list of keys you have (eg, "WHERE datastore_column IN ('a', 'b', 'c')), while you're doing the reverse - matching a single value against a list in the datastore.

I would suggest, though, that the reverse might be better suited to your task: Have each Appointment have a list of line item keys. This operates much the same way, but to retrieve all the data on an appointment, you instead first fetch the appointment, then do a bulk get on the line items, using the keys from the Appointment entity. If you know the key of the Appointment, you're thus avoiding the need to do any queries at all.

I've been trying to explain to Pindatjuh why querying a list property is no less efficient than a singly-valued one, but apparrently a more detailed description is required, so without any further ado, here is...

A brief primer on App Engine Datastore indexing

Although Python and Java provide various high level interfaces to the datastore, the datastore itself speaks a lower-level abstraction, called entities. An entity consist of the following:

  1. A unique primary key
  2. A list of (name, value) pairs

The primary key is the Datastore key you're already familiar with. The list of (name, value) pairs is App Engine's representation for the data in your entity. So far so straightforward. An entity with the following values:

a_string = "Hello, world"
an_int = 123

would be serialized to something resembling this:

[('a_string', 'Hello, world'), ('an_int', 123)]

But how does this interact with lists? Well, lists are treated as 'multiply valued' properties. That is, a list with n items is stored as n separate properties. An example probably makes this clearer:

a_string = "Hello, world"
an_int = 123
a_list_of_ints = [42, 314, 9]

will be serialized as:

[('a_string', 'Hello, world'), ('an_int', 123), ('a_list_of_ints', 42), ('a_list_of_ints', 314), ('a_list_of_ints', 9)]

As you can see, the list gets represented a series of values, all with the same name. When you load data from the datastore, the SDK sees the repeated value and turns it into a list.

Where this gets important is when it interacts with indexing. Suppose you have an index on 'a_string' and 'an_int'. When you insert or modify a value, App Engine generates a set of index entries for it; for the above index and the above entity, it generates a single row in the index that looks something like this:

('Hello, world', 123, a_key)

('a_key' here is a placeholder for the key of the original entity.) When you do a query that uses this index, it just needs to do a lookup on the index to find rows with the appropriate prefix (Eg, 'SELECT * FROM Kind WHERE a_string = "Hello, world" ORDER BY an_int').

When you index a list, though, App Engine inserts multiple index rows. An index on 'an_int' and 'a_list_of_ints' would generate these rows for the above entity:

(123, 42, a_key)
(123, 314, a_key)
(123, 9, a_key)

Again, querying works the same as it did previously - App Engine just has to look up the row with the correct prefix in the index. The number of entries in the list has no impact on how fast the query is - only on how long it took to generate and write the index entries. In fact, the query planner has no idea that 'a_list_of_ints' is a multiply valued property - it just treats it like any other index entry.

So in a nutshell:

  1. There's no practical difference between a list with one element in it and an individual property, in indexing and querying terms
  2. The size of an indexed list affects the time and space required for indexing, but not for querying.
  3. You can do a query that matches any entity with a given value in a list using a simple equality filter.
Nick Johnson
Very informative answer! Thank you for sharing this information with SO. @DutrowLLC please mark this answer as the correct one, as it is, in my opinion, a far better answer to your question. @Nick Johnson My apologies for believing the wrong stuff. Thank you for explaining and providing this very nice answer with great information for everybody!
Pindatjuh
@Pindatjuh - Its a lot to take in. This video also goes into some detail about how lists are indexed and search. I found the second half on merge-join extremely helpful. It was a pdf with slides that you can look at while watching the video:http://code.google.com/events/io/2009/sessions/BuildingScalableComplexApps.html
DutrowLLC
@Nick Johnson - Thanks for taking the time to answer this question so thoroughly, I hope other people will also be able to find your answer and benefit from it.
DutrowLLC