views:

3578

answers:

7

As an example, Google App Engine uses data stores, not a database, to store data. Does anybody have any tips for using data stores instead of databases? It seems I've trained my mind to think 100% in object relationships that map directly to table structures, and now it's hard to see anything differently. I can understand some of the benefits of data stores (e.g. performance and the ability to distribute data), but some good database functionality is sacrificed (e.g. joins).

Does anybody who has worked with data stores like BigTable have any good advice to working with them?

A: 

Being rooted in the database world, a data store to me would be a giant table (hence the name "bigtable"). BigTable is a bad example though because it does a lot of other things that a typical database might not do, and yet it is still a database. Chances are unless you know you need to build something like Google's "bigtable", you will probably be fine with a standard database. They need that because they are handling insane amounts of data and systems together, and no commercially available system can really do the job the exact way they can demonstrate that they need the job to be done.

(bigtable reference: http://en.wikipedia.org/wiki/BigTable)

devinmoore
The question relates specifically to Google App Engine, which uses Bigtable; using a relational database isn't an option.
Nick Johnson
+11  A: 

The way I have been going about the mind switch is to forget about the database alltogether. In the relational db world you always have to worry about data normalization and your table structure. Ditch it all. Just layout your web page. Lay them all out. Now look at them. Your already 2/3 there. If you forget the notion that database size matters and data shouldn't be duplicated than your 3/4 there and you didnt even have to write any code! Let your views dictate your Models. You don't have to take your objects and make them 2 dimensional anymore as in the relational world. You can store objects with shape now.

Yes, this is a simplified explanation of the ordeal, but it helped me forget about databases and just make an application. I have made 4 App Engine apps so far using this philosophy and there are more to come.

I like the "Let your views dictate your Models." bit. I think that's a hang-up coming from RDBMS, but it simplifies everything.
banzaimonkey
+2  A: 

If you're used to thinking about ORM-mapped entities then that's basically how an entity-based datastore like Google's App Engine works. For something like joins, you can look at reference properties. You don't really need to be concerned about whether it uses BigTable for the backend or something else since the backend is abstracted by the GQL and Datastore API interfaces.

Mark Cidade
One issue with reference properties is that they can quickly create a 1+N query problem. (Pull 1 query to find 100 people, then make another query for each one of them to get person.address.)
0124816
Link to 'reference properties' is broken, probably by addition of Java support. Try:http://code.google.com/appengine/docs/python/datastore/entitiesandmodels.html#References
Spike0xff
link fixed. feel free to edit any answer if/when you have enough rep.
Mark Cidade
+45  A: 

There's two main things to get used to about the App Engine datastore when compared to 'traditional' relational databases:

  • The datastore makes no distinction between inserts and updates. When you call put() on an entity, that entity gets stored to the datastore with its unique key, and anything that has that key gets overwritten. Basically, each entity kind in the datastore acts like an enormous map or sorted list.
  • Querying, as you alluded to, is much more limited. No joins, for a start.

The key thing to realise - and the reason behind both these differences - is that Bigtable basically acts like an enormous ordered dictionary. Thus, a put operation just sets the value for a given key - regardless of any previous value for that key, and fetch operations are limited to fetching single keys or contiguous ranges of keys. More sophisticated queries are made possible with indexes, which are basically just tables of their own, allowing you to implement more complex queries as scans on contiguous ranges.

Once you've absorbed that, you have the basic knowledge needed to understand the capabilities and limitations of the datastore. Restrictions that may have seemed arbitrary probably make more sense.

The key thing here is that although these are restrictions over what you can do in a relational database, these same restrictions are what make it practical to scale up to the sort of magnitude that Bigtable is designed to handle. You simply can't execute the sort of query that looks good on paper but is atrociously slow in an SQL database.

In terms of how to change how you represent data, the most important thing is precalculation. Instead of doing joins at query time, precalculate data and store it in the datastore wherever possible. If you want to pick a random record, generate a random number and store it with each record. There's a whole cookbook of these sort of tips and tricks here

Nick Johnson
hashtable -> map. (Hash tables do not support efficient sub-range iteration.)
0124816
"as lookups on contiguous ranges" -> "as scans over contiguous ranges"
0124816
Both fixed, thanks.
Nick Johnson
+1  A: 

I think this question should probably be merged with my earlier question.

fuentesjr
+4  A: 

Hi,

I always chuckle when people come out with - it's not relational. I've written cellectr in django and here's a snippet of my model below. As you'll see, I have leagues that are managed or coached by users. I can from a league get all the managers, or from a given user I can return the league she coaches or managers.

Just because there's no specific foreign key support doesn't mean you can't have a database model with relationships.

My two pence.


class League(BaseModel):
name = db.StringProperty()    
managers = db.ListProperty(db.Key) #all the users who can view/edit this league
coaches = db.ListProperty(db.Key) #all the users who are able to view this league

def get_managers(self):
    # This returns the models themselves, not just the keys that are stored in teams
    return UserPrefs.get(self.managers)

def get_coaches(self):
    # This returns the models themselves, not just the keys that are stored in teams
    return UserPrefs.get(self.coaches)      

def __str__(self):
    return self.name

# Need to delete all the associated games, teams and players
def delete(self):
    for player in self.leagues_players:
        player.delete()
    for game in self.leagues_games:
        game.delete()
    for team in self.leagues_teams:
        team.delete()            
    super(League, self).delete()

class UserPrefs(db.Model):
    user = db.UserProperty()
    league_ref = db.ReferenceProperty(reference_class=League,
                            collection_name='users') #league the users are managing

def __str__(self):
    return self.user.nickname

# many-to-many relationship, a user can coach many leagues, a league can be
# coached by many users
@property
def managing(self):
    return League.gql('WHERE managers = :1', self.key())

@property
def coaching(self):
    return League.gql('WHERE coaches = :1', self.key())

# remove all references to me when I'm deleted
def delete(self):
    for manager in self.managing:
        manager.managers.remove(self.key())
        manager.put()
    for coach in self.managing:
        coach.coaches.remove(self.key())
        coaches.put()            
    super(UserPrefs, self).delete()
A: 

DataSource is an old api that we are gradually removing - it was very tied to a database connection model.

DataStore is the low level api that allows access to a "raw" streaming based approach to GIS content;using FeatureReaders and FeatureWriter.

murali