I am currently building a web application on Google App Engine in Python to harvest horse racing data of the form. The basic data structure is Course has many Meetings has many Races has many Horses has one Jockey and had one Trainer. So far I have got the following models (reduced number of fields for sake of brevity).
class Course(db.Model):
course_number = db.IntegerProperty() # course id (third party)
course_description = db.StringProperty() # course name
class Meeting(db.Model):
course = db.ReferenceProperty(Course) # reference to course
meeting_number = db.IntegerProperty() # lifetime meeting number for course
meeting_date = db.DateProperty() # meeting date
class Race(db.Model):
meeting = db.ReferenceProperty(Meeting) # reference to meeting
race_number = db.IntegerProperty() # eg 1 for 1st race of meeting
race_name = db.StringProperty() # race name
time_of_race = db.TimeProperty() # race time
I am having trouble working out how to store data on Horses, Trainers, Jockeys in the data store.
My application will be harvesting data for say the last 2 years, for this I will be saving relevant result information for Horse, Trainer, Jockey. The information on a particular horses result is the same for Trainer and Jockey at that time point. However over time a Horse can have different trainer and different jockey.
My main brain ache is coming when I realise that in analysis I may need to look at the result for the last 10 races for either Horse, Jockey, Trainer. Results which may not be stored either because the results occured outside of UK racing (data is still available) or happened before the date I start complete race storage.
Can anyone shed any light on how to optimise the storage of Horse, Jockey, Trainer results so that I can accomodate for this?
Source of data: http://form.horseracing.betfair.com/timeform All required data can be easily accessed via JSON requests.