views:

51

answers:

1

I am currently building a web application on Google App Engine in Python to harvest horse racing data of the form. The basic data structure is Course has many Meetings has many Races has many Horses has one Jockey and had one Trainer. So far I have got the following models (reduced number of fields for sake of brevity).

class Course(db.Model):
  course_number = db.IntegerProperty()     # course id (third party)
  course_description = db.StringProperty() # course name

class Meeting(db.Model):
  course = db.ReferenceProperty(Course)    # reference to course
  meeting_number = db.IntegerProperty()    # lifetime meeting number for course
  meeting_date = db.DateProperty()         # meeting date

class Race(db.Model):
  meeting = db.ReferenceProperty(Meeting)  # reference to meeting
  race_number = db.IntegerProperty()       # eg 1 for 1st race of meeting
  race_name = db.StringProperty()          # race name
  time_of_race = db.TimeProperty()         # race time

I am having trouble working out how to store data on Horses, Trainers, Jockeys in the data store.

My application will be harvesting data for say the last 2 years, for this I will be saving relevant result information for Horse, Trainer, Jockey. The information on a particular horses result is the same for Trainer and Jockey at that time point. However over time a Horse can have different trainer and different jockey.

My main brain ache is coming when I realise that in analysis I may need to look at the result for the last 10 races for either Horse, Jockey, Trainer. Results which may not be stored either because the results occured outside of UK racing (data is still available) or happened before the date I start complete race storage.

Can anyone shed any light on how to optimise the storage of Horse, Jockey, Trainer results so that I can accomodate for this?

Source of data: http://form.horseracing.betfair.com/timeform All required data can be easily accessed via JSON requests.

+1  A: 

You are on the right track with using HorseResult, TrainerResult, and JockeyResult models. Do not forget, the datastore does not have grouping or aggregate functions, so you might want to pre-compute any aggregates or statistics of interest when you are loading the data.

Perhaps you will also want to have statistics type models for tracking horse, jockey, and trainer performance over time and the combinations of each. Something like HorseMonth, which might track how many races the horse was involved in and how it placed by month.

I would also consider keeping details on how the combinations of horse and jockey, or horse and trainer did over time. Unfortunately I do not know enough about horse racing to give you specific suggestions for which combinations are meaningful.

Since it sounds like this is a tool largely for your own use, you might look into the mapper API. It might be of great value when you are exploring the data.

If a race is not included in your data, aside from expanding the harvest range, there may not be a lot you can do. You will probably just want to return the results you have, and perhaps something indicating there is not enough data in the date range?

Robert Kluin