views:

151

answers:

2

I'm trying to build an app that will take a real estate MLS(Multiple Listing Service) data from a CSV and insert it into the database. I have the CSV parsing taken care of but I'm having trouble making the database efficient. The problem is that the MLS data providers are known to change the format of the property elements quickly without much notice. So having one table that would match-up 1to1 with the data would possibly cause issues with loading data in the future.

It seems most developer put each element in a single row. IE my current setup:

id = int
property_id = longint 
element_key = char
element_value = text

As you can imagine this is very slow with 1000s of properties with about 80+ elements each.

How can I make this more efficient but keep the database flexible?

And yes I know about memcache and plan on using it.

A: 

It really depends what you want to do with the data. A document-style database plus a full-text indexer may be enough for you (really, just a persistent form of memcache). You would just store all the item data in a single row/document then, and unpack it when you need it.

Perhaps some of the stuff here may be useful.

Nicholas Riley
A: 

You are at the mercy of the data providers, unless there's a way to bring them under control. This has been the bane of database work for about fifty years now, and it isn't likely to change any time soon. The use of CSV has little to do with the underlying problem.

I suspect that it isn't just the format of the data that changes, but also the semantics of the data, even though you didn't say so.

Your best bet is to have one or more staging tables that will record the CSV data pretty much in the format you receive it. Be prepared to change these tables whenever the providers change things on you. Then write some procedures that transform this data into a suitable form for your base tables, and copy the transformed data to the base tables. These procedures will need periodic maintenance, but your base tables will remain more stable, as long as you don't have to add more information storing capabilities in order to match the changes offered by the suppliers.

If your database has to be completely dynamic in order to keep up with the changing inputs, then your stuck with some model like EAV, which is completely ignorant of the logical structure of the data. That ignorance makes it possible for EAV to be very dynamic, but it will create havoc for you when you go to try and turn the EAV data into meaningful information.

Walter Mitty