views:

40

answers:

2

We work with a lot of real estate, and while rearchitecting how the data is imported, I came across an interesting issue.

Firstly, the way our system works (loosely speaking) is we run a Coldfusion process once a day that retrieves data provided from an IDX vendor via FTP. They push the data to us. Whatever they send us is what we get.

Over the years, this has proven to be rather unstable.

I am rearchitecting it with PHP on the RETS standard, which uses SOAP methods of retrieving data, which is already proven to be much better than what we had.

When it comes to 'updating' existing data, my initial thought was to query only for data that was updated. There is a field for 'Modified' that tells you when a listing was last updated, and the code I have will grab any listing updated within the last 6 hours (give myself a window in case something goes wrong).

However, I see a lot of real estate developers suggest creating 'batch' processes that run through all listings regardless of updated status that is constantly running.

Is this the better way to do it? Or am I fine with just grabbing the data I know I need? It doesn't make a lot of sense to me to do more processing than necessary. Thoughts?

+3  A: 

If you can trust the MODIFIED, you approach is good enough and faster than parsing the whole file. If you are looping the whole find trying to find what is modified or not it will be harder.

There will a slight performance improvement if you can send the file directly to the Database and let it import with its built in file import functions. For example MySQL has this - http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html I am sure other DB like SQL Server and Oracle have such mechanisms too. This will save you the overhead of the programming language work and only use the Database resources.

Ivo Sabev
Sorry, by 'query' I meant using DMQL in my request. For example, Query()->SELECT->(Modified|PhotoModified) which is then translated on the other side as 'give me any listing where modified or photo modified date is ______', where _____ is today - 6 hours. So that part of the processing is on their end, but I know what you mean.
Kevin
+1  A: 

If you are lucky enough to have a last updated date, then it is certainly more efficient to limit the number of rows you are updating or importing. Just be sure that the date is actually being updated by the data source when records change as not all databases are well-designed and don't always do things the way you would expect.

And whatever you do, don't process row-by-row, that is the most inefficent way to do things. I can do a batch insert of a million records faster than I can process a data set of 100000 records one row at a time. But there are set-based ways to do the type if insert you are talking about and a set-based update of the 100 records that changed is going to be faster than a batch insert of 50,000 records.

HLGEM
The field appears to be updated so I'm relying on that at the moment. What do mean by row by row process? The unfortunate side of RETS is that I must explicitly state a property type in the query. So in order to build my list of MLS numbers, I have to loop each property type (around 6) and pull them down. After that, I am only running a few hundred listings every cron that pull down the full information.
Kevin
It means you process only one row at a time which an inefficient practice. If you are using a loop or a cursor you are porbably doing row-by-row processing. In that case, a batch process to truncate all records and insert them in a bulk insert could be faster but you would have to test this on your system.
HLGEM