tags:

views:

27

answers:

2

I have a XML feed which contains 1000+ records of properties (rent, sale).

Currently I am calling this feed 16x on homepage, always returning only 3 properties for specific criteria like 3 new house, 3 new flats, etc, 5 recommended house, 5 recommended flats etc.

This scenario was working well for 7 months whilst there was 200+ properties and only 100-200 views a day. It is now getting to stage where I have 700+ visits a day and over 1000+ properties and downloading 16 feeds separately just to show homepage is getting slower and traffic is getting massively larger.

Therefore I would like to cache these streams, I would like only my 'robot' to directly download streams from source and all visitors to use my local copy to make things much quicker and decrease traffic load massively.

I dont have a problem downloading XML locally and locally call files to show data. But I would like to know how to solve possible issues like:

  • not showing data to clients because robot is updating XML files and original file would be overwritten and empty whilst loading new data
  • using XML file as local backup, means that if source server is offline homepage would be still working and loading
  • making sure that I wont lock data for clients in such way that robot would be unable to update files

My first toughts would be to work with 2 xml files for every stream, one which would be shown to clients and one which would be downloaded. If download is correct then downloaded XML would be used as live data and other one deleted. Some kind of incremental marking with one file working as file holding name of actual data.

Is there any way how to cache these XML files so it would do something similar? Really the main issue is to have bulletproof solution so clients wont see error pages or empty results.

Thanks.

+1  A: 

Use the caching options built into HttpWebResponse. This lets you programatically choose between obtaining straight from cache (ignoring freshness), ignoring the cache, forcing the cache to be refreshed, forcing the cache to be revalidated and the normal behaviour of using the cache if it's considered fresh according to the original response's age information, and otherwise revalidating it.

Even if you've really specific caching requirements that need to go beyond that, build it on top of doing HTTP caching properly, rather than as a complete replacement.

If you do need to manage your own cache of the XML streams, then normal file locking and if really necessary, .NET ReaderWriterLockSlims should suffice to keep different threads from messing each other up. One possibility to remove the risk of contention that is too high, is to default to direct access in the case of cache contention. Consider that caching is ultimately an optimisation (conceptually you are getting the file "from the server", caching just makes this happen in a more efficient manner). Hence, if you fail to quickly obtain a read-lock, you can revert to downloading directly. This in turn reduces the wait that can happen for the write lock (because pending locks won't stack up over time while a write lock is requested). In practice it probably won't happen very often, but it will save you from the risk of unacceptable contention building up around one file and bringing the whole system down.

Jon Hanna
Oh, I should add - as it might be relevant here - that another option in dealing with the normal cache is to add on an acceptable degree of staleness, e.g. ("give me this if it's fresh, or if you would normally consider it out of date but out of date by less than 4hours").
Jon Hanna
Would it be possible to check actual size or status of file and if cache/sync only if it differs?
feronovak
It would, though you'd have to go further as changes can (and in real life, often do) result in equal-sized files. You could store and MD5 of the file or the E-tag that the WebResponse got (this latter is better if E-tags are being sent; if not berate the person running the web side of things if possible until they are) or the last-mod date on the web response (if sub-second changes are impossible in this system). Again, checking last-mod and etags happens automatically with the appropriate use of the web cache built in to HttpWebResponse when you use appropriate options.
Jon Hanna
(MD5 is considered broken for many purposes, here though we're just using it as a CRC-on-steroids - much lower risk of accidental collision - rather than for security purposes, so those objections don't apply here).
Jon Hanna
A: 

I'm going to start by assuming that you don't own code that produces the source XML feed? Because if you do, I'd look at adding some specific support for the queries you want to run.

I had a similar issue with a third-party feed and built a job that runs a few times a day, downloads the feed, parses it, and stores the results locally in a database.

You need to do a bit of comparison each time you update the database, and only add new records and delete old records, but it ensures that you always have data to feed to your clients and the database works around simple issues like file locking.

Then I'd look at a simple service layer to expose the data in your local store.

Stewart Ritchie
Simpler than doing that comparison in the database is just to have a version column that is automatically updated on update. Then you can use it for the last-modified value (if it's a datatime and the single-second resolution suffices for your application) and/or for creating the e-tag (works with datetimes of finer resolution, and for change-counts that will work for any time difference between updates).
Jon Hanna
No I dont have access to original XML feeding code. I have a structure and need to work with that. I am thinking about caching every 5 minutes as there is quite high fluctuation of data.
feronovak