I'm trying to figure out the best way to do caching for a website I'm building. It relies heavily on screen scraping the wikipedia website. Here is the process that I'm currently doing:
- User requests a topic from wikipedia via my site (i.e. http://www.wikipedia.org/wiki/Kevin_Bacon would be http://www.wikipediamaze.com/wiki?topic?=Kevin_Bacon) NOTE: Because IIS can't handle requests that end in a '.' I'm forced to use the querystring parameter
- Check to see if I've already stored the formatted html in my database and if it does then just display it to the user
- Otherwise I perform a web request to wikipedia
- Decompress the stream if needed.
- Do a bunch of DOM manipulation to get rid of the stuff I don't need (and inject stuff I do need).
- Store the html in my database for future requests
- Return the html to the browser
Because it relies on screen scraping and DOM manipulation I am trying to keep things speedy so that I only have to do it once per topic instead of for every single request. Here are my questions:
- Is there a better way of doing caching or additional things I can do to help performace?
- I know asp.net has built in caching mechanism, but will it work in the way that I need it to? I don't want to have to retrieve the html (pretty heavy) from the database on every request, but I DO need to store the html so that every user get's the same page. I only ever want to get the data from Wikipedia 1 time.
- Is there anything I can do with compression to get it to the browser quicker and if so can the browser handle uncmopressing and displaying the html? Or is this not even a consideration. The only reason I'm asking is that because some of the pages wikipedia sends me through the HttpWebRequest come through as a gzip stream.
Any and all suggestions, guidance, etc. are much appreciated.
Thanks!