views:

109

answers:

2

I'm trying to figure out the best way to do caching for a website I'm building. It relies heavily on screen scraping the wikipedia website. Here is the process that I'm currently doing:

  1. User requests a topic from wikipedia via my site (i.e. http://www.wikipedia.org/wiki/Kevin_Bacon would be http://www.wikipediamaze.com/wiki?topic?=Kevin_Bacon) NOTE: Because IIS can't handle requests that end in a '.' I'm forced to use the querystring parameter
  2. Check to see if I've already stored the formatted html in my database and if it does then just display it to the user
  3. Otherwise I perform a web request to wikipedia
  4. Decompress the stream if needed.
  5. Do a bunch of DOM manipulation to get rid of the stuff I don't need (and inject stuff I do need).
  6. Store the html in my database for future requests
  7. Return the html to the browser

Because it relies on screen scraping and DOM manipulation I am trying to keep things speedy so that I only have to do it once per topic instead of for every single request. Here are my questions:

  1. Is there a better way of doing caching or additional things I can do to help performace?
  2. I know asp.net has built in caching mechanism, but will it work in the way that I need it to? I don't want to have to retrieve the html (pretty heavy) from the database on every request, but I DO need to store the html so that every user get's the same page. I only ever want to get the data from Wikipedia 1 time.
  3. Is there anything I can do with compression to get it to the browser quicker and if so can the browser handle uncmopressing and displaying the html? Or is this not even a consideration. The only reason I'm asking is that because some of the pages wikipedia sends me through the HttpWebRequest come through as a gzip stream.

Any and all suggestions, guidance, etc. are much appreciated.

Thanks!

+1  A: 

Caching strategy: write the HTML to a static file and let users download from that file. Compression strategy: check out Google's PageSpeed Best Practices.

Adrian Godong
+1  A: 

You can try to enable the OutputCache for your page with VaryByParam=topic. That stores a copy of the page in memory if multiple clients request it. When the page is not in memory, the server can retrieve it from your database. The beauty of OutputCache is that you can even store a gzipped version of the HTML (use VaryByEncoding)

If it's a problem for you to decompress the stuff you get from Wikipedia, then don't send an Accept-Encoding header. That should force Wikipedia to send the page to you uncompressed.

chris166