I’m working a project involving datamining from various sites, a good analogy is gathering statistical data on eBay auctions. However as well as storing the key data, I really need to ensure access to the original page, and on some sites the original pages may not be permanent – like if eBay removed an auction’s page after completion. I’d ideally like to have a similar system to how Google caches pages, e.g storing a copy of the page on my own server. However I’ve been advised there may be complications as well as a big impact on resources needed for my database.
Even if each page you cache is only 5kb, that still adds up over time - cache 200 pages and you've used an addition 1mb in your DB; cache 20,000 pages and you've used 100mb - and many pages (when you consider the markup+content) are going to be larger than 5kb.
One alternative option would be to save pages to disk as (potentially compressed) files in a directory and then simply reference the saved filename in your database - if you don't need to search through the contents of the page code via query after your initial datamining, then this approach could reduce the size of your database and query results while still storing the full pages.
I would echo what Dav said but perhaps also considering storing just the changes if you are indexing the same page over and over through time. Also storing text as varbinary would go along way to saving space. As far as searching you could set up Lucene in the parallel to index pages.
Is it a problem that the saved page will not include server-side CSS & JS files, meaning it would presumably be rendered badly when served from the caching DB/file system?
What about photographs or images on the page also?
I think 5Kb is low for a page save, and I wonder about saving pages which contain JS... AJAX pages especially. I find it hard to visualize exactly what would happen, but maybe you only need to see the text on the page?
Anyone know how Google's caching works?