I've been doing a lot of parsing of webpages lately and my process usually looks something like this:
- Obtain list of links to Parse
- Import list into database
- Download Entire Webpage for each link and store into mysql
- Add Index for each scraping session
- Scrape relevant sections (content, metas, whatever)
- Steps 4,5 -- Rinse/Repeat -- as it is common to want to scrape diff. content from the same page later on or modify your xpath or scrub said content or whatever.
- Export Scraping Database to Real Database and Remove Webpage column and scraping indexes
Now, the easiest answer is of course do the scraping at the same time that you are downloading the webpage but I don't think this lends itself to modular design very well as I'd like to be able to grow this process a bit more.
Let me give you some examples of the problems I keep running into: For 50k pages (rows) I have around a 6gig database. Remember, we are storing the ENTIRE webpage into one column and extracting relevant data from it and storing that into a different column.
Throwing an index on the table can take 7-10 minutes on a quad core with 6 gig of ram. God forbid you screw up on something and watch mysqld jump to 70% cpu and ALL of your ram. This is wnhy I have step 4 -- every operation I do I'll throw an index on the column before I do it -- so if I want to grab metas I'd throw an index on say the title column and then update each row where title is not null.
I should state that I do NOT do all rows in one go -- that tends to really screw me over bad -- as it should -- you are loading 6gig into memory. ;)
What I suppose the solution to this problem is -- grab a total count and then iterate through an offset of a 100 or so at a time.
Still -- I think there are some storage problems here as well. Should I be storing the original webpages on the filesystem instead? I have thought about storing the pages into a document oriented database such as mongo or couch.
EDIT Just to be clear here -- any solution presented should take into account the fact that 50k pages is just ONE BATCH by ONE USER. I'm not trying to have multiple users quite yet but I would like the ability to store more than a couple of batches at a time.