views:

50

answers:

4

I've been doing a lot of parsing of webpages lately and my process usually looks something like this:

  1. Obtain list of links to Parse
  2. Import list into database
  3. Download Entire Webpage for each link and store into mysql
  4. Add Index for each scraping session
  5. Scrape relevant sections (content, metas, whatever)
  6. Steps 4,5 -- Rinse/Repeat -- as it is common to want to scrape diff. content from the same page later on or modify your xpath or scrub said content or whatever.
  7. Export Scraping Database to Real Database and Remove Webpage column and scraping indexes

Now, the easiest answer is of course do the scraping at the same time that you are downloading the webpage but I don't think this lends itself to modular design very well as I'd like to be able to grow this process a bit more.

Let me give you some examples of the problems I keep running into: For 50k pages (rows) I have around a 6gig database. Remember, we are storing the ENTIRE webpage into one column and extracting relevant data from it and storing that into a different column.

Throwing an index on the table can take 7-10 minutes on a quad core with 6 gig of ram. God forbid you screw up on something and watch mysqld jump to 70% cpu and ALL of your ram. This is wnhy I have step 4 -- every operation I do I'll throw an index on the column before I do it -- so if I want to grab metas I'd throw an index on say the title column and then update each row where title is not null.

I should state that I do NOT do all rows in one go -- that tends to really screw me over bad -- as it should -- you are loading 6gig into memory. ;)

What I suppose the solution to this problem is -- grab a total count and then iterate through an offset of a 100 or so at a time.

Still -- I think there are some storage problems here as well. Should I be storing the original webpages on the filesystem instead? I have thought about storing the pages into a document oriented database such as mongo or couch.

EDIT Just to be clear here -- any solution presented should take into account the fact that 50k pages is just ONE BATCH by ONE USER. I'm not trying to have multiple users quite yet but I would like the ability to store more than a couple of batches at a time.

A: 

You could use an existing web crawler such as wget or one of the many others. This can download the files to the hard disk and then you can parse the files afterwards and store information about them in the database.

Mark Byers
you are still running into problems involving the # of files you can store in any given directory -- so if we went this route you'd prob. want to be storing X number of pages in each directory and ref. the directories by your unique indices or whatever right? -- btw not sure if I made it clear but 50k pages is just ONE batch -- I want to store hundreds or thousands of these at a time
feydr
wget has numerous options regarding directory structure, e.g. `-x, --force-directories force creation of directories.`, `-P, --directory-prefix=PREFIX` save files to `PREFIX/...`, etc...
Mark Byers
+2  A: 

Why don't you add the index to the table BEFORE inserting your data? This way the index is built as the rows are added to the table.

Mike Sherov
this works for common things that you KNOW you are going to get but if you don't know what you will be parsing ... -- I guess using this advice what we'd want to do is create a generic tbl for 'content' and have a 'type-of' column (paragraph1, table2, heading3, etc.) instead of storing it all in one table like we are doing now
feydr
@feydr, Yes, or at least maintain an index of the most likely to be parsed content. You don't have to index ALL cases to obtain the benefit of the most likely ones.
Mike Sherov
+1  A: 

If you have more hardware to throw at the problem, you can start distributing your database over multiple servers using via sharding.

I would also suggest you consider removing useless information from the webpages you're capturing (e.g. page structure tags, JavaScript, styling, etc), and perhaps compressing the results if appropriate.

Dolph
gzip compression is in effect when we pull down the webpage -- as for removing 'useless information' when you are capturing it -- I think that kinda goes against what I'm trying to do -- I'd like to retain all the information for later on -- sometimes it's helpful to go back and do more extractions on the same dataset -- think of someone extracting a bit of data today and then some more 2 weeks from now on the same stuff
feydr
also, on this note -- I'm not trying to download the source multiple times as it takes time for me and is not nice to the sites I'm pulling from
feydr
I was only supposing that you could identify portions of the page that you knew in advance you would never be interested in, such as JavaScript, styles, etc. Furthermore, without knowing your intentions, I doubt tags such as `<br />` and `<hr />`, line breaks, etc would have any long term value. Eliminating this could significantly reduce your storage requirements as you scale up.
Dolph
Also, I doubt anyone would mind if you crawl periodically to index, as long as you throttle the request rate quite low.
Dolph
A: 

Thanks for helping me think this out everyone!

I'm going to try a hybrid approach here:

1) Pull down pages to a tree structure on the filesystem.

2) Put content into a generic content table that does not contain any full webpage (this means that our average 63k column is now maybe a 1/10th of a k.

THE DETAILS

1) My tree structure for housing the webpages will look like this:

-- usr_id1k
|   |-- user1
|   |   |-- job1
|   |   |   |-- pg_id1k
|   |   |   |   |-- p1
|   |   |   |   |-- p2
|   |   |   |   `-- p3
|   |   |   |-- pg_id2k
|   |   |   `-- pg_id3k
|   |   |-- job2
|   |   `-- job3
|   |-- user2
|   `-- user3
|-- usr_id2k
`-- usr_id3k

2) Instead of creating a table for each 'job' and then exporting it we'll have a couple different tables -- the primary one being a 'content' table.

content_type, Integer # fkey to content_types table
user_id, Integer # fkey to users table
content, Text # actual content, no full webpages

.... other stuff like created_at, updated_at, perms, etc...

feydr