ansaurus

Question

How to Store Entire WebPages for Later Parsing?

Answer 1

A:

You could use an existing web crawler such as wget or one of the many others. This can download the files to the hard disk and then you can parse the files afterwards and store information about them in the database.

Mark Byers 2010-07-03 14:38:21

you are still running into problems involving the # of files you can store in any given directory -- so if we went this route you'd prob. want to be storing X number of pages in each directory and ref. the directories by your unique indices or whatever right? -- btw not sure if I made it clear but 50k pages is just ONE batch -- I want to store hundreds or thousands of these at a time

feydr 2010-07-03 14:44:31

wget has numerous options regarding directory structure, e.g. `-x, --force-directories force creation of directories.`, `-P, --directory-prefix=PREFIX` save files to `PREFIX/...`, etc...

Mark Byers 2010-07-03 14:47:12

Answer 2

+2 A:

Why don't you add the index to the table BEFORE inserting your data? This way the index is built as the rows are added to the table.

Mike Sherov 2010-07-03 14:40:00

this works for common things that you KNOW you are going to get but if you don't know what you will be parsing ... -- I guess using this advice what we'd want to do is create a generic tbl for 'content' and have a 'type-of' column (paragraph1, table2, heading3, etc.) instead of storing it all in one table like we are doing now

feydr 2010-07-03 14:53:43

@feydr, Yes, or at least maintain an index of the most likely to be parsed content. You don't have to index ALL cases to obtain the benefit of the most likely ones.

Mike Sherov 2010-07-03 15:11:46

Answer 3

+1 A:

If you have more hardware to throw at the problem, you can start distributing your database over multiple servers using via sharding.

I would also suggest you consider removing useless information from the webpages you're capturing (e.g. page structure tags, JavaScript, styling, etc), and perhaps compressing the results if appropriate.

Dolph 2010-07-03 14:42:20

gzip compression is in effect when we pull down the webpage -- as for removing 'useless information' when you are capturing it -- I think that kinda goes against what I'm trying to do -- I'd like to retain all the information for later on -- sometimes it's helpful to go back and do more extractions on the same dataset -- think of someone extracting a bit of data today and then some more 2 weeks from now on the same stuff

feydr 2010-07-03 14:49:40

also, on this note -- I'm not trying to download the source multiple times as it takes time for me and is not nice to the sites I'm pulling from

feydr 2010-07-03 14:50:32

I was only supposing that you could identify portions of the page that you knew in advance you would never be interested in, such as JavaScript, styles, etc. Furthermore, without knowing your intentions, I doubt tags such as `<br />` and `<hr />`, line breaks, etc would have any long term value. Eliminating this could significantly reduce your storage requirements as you scale up.

Dolph 2010-07-03 18:55:15

Also, I doubt anyone would mind if you crawl periodically to index, as long as you throttle the request rate quite low.

Dolph 2010-07-03 18:57:18

Answer 4

A:

Thanks for helping me think this out everyone!

I'm going to try a hybrid approach here:

1) Pull down pages to a tree structure on the filesystem.

2) Put content into a generic content table that does not contain any full webpage (this means that our average 63k column is now maybe a 1/10th of a k.

THE DETAILS

1) My tree structure for housing the webpages will look like this:

-- usr_id1k
|   |-- user1
|   |   |-- job1
|   |   |   |-- pg_id1k
|   |   |   |   |-- p1
|   |   |   |   |-- p2
|   |   |   |   `-- p3
|   |   |   |-- pg_id2k
|   |   |   `-- pg_id3k
|   |   |-- job2
|   |   `-- job3
|   |-- user2
|   `-- user3
|-- usr_id2k
`-- usr_id3k

2) Instead of creating a table for each 'job' and then exporting it we'll have a couple different tables -- the primary one being a 'content' table.

content_type, Integer # fkey to content_types table
user_id, Integer # fkey to users table
content, Text # actual content, no full webpages

.... other stuff like created_at, updated_at, perms, etc...

feydr 2010-07-03 15:34:33

ansaurus

tags:

views:

answers:

How to Store Entire WebPages for Later Parsing?

related questions