views:

67

answers:

3

I'm building at list trying to build simple web crawler that extract data by pre definitions.

After reading posts here I based my spider on libxml2 and curl and C++. Now I would like to learn web spider and data extraction algorithms if there are any. What should I learn ?

A: 

In order to adequately traverse something like a web site, you should use recursion.

amphetamachine
no man , not recursion.
A: 

Based on your vague question this is all I can offer in the way of help:

Your question is rather vague. No one will be able to write your web crawler program for you. You need to break down the programming into steps then come back to StackOverflow and ask for how to solve ONE of those steps IF you are stuck. But you need to have a good go at it yourself FIRST.


Unless you want to code your web spider (do we need another one of those?) and "data extraction" application from scratch, you probably want to learn a framework which has already solved these problems for you.

Although I don't know of any that exist, probably because the only people who do this are highly specialised web search companies and spammers. Not mainstream enough to write a framework for it but I'll bet someone smarter than me knows of someone who has actually done it.

Brock Woolf
i don't what that someone write my program no way .. i need just read some implementations ideas
+1  A: 

I have a good cache of academia here: http://arachnode.net/media/8/default.aspx

Also, read the Wiki on the subject: http://en.wikipedia.org/wiki/Web_crawler

Finally, you definintely do not want to use recursion. :) Imagine recursing a site with a depth of 1000, and millions of pages. You will likely run out of stack space.

arachnode.net