views:

48

answers:

2

Last year I dabbed in a bit of perl programming. The first thing I wrote was a simple script that took a web page and found out how many times a word or name was on that page. I refer to this as "crawling" is that correct?. I was wondering If this is a native process for other languages like PHP and ROR. Essentially I want to build my own "API" for a site without a public "API" and possibly pass in the keywords dynamically from another "API" from another site (just for reading and organizing public data). Sorry for the high level of abstraction my head has just been in the clouds lately.

+2  A: 

Your problem is very tractible, and in fact many people/companies have done it already, but alas you are a long was off still. Loosely speaking "Crawling" usually refers to a breadth or depth first search of the internet using anchor tags in html pages as the edges between nodes.

What you did in perl was basically just searched an html string.

For your API I would suggest finding a DOM parser so that you don't have to bother messing with parsing html strings and the inherent errors that produces.

A few years back I wanedt to generate some data for apartment prices regions of Massachusetts so I wrote a bit of a crawler to extract all of the apartment listing on craigslist and toss them in a DB.

If anyone is interested I can go on, but it's outside the scope of this answer.

Ohh yea, and it was in PHP...

umassthrower
just looked at my code and I used the native "DOMDocument" class: http://php.net/DOMDocument
umassthrower
Here is the txt of the class I mentioned. I don't remember if I took care in writing this, and there are a few places that I hardcoded things, but I would expect this to be a good example to get you started.http://jeffreyjason.com/Craigslist.class.php.txthttp://jeffreyjason.com/HTMLParser.class.php.txtnote: I'm not doing any posting of data here, strictly gets.
umassthrower
thanks a lot I think I understand it a lot better now.
ThomasReggi
+2  A: 

If I understand correctly, you want to take a URL, pass it to your program, and have it crawl the site looking for user supplied keywords?

If that is correct, then no, this is not a native process for ANY language and you will have to write the necessary logic yourself.

Each language/framework (and please note, ROR is not a language, it is a framework built on the language of Ruby) there are tools that can assist you (for example, in Ruby you should look at the Nokogiri gem to parse the HTML), but you will have to supply the bulk of the logic.

It is not a very hard thing to do but it will take some time and effort. Best of luck to you.

sosborn