views:

319

answers:

5

Hi ,

I am doing a hobby project to scrape the content of an ASP.net website using Ruby or PHP or Java . For example if the website url " www.myaspnet.com/home.aspx" . i would like to extract the unicode text content from home.aspx and paste it to a notepad . Is there any libraries available in any of the above mentioned languages ? . Can anybody help me regarding this by providing suitable resources .

Thanks!

+1  A: 

If you are using PHP you should be able to do without too much difficulties, you can use curl for getting the content and simplexml & dom to navigate into html.

You might also want to take a look at xpath for extracting content easily.

RageZ
+1 for using curl
Pascal Thivent
+1  A: 

Sounds like you just want to scrape content.

You don't "paste it in notepad", you just write it to a simple text file (which you could then open in notepad, or any other program that can read a text file).

Generally:

$content = file_get_contents('http://example.com/url.asp');
//do some stuff to extract what you want and format it.  Probably using simplexml, or regular expressions to do the extraction
$fp = fopen('some_file.txt');
fwrite($fp,$stuff);
fclose($fp);

And you're done.

timdev
As a sidenote, if you DO choose to use a regular expression, make sure to make it flexible, ie, use `"\s+"` instead of a bunch of spaces. It's very easy for people to intentionally, or even unintentionally screw up a rigid regular expression.
Matthew Scharley
Indeed, regex are dirty, and you'll get sick and have to go to a doctor and take care of it. Infintely better to parse the output properly.
timdev
@tim Thanks for your answer, i tried your script but i am getting the html content of the page , which can also be done by using a curl , But my requirement i wanted to scrape only the unicode text content , Only text not along with the html tags.
YetAnotherCoder
Right, once you get the html content, you'll need to extract and process the stuff you want. As others have pointed out, you can do this in a variety of ways using simplexml, dom, or xpath.If your needs are very simple, you could probably just pass the entire HTML contents into strip_tag() and take it from there.
timdev
+1  A: 

If you need some kind of screen scraping and you favor Ruby,then you can check scRubyt or watch this screen cast by Ryan Bates.

khelll
A: 

With Java you could write a simple utility using HttpUnit to extract the text from a page, then follow each of the links presented on the page - visiting those recursively and performing the same task.

It should be a fairly trivial exercise, external links need to be treated with some caution though, depending on how far you want to go with indexing.

Jon
A: 

This screencast basically shows how to scrap a website step by step in Ruby. It uses it the output a rails application, it should be easy to focus on just the Ruby library ( ScrAPI ).

jhc_