ansaurus

Question

How to parse/leach/extract a ASP.net website Content using Ruby or PHP or Java?

Answer 1

+1 A:

If you are using PHP you should be able to do without too much difficulties, you can use curl for getting the content and simplexml & dom to navigate into html.

You might also want to take a look at xpath for extracting content easily.

RageZ 2009-09-17 03:40:07

+1 for using curl

Pascal Thivent 2009-09-17 04:08:55

Answer 2

+1 A:

Sounds like you just want to scrape content.

You don't "paste it in notepad", you just write it to a simple text file (which you could then open in notepad, or any other program that can read a text file).

Generally:

$content = file_get_contents('http://example.com/url.asp');
//do some stuff to extract what you want and format it.  Probably using simplexml, or regular expressions to do the extraction
$fp = fopen('some_file.txt');
fwrite($fp,$stuff);
fclose($fp);

And you're done.

timdev 2009-09-17 03:42:09

As a sidenote, if you DO choose to use a regular expression, make sure to make it flexible, ie, use `"\s+"` instead of a bunch of spaces. It's very easy for people to intentionally, or even unintentionally screw up a rigid regular expression.

Matthew Scharley 2009-09-17 03:51:27

Indeed, regex are dirty, and you'll get sick and have to go to a doctor and take care of it. Infintely better to parse the output properly.

timdev 2009-09-17 03:53:51

@tim Thanks for your answer, i tried your script but i am getting the html content of the page , which can also be done by using a curl , But my requirement i wanted to scrape only the unicode text content , Only text not along with the html tags.

YetAnotherCoder 2009-09-17 05:49:05

Right, once you get the html content, you'll need to extract and process the stuff you want. As others have pointed out, you can do this in a variety of ways using simplexml, dom, or xpath.If your needs are very simple, you could probably just pass the entire HTML contents into strip_tag() and take it from there.

timdev 2009-09-17 14:13:31

Answer 3

+1 A:

If you need some kind of screen scraping and you favor Ruby,then you can check scRubyt or watch this screen cast by Ryan Bates.

khelll 2009-09-17 04:24:58

Answer 4

A:

With Java you could write a simple utility using HttpUnit to extract the text from a page, then follow each of the links presented on the page - visiting those recursively and performing the same task.

It should be a fairly trivial exercise, external links need to be treated with some caution though, depending on how far you want to go with indexing.

Jon 2009-09-17 04:50:01

Answer 5

A:

This screencast basically shows how to scrap a website step by step in Ruby. It uses it the output a rails application, it should be easy to focus on just the Ruby library ( ScrAPI ).

jhc_ 2009-09-17 08:26:36

ansaurus

tags:

views:

answers:

How to parse/leach/extract a ASP.net website Content using Ruby or PHP or Java?

related questions