tags:

views:

99

answers:

4

How to extract content of web pages easily which are embeded in html pages inside only (like img, pdf, flv, doc, rtf, wmc etc) not css and css backgrounds images,javascript.

I'm migrating content old site to new site. re-uploading all images, linked pdf, flv etc.

A: 

For that, you need an HTML Parser. In Perl, there is HTML::Parser module.

Alan Haggai Alavi
+1  A: 

If you've used XHTML you can use a normal XML-Parser.

r3zn1k
just add "valid" XHTML :)
Bozho
+1  A: 

The BeautifulSoup class op python is a very good parser that is extremely handy in doing operations like this.

Vincent Osinga
how can i do what is process
metal-gear-solid
sorry I don't understand your question
Vincent Osinga
A: 
  1. you can use Firebug addon of firefox for readonly purpose.
  2. you can build your custom app using following:
    http://www.codeplex.com/htmlagilitypack
Brij