How to extract content from other websites automatically?

views:

108

answers:

How to extract content from other websites automatically?

I want to extract a specific data from the website from its pages...

I dont want to get all the contents of a specific page but i need only some portion (may be data only inside a table or content_div) and i want to do it repeatedly along all the pages of the website..

How can i do that?

+1 A:

Use curl to retreive the content and xPath to select the individual elements.

Be aware of copyright though.

Visage 2010-02-15 11:11:49

For example, if i want to get images from a website matching certain category , how can i do?

kvijayhari 2010-02-15 11:22:31

You could use google image search, and restrict the search to a site. It may or may not work, somehow google has to tag the pictures into categories. This is also a hint.

Paul 2010-02-15 11:30:03

You need the php crawler. The key is to use string manipulatin functions such as strstr, strpos and substr.

Sarfraz 2010-02-15 11:13:38

There are ways to do this. Just for fun I created a windows app that went through my account on a well know social network, looked into the correct places and logged the information into an xml file. This information would then be imported elsewhere. However, this sort of application can be used for motives I don't agree with so I never uploaded this.

I would recommend using RSS feeds to extract content.

Zeb 2010-02-15 11:15:59

I think, you need to implement something like a spider. You can make an XMLHTTP request and get the content and then do a parsing.

Kangkan 2010-02-15 11:41:58

"extracting content from other websites" is called screen scraping or web scraping.

simple html dom parser is the easiest way(I know) of doing it.

vsr 2010-02-15 13:35:44

How about the imacros tool for firefox to do repetitive tasks? will that be used to get the data from a site which has a standard format of displaying data?

kvijayhari 2010-03-18 07:04:27

ansaurus

tags:

views:

answers:

How to extract content from other websites automatically?

related questions