tags:

views:

320

answers:

2

Hi,

I'm trying to write a text parser with PHP, like Instapaper did. What I want to do is; get a webpage and parse it in text-only mode.

It's simple to get the webpage with cURL and strip HTML tags. But every webpage have some common areas; like header, navigation, sidebar, footer, banners etc. I only want to get the article in text mode and exclude all other parts. It's also simple to exclude those parts if I know the "id" or "class" info. But I'm trying to automatize this process and apply for any page, like Instapaper.

I get all the content between but I don't know how to exclude header, sidebar or footer and get only the main article body. I have to develop a logic to get only the main article part.

It's not important for me to find the exact code. It would also be useful to understand how to exclude unnecessary parts as I can try to write my own code with PHP. It would also be useful if there any examples in other languages.

Thanks for helping.

A: 

You really should consider using a HTML parser for this. Gather similar pages and compare the DOM trees to find the differing nodes.

Ignacio Vazquez-Abrams
+1  A: 

You might try looking at the algorithms behind this bookmarklet, readability - It's got a decent success rate for extracting content among on all web page rubbish.

Friend of mine made it, that's why I'm recommending it - since I know it works, and I'm aware of the many techniques he's using to parse the data. You could apply these techniques for what your asking.

Mr-sk
Looks like it's working perfect and does exactly what I want. Now I have to solve the algorithm. Thank you Mr-sk
Burak Erdem