Reverse Engineer a web page

views:

answers:

+5 Q:

Reverse Engineer a web page

Hi, I wish to reverse engineer any web-page into a logical representation of the page. For example, if a web page has a menu, then I want a logical menu structure perhaps in XML. If the webpage has an article, I want a article XML node, if it has a title for the article I want a title XML node. Basically, I want the logical form of the web-page without any of the user interface.

This logical model could either be objects in code or XML it doesn't matter, the important part is that it has identified what everything on the page means.

+3 A:

Sounds like what you want requires a human to categorise a page's contents.

This could be automated, however it would have false positives and not work in every case.

For example, what if one page used a ul for a menu and another one used table cells?

Do you want this for one site in particular, or any site on the Internet?

alex 2010-05-30 11:20:09

How about parsing the XML already on the page, see

http://en.wikipedia.org/wiki/XHTML

wiifm 2010-05-30 11:20:21

I was going to suggest too that he converts the entire internet to XHTML ;)

Onots 2010-05-30 11:22:21

Makes me want to find that GIF of the Windows transfer dialog saying 'Downloading the Internet...'

alex 2010-05-30 11:44:26

@alex - http://www.gifbin.com/982378 :) Though, the size seems a bit small now....

Nick Craver 2010-05-30 12:07:44

Too simple, need to be able to recognize buttons on the page and know based upon their location what they mean, ie is a submit button for cancel or submission. Also the menu structure, text on graphics, etc.

Phil 2010-05-30 13:08:04

@Nick Thanks... I think I must of seen that GIF for the first time 10 years ago or so...

alex 2010-05-30 14:29:21

Cheers for the gif guys, well good effort. @Phil, how is parsing XHTML too simple? This is of course how your browser renders the page...

wiifm 2010-05-30 19:28:29

ansaurus

tags:

views:

answers:

Reverse Engineer a web page

related questions