Hi!
Is it possible to write code which generates a regular expression or xpath that parses links based on some html document?
What i want is to parse a page for some links. The only thing i know is that the majority of the links on the page is those links.
For a simple example, take a google search engine results page, for example this http://www.google.com/search?hl=en&q=stackoverflow&btnG=Google-search. The majority of the links is from the search results and looks something like this:
<h3 class="r"><a onmousedown="return rwt(this,'','','res','1','AFQjCNERidL9Hb6OvGW93_Y6MRj3aTdMVA','')" class="l" href="http://stackoverflow.com/"><em>Stack Overflow</em></a></h3>
Is it possible to write code that learns this and recognizes this and is able to parse all links, even if Google changes their presentation?
I'm thinking of parsing out all links, and looking X chars before and after each tag and then work from that.
I understand that this also could be done with xpath, but the question is still the same. Can i parse this content and generate a valid xpath to find the serp links?
Thanks