tags:

views:

486

answers:

2

Hi!

Is it possible to write code which generates a regular expression or xpath that parses links based on some html document?

What i want is to parse a page for some links. The only thing i know is that the majority of the links on the page is those links.

For a simple example, take a google search engine results page, for example this http://www.google.com/search?hl=en&q=stackoverflow&btnG=Google-search. The majority of the links is from the search results and looks something like this:

<h3 class="r"><a onmousedown="return rwt(this,'','','res','1','AFQjCNERidL9Hb6OvGW93_Y6MRj3aTdMVA','')" class="l" href="http://stackoverflow.com/"&gt;&lt;em&gt;Stack Overflow</em></a></h3>

Is it possible to write code that learns this and recognizes this and is able to parse all links, even if Google changes their presentation?

I'm thinking of parsing out all links, and looking X chars before and after each tag and then work from that.

I understand that this also could be done with xpath, but the question is still the same. Can i parse this content and generate a valid xpath to find the serp links?

Thanks

A: 

If I understand your question, there's really no need to write a learning algorithm. Regular expressions are powerful enough to pick this up. You can get all the links in an HTML page with the following regular expression:

(?<=href=")[^"]+(?=")

Verified in Regex Hero, this regular expression uses a positive lookbehind and a positive lookahead to grab the url inside of href="".

If you want to take it a step further you can also look for the anchor tag to ensure you're getting an actual anchor link and not a reference to a css file or something. You can do that like this:

(?<=<a[^<]+href=")[^"]+(?=")

This should work fine as long as the page follows the href="" convention for the links. If they're using onclick events then everything becomes more complicated as you're going to be dealing with the unpredictability of Javascript. Even Google doesn't crawl Javascript links.

Does that help?

Steve Wortham
A: 

As I understand them, most machine learning algorithms work best when they have many examples from which they generalize an 'intelligent' behavior. In this case, you don't have many examples. Google isn't likely to change their format often. Even if it feels often to us, it's probably not enough for a machine learning algorithm.

It may be easier to monitor the current format and if it changes, change your code. If you make the expected format a configurable regular expression, you can re-deploy the new format without rebuilding the rest of your project.

Corbin March
Yeah, this is the approach i'm using at the moment, and I'm going to stick with it. Thanks
alexn