ansaurus

Question

Self learning regular expression or xpath query?

Answer 1

A:

If I understand your question, there's really no need to write a learning algorithm. Regular expressions are powerful enough to pick this up. You can get all the links in an HTML page with the following regular expression:

(?<=href=")[^"]+(?=")

Verified in Regex Hero, this regular expression uses a positive lookbehind and a positive lookahead to grab the url inside of href="".

If you want to take it a step further you can also look for the anchor tag to ensure you're getting an actual anchor link and not a reference to a css file or something. You can do that like this:

(?<=<a[^<]+href=")[^"]+(?=")

This should work fine as long as the page follows the href="" convention for the links. If they're using onclick events then everything becomes more complicated as you're going to be dealing with the unpredictability of Javascript. Even Google doesn't crawl Javascript links.

Does that help?

Steve Wortham 2009-05-27 21:37:45

Answer 2

A:

As I understand them, most machine learning algorithms work best when they have many examples from which they generalize an 'intelligent' behavior. In this case, you don't have many examples. Google isn't likely to change their format often. Even if it feels often to us, it's probably not enough for a machine learning algorithm.

It may be easier to monitor the current format and if it changes, change your code. If you make the expected format a configurable regular expression, you can re-deploy the new format without rebuilding the rest of your project.

Corbin March 2009-05-27 21:39:36

Yeah, this is the approach i'm using at the moment, and I'm going to stick with it. Thanks

alexn 2009-05-28 06:02:06

ansaurus

tags:

views:

answers:

Self learning regular expression or xpath query?

related questions