views:

170

answers:

2

I've been wanting to improve my regex skills for quite some time now and "Mastering Regular Expressions" was recommended quite a few times so I bought it and have been reading it over the past day or so.

I have created the following regular expression:

^(?:<b>)?(?:^<i>)?<a href="/site\.php\?id=([0-9]*)">(.*?) \(([ a-z0-9]{2,10})\)</a>(?:^</i>)?(?:</b>)?$

Which matches the first two links but ignores the two enclosed by an <i> tag. It extracts the id, title and type.

<a href="/site.php?id=6321">site 1 title (type 1)</a>
<b><a href="/site.php?id=10254">site 2 title (type 2)</a></b>

<i><a href="/site.php?id=5479">site 3 title (type 3)</a></i>
<b><i><a href="/site.php?id=325">site 4 title (type 4)</a></i></b>

Although it works, it seems fairly long for something so simple, could it be improved?

+5  A: 

Short of using character classes (\d for 0-9 etc.) I don't see that the regular expression in question could be shortened much; however...

As a side note it can be worth mentioning that parsing HTML with regular expressions is hazardous at best; when dealing with HTML (and to a lesser extent XML), DOM tools are generally better suited.

Williham Totland
Not just hazardous... just really incorrect... regular expressions are not designed for dealing with html because they are context insensitive. +1 for pointing it out that it's bad.
Tom
so many of these questions. how to let them know? make an faq? lol
Victor
@Victor: I wish I knew. I think part of the problem is that the term regular expression is abused. There are so many variations and extensions added by languages that make regexes more powerful... I think this makes people think they are THE solution to all parsing problems. Sometimes you can do quick and dirty stuff with them for html (if you make certain assumptions about your data)... but still, I wish there was an easy way for people to stumble across the fact that they shouldn't be using them to parse context sensitive grammars.
Tom
We proved that HTML can't be parsed by a regular expression something around the 6th week in the first semester, and this wasn't even a CS degree. I would assume that everybody who took a course that's at least tangentially related to CS and didn't drop out during, say, the first 10 weeks, would *know* that it is *mathematically proven to be impossible* to parse HTML with a regular expression. Still, questions like this pop up like clockwork.
Jörg W Mittag
+1  A: 

If your writing screenscrappers as Whilliham rightfully mentions DOM might just be a suitable parser as Regex since HTML is alot more forgiving then regex.

Not shortened by much but a bit the regex is more forgiving

  • Removed start of string and end of string checks, did you really need them?
  • negative lookbehind to make sure <a> is not preceeded by <i>
  • use of \d simple asertation instead of [0-9] tad cleaner.
  • You had type in for 3 to 11 characters long i changed it to 3 or more.
  • removed checks for end tags they serve no contextual meaning for your screenscrapper (presumably).

(?<!<i>)<a href="/site.php\?id=(\d*)">(.*?) \(([ a-z\d]{2,})\)

Martijn Laarman
Thanks! That was really helpful. I will take a look at lookaheads and lookbehinds as they seem handy.
DanCake