ansaurus

Question

Answer 1

+5 A:

Short of using character classes (\d for 0-9 etc.) I don't see that the regular expression in question could be shortened much; however...

As a side note it can be worth mentioning that parsing HTML with regular expressions is hazardous at best; when dealing with HTML (and to a lesser extent XML), DOM tools are generally better suited.

Williham Totland 2009-07-13 17:55:59

Not just hazardous... just really incorrect... regular expressions are not designed for dealing with html because they are context insensitive. +1 for pointing it out that it's bad.

Tom 2009-07-13 18:00:54

so many of these questions. how to let them know? make an faq? lol

Victor 2009-07-13 18:03:36

@Victor: I wish I knew. I think part of the problem is that the term regular expression is abused. There are so many variations and extensions added by languages that make regexes more powerful... I think this makes people think they are THE solution to all parsing problems. Sometimes you can do quick and dirty stuff with them for html (if you make certain assumptions about your data)... but still, I wish there was an easy way for people to stumble across the fact that they shouldn't be using them to parse context sensitive grammars.

Tom 2009-07-13 18:13:47

We proved that HTML can't be parsed by a regular expression something around the 6th week in the first semester, and this wasn't even a CS degree. I would assume that everybody who took a course that's at least tangentially related to CS and didn't drop out during, say, the first 10 weeks, would *know* that it is *mathematically proven to be impossible* to parse HTML with a regular expression. Still, questions like this pop up like clockwork.

Jörg W Mittag 2009-07-13 22:08:13

Answer 2

+1 A:

If your writing screenscrappers as Whilliham rightfully mentions DOM might just be a suitable parser as Regex since HTML is alot more forgiving then regex.

Not shortened by much but a bit the regex is more forgiving

Removed start of string and end of string checks, did you really need them?
negative lookbehind to make sure <a> is not preceeded by <i>
use of \d simple asertation instead of [0-9] tad cleaner.
You had type in for 3 to 11 characters long i changed it to 3 or more.
removed checks for end tags they serve no contextual meaning for your screenscrapper (presumably).

(?<!<i>)<a href="/site.php\?id=(\d*)">(.*?) \(([ a-z\d]{2,})\)

Martijn Laarman 2009-07-13 18:11:27

Thanks! That was really helpful. I will take a look at lookaheads and lookbehinds as they seem handy.

DanCake 2009-07-13 18:30:26

ansaurus

tags:

views:

answers:

Improving my regular expression skills

related questions