ansaurus

Question

How can I improve this regular expression?

Answer 1

+1 A:

Your RE looks like it's doing pretty much exactly what you were asking for. I might recommend not using an RE at all though, in this case - just split the input on whitespace into an array, then validate each value in the array on it's own.

REs are cool, but sometimes, they aren't the best way to get the job done :)

zigdon 2010-03-24 22:25:37

I hear you--and on the server that's what I might end up doing. However, it's just so easy to plug a regex into the ASP.NET MVC model validation engine that I'd like to stick with it for now

Michael Haren 2010-03-24 22:27:39

Or you could write a custom validation that does the split.

Dr. Zim 2010-03-24 22:30:37

Answer 2

A:

\w could replace the a-zA-Z0-9 but it also contains _ if that's ok.

You may also be able to break it down a little more like this:

(\s*[a-zA-Z0-9-]{1,30}){0,5}

if you are always guaranteed to have whitespace separating your tags.

2010-03-24 22:29:49

Wouldn't this match a single tag of 60, 90, 120... characters, though?

Michael Haren 2010-03-24 22:31:27

Yes you are right Michael, this one does not comply with your requirements.

Pindatjuh 2010-03-24 22:34:41

Yeah, that's what I was unclear about, whether tags will always be separated by ws. If not, I'm not exactly sure how you'd determine what to do with a 60 character long string? Maybe posting an example would help.

2010-03-24 22:36:55

I added an example--it's basically the same as SO tags

Michael Haren 2010-03-24 22:48:19

Answer 3

A:

You could shorten it to something like

([a-zA-Z0-9-]{1,30}\s*){1,5}

I always like to make my regular expressions more concise (where it doesn't affect performance).

Ben 2010-03-24 22:31:38

Wouldn't this match a single tag of 60, 90, 120... characters, though?

Michael Haren 2010-03-24 22:34:03

This regexp doesn't work. The `{1,30}` field fails, because of the `*` at `\s`.

Pindatjuh 2010-03-24 22:35:25

Quite true both. :(

Ben 2010-03-24 22:43:01

Answer 4

+2 A:

Performance-wise, you can optimize (improve) it this way:

^(?:\s+[a-zA-Z0-9]{1,30}){1,5}\s*$

And add a whitespace in the front, before testing the regexp.

^
(?: // don't keep track of groups
\s+ // first (necessairy whitespace) or between
  [a-zA-Z0-9-]{1,30} // unchanged
  ){1,5} // 1 to 5 tags
\s*$

Pindatjuh 2010-03-24 22:34:05

+1 for nice explanation with comments - I just learned something about group tracking hint. I think you still need to deal with the dashes (-) though...

Neil Fenwick 2010-03-24 22:36:46

Pindatjuh 2010-03-24 22:38:03

I like this--thanks--unfortunately, I cannot preprocess the input in anyway, e.g. adding a space

Michael Haren 2010-03-24 22:43:37

Answer 5

+1 A:

I haven't tested this, but the idea is to look for word boundaries (\b) rather than worrying about spaces (these take start, end, line end, space into account), and then look for 1 or more groups of word chars (\w) OR digits (\d) and dashes between...

(\b[a-zA-Z0-9-]{1,30}\b){1,5}

or

(\b[\w\d-]{1,30}\b){1,5}

the problem with this and yours is that it allows a tag to end with a dash (-) which is probably not ideal.

at least its shorter and might be more readable

Neil Fenwick 2010-03-24 22:34:35

i hadn't thought of the weakness of ending with a dash--I can accept that. I like this approach, thanks!

Michael Haren 2010-03-24 22:38:59

Neil Fenwick 2010-03-24 22:51:52

`\b` doesn't consume any characters, so this won't match a string with whitespace in it. But it *will* match a single sequence of 150 letters, digits and dashes, provided the dashes are evenly distributed.

Alan Moore 2010-03-24 23:23:14

Answer 6

+3 A:

You can simplify it a bit like this:

^(?:(?:^|\s+)[a-zA-Z0-9-]{1,30}){1,5}\s*$

The (?: ) syntax is a noncapturing group, which I believe should improve performance when you don't need groups per se.

Then the trick is this statement:

(?:^|\s+)

Thanks to the caret, this will match the beginning of the line, or one or more characters of whitespace.

UPDATE: This works perfectly in my testing and there's certainly less redundant code. However, I just used the benchmarking in Regex Hero to find that your original regex is actually faster. That's probably because mine is causing more backtracking to occur.

UPDATE #2: I found another way that accomplishes the same thing, I think:

^(?:\s*[a-zA-Z0-9-]{1,30}){1,5}\s*$

I realized that I was trying too hard. \s* matches 0 or more spaces, which means that it'll work for a single tag. But... it'll work for 2-5 tags as well because the space is not in your character class [ ]. And indeed it fails with 6 tags as it should. That means this a much more forward-looking regex with less backtracking, better performance, and less redundancy.

UPDATE #3:

I see the error in my ways. This should work better.

^(?:\s*[a-zA-Z0-9-]{1,30}\b){1,5}\s*$

Putting the \b just before the last ) will assert a word boundary. That allows the 1-30 character length rule to work properly again.

Steve Wortham 2010-03-24 22:40:40

That's a neat site--thanks for the reference

Michael Haren 2010-03-24 22:56:53

@Michael - You're welcome. ;) And check out my second attempt here. It's even simpler and I think performance is about the same as your first expression.

Steve Wortham 2010-03-24 23:12:37

@Steve: I appreciate your extra efforts--unfortunately the latest regex doesn't cap out at 30 chars/each--i.e. it matches a 40char tag

Michael Haren 2010-03-24 23:15:26

@Michael - Ah, I see the error in my ways. Third try is the charm.

Steve Wortham 2010-03-24 23:27:31

Adding `(?!\S)` after the `{1,30}` will enforce the 30-chars-per-tag rule. `\b` isn't good enough because it will match between a dash and a letter/digit.

Alan Moore 2010-03-24 23:34:28

@Alan - Hmm, that stinks. And then (?!\S) won't really work since lookarounds won't work reliably in Javascript implementations. Oh well, at least my first regex worked.

Steve Wortham 2010-03-25 20:41:02

`(?!\S)` should be safe. AFAIK, it takes a rather more complex lookahead than that to confuse even JavaScript, e.g.: http://blog.stevenlevithan.com/archives/regex-lookahead-bug

Alan Moore 2010-03-25 21:08:55

Answer 7

A:

You're not going to improve on that. Anything you do to reduce the length will also make it harder to read, and regexes don't need any help in that regard. ;)

That said, your regex needs to be more complicated anyway. As written, it fails to ensure that tag names don't start or end with a hyphen, or contain two or more consecutive hyphens. The regex for a single tag would need to be structured like this:

[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*

Then the base regex to match up to five tags would be

[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*(?:\s+[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*){0,4}

...but that doesn't enforce the maximum tag length. I think the simplest way to do that would be to put your original regex in a lookahead:

/^\s*
 (?=[A-Za-z0-9-]{1,30}(\s+[A-Za-z0-9-]{1,30}){0,4}\s*$)
 (?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*\s*)+$
/

The lookahead enforces the tag lengths as well as the overall structure of up five tags separated by whitespace. Then the main body only has to enforce the structure of the individual tags.

I could have shortened the regex a bit by leaving the a-z out of the character classes and adding the i modifier. I didn't do that because you talked about using the regex in an ASP.NET validator, and as far as I know, they don't let you use regex modifiers. And, since JavaScript doesn't support the (?i) inline modifier syntax, case-insensitive validator regexes aren't possible. If I'm mistaken about that, I hope someone will correct me.

Alan Moore 2010-03-25 01:34:12

ansaurus

tags:

views:

answers:

How can I improve this regular expression?

related questions