views:

141

answers:

7

I want a regular expression to match valid input into a Tags input field with the following properties:

  • 1-5 tags
  • Each tag is 1-30 characters long
  • Valid tag characters are [a-zA-Z0-9-]
  • input and tags can be separated by any amount of whitespace

For example:

Valid: tag1 tag2 tag3-with-dashes tag4-with-more-dashes tAaG5-with-MIXED-case

Here's what I have so far--it seems to work but I'm interested how it could be simplified or if it has any major flaws:

\s*[a-zA-Z0-9-]{1,30}(\s+[a-zA-Z0-9-]{1,30}){0,4}\s*

// that is: 
\s*                          // match all beginning whitespace
[a-zA-Z0-9-]{1,30}           // match the first tag
(\s+[a-zA-Z0-9-]{1,30}){0,4} // match all subsequent tags
\s*                          // match all ending whitespace

Preprocessing the input to make the whitespace issue easier isn't an option (e.g. trimming or adding a space).

If it matters, this will be used in javascript. Any suggestions would be appreciated, thanks!

+1  A: 

Your RE looks like it's doing pretty much exactly what you were asking for. I might recommend not using an RE at all though, in this case - just split the input on whitespace into an array, then validate each value in the array on it's own.

REs are cool, but sometimes, they aren't the best way to get the job done :)

zigdon
I hear you--and on the server that's what I might end up doing. However, it's just so easy to plug a regex into the ASP.NET MVC model validation engine that I'd like to stick with it for now
Michael Haren
Or you could write a custom validation that does the split.
Dr. Zim
A: 

\w could replace the a-zA-Z0-9 but it also contains _ if that's ok.

You may also be able to break it down a little more like this:

(\s*[a-zA-Z0-9-]{1,30}){0,5}

if you are always guaranteed to have whitespace separating your tags.

Wouldn't this match a single tag of 60, 90, 120... characters, though?
Michael Haren
Yes you are right Michael, this one does not comply with your requirements.
Pindatjuh
Yeah, that's what I was unclear about, whether tags will always be separated by ws. If not, I'm not exactly sure how you'd determine what to do with a 60 character long string? Maybe posting an example would help.
I added an example--it's basically the same as SO tags
Michael Haren
A: 

You could shorten it to something like

([a-zA-Z0-9-]{1,30}\s*){1,5}

I always like to make my regular expressions more concise (where it doesn't affect performance).

Ben
Wouldn't this match a single tag of 60, 90, 120... characters, though?
Michael Haren
This regexp doesn't work. The `{1,30}` field fails, because of the `*` at `\s`.
Pindatjuh
Quite true both. :(
Ben
+2  A: 

Performance-wise, you can optimize (improve) it this way:

^(?:\s+[a-zA-Z0-9]{1,30}){1,5}\s*$

And add a whitespace in the front, before testing the regexp.

^
(?: // don't keep track of groups
\s+ // first (necessairy whitespace) or between
  [a-zA-Z0-9-]{1,30} // unchanged
  ){1,5} // 1 to 5 tags
\s*$
Pindatjuh
+1 for nice explanation with comments - I just learned something about group tracking hint. I think you still need to deal with the dashes (-) though...
Neil Fenwick
Pindatjuh
I like this--thanks--unfortunately, I cannot preprocess the input in anyway, e.g. adding a space
Michael Haren
+1  A: 

I haven't tested this, but the idea is to look for word boundaries (\b) rather than worrying about spaces (these take start, end, line end, space into account), and then look for 1 or more groups of word chars (\w) OR digits (\d) and dashes between...

(\b[a-zA-Z0-9-]{1,30}\b){1,5}

or

(\b[\w\d-]{1,30}\b){1,5}

the problem with this and yours is that it allows a tag to end with a dash (-) which is probably not ideal.

at least its shorter and might be more readable

Neil Fenwick
i hadn't thought of the weakness of ending with a dash--I can accept that. I like this approach, thanks!
Michael Haren
Neil Fenwick
`\b` doesn't consume any characters, so this won't match a string with whitespace in it. But it *will* match a single sequence of 150 letters, digits and dashes, provided the dashes are evenly distributed.
Alan Moore
+3  A: 

You can simplify it a bit like this:

^(?:(?:^|\s+)[a-zA-Z0-9-]{1,30}){1,5}\s*$

The (?: ) syntax is a noncapturing group, which I believe should improve performance when you don't need groups per se.

Then the trick is this statement:

(?:^|\s+)

Thanks to the caret, this will match the beginning of the line, or one or more characters of whitespace.

UPDATE: This works perfectly in my testing and there's certainly less redundant code. However, I just used the benchmarking in Regex Hero to find that your original regex is actually faster. That's probably because mine is causing more backtracking to occur.

UPDATE #2: I found another way that accomplishes the same thing, I think:

^(?:\s*[a-zA-Z0-9-]{1,30}){1,5}\s*$

I realized that I was trying too hard. \s* matches 0 or more spaces, which means that it'll work for a single tag. But... it'll work for 2-5 tags as well because the space is not in your character class [ ]. And indeed it fails with 6 tags as it should. That means this a much more forward-looking regex with less backtracking, better performance, and less redundancy.

UPDATE #3:

I see the error in my ways. This should work better.

^(?:\s*[a-zA-Z0-9-]{1,30}\b){1,5}\s*$

Putting the \b just before the last ) will assert a word boundary. That allows the 1-30 character length rule to work properly again.

Steve Wortham
That's a neat site--thanks for the reference
Michael Haren
@Michael - You're welcome. ;) And check out my second attempt here. It's even simpler and I think performance is about the same as your first expression.
Steve Wortham
@Steve: I appreciate your extra efforts--unfortunately the latest regex doesn't cap out at 30 chars/each--i.e. it matches a 40char tag
Michael Haren
@Michael - Ah, I see the error in my ways. Third try is the charm.
Steve Wortham
Adding `(?!\S)` after the `{1,30}` will enforce the 30-chars-per-tag rule. `\b` isn't good enough because it will match between a dash and a letter/digit.
Alan Moore
@Alan - Hmm, that stinks. And then (?!\S) won't really work since lookarounds won't work reliably in Javascript implementations. Oh well, at least my first regex worked.
Steve Wortham
`(?!\S)` should be safe. AFAIK, it takes a rather more complex lookahead than that to confuse even JavaScript, e.g.: http://blog.stevenlevithan.com/archives/regex-lookahead-bug
Alan Moore
A: 

You're not going to improve on that. Anything you do to reduce the length will also make it harder to read, and regexes don't need any help in that regard. ;)

That said, your regex needs to be more complicated anyway. As written, it fails to ensure that tag names don't start or end with a hyphen, or contain two or more consecutive hyphens. The regex for a single tag would need to be structured like this:

[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*

Then the base regex to match up to five tags would be

[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*(?:\s+[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*){0,4}

...but that doesn't enforce the maximum tag length. I think the simplest way to do that would be to put your original regex in a lookahead:

/^\s*
 (?=[A-Za-z0-9-]{1,30}(\s+[A-Za-z0-9-]{1,30}){0,4}\s*$)
 (?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*\s*)+$
/

The lookahead enforces the tag lengths as well as the overall structure of up five tags separated by whitespace. Then the main body only has to enforce the structure of the individual tags.

I could have shortened the regex a bit by leaving the a-z out of the character classes and adding the i modifier. I didn't do that because you talked about using the regex in an ASP.NET validator, and as far as I know, they don't let you use regex modifiers. And, since JavaScript doesn't support the (?i) inline modifier syntax, case-insensitive validator regexes aren't possible. If I'm mistaken about that, I hope someone will correct me.

Alan Moore