views:

808

answers:

9

Question

I would like to be able to use a single regex (if possible) to require that a string fits [A-Za-z0-9_] but doesn't allow:

  • Strings containing just numbers or/and symbols.
  • Strings starting or ending with symbols
  • Multiple symbols next to eachother

Valid

  • test_0123
  • t0e1s2t3
  • 0123_test
  • te0_s1t23
  • t_t

Invalid

  • t__t
  • ____
  • 01230123
  • _0123
  • _test
  • _test123
  • test_
  • test123_

Reasons for the Rules

The purpose of this is to filter usernames for a website I'm working on. I've arrived at the rules for specific reasons.

  • Usernames with only numbers and/or symbols could cause problems with routing and database lookups. The route for /users/#{id} allows id to be either the user's id or user's name. So names and ids shouldn't be able to collide.

  • _test looks wierd and I don't believe it's valid subdomain i.e. _test.example.com

  • I don't like the look of t__t as a subdomain. i.e. t__t.example.com

A: 
[A-Za-z][A-Za-z0-9_]*[A-Za-z]

That would work for your first two rules (since it requires a letter at the beginning and end for the second rule, it automatically requires letters).

I'm not sure the third rule is possible using regexes.

Rusky
I think that you need to anchor your regex.
Telemachus
Doesn't allow `te_t`, `0123test`
epochwolf
You totally missed the underscore.
Sinan Taifour
Doesn't allow t.
Beta
Oops, yeah, forgot the underscore. And the single-character thing.>_<
Rusky
A: 
(?=.*[a-zA-Z].*)^[A-Za-z0-9](_?[A-Za-z0-9]+)*$

This one works.

Look ahead to make sure there's at least one letter in the string, then start consuming input. Every time there is an underscore, there must be a number or a letter before the next underscore.

Welbog
This would rule out a name ending in a number, wouldn't it? The OP considers `test_0123` valid.
Telemachus
Doesn't allow `te_t`, `0123test`.
epochwolf
Fair enough. That's easy to change.
Welbog
This one doesn't match things like "hello_t" (with a single character at the end), or strings starting or ending with a number.
Sinan Taifour
Which is why I updated it. Now it doesn't make sure a letter is present. Still thinking about it.
Welbog
Why do you say it doesn't match "a_a"? It looks like it should.
Michael Myers
It didn't before. Let me remove that.
Welbog
Just FYI, the second `.*` in the lookahead and the `?` after the underscore aren't doing anything useful. Also, I would put the `^` *before* the lookahead, if only to more clearly reflect my intentions.
Alan Moore
+2  A: 

I'm sure that you could put all this into one regular expression, but it won't be simple and I'm not sure why insist on it being one regex. Why not use multiple passes during validation? If the validation checks are done when users create a new account, there really isn't any reason to try to cram it into one regex. (That is, you will only be dealing with one item at a time, not hundreds or thousands or more. A few passes over a normal sized username should take very little time, I would think.)

First reject if the name doesn't contain at least one number; then reject if the name doesn't contain at least one letter; then check that the start and end are correct; etc. Each of those passes could be a simple to read and easy to maintain regular expression.

Telemachus
:) `validates_format_of :name, :with => /\A[A-Za-z0-9-_]{4,20}\Z/, :on => :create` and I'm trying to relearn regex. It's been awhile since I've really worked on difficult regex. (It's a personal project, not a work related one or I would do a multiple regex solution)
epochwolf
-1 since the question asked for a regex and not an alternate solution. A regex accomplishing the above is easy enough that there really is no need for an alternate solution IMO
Rado
@Rado: You're entitled to your opinion (and your votes), but I think that the trail of failed answers here makes pretty clear that a regex is *not* easy enough here. As a more general rule, I think it's not unreasonable sometimes to say "You could do that, but maybe it's not the best idea."
Telemachus
@Telemachus: it took me less than 5 minutes to come up with a working solution (although it could be shortened a bit). This is the kind of problem that is best suited for regex and I am not sure why one would want to avoid regex just because it "may" take a little longer to come up with a quick implementation.
Rado
`/^(?:[a-z0-9]_?)*[a-z](?:_?[a-z0-9])*$/i` is actually fairly straightforward as far as regex goes (and in fact the `?:` prefixes are optional if one didn't care about not generating captures). While there are certainly times where regex isn't the best idea, I don't think this is actually one of them, as long as one actually takes a moment to consider the problem.
Amber
@Rado: I wasn't thinking about implementation time, but about ease of maintenance. Complex regular expressions like this that try to do four or five things all at once are hard to get exactly right and hard to read. There are already a number of good working answers, so maybe I jumped the gun here, but in general I would choose cleaner and easier to maintain over a single monster regular expression. Maybe that just means I should work on my regex-fu.
Telemachus
+1  A: 

This doesn't block "__", but it does get the rest:

([A-Za-z]|[0-9][0-9_]*)([A-Za-z0-9]|_[A-Za-z0-9])*

And here's the longer form that gets all your rules:

([A-Za-z]|([0-9]+(_[0-9]+)*([A-Za-z|_[A-Za-z])))([A-Za-z0-9]|_[A-Za-z0-9])*

dang, that's ugly. I'll agree with Telemachus, that you probably shouldn't do this with one regex, even though it's technically possible. regex is often a pain for maintenance.

Tim
+5  A: 

This matches exactly what you want:

/^(?!_)(?:[a-z0-9]_?)*[a-z](?:_?[a-z0-9])*(?<!_)$/i
  1. At least one alphabetic character (the [a-z] in the middle).
  2. Does not begin or end with an underscore (the (?!_) and (?<!_) at the beginning and end).
  3. May have any number of numbers, letters, or underscores before and after the alphabetic character, but every underscore must be separated by at least one number or letter (the rest).

Edit: In fact, you probably don't even need the lookahead/lookbehinds due to how the rest of the regex works - the first ?: parenthetical won't allow an underscore until after an alphanumeric, and the second ?: parenthetical won't allow an underscore unless it's before an alphanumeric:

/^(?:[a-z0-9]_?)*[a-z](?:_?[a-z0-9])*$/i

Should work fine.

Amber
Seems promising, what flavor of regexs allows `(?<!_)`?
Sinan Taifour
Ruby is choking on `(?<!_)`. It would appear ruby 1.8 doesn't have support for negative lookbehinds which would make this problem much much harder.
epochwolf
Indeed it would. I know that both Perl and PHP (using Perl-style regex) support negative lookbehinds; I couldn't tell you off the top of my head what else does.
Amber
However, you could probably actually omit the lookbehinds in this case - `/^(?:[a-z0-9]_?)*[a-z](?:_?[a-z0-9])*$/i` should work fine.
Amber
Does the second version check that the initial character is *not* an underscore?
Telemachus
Yes - an initial underscore would not match `(?:[a-z0-9]_?)*[a-z]` and thus would be rejected.
Amber
@Dav: Cool. I think I'm still misreading it. Why doesn't the `_?` in the initial non-capturing group not allow an (single) underscore to start the string?
Telemachus
Because the initial group is (leaving out the non-capturing mark for clarity) the following: `([a-z0-9]_?)` - notice that the underscore is NOT within the character class definition, thus it actually matches either 1 (just an [a-z0-9]) or 2 (an [a-z0-9] followed by a _) characters. It can't match just a _ on its own.
Amber
@Dav: Yeah, I finally see it and was just about to write. Man I'm thick today. Thanks for clarifying.
Telemachus
You're welcome. :)
Amber
This site shows what features are supported on various regex flavors: http://www.regular-expressions.info/refflavors.html
Tim Sylvester
A: 

Here you go:

^(([a-zA-Z]([^a-zA-Z0-9]?[a-zA-Z0-9])*)|([0-9]([^a-zA-Z0-9]?[a-zA-Z0-9])*[a-zA-Z]+([^a-zA-Z0-9]?[a-zA-Z0-9])*))$

If you want to restrict the symbols you want to accept, simply change all [^a-zA-Z0-9] with [] containing all allowed symbols

Rado
This is far more complicated than necessary...
Amber
Yeah ... you could combine both cases, but it still accomplishes the required task
Rado
+2  A: 

What about:

/^(?=[^_])([A-Za-z0-9]+_?)*[A-Za-z](_?[A-Za-z0-9]+)*$/

It doesn't use a back reference.

Edit:

Succeeds for all your test cases. Is ruby compatible.

Sinan Taifour
A: 
/^(?![\d_]+$)[A-Za-z0-9]+(?:_[A-Za-z0-9]+)*$/

Your question is essentially the same as this one, with the added requirement that at least one of the characters has to be a letter. The negative lookahead - (?![\d_]+$) - takes care of that part, and is much easier (both to read and write) than incorporating it into the basic regex as some others have tried to do.

Alan Moore
+1  A: 

The question asks for a single regexp, and implies that it should be a regexp that matches, which is fine, and answered by others. For interest, though, I note that these rules are rather easier to state directly as a regexp that should not match. I.e.:

x !~ /[^A-Za-z0-9_]|^_|_$|__|^\d+$/
  • no other characters than letters, numbers and _
  • can't start with a _
  • can't end with a _
  • can't have two _s in a row
  • can't be all digits

You can't use it this way in a Rails validates_format_of, but you could put it in a validate method for the class, and I think you'd have much better chance of still being able to make sense of what you meant, a month or a year from now.

glenn mcdonald
This is a very good point. I've actually added my entire question and the response I've accepted into the bottom of the file where the regex appears. There is also a note on the regex to look at the bottom of the file. Because I know I will have fun trying to figure out regex later.
epochwolf