tags:

views:

239

answers:

9

Hi All,

I am wanting to create a regular expression for the following scenario:

If a string contains the percentage character (%) then it can only contain the following: %20, and cannot be preceded by another '%'.

So if there was for instance, %25 it would be rejected. For instance, the following string would be valid:

http://www.test.com/?&Name=My%20Name%20Is%20Vader

But these would fail:

http://www.test.com/?&Name=My%20Name%20Is%20VadersAccountant%25

%%%25

Any help would be greatly appreciated,

Kyle


EDIT:

The scenario in a nutshell is that a link is written to an encoded state and then launched via JavaScript. No decoding works. I tried .net decoding and JS decoding, each having the same result - The results stay encoded when executed.

+5  A: 

Doesn't require a %:

/^[^%]*(%20[^%]*)*$/
Mark Byers
Thanks Mark, this suits my needs.
Kyle Rozendo
+1  A: 

I think that would find what you need

/^([^%]|%%|%20)+$/

Edit: Added case where %% is valid string inside URI
Edit2: And fixed it for case where it should fail :-)
Edit3:

In case you need to use it in editor (which would explain why you can't use more programmatic way), then you have to correctly escape all special characters, for example in Vim that regex should lool:

/^\([^%]\|%%\|%20\)\+$/
MBO
Joey
You're right, it stopped before invalid sequence and matched empty strings in loop. Fixed now
MBO
Seems to work now. Nice.
Joey
I don't seem to be getting any matches with this.
Kyle Rozendo
@Kyle I seem to get matches to provided strings. You didn't mention which language you use, and if you only want to test for match, or extract something by rx. I tested mine in Ruby and with <http://www.rubular.com/regexes/12107>
MBO
+1  A: 

Another solution if look-arounds are not available:

^([^%]|%([013-9a-fA-F][0-9a-fA-F]|2[1-9a-fA-F]))*$
Gumbo
A: 

Maybe a better approach is to deal with that validation after you decode that string:

string name = HttpUtility.UrlDecode(Request.QueryString["Name"]);
Rubens Farias
Yep, tried this approach before and it unfortunately did not suite the scenario, thanks for the answer though!
Kyle Rozendo
+2  A: 

Which language are you using?

Most languages have a Uri Encoder / Decoder function or class. I would suggest you decode the string first and than check for valid (or invalid) characters.

i.e. something like /[\w ]/ (empty is a space)

With a regex in the first place you need to respect that www.example.com/index.html?user=admin&pass=%%250 means that the pass really is "%250".

SchlaWiener
Eep, very valid point. Thanks.
Joey
Ok, couldn't get the regex to work with that constraint. Another test string to consider (which should fail again): "example.com/?pass=%%%25"
Joey
Bugger you're right. That I need to update the question with. Unfortunately I don't have the option of URL Decoding here, so the regex is my lost hope ;)
Kyle Rozendo
A: 

I agree with dominic's comment on the question. Don't use Regex.

If you want to avoid scanning the string twice, you can just iteratively search for % and then check that it is being followed by 20 and nothing else. (Update: allow a % after to be interpreted as a literal %nnn sequence)

// pseudo code
pos = 0
while (pos = mystring.find(pos, '%'))
{
     if mystring[pos+1] = "%" then
         pos = pos + 2 // ok, this is a literal, skip ahead
     else if mystring.substring(pos,2) != "20" 
          return false; // string is invalid
     end if
}
return true;
Isak Savo
Why should `%200` be disallowed? If I want to type a `0` after a space this should be totally possible, actually.
Joey
As I said, it is not possible to do this in the current scenario, however I appreciate the answer, thanks.
Kyle Rozendo
Your approach suffers from the same problem as the regex one, though. See SchlaWiener's answer.
Joey
Johannes: good point
Isak Savo
About %200: I was under the impression that multi-octed characters (e.g. UTF-8 encoded characters) would be URL-encoded with a single '%' sign, but I may be wrong here. If so, then no need to check for subsequent digits
Isak Savo
A: 
/^([^%]|%20)*$/
Dave Hinton
Would someone care to explain the downvote? If my answer is wrong I would like to know why.
Dave Hinton
This means "Either starts with something that is "%20" or not "%" ...
Mez
...and continues with something that is either `%20` or not `%`, until we hit end of string. Which is what the OP asked for.
Dave Hinton
+1  A: 

Reject the string if it matches %[^2][^0]

Amarghosh
-1 This wouldn’t allow *any* string that contains `%2x` or `%x0` where `x` can be any arbitrary character.
Gumbo
@Gumbo **And that's exactly what OP wants**. Quoting from the question "If a string contains the percentage character (%) then it can only contain the following: %20, and cannot be preceded by another '%'"
Amarghosh
A: 

This requires a test against the "bad" patterns. If we're allowing %20 - we don't need to make sure it exists.

As others have said before, %% is valid too... and %%25would be %25

The below regex matches anything that doesn't fit into the above rules

/(?<![^%]%)%(?!(20|%))/

The first brackets check whether there is a % before the character (meaning that it's %%) and also checks that it's not %%%. it then checks for a %, and checks whether the item after doesn't match 20

This means that if anything is identified by the regex, then you should probably reject it.

Mez
It doesn't work correctly for `%%25` here.
Joey
Apologies... fixed.
Mez