views:

250

answers:

4

Hi folks,

I am hoping that this will have a pretty quick and simple answer. I am using regular-expressions.info to help me get the right regular expression to turn URL-encoded, ISO-8859-1 pound sign ("%A3"), into a URL-encoded UTF-8 pound sign ("%C2%A3").

In other words I just want to swap %A3 with %C2%A3, when the %A3 is not already prefixed with %C2.

So I would have thought the following would work:

Regular Expression: (?!(\%C2))\%A3
Replace With:       %C2%A3

But it doesn't and I can't figure out why!

I assume my syntax is just slightly wrong, but I can't figure it out! Any ideas?

FYI - I know that the following will work (and have used this as a workaround in the meantime), but really want to understand why the former doesn't work.

Regular Expression: ([^\%C2])\%A3
Replace With:       $1%C2%A3

TIA!

+4  A: 

Unfortunately, the (?!) syntax is negative lookahead. To the best of my knowledge, JavaScript does not support negative lookbehind.

What you could do is go forward with the replacement anyway, and end up with %C2%C2%A3 strings, but these could easily be converted in a second pass to the desired %C2%A3.

David Andres
I've asked two or three times whether lookbehind operators would be added to ECMAScript, on the mozilla.dev.tech.js-engine newsgroup and got no reply. Feel free to add your voice. http://groups.google.com/group/mozilla.dev.tech.js-engine/browse_thread/thread/5d8e24ca46aa72f1?hl=en#
Jason S
Thanks for the quick answer. Sounds silly, but I am finding it very hard to understand the major difference between lookahead and lookbehind - to my mind (and I know I am wrong, otherwise there wouldn't be two different names for it!), it is just doing a search for some characters, but then not using these for the replacement?And thank you for the suggestion, but I think my workaround is slightly neater. :)
FrostbiteXIII
Think about it this way....regular expressions often work by keeping track of where you currently are in the string. This may be at the expense of figuring out where you've been (lookbehind) and where you're going (lookahead). Perhaps there are implementation difficulties with implementing lookbehind due to the expense of keeping track of your current position.
David Andres
+3  A: 

You could replace

(^.?.?|(?!%C2)...)%A3

with

$1%C2%A3
Tomalak
This seems to match too much in some cases. Try to match this against the text "ladskfjdkfj%A3" and it appears that kfj%A3 is matched.
David Andres
...until after I remove the ellipses, but even then the string "ladskfjd%C2%A3" matches though it shouldn't...JavaScript isn't making this easy!
David Andres
@Tomalak: +1 That’s what I would have written.
Gumbo
@Gumbo: This is a sincere compliment for me. :)
Tomalak
@dandres109: Yes, the string is matched. In any case, that's what the backreference is for - the matched characters are added back where they belong.
Tomalak
@Tomalak: I wrote almost the same regular expression just a week ago (http://stackoverflow.com/questions/1357769).
Gumbo
@Tomalak: Yes, that makes sense
David Andres
This will probably work, but lookaheads are generally troublesome. IE/JScript/VBScript's regexp implementation has serious bugs (see http://blog.stevenlevithan.com/archives/regex-lookahead-bug ). Lookahead is a newer addition to ECMA-262 than most of JavaScript so there may also be browsers that don't support it. Care required!
bobince
Supported or not, buggy or not, lookahead seems to be overkill for this problem; it's *not* the same as that other question. See @Tashkant's answer -- http://stackoverflow.com/questions/1390037/javascript-regular-expressions-lookbehind-failing/1390187#1390187
Alan Moore
Thanks a lot guys. I think I was completely misunderstanding lookahead/behind - thanks for the link bobince, that helped me realise. Will go with what you suggest after all Alan and accept Tashkant's answer, given that lookahead really isn't suitable for this!
FrostbiteXIII
Look-ahead is not completely broken, even though this impression could arise after @bobince's comment. In fact it never failed me, even for expressions that were a lot more complicated than this. But I agree this solution is overkill for the problem, and @Tashkant's approach has my vote.
Tomalak
+1  A: 

I would suggest you use the functional form of Javascript String.replace (see the section "Specifying a function as a parameter"). This lets you put arbitrary logic, including state if necessary, into a regexp-matching session. For your case, I'd use a simpler regexp that matches a superset of what you want, then in the function call you can test whether it meets your exact criteria, and if it doesn't then just return the matched string as is.

The only problem with this approach is that if you have overlapping potential matches, you have the possibility of missing the second match, since there's no way to return a value to tell the replace() method that it isn't really a match after all.

Jason S
+4  A: 

Why not just replace ((%C2)?%A3) with %C2%A3, making the prefix an optional part of the match? It means that you're "replacing" text with itself even when it's already right, but I don't foresee a performance issue.

Tashkant
Sounds great - don't know why I didn't think of that - thanks! :) Not accepting it as the answer as it is essentially another workaround (the point of the question was to find out why my lookbehind wasn't working), but thank you!
FrostbiteXIII
Ignore that - accepted answer - many thanks! :)
FrostbiteXIII
Very nice, +1. You could use `(?:(?:%C2)?%A3)` because backreferences are not really necessary in this case.
Tomalak