views:

180

answers:

5

I'm trying to do a replace on the following string prototype: "I‘m singing & dancing in the rain." The following regular expression matches the instance properly, but also captures the character following the instance of &amp. "(&)[#?a-zA-Z0-9;]" captures the following string from the above prototype: "&l".

How can I limit it to only capture the &?

Edit: I should add that I don't want to match "&" by itself.

+1  A: 

Actually you're matching the string &l but captured is only the &. This is because of the character class after the capture group which will match an additional character.

But your original regex is a little flawed to begin with anyway. A (not optimal) replacement might be:

&(#[0-9]+|#x[0-9a-zA-Z]+|[a-zA-Z]+);

which will match the complete entity or character declaration and capture the &.

Joey
This will match `"An ampersand "`
Tomalak
Good catch, thanks. Might be better now.
Joey
Tomalak
Joey
+3  A: 

look for (this copes with named, decimal and hexadecimal entities):

&([A-Za-z]+|#x[\dA-Fa-f]+|#\d+);

replace with

&$1;

Be warned: This has a real probability to go wrong. I recommend using a HTML parser to decode the text. You can decode it twice, if it was double encoded. HTML and regex don't play well together even on the small scale.

Since you are in JavaScript, I expect you are in a browser. If you are, you have a nice DOM parser at your hands. Create a new element, assign the string to its inner HTML property and read out the text value. Done.

Tomalak
This works partially, but isn't matching strings like: `"ό"` I also had to change the replace string to: `""`
sholsinger
I ended up with the following search pattern: `""` and the following replace string: `""`
sholsinger
I should mention that this replace scheme actually removes leading 0s from the numeric entity, however this doesn't seem to affect rendering of the entities. Also, using `"\d+"` instead of `"[0-9]"` didn't work for reasons unknown to me.
sholsinger
Okay, back references are implementation-dependent. Some use backslashes, some use the dollar sign. I've edited the regex to include proper support for decimal and hexadecimal numeric entities.
Tomalak
Look-behinds aren't supported in JS, afaik...
J-P
Whoops, you are right. Damn. ;-) I'll remove the second regex.
Tomalak
This works marvelously. `""` It matches `` or `Ӓ` or `
sholsinger
sholsinger
A: 

If you only want to match &, why did you include the character class [#?a-zA-Z0-9;] as well?

In english, your expression would be "Match & followed by a character that is #, ?, a lowercase letter, an uppercase letter or ;".

Just use (&)

wwerner
A: 

You probably meant:

"&([#a-zA-Z0-9]+;)"
Roger Pate
+2  A: 

I gather that you want to match &, but only if it is followed by an alphanumeric character or certain punctuation. That calls for lookahead. This regular expression should match what you want without capturing or consuming any additional characters.

(&)(?=[#?a-zA-Z0-9;])

Rob Kennedy
This didn't do anything in Firefox... though I haven't checked other browsers. Are lookaheads supported by JavaScript regular expression engines?
sholsinger
Mozilla developer reference says yes. https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp
Rob Kennedy
Rob Kennedy: You are wrong here `[#?a-zA-Z0-9;]` - this is a character class that matches *any single one* of these characters (no quantifier!). Also, a character class has no order, so it would be equivalent to, say `[?0-9;#a-zA-Z]`.
Tomalak
Rob Kennedy
Tomalak correctly gathered that I was intending to match HTML character entities, which require quantification, or, at least matching the semicolon after a qualifying single character.
sholsinger