ansaurus

Question

Regex To Match &entity; or &#0-9; And Capture &

Answer 1

+1 A:

Actually you're matching the string &l but captured is only the &. This is because of the character class after the capture group which will match an additional character.

But your original regex is a little flawed to begin with anyway. A (not optimal) replacement might be:

&amp;(#[0-9]+|#x[0-9a-zA-Z]+|[a-zA-Z]+);

which will match the complete entity or character declaration and capture the &.

Joey 2009-11-19 16:17:34

This will match `"An ampersand "`

Tomalak 2009-11-19 16:22:29

Good catch, thanks. Might be better now.

Joey 2009-11-19 16:37:55

Tomalak 2009-11-19 17:25:37

Joey 2009-11-19 17:27:22

Answer 2

+3 A:

look for (this copes with named, decimal and hexadecimal entities):

&amp;([A-Za-z]+|#x[\dA-Fa-f]+|#\d+);

replace with

&$1;

Be warned: This has a real probability to go wrong. I recommend using a HTML parser to decode the text. You can decode it twice, if it was double encoded. HTML and regex don't play well together even on the small scale.

Since you are in JavaScript, I expect you are in a browser. If you are, you have a nice DOM parser at your hands. Create a new element, assign the string to its inner HTML property and read out the text value. Done.

Tomalak 2009-11-19 16:21:20

This works partially, but isn't matching strings like: `"ό"` I also had to change the replace string to: `""`

sholsinger 2009-11-19 16:36:27

I ended up with the following search pattern: `""` and the following replace string: `""`

sholsinger 2009-11-19 16:42:16

I should mention that this replace scheme actually removes leading 0s from the numeric entity, however this doesn't seem to affect rendering of the entities. Also, using `"\d+"` instead of `"[0-9]"` didn't work for reasons unknown to me.

sholsinger 2009-11-19 16:44:49

Okay, back references are implementation-dependent. Some use backslashes, some use the dollar sign. I've edited the regex to include proper support for decimal and hexadecimal numeric entities.

Tomalak 2009-11-19 17:15:43

Look-behinds aren't supported in JS, afaik...

J-P 2009-11-19 17:32:08

Whoops, you are right. Damn. ;-) I'll remove the second regex.

Tomalak 2009-11-19 17:34:20

This works marvelously. `""` It matches `` or `Ӓ` or `

sholsinger 2009-11-19 19:41:26

sholsinger 2009-11-19 19:52:26

Answer 3

A:

If you only want to match &, why did you include the character class [#?a-zA-Z0-9;] as well?

In english, your expression would be "Match & followed by a character that is #, ?, a lowercase letter, an uppercase letter or ;".

Just use (&)

wwerner 2009-11-19 16:21:48

Answer 4

A:

You probably meant:

"&amp;([#a-zA-Z0-9]+;)"

Roger Pate 2009-11-19 16:22:10

Answer 5

+2 A:

I gather that you want to match &, but only if it is followed by an alphanumeric character or certain punctuation. That calls for lookahead. This regular expression should match what you want without capturing or consuming any additional characters.

(&)(?=[#?a-zA-Z0-9;])

Rob Kennedy 2009-11-19 16:29:16

This didn't do anything in Firefox... though I haven't checked other browsers. Are lookaheads supported by JavaScript regular expression engines?

sholsinger 2009-11-19 16:41:22

Mozilla developer reference says yes. https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp

Rob Kennedy 2009-11-19 17:10:52

Rob Kennedy: You are wrong here `[#?a-zA-Z0-9;]` - this is a character class that matches *any single one* of these characters (no quantifier!). Also, a character class has no order, so it would be equivalent to, say `[?0-9;#a-zA-Z]`.

Tomalak 2009-11-19 17:20:57

Rob Kennedy 2009-11-19 18:16:32

Tomalak correctly gathered that I was intending to match HTML character entities, which require quantification, or, at least matching the semicolon after a qualifying single character.

sholsinger 2009-11-19 19:33:23

ansaurus

tags:

views:

answers:

Regex To Match &entity; or &#0-9; And Capture &

related questions

ansaurus

tags:

views:

answers:

Regex To Match &amp;entity; or &amp;#0-9; And Capture &amp;

related questions

Regex To Match &entity; or &#0-9; And Capture &