ansaurus

Question

How do i remove   — – special characters from my XML files

Answer 1

+1 A:

Because your regex is of the form /^...$/, it will only match against the entire string. You will only skip text that consists entirely of one HTML entity.

Ned Batchelder 2010-07-27 18:46:48

it doesn't even skip that one, i tried it out in irb...

charudatta 2010-07-27 19:22:36

Answer 2

+2 A:

&#x...; isn't an entity reference, it's a character reference. To an XML parser, — is absolutely identical to the raw character —, so when you look at the DOM produced by an XML parser through a property such as element.text you won't see anything with an ampersand in it, but a simple — character.

So in principle, you'd match it with a regex something like /[—– ]/. However, if you are using Ruby 1.8, you've got the problem that the language itself doesn't have support for Unicode, so the character group in /[—– ]/ won't quite work properly: it'll try to remove every byte in the UTF-8 representation of –, — and , which will likely mangle any other characters.

A simple string replace for each target character would work correctly, as that doesn't require special character handling. (Naturally if you included characters like — directly in the source code you'd also have to get the file encoding of that script right, so probably easier to use a string literal escape like "\xe2\x80\x94".)

bobince 2010-07-27 19:27:46

ansaurus

tags:

views:

answers:

How do i remove   — – special characters from my XML files

related questions

ansaurus

tags:

views:

answers:

How do i remove &#x2002; &#x2014; &#x2013; special characters from my XML files

related questions

How do i remove — – special characters from my XML files