views:

52

answers:

2

this is a sample of the xml file

<row tnote="0">
<entry namest="col2" nameend="col4" us="none" emph="bld"><blst>
<li><text>Single, head of household, or qualifying widow(er)&#x2014;$55,000</text></li>
<li><text>Married filing jointly&#x2014;$115,000</text></li>
</blst></entry>
<entry colname="col6" ldr="1" valign="middle">&#x2002;</entry>
<entry colname="col7" valign="middle"> 5.</entry>
</row>

the &#x2014; etc represent HTML 4.0 entities. i want to store each line's text as an element of an array, but not if the line is just &#x2002;

if e.text.strip =~ /^&#x20[0-9][0-9];$/ then
next
else
subLines << e.text
end

but it doesn't seem to be working...is my regEx incorrect?

+1  A: 

Because your regex is of the form /^...$/, it will only match against the entire string. You will only skip text that consists entirely of one HTML entity.

Ned Batchelder
it doesn't even skip that one, i tried it out in irb...
charudatta
+2  A: 

&#x...; isn't an entity reference, it's a character reference. To an XML parser, &#x2014; is absolutely identical to the raw character , so when you look at the DOM produced by an XML parser through a property such as element.text you won't see anything with an ampersand in it, but a simple character.

So in principle, you'd match it with a regex something like /[—– ]/. However, if you are using Ruby 1.8, you've got the problem that the language itself doesn't have support for Unicode, so the character group in /[—– ]/ won't quite work properly: it'll try to remove every byte in the UTF-8 representation of , and , which will likely mangle any other characters.

A simple string replace for each target character would work correctly, as that doesn't require special character handling. (Naturally if you included characters like directly in the source code you'd also have to get the file encoding of that script right, so probably easier to use a string literal escape like "\xe2\x80\x94".)

bobince