tags:

views:

190

answers:

3

I have the following attribute in an xml node I'm reading with libxml. It prints out normally with the accented character if I print out reader.node.

reader = XML::Reader.new(File.open("somefile.xml", "r"))
reader.read
reader.read
...
p reader.node

=> ... Full_Name="Univisión Network - East Feed" ...

If I do this, though, it comes out escaped.

p reader.node["Full_Name"]
=> "Univisi\xC3\xB3n Network - East Feed"

And when I try to convert this value to json laater, I get the following error.

Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8

Here is the xml line in the document

<?xml version="1.0" encoding="ISO-8859-1"?>

I don't have control over the xml document itself. How can I get that unicode character back into json, or into a format json understands?

EDIT: Oh, I forgot to mention - this is how it looks in the actual XML document

Full_Name="Univisi&#243;n Network - East Feed" 
A: 

If it do this, though, it comes out escaped.

Not quite. What you're seeing is UTF-8 output interpreted as a string of bytes.

The problem is that your XML document says it's ISO-8859-1, while it is really UTF-8. Fix the encoding problems and it should work.

Jim Garrison
Do you know off the top of your head how to override the xml document's encoding? Again, I can't change the xml document.
Sean Clark Hess
This is not working: `reader = XML::Reader.io(file, :encoding => XML::Encoding::UTF_8)` When I ask for the document's encoding it's still returning the old one
Sean Clark Hess
it should be `XML::Encoding::ISO_8859_1` - see my answer.
ax
I still get the error even when setting the encoding to ISO_8859_1. Sorry I'm so confused here
Sean Clark Hess
A: 

EDIT
so i've been trying figuring this out for quite some time now. funny thing: your code works without error in ruby 1.8 (at least here). so i think the error has to do with ruby 1.9's new encoding handling. somehow it cannot figure out that the parsed and read XML is in (libxml's internal) utf-8 format (the document encoding doesn't matter here: in 1.8 it works with both iso-8859-1 and utf-8, even with the wrong xml encoding declaration). instead, it treats it as ASCII-8BIT, or BINARY. in other words, it doesn't know the encoding. which is why to_json fails trying to convert it to utf-8.

your easiest way to solve it might be to downgrade to ruby 1.8.

alternatively, your approach of force_encoding('UTF-8') seems to be reasonable.
EDIT END

you can try passing the proper encoding to the reader:

reader = XML::Reader.new(File.open("somefile.xml", "r"), 
  XML::Encoding::ISO_8859_1)
ax
Wait, I thought I wanted to force it to use UTF-8? Wouldn't i want `reader = XML::Reader.new(File.open("somefile.xml", "r"), :encoding => XML::Encoding::UTF_8)` ? I still get the error pasting your code in blindly, and setting it to UTF_8 doesn't seem to change it (when I ask for it's encoding it still gives ISO_8859_1)
Sean Clark Hess
so what is the real encoding of `somefile.xml` (regardless of the `<?xml version="1.0" encoding="ISO-8859-1"?>`)?
ax
I'm not quite sure how to check. `file --mime somefile.xml` gives `application/xml; charset=us-ascii`
Sean Clark Hess
+1  A: 

So, I'm still completely lost as to why I couldn't figure out the "Right" way to do it, but this thread helped to find the force_encoding method on the String class. Since my code involves copying attributes into a hash anyway, it's not a big deal to call force_encoding when I copy the value.

I doubly made sure I had saved the file as UTF-8, and put the right xml declaration at the top. It still failed.

Anyway, until I can figure out how to fix the actual problem, this code fixed it.

  object = { type: node.name }      
  node.attributes.each do |attribute|
    name = attribute.name.gsub /_/,""
    value = attribute.value.force_encoding('UTF-8')

    object[name] = value
  end

Note this would not be appropriate if I weren't already needing to copy the node into a hash, since it definitely wouldn't be worth all the trouble. If I then do

object.to_json

It works without a problem. Thanks for all your help ax! Do you have any idea how I can force the encoding on the xml?

Sean Clark Hess
Well, Ruby 1.9's been hard on you here.
Julik
The problem is that the XML document is just plain WRONG. It claims to be ISO-8859-1 when it's actually UTF-8. You are fixing it by using force-encoding. Not being a Ruby developer, I don't know if you can force the XML parser to ignore the declared encoding and use UTF-8 for the entire document.
Jim Garrison
@Julik - I'd say. @Jim - yeah, I that's what I thought too, but I changed the declaration at the top AND made sure it was saved as UTF-8 (using textmate), AND forced the encoding (at least according to the documentation for the libxml library) and it's still being weird. Oh well
Sean Clark Hess