views:

365

answers:

3

(Sorry if a newb question...I've done quite a bit of research, honestly...)

I'm writing some Ruby on Rails code to parse RSS/ATOM feeds. My code is throwing-up on on a pesky '£' symbol.

I've been trying the approach of normalizing the description and title fields of the feeds before doing anything else:

descr = self.description.mb_chars.normalize(:kc)

However, when it hits the string with the '£', I'm guessing that mb_chars hits a problem and returns a regular Ruby String object. I get the error:

undefined method `normalize' for #<String:0x5ef8490>

So what is the best process to defensively prep these strings for insertion into the database? (I need to do a bunch of string processing on them as well)

My problem is compounded in that I don't know the format of the feed I'm processing. For instance, I've had some luck with the following line:

descr = Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv descr

However, when it encounters the '£' it simply truncates everything after that point.

When I display the '£' symbol with the String.inspect function, it displays at '\243'. Failing a method to 'correctly' deal with this symbol, I'd be happy enough to substitute it for another value (like 'GBP'). So help with that code would be appreciated as well.

The feed in question is http://www.dailymail.co.uk/sport/football/index.rss

A: 

Maybe this helps you? http://railscasts.com/episodes/168-feed-parsing

Lichtamberg
That feedzirra scares me...the dependencies seem scary and I think it's overkill for my needs. I'll look at it if I can get the basics working though. Thanks.
Phil McT
+1  A: 

I was missing something pretty basic - I was guessing at the encoding of the feed that was coming in.

So now I'm looking at (a) the charset in the HTTP response headers, then (b) the encoding in the XML declaration in the feed itself.

Once I have the encoding I use iconv to move it into UTF-8.

So far so good.

Phil McT
A: 

I've found one solution:

To fix it, I had to define the $KCODE (encoding) for the document:

require 'rubygems'
require 'active_support/all'

$KCODE = 'UTF8'

str = "test ščž"
puts str.parameterize.inspect
puts str.parameterize.to_s

=> # => test-scz

Original post: https://rails.lighthouseapp.com/projects/8994/tickets/3504-string-parameterize-normalize-bug

Fernando Kosh