views:

139

answers:

1

Hi all-

I am currently scraping an rss feed from last.fm and the title attribute looks like it has a unicode "-" that comes up as \u2013 on firebug. Here is the feed for those that are curious:

http://ws.audioscrobbler.com/2.0/user/rj/recenttracks.rss

When I write something like this

feedentry.title.split('-')

it won't find the unicode dash. I have also tried this:

@feedsplit = feedentry.title.gsub(/\u2013/,'-').split("-") 

and some variations like using [] ranges. No luck. I took a look at the other answers floating around, and none of them seem to work for me, so this is my last hope.

Thanks for your time!

A: 

The \u2013 syntax only works with Ruby 1.9, which is fully Unicode aware. I'm guessing that you are running Ruby 1.8.

In Ruby 1.8, you can still use the unicode dash as argument to split. These both work:

feedentry.title.split("–")             # The actual UTF-8 char
feedentry.title.split("\342\200\223")  # The sequence of bytes

In regular expressions, remember to set the u modifier for unicode compatibility (outside of Rails):

@feedsplit = feedentry.title.gsub(/–/u,'-').split("-") 

Alternatively, set $KCODE = "U", which implies the u modifier for all regular expressions. Rails does this for you already.

molf
Thanks for the quick response. I tried this, but had no luck. I am using rails 1.8.6. I am using Feedzirra to fetch and parse the feeds, and it works with fine with most other ones. Last.fm seems to be causing all kinds of problems though.
using the actual byte code did work however. @feedsplit = feedentry.title.gsub(/\342\200\223/u,"-").split("-")Thanks for the help!
If the literal char does not work, your editor may be saving the source code as something other than UTF-8.
molf