tags:

views:

278

answers:

1

I have the following string...

"Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."

I need to turn it into this string...

Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

This is pretty standard HTML encoding and I can't for the life of me figure out how to convert it in python.

I found this: GitHub

And it's very close to working, however it does not output an apostrophe but instead some off unicode character.

Here is an example of the output from the GitHub script...

Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

+2  A: 

What's you're trying to do is called "HTML entity decoding" and it's covered in a number of past Stack Overflow questions, for example:

Here's a code snippet using the Beautiful Soup HTML parsing library to decode your example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup

string = "Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."
s = BeautifulSoup(string,convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0]
print s

Here's the output:

Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

las3rjock
I had tried the BeautifulSoup code earlier and was still getting an exception. Turns out it was somewhere else in my code having trouble using the decoded Unicode characters.
Lounges