views:

1021

answers:

4

Hi there,

This will surely be an easy one but it is really bugging me.

I have a script that reads in a webpage and uses BeutifulSoup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.

All off the text that I am parsing is ASCII. I know that python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.

Every time I go to print out a variable that holds 'String' I get u['String'] printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?

Thanks, John.

A: 

Do you really mean u'String'?

In any event, can't you just do str(string) to get a string rather than a unicode-string? (This should be different for Python 3, for which all strings are unicode.)

Andrew Jaffe
I should have been clearer. I am using str() but still getting output like below when I print. [u'ABC'] [u'DEF'] [u'GHI'] [u'JKL'] The data is stripped as text from a webpage, then inserted into a database (Google Appstore), then retrieved and printed.
gnuchu
+1  A: 

Use dir or type on the 'string' to find out what it is. I suspect that it's one of BeautifulSoup's tag objects, that prints like a string, but really isn't one. Otherwise, its inside a list and you need to convert each string separately.

In any case, why are you objecting to using Unicode? Any specific reason?

sykora
I've been looking at BeautifulSoup since the last few days. I couldn't figure out how gnuchu would get u['string'] not [u'String']. His comment to Andrew Jaffe seems to prove it is a list.
batbrat
+1 on teaching him to fish instead of catching a fish and giving it to him.
batbrat
+6  A: 

[u'ABC'] would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.

I don't know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:

 soup[0].encode("ascii")

However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it's latin-1 or utf-8.

 soup[0].encode("latin-1")


 soup[0].encode("utf-8")

Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:

 soup[0].encode(soup.originalEncoding)
oefe
Brilliant. Thanks. Apologies for the typo.
gnuchu
You actually don't have to do the encoding, because the OP is only seeing the string repr because thats how you see anything when you print a list. soup[0] will be enough to show the str instead of the repr, showing the contents of the string and not the quote and unicode modifier.
ironfroggy
+2  A: 

You probably have a list containing one unicode string. The repr of this is [u'String'].

You can convert this to a list of byte strings using any variation of the following:

# Functional style.
print map(lambda x: x.encode('ascii'), my_list)

# List comprehension.
print [x.encode('ascii') for x in my_list]

# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)

# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)

# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)
ddaa