ansaurus

Question

How to split line at non-printing ascii character in Python

Answer 1

+1 A:

You can use re.split.

>>> import re
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']

Adjust the pattern to only include the characters you want to keep.

See also: stripping-non-printable-characters-from-a-string-in-python

Example (w/ the long minus):

>>> # \xe2\x80\x93 represents a long dash (or long minus)
>>> s = 'hello – world'
>>> s
'hello \xe2\x80\x93 world'
>>> import re
>>> re.split("\xe2\x80\x93", s)
['hello ', ' world']

Or, the same with unicode:

>>> # \u2013 represents a long dash, long minus or so called en-dash
>>> s = u'hello – world'
>>> s
u'hello \u2013 world'
>>> import re
>>> re.split(u"\u2013", s)
[u'hello ', u' world']

The MYYN 2010-05-29 18:39:41

How do I specify that I want to split exactly at hex character 97?

Donnied 2010-05-29 18:46:09

See my updated post; hope it helps.

The MYYN 2010-05-29 18:53:36

I think re.split("\x97", s) should do it ..

amir75 2010-05-29 19:25:27

Excellent. Thank you.

Donnied 2010-05-29 21:03:00

-1 (0) The OP has an EM DASH (U+2014, cp1252 x97), not an EN DASH (U+2013, cp1252 0x96). (1) Your second example is in terms of UTF-8 which obviously (??) the OP is not using (2) Using re.split instead of str.split is gross overkill.

John Machin 2010-05-30 22:06:02

Answer 2

A:

Just use the string/unicode split method (They don't really care about the string you split upon (other than it is a constant. If you want to use a Regex then use re.split)

To get the split string either escape it like the other people have shown "\x97"

or

use chr(0x97) for strings (0-255) or unichr(0x97) for unicode

so an example would be

'will not be split'.split(chr(0x97))

'will be split here:\x97 and this is the second string'.split(chr(0x97))

Terence Honles 2010-05-29 20:55:39

Thanks. I like the chr() use.

Donnied 2010-05-29 21:03:37

(0) You mean str/unicode split method (1) "other than it is a constant": It can be any expression that evaluates to a single string (like, for example, `chr(0x97)`) (2) using `[uni]chr(0x97) instead of [u]"\x97"` is obfuscatory/redundant/wasteful/deprecable (IMHO) -- would you write `float("1.23")` instead of `1.23`?? (3) If operating in unicode, he wouldn't need `unichr(0x97)`, he would need `u"\u2014"`, which is `"\x97".decode("cp1252")`

John Machin 2010-05-30 21:58:09

(0) In my *english* explanation do I really have to specify that it is the *str* method rather than a method that operates on a string... which **is** the str class??? (1) it is a constant was referring to the string couldn't specify more than one string (chr(97) will always be '\x97') where as an re.split could handle '\x97|\x91'. **OF COURSE** you could write chr(i) where i is a variable which can change. (2) Yes... of course you wouldn't do a float conversion, but chr maybe useful if he needed to convert a number into a string **at runtime**.

Terence Honles 2010-05-30 22:45:37

(3) And no I didn't check what 0x97 was in unicode... why should I? he asked for 0x97... I gave that to him. It's up to him to figure out that character hex values in ASCII are different than in unicode (I was merely showing that there *was* an equivalent that would generate a unicode character string)

Terence Honles 2010-05-30 22:46:53

(0) a string is an instance of the str type OR the unicode type (1) "constant" != "only one string" (3) You shouldn't need to "check what 0x97 was in unicode" ... characters in the range U+0080 to U+009F are C1 control characters, nothing to do with dashes. If you have them in your unicode data, you are either working with some ancient/arcane protocol (prob=0.001) or some wally has decoded using latin1 instead of cp1252 (prob=0.999). The first 128 Unicode characters were deliberately made same as ASCII; "character hex values in ASCII" are NOT "different than in unicode". 0x97 isn't in ASCII.

John Machin 2010-05-31 01:24:56

I still think you are a *little* too picky about the string thing, and I was a *little* wary about putting "constant" when I wrote it (I thought with context it was obvious that it was not an re). Well thank you for your knowledge about unicode... I haven't really used it much. And finally I was afraid you were going to bring that up (I was not 100% sure about the mapping of characters from ASCII to unicode)... and about 0x97, it **is** if you are using extended ASCII (which I was including when I wrote the comment because I had already written over one comment worth)

Terence Honles 2010-05-31 03:19:53

Unicode knowledge: you may like to read http://www.amk.ca/python/howto/unicode and the references (especially the articles by Czyborra, Spolsky and Orendorff). "extended ASCII" is not a very technical description and is meaningless without specifying somehow (e.g. by naming an encoding) what codepoints 0x80 to 0xFF mean. I don't understand your reason for using "ASCII" when you meant "extended ASCII" ("because I had already written over one comment worth").

John Machin 2010-05-31 05:57:44

I apologize if I described the character wrong. cat -e showed 'M-^W' which gave the octal value of 227 which googling I found was equivalent to: U+0097, character —‬, decimal 151, hex 0x97, octal \227, binary 10010111

Donnied 2010-05-31 18:40:10

@Donnied: You didn't describe the character wrongly; you gave enough info (long minus sign). However you are NOW going wrong; U+0097 is (as I wrote above) a control character, not a dash/minus. cat -e? What's that? In the Python 2.x context, `print repr(your_data)` shows unambiguously and portably what you have; try using it when you ask your next question.

John Machin 2010-06-01 01:18:14

"cat -e" is a Linux command that I was using to see what was getting munged. The "U+0097" bit was what googling for the significance of octal \227. u0097 is a control character - why was it listed as an equivalent? Thanks for the heads up.

Donnied 2010-06-01 02:00:08

"""u0097 is a control character - why was it listed as an equivalent?""" Please don't be shocked: Some people who write articles that you can find with google are a few sandwiches short of a picnic :-) See my comments on "extended ASCII" and encoding above. Also read the articles that I recommended to Terence.

John Machin 2010-06-02 04:10:19

Answer 3

+2 A:

_, _, your_result= your_input_string.partition('\x97')

or

your_result= your_input_string.partition('\x97')[2]

If your_input_string does not contain a '\x97', then your_result will be empty. If your_input_string contains multiple '\x97' characters, your_result will contain everything after the first '\x97' character, including other '\x97' characters.

ΤΖΩΤΖΙΟΥ 2010-05-30 18:27:34

ansaurus

tags:

views:

answers:

How to split line at non-printing ascii character in Python

related questions