tags:

views:

1094

answers:

8

I am learning Python and trying to figure out an efficient way to tokenize a string of numbers separated by commas into a list. Well formed cases work as I expect, but less well formed cases not so much.

If I have this:

A = '1,2,3,4'
B = [int(x) for x in A.split(',')]

B results in [1, 2, 3, 4]

which is what I expect, but if the string is something more like

A = '1,,2,3,4,'

if I'm using the same list comprehension expression for B as above, I get an exception. I think I understand why (because some of the "x" string values are not integers), but I'm thinking that there would be a way to parse this still quite elegantly such that tokenization of the string a works a bit more directly like strtok(A,",\n\t") would have done when called iteratively in C.

To be clear what I am asking; I am looking for an elegant/efficient/typical way in Python to have all of the following example cases of strings:

A='1,,2,3,\n,4,\n'
A='1,2,3,4'
A=',1,2,3,4,\t\n'
A='\n\t,1,2,3,,4\n'

return with the same list of:

B=[1,2,3,4]

via some sort of compact expression.

+22  A: 

How about this:

A = '1, 2,,3,4  '
B = [int(x) for x in A.split(',') if x.strip()]

x.strip() trims whitespace from the string, which will make it empty if the string is all whitespace. An empty string is "false" in a boolean context, so it's filtered by the if part of the list comprehension.

Dave Ray
-1? What? How does this not do exactly what he asked for?
Dave Ray
strip() is overkill here, converting to int already takes care of the whitespace... better filter out the invalid substrings.
Algorias
+1 Without the test, it'll fail for e.g. a = "1, 2, , 3, 4"
Ryan Ginstrom
A: 

I'd guess regular expressions are the way to go: http://docs.python.org/library/re.html

Simon Groenewolt
Seems overkill in this case...
Algorias
A: 

This will work, and never raise an exception, if all the numbers are ints. The isdigit() call is false if there's a decimal point in the string.

>>> nums = ['1,,2,3,\n,4\n', '1,2,3,4', ',1,2,3,4,\t\n', '\n\t,1,2,3,,4\n']
>>> for n in nums:
...     [ int(i.strip()) for i in n if i.strip() and i.strip().isdigit() ]
... 
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
runeh
The isdigit check is not necessary for the test cases provided, but it does add extra robustness.
Carl Meyer
what if he considers '1,2,a' to be an error?
hasen j
The first i.strip() is redundant.
Ryan Ginstrom
You're performing i.strip() three times per element. Yikes.
Triptych
+1  A: 

How about this?

>>> a = "1,2,,3,4,"
>>> map(int,filter(None,a.split(",")))
[1, 2, 3, 4]

filter will remove all false values (i.e. empty strings), which are then mapped to int.

EDIT: Just tested this against the above posted versions, and it seems to be significantly faster, 15% or so compared to the strip() one and more than twice as fast as the isdigit() one

Algorias
he needs it to filter whitespace
hasen j
Yes, this will filter out whitespace: >>> int(" 1 ") ==> 1
Algorias
This fails if any of the empty entries has whitespace, e.g. '\n\t,1,2,3,,4\n'
Dave Ray
Doesn't work when list has elements containing whitespace.
Triptych
+2  A: 

Generally, I try to avoid regular expressions, but if you want to split on a bunch of different things, they work. Try this:

import re
result = [int(x) for x in filter(None, re.split('[,\n,\t]', A))]
Nick
A: 

Why not just wrap in a try except block which catches anything not an integer?

Josh Smeaton
+3  A: 

Mmm, functional goodness (with a bit of generator expression thrown in):

a = "1,2,,3,4,"
print map(int, filter(None, (i.strip() for i in a.split(','))))

For full functional joy:

import string
a = "1,2,,3,4,"
print map(int, filter(None, map(string.strip, a.split(','))))
Alec Thomas
`print map(int, filter(len, map(str.strip, a.split(','))))` Note: str.strip(i) and i.strip() are the same (no need in `string` module). `len` is used for readability.
J.F. Sebastian
I prefer `string.strip()` as it safely deals with Unicode strings.
Alec Thomas
@Alec: `map(type(a).strip, a.split(','))` will work both for Unicode and encoded strings.
J.F. Sebastian
A: 

Why accept inferior substitutes that cannot segfault your interpreter? With ctypes you can just call the real thing! :-)

# strtok in Python
from ctypes import c_char_p, cdll

try: libc = cdll.LoadLibrary('libc.so.6')
except WindowsError:
     libc = cdll.LoadLibrary('msvcrt.dll')

libc.strtok.restype = c_char_p
dat = c_char_p("1,,2,3,4")
sep = c_char_p(",\n\t")
result = [libc.strtok(dat, sep)] + list(iter(lambda: libc.strtok(None, sep), None))
print(result)
joeforker
added `cdll.LoadLibrary('msvcrt.dll')`
J.F. Sebastian
When I saw the `while` loop I'd thought of `lambda tokenize dat, sep: itertools.chain((strtok(dat, sep),), iter(lambda: strtok(None, sep), None))`. It is funny how similar programmers' minds work. Add the smile emoticon back otherwise somebody can think that the code is not a joke.
J.F. Sebastian