views:

119

answers:

5

Hi,

I am trying to tokenize a string using the pattern as below.

>>> splitter = re.compile(r'((\w*)(\d*)\-\s?(\w*)(\d*)|(?x)\$?\d+(\.\d+)?(\,\d+)?|([A-Z]\.)+|(Mr)\.|(Sen)\.|(Miss)\.|.$|\w+|[^\w\s])')
>>> splitter.split("Hello! Hi, I am debating this predicament called life. Can you help me?")

I get the following output. Could someone point out what I'd need to correct please? I'm confused about the whole bunch of "None"'s. Also if there is a better way to tokenize a string I'd really appreciate the additional help.

['', 'Hello', None, None, None, None, None, None, None, None, None, None, '', '!', None, None, None, None, None, None, None, None, None, None, ' ', 'Hi', None,None, None, None, None, None, None, None, None, None, '', ',', None, None, None, None, None, None, None, None, None, None, ' ', 'I', None, None, None, None, None, None, None, None, None, None, ' ', 'am', None, None, None, None, None, None,None, None, None, None, ' ', 'debating', None, None, None, None, None, None, None, None, None, None, ' ', 'this', None, None, None, None, None, None, None, None, None, None, ' ', 'predicament', None, None, None, None, None, None, None, None, None, None, ' ', 'called', None, None, None, None, None, None, None, None, None, None, ' ', 'life', None, None, None, None, None, None, None, None, None, None, '', '.', None, None, None, None, None, None, None, None, None, None, ' ', 'Can', None, None, None, None, None, None, None, None, None, None, ' ', 'you', None, None, None, None, None, None, None, None, None, None, ' ', 'help', None, None,None, None, None, None, None, None, None, None, ' ', 'me', None, None, None, None, None, None, None, None, None, None, '', '?', None, None, None, None, None, None, None, None, None, None, '']

The output that I'd like is:-

['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life', '.', 'Can', 'you', 'help', 'me', '?']

Thank you.

+2  A: 

Could be missing something but I beleive something like the following would work:

s = "Hello! Hi, I am debating this predicament called life. Can you help me?"
s.split(" ")

This is assuming you want spaces. You should get something along the lines of:

['Hello!', 'Hi,', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life.', 'Can', 'you', 'help', 'me?']

With this, if you needed a specific piece, you could probably loop though it to get what you need.

Hopefully this helps....

Frank V
I'm sorry but I did not specify the output that I was aiming for. I've re-edited my question above. Sorry for any inconvenience.
rookie
I didn't nail what you needed exactly, but I should have given you enough to move forward. :-) Cheers!
Frank V
A space is the default delimiter, so you could just call s.split().
GreenMatt
@GreenMatt: """A space is the default delimiter"""??? Not so, the default action is radically different: delimiter is any run of whitespace; leading and trailing whitespace is ignored. s.split() would be BETTER than s.split(' ') but still not what the OP wants.
John Machin
Split wont work anyway, OP wants punctuation separated out too...
Tim McNamara
@John Machin (and @Tim McNamara): Granted, to be specific, a space isn't the default delimiter. I probably should have known better than to say that here. However, as John pointed out, s.split() will split the string the OP presented, except for the punctuation. As for that punctuation, I was following the spirit of the answerer's comment that he had presented a place for the OP to get started. If I'd have wanted to be more detailed, I'd have posted an answer myself!
GreenMatt
@GreenMatt, @Frank V: str.split is NOT a place to get started. Even using regexes is iffy for anything moderately complicated.
John Machin
Everyone, please note that when I wrote this the details of exactly what the OP wanted wasn't posted. The question has evolved with additional detail since then... I openly admit that better answers have since been posted but this answer may prove relevant to someone down the road who'd like a simple solution.
Frank V
+4  A: 

re.split rapidly runs out of puff when used as a tokeniser. Preferable is findall (or match in a loop) with a pattern of alternatives this|that|another|more

>>> s = "Hello! Hi, I am debating this predicament called life. Can you help me?"
>>> import re
>>> re.findall(r"\w+|\S", s)
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life', '.', 'Can', 'you', 'help', 'me', '?']
>>>

This defines tokens as either one or more "word" characters, or a single character that's not whitespace. You may prefer [A-Za-z] or [A-Za-z0-9] or something else instead of \w (which allows underscores). You may even want something like r"[A-Za-z]+|[0-9]+|\S"

If things like Sen., Mr. and Miss (what happened to Mrs and Ms?) are significant to you, your regex should not list them out, it should just define a token that ends in ., and you should have a dictionary or set of probable abbreviations.

Splitting text into sentences is complicated. You may like to look at the nltk package instead of trying to reinvent the wheel.

Update: if you need/want to distinguish between the types of tokens, you can get an index or a name like this without a (possibly long) chain of if/elif/elif/.../else:

>>> s = "Hello! Hi, I we 0 1 987?"

>>> pattern = r"([A-Za-z]+)|([0-9]+)|(\S)"
>>> list((m.lastindex, m.group()) for m in re.finditer(pattern, s))
[(1, 'Hello'), (3, '!'), (1, 'Hi'), (3, ','), (1, 'I'), (1, 'we'), (2, '0'), (2,     '1'), (2, '987'), (3, '?')]

>>> pattern = r"(?P<word>[A-Za-z]+)|(?P<number>[0-9]+)|(?P<other>\S)"
>>> list((m.lastgroup, m.group()) for m in re.finditer(pattern, s))
[('word', 'Hello'), ('other', '!'), ('word', 'Hi'), ('other', ','), ('word', 'I'), ('word', 'we'), ('number', '0'), ('number', '1'), ('number', '987'), ('other'
, '?')]
>>>
John Machin
It seems a bit ironic to denigrate regexes in a comment to another answer but then use them here.
GreenMatt
+1  A: 

The reason you're getting all of those None's is because you have lots of parenthesized groups in your regular expression separated by |'s. Every time your regular expression finds a match, it's only matching one of the alternatives given by the |'s. The parenthesized groups in the other, unused alternatives get set to None. And re.split by definition reports the values of all parenthesized groups every time it gets a match, hence lots of None's in your result.

You could filter those out pretty easily (e.g. tokens = [t for t in tokens if t] or something similar) but I think split isn't really the tool you want for tokenizing. split is meant for just throwing away whitespace. If you want really want to use regular expressions to tokenize something, here's a toy example of another method (I'm not going to even try to unpack that monster r.e. you're using...use the re.VERBOSE option for the love of Ned...but hopefully this toy example will give you the idea):

tokenpattern = re.compile(r"""
(?P<words>\w+) # Things with just letters and underscores
|(?P<numbers>\d+) # Things with just digits
|(?P<other>.+?) # Anything else
""", re.VERBOSE)

The (?P<something>... business lets you identify the type of token you're looking for by name in the code below:

for match in tokenpattern.finditer("99 bottles of beer"):
  if match.group('words'):
    # This token is a word
    word = match.group('words')
    #...
  elif match.group('numbers'):
    number = int(match.group('numbers')):
  else:
    other = match.group('other'):

Note that this is still a r.e. using a bunch of parenthesized groups separated by |'s, so the same thing is going to happen as in your code: for each match, one group will be defined and the others will be set to None. This method checks for that explicitly.

Peter Milley
+4  A: 

I recommend NLTK's tokenizers. Then you don't need to worry about tedious regular expressions yourself:

>>> import nltk
>>> nltk.word_tokenize("Hello! Hi, I am debating this predicament called life. Can you help me?")
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life.', 'Can', 'you', 'help', 'me', '?']
Tim McNamara
A: 

Perhaps he didn't mean it as such, but John Machin's comment "str.split is NOT a place to get started" (as part of the exchange after Frank V's answer) came as a bit of a challenge. So ...

the_string = "Hello! Hi, I am debating this predicament called life. Can you help me?"
tokens = the_string.split()
punctuation = ['!', ',', '.', '?']
output_list = []
for token in tokens:
    if token[-1] in punctuation:
        output_list.append(token[:-1])
        output_list.append(token[-1])
    else:
        output_list.append(token)
print output_list

This seems to provide the requested output.

Granted, John's answer is simpler in terms of number of lines of code. However, I have a couple points to make supporting this sort of solution.

I don't completely agree with Jamie Zawinski's 'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.' (Neither did he from what I've read.) My point in quoting this is that regular expressions can be a pain to get working if you're not accustomed to them.

Also, while it won't normally be an issue, the performance of the above solution was consistently better than the regex solution, when measured with timeit. The above solution (with the print statement removed) came in at about 8.9 seconds; John's regular expression solution came in at about 11.8 seconds. This involved 10 tries each of 1 million iterations on a quad core dual processor system running at 2.4 GHz.

GreenMatt