tags:

views:

55

answers:

5

Here is what I have so far

text = "Hello world. It is a nice day today. Don't you think so?"
re.findall('\w{3,}\s{1,}\w{3,}',text)
#['Hello world', 'nice day', 'you think']

The desired output would be ['Hello world', 'nice day', 'day today', 'today Don't', 'Don't you', 'you think']

Can this be done with a simple regex pattern?

A: 

This is an excellent example of when not to use regular expressions for parsing.

anthony
this is an excellent example when to not to post an answer.
SilentGhost
well is there a simple alternative?
tomfmason
+1  A: 
map(lambda x: x[0] + x[1], re.findall('(\w{3,}(?=(\s{1,}\w{3,})))',text))

May be you can rewrite the lambda for shorter (like just '+') And BTW ' is not part of \w or \s

Lucho
Ok, the uber-neater way: map("".join, re.findall('(\w{3,}(?=(\s{1,}\w{3,})))',text))
Lucho
@Lucho: Nice but your example confirms to me that regex will make your python look like perl.
pyfunc
Yes, everything that uses regexp is "very much like" Perl, because Perl is the foundation of the nowadays regexps - PCRE (Perl Compatible Reg Exp) - http://en.wikipedia.org/wiki/Regular_expression
Lucho
+1  A: 

Something like this with additional checks for list boundaries should do:

>>> text = "Hello world. It is a nice day today. Don't you think so?"
>>> k = text.split()
>>> k
['Hello', 'world.', 'It', 'is', 'a', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> z = [x for x in k if len(x) > 2]
>>> z
['Hello', 'world.', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']

>>> [z[n]+ " " + z[n+1] for n in range(0, len(z)-1, 2)]
['Hello world.', 'nice day', "today. Don't", 'you think']
>>> 
pyfunc
Sometimes, regexes are far more trouble than they are worth. +1
jkerian
+1  A: 

There are two problems with your approach:

  1. Neither \w nor \s matches punctuation.
  2. When you match a string with a regular expression using findall, that part of the string is consumed. Searching for the next match commences immediately after the end of the previous match. Because of this a word can't be included in two separate matches.

To solve the first issue you need to decide what you mean by a word. Regular expressions aren't good for this sort of parsing. You might want to look at a natural language parsing library instead.

But assuming that you can come up with a regular expression that works for your needs, to fix the second problem you can use a lookahead assertion to check the second word. This won't return the entire match as you want but you can at least find the first word in each word pair using this method.

 re.findall('\w{3,}(?=\s{1,}\w{3,})',text)
                   ^^^            ^
                  lookahead assertion
Mark Byers
+1  A: 
import itertools as it
import re 

three_pat=re.compile(r'\w{3}')
text = "Hello world. It is a nice day today. Don't you think so?"
for key,group in it.groupby(text.split(),lambda x: bool(three_pat.match(x))):
    if key:
        group=list(group)       
        for i in range(0,len(group)-1):
            print(' '.join(group[i:i+2]))

# Hello world.
# nice day
# day today.
# today. Don't
# Don't you
# you think

It not clear to me what you want done with all punctuation. On the one hand, it looks like you want periods to be removed, but single quotation marks to be kept. It would be easy to implement the removal of periods, but before I do, would you clarify what you want to happen to all punctuation?

unutbu