tags:

views:

309

answers:

5

I have a string that is randomly generated:

polymer_str = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine"

I'd like to find the longest sequence of "diNCO diol" and the longest of "diNCO diamine". So in the case above the longest "diNCO diol" sequence is 1 and the longest "diNCO diamine" is 3.

How would I go about doing this using python's re module?

Thanks in advance.

EDIT:
I mean the longest number of repeats of a given string. So the longest string with "diNCO diamine" is 3:
diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine

A: 

One was is to use findall:

polymer_str = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine"
len(re.findall("diNCO diamine", polymer_str)) # returns 4.
Sinan Taifour
That finds the total number, not the longest sequence
Dan Lorenc
Sorry, I misunderstood the question.
Sinan Taifour
A: 

Using re:

 m = re.search(r"(\bdiNCO diamine\b\s?)+", polymer_str)
 len(m.group(0)) / len("bdiNCO diamine")
This doesn't account for the spaces correctly. You also have an extra "b" in the second line. +1 for being closer than I was!
Sinan Taifour
good show sir! thank you :) adding the extra \b did it :)
Casey
It has been a pleasure :)
This does not work. It finds the first match, not the longest match. polymer_str = "diol diNCO diamine tacos diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine" The correct result is 3; this returns 1.
Glenn Maynard
Actually group(0) gets overwritten every time a new match is found. There is currently no way to get multiple groups from a '+' or '*' in a regex using the 're' module.
tgray
+3  A: 

I think the op wants the longest contiguous sequence. You can get all contiguous sequences like: seqs = re.findall("(?:diNCO diamine)+", polymer_str)

and then find the longest.

Ealdwulf
do you mind explaining this a little more?
Casey
I've expanded on this in my answer.
tgray
tgray has mostly done it. I would only add that the '?:' rune is necessary as otherwise findall returns each match inside the parentheses as a separate entry in the list, which in this case defeats the object.
Ealdwulf
+4  A: 

Expanding on Ealdwulf's answer:

Documentation on re.findall can be found here.

def getLongestSequenceSize(search_str, polymer_str):
    matches = re.findall(r'(?:\b%s\b\s?)+' % search_str, polymer_str)
    longest_match = max(matches)
    return longest_match.count(search_str)

This could be written as one line, but it becomes less readable in that form.

Alternative:

If polymer_str is huge, it will be more memory efficient to use re.finditer. Here's how you might go about it:

def getLongestSequenceSize(search_str, polymer_str):
    longest_match = ''
    for match in re.finditer(r'(?:\b%s\b\s?)+' % search_str, polymer_str):
        if len(match.group(0)) > len(longest_match):
            longest_match = match.group(0)
    return longest_match.count(search_str)

The biggest difference between findall and finditer is that the first returns a list object, while the second iterates over Match objects. Also, the finditer approach will be somewhat slower.

tgray
This actually returns the number of characters in the final match, not the number of matches (as seems to be suggested in the question) and not a string containing the longest match.
Whisty
Good point, I've fixed the code to return the number of matches. If they want the actual string, they just need to return 'longest_match'.
tgray
+2  A: 
import re
pat = re.compile("[^|]+")
p = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine".replace("diNCO diamine","|").replace(" ","")
print max(map(len,pat.split(p)))
ghostdog74