ansaurus

Question

Answer 1

A:

One was is to use findall:

polymer_str = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine"
len(re.findall("diNCO diamine", polymer_str)) # returns 4.

Sinan Taifour 2009-07-20 19:25:40

That finds the total number, not the longest sequence

Dan Lorenc 2009-07-20 19:29:00

Sorry, I misunderstood the question.

Sinan Taifour 2009-07-20 19:37:27

Answer 2

A:

Using re:

 m = re.search(r"(\bdiNCO diamine\b\s?)+", polymer_str)
 len(m.group(0)) / len("bdiNCO diamine")

2009-07-20 19:29:42

This doesn't account for the spaces correctly. You also have an extra "b" in the second line. +1 for being closer than I was!

Sinan Taifour 2009-07-20 19:39:04

2009-07-20 19:40:34

good show sir! thank you :) adding the extra \b did it :)

Casey 2009-07-20 19:48:46

It has been a pleasure :)

2009-07-20 19:53:32

This does not work. It finds the first match, not the longest match. polymer_str = "diol diNCO diamine tacos diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine" The correct result is 3; this returns 1.

Glenn Maynard 2009-07-20 20:12:54

Actually group(0) gets overwritten every time a new match is found. There is currently no way to get multiple groups from a '+' or '*' in a regex using the 're' module.

tgray 2009-07-20 20:24:12

Answer 3

+3 A:

I think the op wants the longest contiguous sequence. You can get all contiguous sequences like: seqs = re.findall("(?:diNCO diamine)+", polymer_str)

and then find the longest.

Ealdwulf 2009-07-20 19:37:33

do you mind explaining this a little more?

Casey 2009-07-20 20:26:34

I've expanded on this in my answer.

tgray 2009-07-20 20:33:58

tgray has mostly done it. I would only add that the '?:' rune is necessary as otherwise findall returns each match inside the parentheses as a separate entry in the list, which in this case defeats the object.

Ealdwulf 2009-07-22 19:58:34

Answer 4

+4 A:

Expanding on Ealdwulf's answer:

Documentation on re.findall can be found here.

def getLongestSequenceSize(search_str, polymer_str):
    matches = re.findall(r'(?:\b%s\b\s?)+' % search_str, polymer_str)
    longest_match = max(matches)
    return longest_match.count(search_str)

This could be written as one line, but it becomes less readable in that form.

Alternative:

If polymer_str is huge, it will be more memory efficient to use re.finditer. Here's how you might go about it:

def getLongestSequenceSize(search_str, polymer_str):
    longest_match = ''
    for match in re.finditer(r'(?:\b%s\b\s?)+' % search_str, polymer_str):
        if len(match.group(0)) > len(longest_match):
            longest_match = match.group(0)
    return longest_match.count(search_str)

The biggest difference between findall and finditer is that the first returns a list object, while the second iterates over Match objects. Also, the finditer approach will be somewhat slower.

tgray 2009-07-20 20:31:51

This actually returns the number of characters in the final match, not the number of matches (as seems to be suggested in the question) and not a string containing the longest match.

Whisty 2009-07-21 00:58:14

Good point, I've fixed the code to return the number of matches. If they want the actual string, they just need to return 'longest_match'.

tgray 2009-07-21 15:25:27

Answer 5

+2 A:

import re
pat = re.compile("[^|]+")
p = "diol diNCO diamine diNCO diamine diNCO diamine diNCO diol diNCO diamine".replace("diNCO diamine","|").replace(" ","")
print max(map(len,pat.split(p)))

ghostdog74 2009-07-21 00:25:54

ansaurus

tags:

views:

answers:

Python: re..find longest sequence

related questions