ansaurus

Question

Python string pattern recognition/compression

Answer 1

A:

How about subbing out the known text, and then splitting?

import re
[re.sub('(sometxt|moretxt)', ',', x).split(',') for x in lst]
# results in
[['a', '0', ''], ['b', '1', ''], ['aa', '10', ''], ['zz', '999', '']]

EMiller 2009-12-16 12:14:44

OP doesn't know what `sometxt` and `moretxt` are

SilentGhost 2009-12-16 12:16:19

@SilentGhost I'm not sure if he does or not, to be honest. The question leaves it pretty unclear.

mavnn 2009-12-16 12:19:14

@mavnn: Let's read it again: *namely I don't know what the pattern is going to be.*

SilentGhost 2009-12-16 12:21:52

@SilentGhost: Given it follows _I can do basic regex..._ it could be read as _I don't know how to construct the regex pattern needed_ rather than _I don't know the pattern of sometxt and moretxt_. It's a much more interesting question in the latter case, though.

mavnn 2009-12-16 12:26:22

This is what re.split is for.

codeape 2009-12-16 13:11:07

EMiller 2009-12-16 13:50:26

Answer 2

+1 A:

I guess you should start by identifying substrings (patterns) that frequently occur in the strings. Since naively counting substrings in a set of strings is rather computationally expensive, you'll need to come up with something smart.

I've done substring counting on a large amount of data using generalized suffix trees (example here). Once you know the most frequent substrings/patterns in the data, you can take it from there.

TC 2009-12-16 12:30:30

Answer 3

+3 A:

Here is a scary one to get the ball rolling.

>>> import re
>>> makere = lambda n: ''.join(['(.*?)(.+)(.*?)(.+)(.*?)'] + ['(.*)(\\2)(.*)(\\4)(.*)'] * (n - 1))
>>> inp = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> re.match(makere(len(inp)), ''.join(inp)).groups()
('a', 'sometxt', '0', 'moretxt', '', 'b', 'sometxt', '1', 'moretxt', 'aa', '', 'sometxt', '10', 'moretxt', 'zz', '', 'sometxt', '999', 'moretxt', '')

I hope its sheer ugliness will inspire better solutions :)

Constantin 2009-12-16 12:31:44

+1: Very elegant

codeape 2009-12-16 12:42:07

Beautiful solution, a bit slow, but I can wait. Ugliness doesn't bother me, good code worth more than gold.Never thought of using regex that way, but this is way beyond my basic regex knowledge. I am trying to figure out how to separate the output into separate list of constants and variables.

iCy 2009-12-16 14:40:51

@iCy, it works significantly faster if there is a character that is never encountered in input data, like '\0'. In that case replace both ''.join with '\0'.join. It also gives a prettier result.

Constantin 2009-12-16 15:34:44

Brilliant! Yea, the strings only contain printable characters. '\0'.join made it much faster.

iCy 2009-12-16 15:46:22

Yes, that speeds it up at a great deal. My solution is still 200 times faster, though :-) Having more test cases would be nice...

codeape 2009-12-16 21:16:47

Answer 4

+1 A:

This solution finds the two longest common substrings and uses them to delimit the input strings:

def an_answer_to_stackoverflow_question_1914394(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> an_answer_to_stackoverflow_question_1914394(lst)
    (['sometxt', 'moretxt'], [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')])
    """
    delimiters = find_delimiters(lst)
    return delimiters, list(split_strings(lst, delimiters))

find_delimiters and friends finds the delimiters:

import itertools

def find_delimiters(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> find_delimiters(lst)
    ['sometxt', 'moretxt']
    """
    candidates = list(itertools.islice(find_longest_common_substrings(lst), 3))
    if len(candidates) == 3 and len(candidates[1]) == len(candidates[2]):
        raise ValueError("Unable to find useful delimiters")
    if candidates[1] in candidates[0]:
        raise ValueError("Unable to find useful delimiters")
    return candidates[0:2]

def find_longest_common_substrings(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(itertools.islice(find_longest_common_substrings(lst), 3))
    ['sometxt', 'moretxt', 'sometx']
    """
    for i in xrange(min_length(lst), 0, -1):
        for substring in common_substrings(lst, i):
            yield substring


def min_length(lst):
    return min(len(item) for item in lst)

def common_substrings(lst, length):
    """
    >>> list(common_substrings(["hello", "world"], 2))
    []
    >>> list(common_substrings(["aabbcc", "dbbrra"], 2))
    ['bb']
    """
    assert length <= min_length(lst)
    returned = set()
    for i, item in enumerate(lst):
        for substring in all_substrings(item, length):
            in_all_others = True
            for j, other_item in enumerate(lst):
                if j == i:
                    continue
                if substring not in other_item:
                    in_all_others = False
            if in_all_others:
                if substring not in returned:
                    returned.add(substring)
                    yield substring

def all_substrings(item, length):
    """
    >>> list(all_substrings("hello", 2))
    ['he', 'el', 'll', 'lo']
    """
    for i in range(len(item) - length + 1):
        yield item[i:i+length]

split_strings splits the strings using the delimiters:

import re

def split_strings(lst, delimiters):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(split_strings(lst, find_delimiters(lst)))
    [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')]
    """
    for item in lst:
        parts = re.split("|".join(delimiters), item)
        yield tuple(part for part in parts if part != '')

codeape 2009-12-16 12:31:59

Why the downvote? My answer at least solves the problem.

codeape 2009-12-16 12:48:08

Have an upvote to counteract (and a bit) the random driveby. There's some interesting and useful stuff in your answer.

mavnn 2009-12-16 13:18:51

Thanks, much appreciated!

codeape 2009-12-16 13:46:16

We might be using a different version of python, 2.4.3 is what I am using. I got TypeError: min() takes no keyword arguments.

iCy 2009-12-17 10:53:51

Yes, I'm on 2.5. I edited the answer to make it work with 2.4

codeape 2009-12-17 13:08:55

Answer 5

+1 A:

This look much like the LZW algorithm for data (text) compression. There should be python implementations out there, which you may be able to adapt to your need.

I assume you have no a priori knowledge of these sub strings that repeat often.

Ber 2009-12-16 12:46:51

it's spelled "a priori"

B Johnson 2009-12-17 13:15:52

Answer 6

+2 A:

This seems to be an example of the longest common subsequence problem. One way could be to look at how diffs are generated. The Hunt-McIlroy algorithm seems to have been the first, and is such the simplest, especially since it apparently is non-heuristic.

The first link contains detailed discussion and (pseudo) code examples. Assuming, of course, Im not completely of the track here.

mizipzor 2009-12-16 12:48:35

Yea, I thought about the diff alg. as well, just have no idea where to start.

iCy 2009-12-16 14:18:52

ansaurus

tags:

views:

answers:

Python string pattern recognition/compression

related questions