ansaurus

Question

What is best way to remove duplicate lines matching regex from string using Python?

Answer 1

+1 A:

The rematcher function seems to do what you want:

def rematcher(re_str, iterable):

    matcher= re.compile(re_str)
    in_match= 0
    for item in iterable:
        if matcher.match(item):
            if in_match == 0:
                yield item
            in_match+= 1
        else:
            if in_match > 1:
                yield "%s repeats %d more times\n" % (re_str, in_match-1)
            in_match= 0
            yield item
    if in_match > 1:
        yield "%s repeats %d more times\n" % (re_str, in_match-1)

import sys, re

for line in rematcher(".*Dog.*", sys.stdin):
    sys.stdout.write(line)

EDIT

In your case, the final string should be:

final_string= '\n'.join(rematcher(".*Dog.*", your_initial_string.split("\n")))

ΤΖΩΤΖΙΟΥ 2008-10-03 17:34:20

Going to have to do some reading to wrap my mind around use of yield. Thanks.

Ethan Post 2008-10-03 18:48:14

Yield is "return keeping state". OK, forget that. You ask me to start reciting the powers of two, which you will use in some calculations of your own. I start with "1" and you do your stuff. You then ask me, "next?". I say "2". This goes on. Every time you ask "next?", I _yield_ a value.

ΤΖΩΤΖΙΟΥ 2008-10-03 20:52:42

Answer 2

+1 A:

Updated your code to be a bit more effective

#!/usr/bin/env python
#

import re
import types

def remove_repeats (l_string, l_regex):
   """Take a string, remove similar lines and replace with a summary message.

   l_regex accepts strings/patterns or tuples of strings/patterns.
   """

   # Convert string/pattern to tuple.
   if not hasattr(l_regex, '__iter__'):
      l_regex = l_regex,

   ret = []
   last_regex = None
   count = 0

   for line in l_string.splitlines(True):
      if last_regex:
         # Previus line matched one of the regexes
         if re.match(last_regex, line):
            # This one does too
            count += 1
            continue  # skip to next line
         elif count > 1:
            ret.append("::::: Pattern %r repeats %d more times.\n" % (last_regex, count-1))
         count = 0
         last_regex = None

      ret.append(line)

      # Look for other patterns that could match
      for regex in l_regex:
         if re.match(regex, line):
            # Found one
            last_regex = regex
            count = 1
            break  # exit inner loop

   return ''.join(ret)

MizardX 2008-10-03 17:44:00

Answer 3

A:

First, your regular expression will match more slowly than if you had left off the greedy match.

.*Dog.*

is equivalent to

Dog

but the latter matches more quickly because no backtracking is involved. The longer the strings, the more likely "Dog" appears multiple times and thus the more backtracking work the regex engine has to do. As it is, ".*D" virtually guarantees backtracking.

That said, how about:

#! /usr/bin/env python

import re            # regular expressions
import fileinput    # read from STDIN or file

my_regex = '.*Dog.*'
my_matches = 0

for line in fileinput.input():
    line = line.strip()

    if re.search(my_regex, line):
     if my_matches == 0:
      print(line)
     my_matches = my_matches + 1
    else:
     if my_matches != 0:
      print('::::: Pattern %s repeats %i more times.' % (my_regex, my_matches - 1))
     print(line)
     my_matches = 0

It's not clear what should happen with non-neighboring matches.

It's also not clear what should happen with single-line matches surrounded by non-matching lines. Append "Doggy" and "Hula" to the input file and you'll get the matching message "0" more times.

2009-08-08 23:45:20

Thanks. Non-neighboring matches are not counted. Single-line matches are not counted.

Ethan Post 2009-08-10 13:22:04

ansaurus

tags:

views:

answers:

What is best way to remove duplicate lines matching regex from string using Python?

EDIT

related questions