tags:

views:

1219

answers:

3

This is a pretty straight forward attempt. I haven't been using python for too long. Seems to work but I am sure I have much to learn. Someone let me know if I am way off here. Needs to find patterns, write the first line which matches, and then add a summary message for remaining consecutive lines which match pattern and return modified string.

Just to be clear...regex .*Dog.* would take

Cat
Dog
My Dog
Her Dog
Mouse

and return

Cat
Dog
::::: Pattern .*Dog.* repeats 2 more times.
Mouse


#!/usr/bin/env python
#

import re
import types

def remove_repeats (l_string, l_regex):
   """Take a string, remove similar lines and replace with a summary message.

   l_regex accepts strings and tuples.
   """

   # Convert string to tuple.
   if type(l_regex) == types.StringType:
      l_regex = l_regex,


   for t in l_regex:
      r = ''
      p = ''
      for l in l_string.splitlines(True):
         if l.startswith('::::: Pattern'):
            r = r + l
         else:
            if re.search(t, l): # If line matches regex.
                m += 1
                if m == 1: # If this is first match in a set of lines add line to file.
                   r = r + l
                elif m > 1: # Else update the message string.
                   p = "::::: Pattern '" + t + "' repeats " + str(m-1) +  ' more times.\n'
            else:
                if p: # Write the message string if it has value.
                   r = r + p
                   p = ''
                m = 0
                r = r + l

      if p: # Write the message if loop ended in a pattern.
          r = r + p
          p = ''

      l_string = r # Reset string to modified string.

   return l_string
+1  A: 

The rematcher function seems to do what you want:

def rematcher(re_str, iterable):

    matcher= re.compile(re_str)
    in_match= 0
    for item in iterable:
        if matcher.match(item):
            if in_match == 0:
                yield item
            in_match+= 1
        else:
            if in_match > 1:
                yield "%s repeats %d more times\n" % (re_str, in_match-1)
            in_match= 0
            yield item
    if in_match > 1:
        yield "%s repeats %d more times\n" % (re_str, in_match-1)

import sys, re

for line in rematcher(".*Dog.*", sys.stdin):
    sys.stdout.write(line)

EDIT

In your case, the final string should be:

final_string= '\n'.join(rematcher(".*Dog.*", your_initial_string.split("\n")))
ΤΖΩΤΖΙΟΥ
Going to have to do some reading to wrap my mind around use of yield. Thanks.
Ethan Post
Yield is "return keeping state". OK, forget that. You ask me to start reciting the powers of two, which you will use in some calculations of your own. I start with "1" and you do your stuff. You then ask me, "next?". I say "2". This goes on. Every time you ask "next?", I _yield_ a value.
ΤΖΩΤΖΙΟΥ
+1  A: 

Updated your code to be a bit more effective

#!/usr/bin/env python
#

import re
import types

def remove_repeats (l_string, l_regex):
   """Take a string, remove similar lines and replace with a summary message.

   l_regex accepts strings/patterns or tuples of strings/patterns.
   """

   # Convert string/pattern to tuple.
   if not hasattr(l_regex, '__iter__'):
      l_regex = l_regex,

   ret = []
   last_regex = None
   count = 0

   for line in l_string.splitlines(True):
      if last_regex:
         # Previus line matched one of the regexes
         if re.match(last_regex, line):
            # This one does too
            count += 1
            continue  # skip to next line
         elif count > 1:
            ret.append("::::: Pattern %r repeats %d more times.\n" % (last_regex, count-1))
         count = 0
         last_regex = None

      ret.append(line)

      # Look for other patterns that could match
      for regex in l_regex:
         if re.match(regex, line):
            # Found one
            last_regex = regex
            count = 1
            break  # exit inner loop

   return ''.join(ret)
MizardX
A: 

First, your regular expression will match more slowly than if you had left off the greedy match.

.*Dog.*

is equivalent to

Dog

but the latter matches more quickly because no backtracking is involved. The longer the strings, the more likely "Dog" appears multiple times and thus the more backtracking work the regex engine has to do. As it is, ".*D" virtually guarantees backtracking.

That said, how about:

#! /usr/bin/env python

import re            # regular expressions
import fileinput    # read from STDIN or file

my_regex = '.*Dog.*'
my_matches = 0

for line in fileinput.input():
    line = line.strip()

    if re.search(my_regex, line):
     if my_matches == 0:
      print(line)
     my_matches = my_matches + 1
    else:
     if my_matches != 0:
      print('::::: Pattern %s repeats %i more times.' % (my_regex, my_matches - 1))
     print(line)
     my_matches = 0

It's not clear what should happen with non-neighboring matches.

It's also not clear what should happen with single-line matches surrounded by non-matching lines. Append "Doggy" and "Hula" to the input file and you'll get the matching message "0" more times.

Thanks. Non-neighboring matches are not counted. Single-line matches are not counted.
Ethan Post