tags:

views:

535

answers:

5

How to use grep to output occurrences of the string 'export to excel' in the input files given below? Specifically, how to handle the line breaks that happen in between the search strings? Is there a switch in grep that can do this or some other command probably?

Input files:

File a.txt:

blah blah ... export to
excel ...
blah blah..

File b.txt:

blah blah ... export to excel ...
blah blah..

A: 

You could do grep "(...export[\r\n\s]*to[\r\n\s]*excel)" files.

Tordek
A: 

Do you just want to find files that contain the pattern, ignoring linebreaks, or do you want to actually see the matching lines?

If the former, you can use tr to convert newlines to spaces:

tr '\n' ' ' | grep 'export to excel'

If the latter you can do the same thing, but you may want to use the -o flag to only print the actual match. You'll then want to adjust your regex to include any extra context you want.

Laurence Gonsalves
tr + grep solution not really suitable for big files as its going to form one BIG string.
A: 

use gawk. set record separator as excel, then check for "export to".

gawk -vRS="excel" '/export.*to/{print "found export to excel at record: "NR}' file

or

gawk '/export.*to.*excel/{print}
/export to/&&!/excel/{
  s=$0
  getline line
  if (line~/excel/){
   printf "%s\n%s\n",s,line
  } 
}' file
How would you print the actual lines as `grep` would (for matches within its capability)?
Dennis Williamson
print the record, $0. Otherwise, i don't understand what you mean.
I think your edit takes care of that. However, it fails for some edge cases. If the input was something like "excel export to\nexcel" or "export to\nsomething other than excel", for example. To answer your question in your comment: the original one-liner, if $0 were added to the output, would not show the "excel" and especially the "..." after it that is indicated in the OP's question.
Dennis Williamson
A: 

I have tested this a little and it seems to work:

sed -n '$b; /export to excel/{p; b}; N; /export to\nexcel/{p; b}; D' filename

You can allow for some extra white space at the end and beginning of the lines like this:

sed -n '$b; /export to excel/{p; b}; N; /export to\s*\n\s*excel/{p; b}; D' filename
Dennis Williamson
+1  A: 

I don't know how to do this in grep. I checked the man page for egrep(1) and it can't match with a newline in the middle either.

I like the solution @Laurence Gonsalves suggested, of using tr(1) to wipe out the newlines. But as he noted, it will be a pain to print the matching lines if you do it that way.

If you want to match despite a newline and then print the matching line(s), I can't think of a way to do it with grep, but it would be not too hard in any of Python, AWK, Perl, or Ruby.

Here's a Python script that solves the problem. I decided that, for lines that only match when joined to the previous line, I would print a --> arrow before the second line of the match. Lines that match outright are always printed without the arrow.

This is written assuming that /usr/bin/python is Python 2.x. You can trivially change the script to work under Python 3.x if desired.

#!/usr/bin/python

import re
import sys

s_pat = "export\s+to\s+excel"
pat = re.compile(s_pat)

def print_ete(fname):
    try:
        f = open(fname, "rt")
    except IOError:
        sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
        sys.exit(2)

    prev_line = ""
    i_last = -10
    for i, line in enumerate(f):
        # is ete within current line?
        if pat.search(line):
            print "%s:%d: %s" % (fname, i+1, line.strip())
            i_last = i
        else:
            # construct extended line that included previous
            # note newline is stripped
            s = prev_line.strip("\n") + " " + line
            # is ete within extended line?
            if pat.search(s):
                # matched ete in extended so want both lines printed
                # did we print prev line?
                if not i_last == (i - 1):
                    # no so print it now
                    print "%s:%d: %s" % (fname, i, prev_line.strip())
                # print cur line with special marker
                print "-->  %s:%d: %s" % (fname, i+1, line.strip())
                i_last = i
        # make sure we don't match ete twice
        prev_line = re.sub(pat, "", line)

try:
    if sys.argv[1] in ("-h", "--help"):
        raise IndexError # print help
except IndexError:
    sys.stderr.write("print_ete <filename>\n")
    sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
            "export to excel")
    sys.exit(1)

print_ete(sys.argv[1])

EDIT: added comments.

I went to some trouble to make it print the correct line number on each line, using a format similar to what you would get with grep -Hn.

It could be much shorter and simpler if you don't need line numbers, and you don't mind reading in the whole file at once into memory:

#!/usr/bin/python

import re
import sys

# This pattern not compiled with re.MULTILINE on purpose.
# We *want* the \s pattern to match a newline here so it can
# match across multiple lines.
# Note the match group that gathers text around ete pattern uses a character
# class that matches anything but "\n", to grab text around ete.
s_pat = "([^\n]*export\s+to\s+excel[^\n]*)"
pat = re.compile(s_pat)

def print_ete(fname):
    try:
        text = open(fname, "rt").read()
    except IOError:
        sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
        sys.exit(2)

    for s_match in re.findall(pat, text):
        print s_match

try:
    if sys.argv[1] in ("-h", "--help"):
        raise IndexError # print help
except IndexError:
    sys.stderr.write("print_ete <filename>\n")
    sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
            "export to excel")
    sys.exit(1)

print_ete(sys.argv[1])
steveha
i don't see you compiled the regex with re.MULTILINE, so how does it check for "excel" on another line?
re.MULTILINE was *not* what I wanted, so I didn't specify it. With re.MULTILINE, the `re` code treats a newline like the end of a string, and does not match after that. I wanted a newline treated like any other white space in the matching. I will add some comments to the code.
steveha
Actually, my first version would work the same with or without the re.MULTILINE. The second, slurp-in-whole-file version needs to not have that flag because it depends on matching around a newline. The first version builds a special single line and strips any newline in the process.
steveha