tags:

views:

729

answers:

11

Hi,

I have a text file which a lot of random occurrences of the string @STRING_A, and I would be interested in writing a short script which removes only some of them. Particularly one that scans the file and once it finds a line which starts with this string like

@STRING_A

then checks if 3 lines backwards there is another occurrence of a line starting with the same string, like

@STRING_A


@STRING_A

and if it happens, to delete the occurrence 3 lines backward. I was thinking about bash, but I do not know how to "go backwards" with it. So I am sure that this is not possible with bash. I also thought about python, but then I should store all information in memory in order to go backwards and then, for long files it would be unfeasible.

What do you think? Is it possible to do it in bash or python?

Thanks

A: 

In bash you can use sort -r filename and tail -n filename to read the file backwards.

$LINES=`tail -n filename | sort -r`
# now iterate through the lines and do your checking
Andrew Austin
How in the world does sorting a file alphabetically (sort -r) or outputting the last n lines of a file (tail -n) solve this problem?
Eric Smith
What OS are you you using? sort -r does not sort alphabetically under linux. I edited for clarity.http://www.thelinuxblog.com/linux-man-pages/1/sorthttp://www.thelinuxblog.com/linux-man-pages/1/tail
Andrew Austin
+1  A: 

Why shouldn't it possible in bash? You don't need to keep the whole file in memory, just the last three lines (if I understood correctly), and write what's appropriate to standard-out. Redirect that into a temporary file, check that everything worked as expected, and overwrite the source file with the temporary one.

Same goes for Python.

I'd provide a script of my own, but that wouldn't be tested. ;-)

DevSolar
+2  A: 

Of course Python will work as well. Simply store the last three lines in an array and check if the first element in the array is the same as the value you are currently reading. Then delete the value and print out the current array. You would then move over your elements to make room for the new value and repeat. Of course when the array is filled you'd have to make sure to continue to move values out of the array and put in the newly read values, stopping to check each time to see if the first value in the array matches the value you are currently reading.

AlbertoPL
+1  A: 

Try this in Python, it will scan through the file and keep only 3 lines in memory by default:

from collections import deque

def delete(fp, marker, gap=3):
    """Delete lines from *fp* if they with *marker* and are followed
    by another line starting with *marker* *gap* lines after.

    >>> from StringIO import StringIO
    >>> fp = StringIO('''a
    ... b
    ... xxx 1
    ... c
    ... xxx 2
    ... d
    ... e
    ... xxx 3
    ... f
    ... g
    ... h
    ... xxx 4
    ... i''')
    >>> print ''.join(delete(fp, 'xxx'))
    a
    b
    xxx 1
    c
    d
    e
    xxx 3
    f
    g
    h
    xxx 4
    i
    """
    buf = deque()
    for line in fp:
        if len(buf) < gap:
            buf.append(line)
        else:
            old = buf.popleft()
            if not (line.startswith(marker) and old.startswith(marker)):
                yield old
            buf.append(line)
    for line in buf:
        yield line

I've only tested it using the doctest you see.

Martin Geisler
Doesn't look correct to me. The OP said nothing about deleting a region. He did say: """random ocurrences of the string @STRING_A, and I would be interested in writing a short script which removes only some of them""" and """delete the ocurrence 3 lines backward""".
John Machin
Well, it was trivial to update the code to match the question :-)
Martin Geisler
s/trivial update/rewrite/ ... and also you still haven't got the point that the OP said he wanted to remove all occurrences of the string, NOT the whole line.
John Machin
@John: Come on, that is really not the point. The OP was concerned with having to go backwards in the file. My code and the code by goger shows how one can avoid this by using a small ring buffer. I don't see why you're so upset with the exact formulation of the original question -- the question is not very precise so the OP should not be surpriced if he has to adapt the answers slightly.
Martin Geisler
A: 

I would consider using sed. gnu sed supports definition of line ranges. if sed would fail, then there is another beast - awk and I'm sure you can do it with awk.

O.K. I feel I should put my awk POC. I could not figure out to use sed addresses. I have not tried combination of awk+sed, but it seems to me it's overkill.

my awk script works as follows:

  • It reads lines and stores them into 3 line buffer

  • once desired pattern is found (/^data.*/ in my case), the 3-line buffer is looked up to check, whether desired pattern has been seen three lines ago

  • if pattern has been seen, then 3 lines are scratched

to be honest, I would probably go with python also, given that awk is really awkward. the AWK code follows:

function max(a, b)
{
    if (a > b)
        return a;
    else
        return b;
}

BEGIN {
    w = 0;  #write index
    r = 0;  #read index
    buf[0, 1, 2];   #buffer

}

END {
    # flush buffer
    # start at read index and print out up to w index
    for (k = r % 3; k  r - max(r - 3, 0); k--) {
        #search in 3 line history buf
        if (match(buf[k % 3], /^data.*/) != 0) {
            # found -> remove lines from history
            # by rewriting them -> adjust write index
            w -= max(r, 3);
        }
    }
    buf[w % 3] = $0;
    w++;
}

/^.*/ {
    # store line into buffer, if the history
    # is full, print out the oldest one.
    if (w > 2) {
        print buf[r % 3];
        r++;
        buf[w % 3] = $0;
    }
    else {
        buf[w] = $0;
    }
    w++;
}
SashaN
You could do it in Brainfuck or INTERCAL, too. The trick is in the "how"...
DevSolar
awk can probably do it by itself... but I suspect it's cleaner to actually use awk+sed, as per my solution above.
jkerian
+1  A: 

As AlbertoPL said, store lines in a fifo for later use--don't "go backwards". For this I would definitely use python over bash+sed/awk/whatever.

I took a few moments to code this snippet up:

from collections import deque
line_fifo = deque()
for line in open("test"):
    line_fifo.append(line)
    if len(line_fifo) == 4:
        # "look 3 lines backward"                                               
        if line_fifo[0] == line_fifo[-1] == "@STRING_A\n":
            # get rid of that match
            line_fifo.popleft()
        else:
            # print out the top of the fifo
            print line_fifo.popleft(),
# don't forget to print out the fifo when the file ends
for line in line_fifo: print line,
goger
The OP says that he wants only the occurrence of "@STRING_A" deleted from the start of the line ... "line starting with", "delete the occurrence" ('occurrence' is used everywhere to mean that string). Everybody seems to be assuming that the whole line is (a) to be tested against (b) deleted. Point 2: why roll you own fifo when there's a deque provided?
John Machin
@John: I think the OP could have made things more precise by giving an example of how the file should look before and after the deletion. Both my code above and goger's code should be enough to solve the problem.
Martin Geisler
@John: IMO the first point is an implementation detail for the OP, tangential to the meat of the question. Your second point regarding the deque is a good one and I've updated my code.
goger
@Martin: I feel a bit silly...I didn't see your code, just your doctest and didn't scroll past. Now that I look, your code looks good to me.
goger
A: 

My awk-fu has never been that good... but the following may provide you what you're looking for in a bash-shell/shell-utility form:

sed `awk 'BEGIN{ORS=";"}
/@STRING_A/ {
  if(LAST!="" && LAST+3 >= NR) print LAST "d"
  LAST = NR
}' test_file` test_file

Basically... awk is producing a command for sed to strip certain lines. I'm sure there's a relatively easy way to make awk do all of the processing, but this does seem to work.

The bad part? It does read the test_file twice.

The good part? It is a bash/shell-utility implementation.

Edit: Alex Martelli points out that the sample file above might have confused me. (my above code deletes the whole line, rather than the @STRING_A flag only)

This is easily remedied by adjusting the command to sed:

sed `awk 'BEGIN{ORS=";"}
/@STRING_A/ {
  if(LAST!="" && LAST+3 >= NR) print LAST "s/@STRING_A//"
  LAST = NR
}' test_file` test_file
jkerian
+2  A: 

Here is a more fun solution, using two iterators with a three element offset :)

from itertools import izip, chain, tee
f1, f2 = tee(open("foo.txt"))
for third, line in izip(chain("   ", f1), f2):
    if not (third.startswith("@STRING_A") and line.startswith("@STRING_A")):
        print line,
truppo
Very cool! :-) Using the tee function from itertools ("T", as in the kind of piping you use to split a water pipe into two pipes) you can get two iterators for the file and thus avoid reading the file twice. I don't think it will matter much here since the OS would buffer the file anyway, but it's fun to play with iterators :-)
Martin Geisler
Tee sounds good, updated the code.
truppo
+4  A: 
Alex Martelli
Why would you repost the code from truppo like that? And why do you guys keep complaining about our perfectly fine solutions when the question is not very clear to start with?
Martin Geisler
@Martin, I agree with John Machin about the likely interpretation of the question -- though you're right it's slightly ambiguous, and your or truppo's solutions would be fine under a different interpretation, I thought that posting the solution to the most likely interpretation was better than leaving that unanswered. I picked truppo's answer (with full credit, of course!) as a base because I agree with your comment about it being cool, and did not edit it in-place because that would violate the editing guidelines. Hope this helps!
Alex Martelli
A: 

This may be what you're looking for?

lines = open('sample.txt').readlines()

needle = "@string "

for i,line in enumerate(lines):
    if line.startswith(needle) and lines[i-3].startswith(needle):
     lines[i-3] = lines[i-3].replace(needle, "")
print ''.join(lines)

this outputs:

string 0 extra text
string 1 extra text
string 2 extra text
string 3 extra text
--replaced --  4 extra text
string 5 extra text
string 6 extra text
@string 7 extra text
string 8 extra text
string 9 extra text
string 10 extra text
lyrae
Replaces needle instead of removing it. Reads the whole file into memory and then makes ANOTHER copy during the print statement at the end. Writes an extra newline at the end of the output. Will crash (IndexError) if needle occurs in the first 3 lines.
John Machin
Easily fixable. He can make the replacement be "". That will delete the needle from the line. Does not cause indexerror. Doesn't need to print at the end; can write directly to another file. It does however copy entire file into memory.
lyrae
Can be made to cause IndexError. Can be made to munch an innocent line. Making the replacement "" is not enough. See the "anwswer" for demonstration.
John Machin
A: 

This "answer" is for lyrae ... I'll amend my previous comment: if the needle is in the first 3 lines of the file, your script will either cause an IndexError or access a line that it shouldn't be accessing, sometimes with interesting side-effects.

Example of your script causing IndexError:

>>> lines = "@string line 0\nblah blah\n".splitlines(True)
>>> needle = "@string "
>>> for i,line in enumerate(lines):
...     if line.startswith(needle) and lines[i-3].startswith(needle):
...         lines[i-3] = lines[i-3].replace(needle, "")
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
IndexError: list index out of range

and this example shows not only that the Earth is round but also why your "fix" to the "don't delete the whole line" problem should have used .replace(needle, "", 1) or [len(needle):] instead of .replace(needle, "")

>>> lines = "NEEDLE x NEEDLE y\nnoddle\nnuddle\n".splitlines(True)
>>> needle = "NEEDLE"
>>> # Expected result: no change to the file
... for i,line in enumerate(lines):
...     if line.startswith(needle) and lines[i-3].startswith(needle):
...         lines[i-3] = lines[i-3].replace(needle, "")
...
>>> print ''.join(lines)
 x  y   <<<=== whoops!
noddle
nuddle
        <<<=== still got unwanted newline in here
>>>
John Machin
ahh gotcha. thanks.
lyrae