ansaurus

Question

How do I crop a very large text file between first and last occurrence of a string? (Linux)

Answer 1

A:

The following script will do all of what you asked, including the bonus. Put this script in the top-level directory that contains all the possible files with the 'uid' you want to crop. The script will recursively search this directory and crop all files that match and put the result in a new file with a .crp extension at the end (see example below). I took special consideration to make sure that this script will work with whatever filename you throw at it, whether it contain spaces or newlines or whatever in its name.

#!/bin/bash
uid="1111-ABCD-1111-SOME-GUID"

while IFS= read -r -d $'\0' file; do
    printf "%s\n" "?$uid?+1,\$d" "1,/$uid/-1d" "%p" | ex -s "$file" > "$file".crp
    echo "$file being cropped"
done < <(grep -lZR --exclude="${0#*/}" --exclude="*.crp" "$uid" .)

Result

$ ./uid.sh
./sample1.txt being cropped
./subdir/sample2.txt being cropped

$ cat ./sample1.txt.crp
line three containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
line four containing other session id: 2222-ABCD-1111-SOME-GUID
line five blaa blaa blaa
line six containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa
line seven containing other session id: 2222-ABCD-1111-SOME-GUID
line eight containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
line nine containing other session id: 3333-ABCD-1111-SOME-GUID
line ten containing my session id: 1111-ABCD-1111-SOME-GUID
line eleven
line twelve containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa
line thirteen containing my session id: 1111-ABCD-1111-SOME-GUID

$ cat ./subdir/sample2.txt.crp
line three containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
foo
bar
line eight containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
baz
line ten containing my session id: 1111-ABCD-1111-SOME-GUID

As you can see in the example above, my script found two files which matched, one of which was in a sub-directory below the top-level directory.

SiegeX 2009-12-16 10:34:27

SessionId is like a guid and appears many times in the file.

TiGz 2009-12-16 10:39:21

that doesn't help. I need to see an ***example*** of what a sessionid can look like and where it can change and how.

SiegeX 2009-12-16 10:48:41

Well an example is like: 4934FF07-436E-8D2A-C7C2-A3328B371005_1260470734931_143 however I don't see how that helps. Bare in mind that the log file contains many different session ids from many different sessions but at any particular time I am only interested in a particular instance. i.e. the sid needs to be an input to the script (however I can do that manually if needs be).

TiGz 2009-12-16 10:56:03

@SiegeX: I may be wrong, but I think TiGz means he would like the regex to match the *files* to check, not the sessionId within the files.

Grundlefleck 2009-12-16 11:23:20

I misunderstood your original 'bonus' question, I had thought that you wanted to use a regex to pick out the sessionid, not which file it was in. Question for you, can the relevant sessionid be in a single file more than two times? Meaning I would need to skip past all the middle occurrences and proceed to crop until the very last occurrence?

SiegeX 2009-12-16 11:27:49

yes the target session id will appear many many times within the log file.

TiGz 2009-12-16 14:26:20

Answer 2

A:

I'd probably do this using cat and awk. Something like

cat *.log | awk 'BEGIN { sidFound = 0; } {
    if (*check for SID here*) {
         sidFound = !sidFound;
    }

    if (sidFound) {
         print $0
    }
 }'

Adam Luchjenbroers 2009-12-16 10:34:33

cat is not needed.

2009-12-16 12:46:24

He mentions in the question that he may need to scan multiple log files for this Session ID. Cat provides an easy way of scanning multiple files at once.

Adam Luchjenbroers 2009-12-16 12:50:55

awk can take in file input as well. --> awk '{blah blah}' *.log

2009-12-16 13:14:28

Answer 3

A:

Either a few lines of Perl, or:

grep -no <session_ID> <log_file>

(make a note of the first and last line numbers with your session ID on)

awk 'NR==3,NR==935' <log_file>

(where 3 and 935 are the first and last line numbers returned from the grep command)

I can't currently think of a way to make that a one-liner.

Richard Gaywood 2009-12-16 11:06:41

this is definitely not what he is looking for.

2009-12-16 13:15:58

Answer 4

A:

I'd propose something like this :

# Find all occurence of session id in the input file
grep -n "<session id>" "<input file>" > /tmp/grep.$$

# get the first line number of session id appearance
FIRST_LINE=$(head -1 /tmp/grep.$$ | cut -d: -f1)

# get the last line number of session id appearance
LAST_LINE=$(tail -1 /tmp/grep.$$| cut -d: -f1)

# Display only the part (inclusive) in between the first and last session id
sed -n "${FIRST_LINE},${LAST_LINE}p" "<input file>"

So that you retrive the line number of the first and last occurence of your pattern in the input file and then, using sed, you display only those (included). It can be optimised (grepping only once) but it should be working.

Zeograd 2009-12-16 11:16:28

there's no need to open input file 3 times.

2009-12-16 15:11:40

The two first opening can be merged if you keep the result of grep in a temporary file, ok.However, with linear stream processor programs, I don't see how you can know in advance if the current line must be printed or not, ie if there's another occurence of the session id later in the file.Keep in mind that it's the same pattern which starts AND ends the portion to dump as far as I understood.

Zeograd 2009-12-16 15:38:29

for this qns, one have to have better control over file manipulation, like open the file, searching through it, storing lines in temp memory, printing out when necessary , etc etc.

2009-12-16 15:58:27

Answer 5

A:

gawk 'BEGIN{c=0}
/1111-ABCD-1111-SOME-GUID/{
    f=1
    for(i=1;i<=c;i++) print _[i]
    print
    delete _
    c=0
}
!/1111-ABCD-1111-SOME-GUID/&&f{ _[++c]=$0}
' file

2009-12-16 12:49:23

Answer 6

A:

The following Perl script (session_id.pl) does the job:

#!/usr/bin/perl 

my  $session_id = '1111-ABCD-1111-SOME-GUID';

while ( <> ) {
    if ( /$session_id/ ... /$session_id/ ) {
     print;
    }
}

Make it executable and run it:

./session_id.pl < session.data

fgm 2009-12-16 18:06:41

Answer 7

A:

What about:

sed -n "/$session_id/,/$session_id/p" file.txt

?

ADEpt 2010-01-06 15:01:07

ansaurus

tags:

views:

answers:

How do I crop a very large text file between first and last occurrence of a string? (Linux)

Result

related questions