views:

242

answers:

7

On a Linux system, I have a very large text file and I need to create a new text file which contains every line between the first and last of occurrence of a particular sessionId (those lines included).

I guess I probably need to use sed or something?

As a bonus, sometimes I won't know which log file will contain the session trace. So a script that can work with regular expressions would be ideal. In this case I would expect the script to find the first file with the sessionId in it and then crop that file before exiting.

Example Log file looking for sessionId 1111-ABCD-1111-SOME-GUID :

line one containing other session id: 2222-ABCD-1111-SOME-GUID blaa blaa blaa
line two blaa blaa blaa
line three containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
line four containing other session id: 2222-ABCD-1111-SOME-GUID
line five blaa blaa blaa
line six containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa
line seven containing other session id: 2222-ABCD-1111-SOME-GUID
line eight containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
line nine containing other session id: 3333-ABCD-1111-SOME-GUID
line ten containing my session id: 1111-ABCD-1111-SOME-GUID
line eleven
line twelve containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa
line thirteen containing my session id: 1111-ABCD-1111-SOME-GUID
line fouteen blaa blaa blaa
line fifteen containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa

output file would contain lines three to thirteen inclusive.

A: 

The following script will do all of what you asked, including the bonus. Put this script in the top-level directory that contains all the possible files with the 'uid' you want to crop. The script will recursively search this directory and crop all files that match and put the result in a new file with a .crp extension at the end (see example below). I took special consideration to make sure that this script will work with whatever filename you throw at it, whether it contain spaces or newlines or whatever in its name.

#!/bin/bash
uid="1111-ABCD-1111-SOME-GUID"

while IFS= read -r -d $'\0' file; do
    printf "%s\n" "?$uid?+1,\$d" "1,/$uid/-1d" "%p" | ex -s "$file" > "$file".crp
    echo "$file being cropped"
done < <(grep -lZR --exclude="${0#*/}" --exclude="*.crp" "$uid" .)

Result

$ ./uid.sh
./sample1.txt being cropped
./subdir/sample2.txt being cropped

$ cat ./sample1.txt.crp
line three containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
line four containing other session id: 2222-ABCD-1111-SOME-GUID
line five blaa blaa blaa
line six containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa
line seven containing other session id: 2222-ABCD-1111-SOME-GUID
line eight containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
line nine containing other session id: 3333-ABCD-1111-SOME-GUID
line ten containing my session id: 1111-ABCD-1111-SOME-GUID
line eleven
line twelve containing other session id: 3333-ABCD-1111-SOME-GUID blaa blaa blaa
line thirteen containing my session id: 1111-ABCD-1111-SOME-GUID

$ cat ./subdir/sample2.txt.crp
line three containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
foo
bar
line eight containing my session id: 1111-ABCD-1111-SOME-GUID blaa blaa blaa
baz
line ten containing my session id: 1111-ABCD-1111-SOME-GUID

As you can see in the example above, my script found two files which matched, one of which was in a sub-directory below the top-level directory.

SiegeX
SessionId is like a guid and appears many times in the file.
TiGz
that doesn't help. I need to see an ***example*** of what a sessionid can look like and where it can change and how.
SiegeX
Well an example is like: 4934FF07-436E-8D2A-C7C2-A3328B371005_1260470734931_143 however I don't see how that helps. Bare in mind that the log file contains many different session ids from many different sessions but at any particular time I am only interested in a particular instance. i.e. the sid needs to be an input to the script (however I can do that manually if needs be).
TiGz
@SiegeX: I may be wrong, but I think TiGz means he would like the regex to match the *files* to check, not the sessionId within the files.
Grundlefleck
I misunderstood your original 'bonus' question, I had thought that you wanted to use a regex to pick out the sessionid, not which file it was in. Question for you, can the relevant sessionid be in a single file more than two times? Meaning I would need to skip past all the middle occurrences and proceed to crop until the very last occurrence?
SiegeX
yes the target session id will appear many many times within the log file.
TiGz
A: 

I'd probably do this using cat and awk. Something like

cat *.log | awk 'BEGIN { sidFound = 0; } {
    if (*check for SID here*) {
         sidFound = !sidFound;
    }

    if (sidFound) {
         print $0
    }
 }'
Adam Luchjenbroers
cat is not needed.
He mentions in the question that he may need to scan multiple log files for this Session ID. Cat provides an easy way of scanning multiple files at once.
Adam Luchjenbroers
awk can take in file input as well. --> awk '{blah blah}' *.log
A: 

Either a few lines of Perl, or:

grep -no <session_ID> <log_file>

(make a note of the first and last line numbers with your session ID on)

awk 'NR==3,NR==935' <log_file>

(where 3 and 935 are the first and last line numbers returned from the grep command)

I can't currently think of a way to make that a one-liner.

Richard Gaywood
this is definitely not what he is looking for.
A: 

I'd propose something like this :

# Find all occurence of session id in the input file
grep -n "<session id>" "<input file>" > /tmp/grep.$$

# get the first line number of session id appearance
FIRST_LINE=$(head -1 /tmp/grep.$$ | cut -d: -f1)

# get the last line number of session id appearance
LAST_LINE=$(tail -1 /tmp/grep.$$| cut -d: -f1)

# Display only the part (inclusive) in between the first and last session id
sed -n "${FIRST_LINE},${LAST_LINE}p" "<input file>"

So that you retrive the line number of the first and last occurence of your pattern in the input file and then, using sed, you display only those (included). It can be optimised (grepping only once) but it should be working.

Zeograd
there's no need to open input file 3 times.
The two first opening can be merged if you keep the result of grep in a temporary file, ok.However, with linear stream processor programs, I don't see how you can know in advance if the current line must be printed or not, ie if there's another occurence of the session id later in the file.Keep in mind that it's the same pattern which starts AND ends the portion to dump as far as I understood.
Zeograd
for this qns, one have to have better control over file manipulation, like open the file, searching through it, storing lines in temp memory, printing out when necessary , etc etc.
A: 
gawk 'BEGIN{c=0}
/1111-ABCD-1111-SOME-GUID/{
    f=1
    for(i=1;i<=c;i++) print _[i]
    print
    delete _
    c=0
}
!/1111-ABCD-1111-SOME-GUID/&&f{ _[++c]=$0}
' file
A: 

The following Perl script (session_id.pl) does the job:

#!/usr/bin/perl 

my  $session_id = '1111-ABCD-1111-SOME-GUID';

while ( <> ) {
    if ( /$session_id/ ... /$session_id/ ) {
     print;
    }
}

Make it executable and run it:

./session_id.pl < session.data
fgm
A: 

What about:

sed -n "/$session_id/,/$session_id/p" file.txt

?

ADEpt