views:

103

answers:

5

Let's say I have something like this (this is only an example, actual request will be different: I loaded StackOverflow with LiveHTTPHeaders enabled to have some samples to work on):

http://stackoverflow.com/

GET / HTTP/1.1
Host: stackoverflow.com
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.2) Gecko/20070220 Firefox/2.0.0.2
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 200 OK
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip
Expires: Sat, 28 Nov 2009 16:04:24 GMT
Vary: Accept-Encoding
Server: Microsoft-IIS/7.0
Date: Sat, 28 Nov 2009 16:04:23 GMT
Content-Length: 19015
----------------------------------------------------------
...

Full log of requests and responses is available on pastebin

And I want to remove all responses (HTTP/1.x 200 OK and everything in that response, for example) and all one liners showing page address. I would like to only have all requests left in text file with saved LiveHTTPHeaders output.

So, the output would be:

GET / HTTP/1.1
Host: stackoverflow.com
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.2) Gecko/20070220 Firefox/2.0.0.2
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

GET /so/all.css?v=5290 HTTP/1.1
Host: sstatic.net
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.2) Gecko/20070220 Firefox/2.0.0.2
Accept: text/css,*/*;q=0.1
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://stackoverflow.com/

...

Again, the full text of what I want to keep is available on pastebin.

If I save LiveHTTPHeaders captured session to text file and I would like to get result like from second 'code' in this question, how do I do this? Maybe with awk, sed or perl? Or something else? I'm on Linux.


Edit: I'm trying to run Sinan's script. Script is this:

#!/usr/bin/perl
local $/ = "\n\n";
while (<>) {
    print if /^GET|POST/; # Add more request types as needed
}

I tried running it this way:

./cleanup-headers.pl livehttp.txt > filtered.txt

And this way:

perl cleanup-headers.pl < livehttp.txt > filtered.txt

... file filtered.txt was created but it's totally empty.

Anyone tried it on FULL headers i pasted into pastebin? Did it worked?

Full headers

+5  A: 

In Perl:

local $/ = "\n\n";
while (<>) {
    print if /^(?:GET|POST)/; # Add more request types as needed
}

Notes: Looking at the output generated by LiveHTTPHeaders, entries are quite clearly separated by two newlines, so I think setting $/ = "\n\n" is more appropriate than setting $/ = ''. I believe your problems were due to the fact that the lines in your input file were actually indented.

I did originally download the file from pastebin and use the full file to test my script. I do not believe the file you were using to test on your computer was identical to the one you put on pastebin.

If you want to robustly deal with possibly indented lines while remaining consistent with the format of the output of LiveHTTPHeaders, you should use something like the following:

#!/usr/bin/perl

use strict; use warnings;

local $/ = "\n\n";
while (<>) {
    next unless /^\s*(?:GET|POST)/;
    s!^\s+!!gm;
    print;
}

I consider using sed and perl in the same pipeline to be a little bit of an abomination.

Sinan Ünür
That's really clever idea of accomplishing this, I wouldn't think about it :). Thanks a lot, I'll try it!
Tim
...but this prints whole section or just one line that has get or post in it?
Tim
./test.pl > livehttp.txtand then it starts to work, won't end (i don't get console free back) and when i look into file all contents are deleted :(
Tim
@Tim The output file cannot be the same as the input file: `./test.pl input.log > filtered.log`. There is nothing in the program that overwrites anything. The program outputs one request section at a time.
Sinan Ünür
./cleanup-headers.pl livehttp.txt > filtered.txt and filtered.txt is created by this and is empty file... i seriously don't know what's wrong here
Tim
@Tim post the exact contents of the script you are running in **your question** as well as the exact command line by copying and pasting, not retyping.
Sinan Ünür
Ok, I edited it into my question.
Tim
...are you sure your code is correct? It only creates empty file or removes file content depending on how it's run.
Tim
Assigning the empty string to `$/` puts `<>` in paragraph mode.
Greg Bacon
You also need parens to have GET and POST share the anchor: `perl -le 'print "Match" if "foo POST" =~ /^GET|POST/'`
Greg Bacon
A: 

Run Sinan's code as perl test.pl < infile.txt > outfile.txt

clscott
Same - filtered.txt is totally empty. Now i feel dumb, I can't even run script :/
Tim
Anyone tried it on FULL headers i pasted into pastebin? Did it worked?
Tim
There's no difference between the two. From the "I/O Operators section of perlop: "The null filehandle `<>` is special: it can be used to emulate the behavior of `sed` and `awk`. Input from `<>` comes either from standard input, or from each file listed on the command line."http://perldoc.perl.org/perlop.html#I%2fO-Operators
Greg Bacon
+1  A: 

Looks like you're having trailing whitespace issues.

$ sed -e 's/^\s*$//' livehttp.txt | \
  perl -e '$/ = ""; while (<>) { print if /^(GET|POST)/ }'

This works by putting Perl's readline operator into paragraph mode (via $/ = ""), which grabs records a chunk at a time, separated by two or more consecutive newlines.

It's nice when it works, but it's a bit brittle. Blank but not empty lines will gum up the works, but sed can clean those up.

Equivalent and more concise command:

$ sed -e 's/^\s*$//' livehttp.txt | perl -000 -ne 'print if /^(GET|POST)/'
Greg Bacon
Works great! Thank you.
Tim
You're welcome!
Greg Bacon
no need to use Perl together with sed. sed is useless.
A: 

just one gawk command

awk -vRS= '/^(GET|POST)/' ORS="\n\n" file
ghostdog74
A: 

you can use the bash shell

while read -r line
do    
    case "$line" in
        GET*|POST*) flag=1;;        
        "") flag=0;;
    esac
    [ "$flag" -eq 1 ] && echo "$line"
done < "file"