views:

122

answers:

1

Tcpflow outputs a bunch of files, many of which are HTTP responses from a web server. Inside, they contain HTTP headers, including Content-type: , and other important ones. I'm trying to write a script that can extract just the payload data (i.e. image/jpeg; text/html; et al.) and save it to a file [optional: with an appropriate name and file extension].

The EOL chars are \r\n (CRLF) and so this makes it difficult to use in GNU distros (in my experiences).

I've been trying something along the lines of:

sed /HTTP/,/^$/d  

To delete all text from the the beginning of HTTP (incl) to the end of \r\n\r\n (incl) but I have found no luck. I'm looking for help from anyone with good experience in sed and/or awk. I have zero experience with Perl, please I'd prefer to use common GNU command line utilities for this

Find a sample tcpflow output file here.

Thanks,
Felipe

+1  A: 

This article recommends running foremost on output from tcpflow to extract the images. It's available at that link and in the repositories of (at least) Debian, Fedora and Ubuntu.

I tried it on the sample file you linked to and it seemed to work fine.

foremost -i tcpflow.out

It created a directory called "output" with subdirectories called "gif" and "jpeg" with files in each. The names of the files don't match the filenames in the headers, though.

To change the line endings of your files do:

dos2unix filename

or in a pipe:

dos2unix < filename | nextcommand

Other links of interest:

Dennis Williamson
foremost is excellent! Thanks for the tips.
Felipe Alvarez