views:

69

answers:

5

I want to parse a log file (log.txt) which contains rows similar to these:

2010-10-19 07:56:14 URL:http://www.website.com/page.php?ID=26 [13676] -> "www.website.com/page.php?ID=26" [1]
2010-10-19 07:56:14 URL:http://www.website.com/page.php?ID=44 [14152] -> "www.website.com/page.php?ID=44" [1]
2010-10-19 07:56:14 URL:http://www.website.com/page.php?ID=13 [13681] -> "www.website.com/page.php?ID=13" [1]
2010-10-19 07:56:14 ERROR:Something bad happened
2010-10-19 07:56:14 ERROR:Something really bad happened
2010-10-19 07:56:15 URL:http://www.website.com/page.php?ID=14 [12627] -> "www.website.com/page.php?ID=14" [1]
2010-10-19 07:56:14 ERROR:Page not found
2010-10-19 07:56:15 URL:http://www.website.com/page.php?ID=29 [13694] -> "www.website.com/page.php?ID=29" [1]

As you might have guessed:

1) I need to extract this portion from each row:

2010-10-19 07:56:15 URL:http://www.website.com/page.php?ID=29 [13694] -> "www.website.com/page.php?ID=29" [1]
------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

2) This portion goes to another file (log.html) like this:

<a href="http://www.website.com/page.php?ID=29"&gt;http://www.website.com/page.php?ID=29&lt;/a&gt;

I need to do this via bash script, which will run on a *nix platform. I have no idea about shell programming so detailed script will be much appreciated, pointers to bash programming reference will do.

+2  A: 

Here's a small awk script that should do what you need.

awk '/URL:/ { sub(/^URL:/,"", $3); printf "<a href=\"%s"\">%s</a>\n", $3, $3; }'
ar
+1  A: 

This should work:

sed -n 's%^.* URL:\(.*\) \[[0-9]*\] -> .*$%<a href="\1">\1</a>%p' log.txt
mouviciel
Do you *really* need the backslash before round brackets?
Salman A
With `sed`, yes I do.
mouviciel
Salman A
+1  A: 

What about sed:

sed -n 's/.*URL:\([^ ]\+\) .*/<a href="\1">\1<\/a>/;/<a href/p' logfile

(Please note: you can address the URL part more properly, e.g. by the length of the date string in front of it, but I was just lazy.)

Zsolt Botykai
+2  A: 

Here's a bash solution

#!/bin/bash
exec 4<"log.txt"
while read -r line<&4
do
  case "$line" in
    *URL:* )
      url="${line#*URL:}"
      url=${url%% [*}
      echo "<a href=\"${url}\">${url}</a>"
  esac
done
exec 4<&-
ghostdog74
+1  A: 

Something like this:

while read line
do
        URL=$(echo $line | egrep -o 'URL:[^ ]+' | sed  's/^URL://')     
        if [ -n "$URL" ]; then
                echo "<a href=\"$URL\">$URL</a>" >> output.txt
        fi  
done < input.txt
codaddict
using `egrep` to read the file is faster than the outer while loop. `egrep -o 'URL:[^ ]+' input.txt| sed .....|while read ....`. btw, `egrep` is `grep -E` now.
ghostdog74
@ghostdog74: Thanks for the `egrep` tip. But didn't get the first part.
codaddict
You have an outer while read loop that iterates the file, and for each line, you are calling 2 external commands, `egrep` and `sed` using pipes. This is expensive operation. Hence the suggestion to use `egrep` to iterate the file instead, since its optimized to go over files, large or small, more efficiently. And no, your script is not wrong, just not optimized in terms of speed, that's all. :)
ghostdog74
@ghostdog74: Got it, thanks :)
codaddict
I used part of this script to post-process the file.
Salman A