I need some help getting a regex working to parse all referrers from an apache access log file which come from real links offsite and which are valid referrals from real people rather than bots or spiders. I'm working in Perl.
This bit of code almost works already [the access log is opened with the filehandle $fh]:
my $totalreferals = 0;
while ( my $line = <$fh> ) {
if ($line !~ m!
\[\d{2}/\w{3}/\d{4}(?::\d\d){3}.+?\]
\s"GET\s\S+\sHTTP/\d.\d"
\s\S+
\s\S+
\s("-"|"http://(www\.|)mywebsite\.com.*"
!xi
)
{
$totalreferals++;
}
$line =~ m!
\[(\d{2}/\w{3}/\d{4})(?::\d\d){3}.+?\]
\s"GET\s(\S+)\sHTTP/\d.\d"
\s(\S+)
\s\S+
\s"http://w{1,3}\.google\.
(?:[a-z]{2}|com?\.[a-z]{2}|com)\.?/
[^\"]*q=([^\"&]+)[^\"]*"
!xi or next;
my ( $datestr, $path, $status, $query ) = ( $1, $2, $3, $4 );
.
.
#do other stuff
.
.
}
The above regex successfully eliminates all internal links recorded in the access_log plus records that don't have a referrer, but it gives a $totalreferals that is otherwise way too large.
Examples of log $line that are being counted by the 1st regex, but which I want excluded are:
61.247.221.45 - - [02/Jan/2009:20:51:41 -0600] "GET /oil-paintings/section.php/2451/0 HTTP/1.1" 200 85856 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)"
-- Appears to be a spider from Korea
93.84.41.131 - - [31/Dec/2008:02:36:54 -0600] "GET /paintings/artists/w/Waterhouse_John_William/oil-big/Waterhouse_Destiny.jpg HTTP/1.1" 200 19924 "http://smrus.web-box.ru/Schemes" "Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5"
-- Request is for an image embedded within another website (we allow this)
87.115.8.230 - - [31/Dec/2008:03:08:17 -0600] "GET /paintings/artists/recently-added/july2008/big/Crucifixion-of-St-Peter-xx-Guido-Reni.JPG HTTP/1.1" 200 37348 "http://images.google.co.uk/im........DN&frame=small" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5"
-- Request is from google images (could be viewing the image full-size, or spidering it)
216.145.5.42 - - [31/Dec/2008:02:21:49 -0600] "GET / HTTP/1.1" 200 53508 "http://whois.domaintools.com/mywebsite.com" "SurveyBot/2.3 (Whois Source)"
-- Request is from a whois bot