ansaurus

Question

How can I extract external referrers from an Apache access log with Perl?

Answer 1

+3 A:

Unless you have some really weird requirement to reinvent the wheel,

http://search.cpan.org/search?query=apache+log&mode=all

derobert 2009-01-14 09:05:51

My requirements are really weird.

rwired 2009-01-14 09:21:11

Answer 2

+1 A:

I think your problem is here:

\s"http://w{0,3}\.mywebsite\.com[^\"]*"

This will not catch the "http://mywebsite.com" case because it will always require a dot before "mywebsite".

Also you are only excluding GET Requests. What about POST and HEAD?

Edit: If you still get numbers that seem wrong, you should definitely capture the referrer with your regex and print it.

innaM 2009-01-14 10:08:12

You are right. ThanksI noticed that, and fixed it. I also noticed I wasn't taking into account https:// -- which some of my local referrals were getting. Also I wasn't eliminating "-", which means there wasn't a referral at all. I'm now getting a total of 436296 for my 6 day sample. Still too big.

rwired 2009-01-14 10:16:47

OK. So could you update your question accordingly?

innaM 2009-01-14 10:23:42

Answer 3

A:

For Regex's matching URIs, try Regex::Common (or more specifically Regexp::Common::URI::http).

Joe Casadonte 2009-01-14 13:39:42

Parsing the data available isn't the problem I'm trying to solve. I'm trying to capture records that I don't want to count because they aren't referrals from other websites that were clicked by real people.

rwired 2009-01-15 04:42:04

Answer 4

+1 A:

Manni's suggestion to eliminate POST and HEAD was indeed correct because I was looking for negative matches (therefore should not restrict to GET as when parsing for query strings). Likewise with the error in the dot before the host without a www, and also needing to eliminate "-" (no referrer)

Also, I eliminated all matches against image files which more often than not are not direct referrals from external sites but are embedded within those sites, or are being indexed by a search engine (Google Images mostly).

I also found many of the server's image files include spaces in the filenames, which was breaking the regex where a \S+ was used for filename, I've changed this to .+

Finally, since I didn't need to grep the date when eliminating records I could simplify the first part of the regex.

The result is much closer to the numbers I'm expecting. Although I have yet to find a good way to eliminate all requests from bots and spiders.

For those that are interested, the final code looks like this:

my $totalreferals = 0;
while ( my $line = <$fh> ) {
    if ($line !~ m!

        \[.+\]
        \s("\S+\s.+\sHTTP/\d.\d"
        \s\S+
        \s\S+
        \s("-"|"http://(www\.|)mywebsite\.com.*")|
        "\S+\s.+\.(jpg|jpeg|gif|png)\sHTTP/\d.\d"
        \s\S+
        \s\S+
        \s".*")
        !xi
        )
    {   
      $totalreferals++;  
    }

    $line =~ m!

        \[(\d{2}/\w{3}/\d{4})(?::\d\d){3}.+?\]
        \s"GET\s(\S+)\sHTTP/\d.\d"
        \s(\S+)
        \s\S+
        \s"http://w{1,3}\.google\.
        (?:[a-z]{2}|com?\.[a-z]{2}|com)\.?/
        [^\"]*q=([^\"&]+)[^\"]*"

    !xi or next;

    my ( $datestr, $path, $status, $query ) = ( $1, $2, $3, $4 );
    .
    .
    #do other stuff  
    .
    .
}

Edit: In the course of my research, it seems that the only really viable way to distinguish between automatic crawlers and real human visitors is with tracking cookies. I doubt there's a way to account for that with pure log analysis. If anyone knows of a way to do it by analyzing logs please let me know. For know I will just add a footnote to my log reports that indicate they include traffic from bots.

rwired 2009-01-14 16:01:15

Answer 5

A:

My choice, at the moment, is filtering log according to IP. The most active bots are google, yahoo, msn, etc. So, I took they IP range and removed from recording.

2009-05-22 16:01:44

ansaurus

tags:

views:

answers:

How can I extract external referrers from an Apache access log with Perl?

related questions