views:

977

answers:

5

I need some help getting a regex working to parse all referrers from an apache access log file which come from real links offsite and which are valid referrals from real people rather than bots or spiders. I'm working in Perl.

This bit of code almost works already [the access log is opened with the filehandle $fh]:

my $totalreferals = 0;
while ( my $line = <$fh> ) {
    if ($line !~ m!

        \[\d{2}/\w{3}/\d{4}(?::\d\d){3}.+?\]
        \s"GET\s\S+\sHTTP/\d.\d"
        \s\S+
        \s\S+
        \s("-"|"http://(www\.|)mywebsite\.com.*"                

        !xi
        )
    {
          $totalreferals++;  
    }

    $line =~ m!

        \[(\d{2}/\w{3}/\d{4})(?::\d\d){3}.+?\]
        \s"GET\s(\S+)\sHTTP/\d.\d"
        \s(\S+)
        \s\S+
        \s"http://w{1,3}\.google\.
        (?:[a-z]{2}|com?\.[a-z]{2}|com)\.?/
        [^\"]*q=([^\"&]+)[^\"]*"

    !xi or next;

    my ( $datestr, $path, $status, $query ) = ( $1, $2, $3, $4 );
    .
    .
    #do other stuff  
    .
    .
}

The above regex successfully eliminates all internal links recorded in the access_log plus records that don't have a referrer, but it gives a $totalreferals that is otherwise way too large.

Examples of log $line that are being counted by the 1st regex, but which I want excluded are:

61.247.221.45 - - [02/Jan/2009:20:51:41 -0600] "GET /oil-paintings/section.php/2451/0 HTTP/1.1" 200 85856 "-" "Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)"

-- Appears to be a spider from Korea


93.84.41.131 - - [31/Dec/2008:02:36:54 -0600] "GET /paintings/artists/w/Waterhouse_John_William/oil-big/Waterhouse_Destiny.jpg HTTP/1.1" 200 19924 "http://smrus.web-box.ru/Schemes" "Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5"

-- Request is for an image embedded within another website (we allow this)


87.115.8.230 - - [31/Dec/2008:03:08:17 -0600] "GET /paintings/artists/recently-added/july2008/big/Crucifixion-of-St-Peter-xx-Guido-Reni.JPG HTTP/1.1" 200 37348 "http://images.google.co.uk/im........DN&amp;frame=small" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5"

-- Request is from google images (could be viewing the image full-size, or spidering it)


216.145.5.42 - - [31/Dec/2008:02:21:49 -0600] "GET / HTTP/1.1" 200 53508 "http://whois.domaintools.com/mywebsite.com" "SurveyBot/2.3 (Whois Source)"

-- Request is from a whois bot


+3  A: 

Unless you have some really weird requirement to reinvent the wheel,

http://search.cpan.org/search?query=apache+log&amp;mode=all

derobert
My requirements are really weird.
rwired
+1  A: 

I think your problem is here:

\s"http://w{0,3}\.mywebsite\.com[^\"]*"

This will not catch the "http://mywebsite.com" case because it will always require a dot before "mywebsite".

Also you are only excluding GET Requests. What about POST and HEAD?

Edit: If you still get numbers that seem wrong, you should definitely capture the referrer with your regex and print it.

innaM
You are right. ThanksI noticed that, and fixed it. I also noticed I wasn't taking into account https:// -- which some of my local referrals were getting. Also I wasn't eliminating "-", which means there wasn't a referral at all. I'm now getting a total of 436296 for my 6 day sample. Still too big.
rwired
OK. So could you update your question accordingly?
innaM
A: 

For Regex's matching URIs, try Regex::Common (or more specifically Regexp::Common::URI::http).

Joe Casadonte
Parsing the data available isn't the problem I'm trying to solve. I'm trying to capture records that I don't want to count because they aren't referrals from other websites that were clicked by real people.
rwired
+1  A: 

Manni's suggestion to eliminate POST and HEAD was indeed correct because I was looking for negative matches (therefore should not restrict to GET as when parsing for query strings). Likewise with the error in the dot before the host without a www, and also needing to eliminate "-" (no referrer)

Also, I eliminated all matches against image files which more often than not are not direct referrals from external sites but are embedded within those sites, or are being indexed by a search engine (Google Images mostly).

I also found many of the server's image files include spaces in the filenames, which was breaking the regex where a \S+ was used for filename, I've changed this to .+

Finally, since I didn't need to grep the date when eliminating records I could simplify the first part of the regex.

The result is much closer to the numbers I'm expecting. Although I have yet to find a good way to eliminate all requests from bots and spiders.

For those that are interested, the final code looks like this:

my $totalreferals = 0;
while ( my $line = <$fh> ) {
    if ($line !~ m!

        \[.+\]
        \s("\S+\s.+\sHTTP/\d.\d"
        \s\S+
        \s\S+
        \s("-"|"http://(www\.|)mywebsite\.com.*")|
        "\S+\s.+\.(jpg|jpeg|gif|png)\sHTTP/\d.\d"
        \s\S+
        \s\S+
        \s".*")
        !xi
        )
    {   
      $totalreferals++;  
    }

    $line =~ m!

        \[(\d{2}/\w{3}/\d{4})(?::\d\d){3}.+?\]
        \s"GET\s(\S+)\sHTTP/\d.\d"
        \s(\S+)
        \s\S+
        \s"http://w{1,3}\.google\.
        (?:[a-z]{2}|com?\.[a-z]{2}|com)\.?/
        [^\"]*q=([^\"&]+)[^\"]*"

    !xi or next;

    my ( $datestr, $path, $status, $query ) = ( $1, $2, $3, $4 );
    .
    .
    #do other stuff  
    .
    .
}

Edit: In the course of my research, it seems that the only really viable way to distinguish between automatic crawlers and real human visitors is with tracking cookies. I doubt there's a way to account for that with pure log analysis. If anyone knows of a way to do it by analyzing logs please let me know. For know I will just add a footnote to my log reports that indicate they include traffic from bots.

rwired
A: 

My choice, at the moment, is filtering log according to IP. The most active bots are google, yahoo, msn, etc. So, I took they IP range and removed from recording.