tags:

views:

42

answers:

2

I trying to write a ruby program that parses web log and makes sure each part of the log is valid. I trying to deal with the case of in the request string of the log, it has additional double quotes besides the starting and ending ones. I made the web log in the form of a regular expression because it's easier to read that make variables for each part. Here's wut I have so far:

isVal = true
lines = lg.readlines
logLine_regex = /^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}) - (\w*|-) \[(\d{2})\/(\w{3})\/(\d{4}):(\d{2}):(\d{2}):(\d{2})\s(-0400)\] (".*") (\d+) (\d+|-)$/

lines.each{ |line|

 linePos = logLine_regex.match(line)

 if linePos == nil
  isVal = false
 elsif linePos[0] != line.chomp
  isVal = false
 elsif !((0..255).include?(linePos[1].to_i))
  isVal = false
 elsif !((0..255).include?(linePos[2].to_i))
  isVal = false
 elsif !((0..255).include?(linePos[3].to_i))
  isVal = false
 elsif !((0..255).include?(linePos[4].to_i))
  isVal = false
 #linePos[5] = Username or hyphen
 elsif !((1..31).include?(linePos[6].to_i))
  isVal = false
 elsif !(["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"].include?(linePos[7]))
  isVal = false
 elsif !((0..9999).include?(linePos[8].to_i))
  isVal = false
 elsif !((0..23).include?(linePos[9].to_i))
  isVal = false
 elsif !((0..59).include?(linePos[10].to_i))
  isVal = false
 elsif !((0..59).include?(linePos[11].to_i))
  isVal = false
 #linePos[12] = -4000
 #linePos[13] = request
 elsif !((0..9999).include?(linePos[14].to_i))
  isVal = false
 #linePos[15] = bytes
 else
  isVal = true
 end

}

I know that if they are additional double quotes can escape by prefixing it with a backslash, but I have no idea how to code that in ruby. Please help??

+1  A: 

A straightforward approach is to define your regexp in terms of simple tokens. Here I've defined the contents of the string to be zero or more instances of backslash-quote or a non-quote character.

examples = [
  '"foo"',
  '"foo\"bar\""',
  'empty',
  'one more "time"',
  'the "last" man "standing"'
]

examples.each do |example|
  puts "%s => %s" % [ example, example.match(/\"(?:\\"|[^"])*?\"/) ]
end

You can see how it performs on the various examples given.

As a note about your strategy for decoding log file contents, doing verification as a series of long, tedious if statements is likely to be a serious performance drag. You may want to extensively benchmark various approaches to validating the contents of specific fields. For example, it may be more efficient to store the Fixnum equivalents of all valid numbers 0.255 in a Hash than it is to run .to_i and then do comparisons between a low and high value.

tadman
A: 

First off, let's break down your regex into pieces, so you don't have to do all that post-validation

BYTE_RE = /(?:[012]?\d)?\d/
IP_RE = /#{BYTE_RE}(?:\.#{BYTE_RE}){3}/
DAY_RE = /0?[1-9]|[12]\d|3[01]/
MONTH_RE = /Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec/
YEAR_RE = /\d{4}/
DATE_RE = %r!#{DAY_RE}/#{MONTH_RE}/#{YEAR_RE}!
HOUR_RE = /[01]?\d|2[0-3]/
MIN_RE = /[0-5]\d/
SEC_RE = MIN_RE
TIME_RE = /#{HOUR_RE}:#{MIN_RE}:#{SEC_RE}\s-400/
DATETIME_RE = /#{DATE_RE}:#{TIME_RE}/
STRING_RE = /"(?:\\.|[^\\"])*"/

logLine_regex = /^#{IP_RE} - (?:\w*|-) \[#{DATETIME_RE}\] #{STRING_RE} \d{4} (?:\d+|-)$/ 
isVal = lg.readlines.all? { |line| line =~ logLine_regex }

The BYTE_RE only accepts strings whose integer value is 0-255, so we don't have to validate that afterwards. This does include 000, so if you want to limit it to numbers without leading zeroes, change it to /\d|[1-9]\d|[12]\d\d/.

The DAY_RE only accepts strings whose integer value is 1-31. Again, if you want to eliminate leading zeroes, use /[1-9]|[12]\d|3[01]/. There's no need to validate the year in your example - since it's exactly four digits, it must be between 0 and 9999 inclusive. We can do the same for position 14 to avoid validating that.

The HOUR_RE only accepts strings whose integer value is 0-23. Not accepting leading zeroes would give /1?\d|2[0-3]/. The MIN_RE and SEC_RE limit accepted strings to those with an integer value between 0 and 59.

Then, to validate a string we use STRING_RE. I'll break this one down.

  • " - the open quote
  • (?:...) - non-capturing parens, good for grouping.
    • \\. - any backslash-letter combination - matches string escapes like \n, \a, \\, or \"
    • | - either the preceding pattern or the following
    • [^\\"] - any character except a backslash or a doublequote
  • * - zero or more of the preceding atom
  • " - the close quote

So this matches an open doublequote, any number of escaped characters or regular characters, and then a close quote.

There's no need to make sure that the amount checked by the regex is the entire line, since the starting ^ and closing $ anchors take care of that.

So we've eliminated all your validation after the fact. Since we only want to know if all the lines match the given regex, we can use Enumerable#all?, which will return true iff and only if all the lines match the given regex. Plus, as a side benefit, it will exit early if any return false, which means this will run a little faster in that case.

rampion