ansaurus

Question

Getting rid of additional double quotes in a web log request string

Answer 1

+1 A:

A straightforward approach is to define your regexp in terms of simple tokens. Here I've defined the contents of the string to be zero or more instances of backslash-quote or a non-quote character.

examples = [
  '"foo"',
  '"foo\"bar\""',
  'empty',
  'one more "time"',
  'the "last" man "standing"'
]

examples.each do |example|
  puts "%s => %s" % [ example, example.match(/\"(?:\\"|[^"])*?\"/) ]
end

You can see how it performs on the various examples given.

As a note about your strategy for decoding log file contents, doing verification as a series of long, tedious if statements is likely to be a serious performance drag. You may want to extensively benchmark various approaches to validating the contents of specific fields. For example, it may be more efficient to store the Fixnum equivalents of all valid numbers 0.255 in a Hash than it is to run .to_i and then do comparisons between a low and high value.

tadman 2009-06-09 16:22:28

Answer 2

A:

First off, let's break down your regex into pieces, so you don't have to do all that post-validation

BYTE_RE = /(?:[012]?\d)?\d/
IP_RE = /#{BYTE_RE}(?:\.#{BYTE_RE}){3}/
DAY_RE = /0?[1-9]|[12]\d|3[01]/
MONTH_RE = /Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec/
YEAR_RE = /\d{4}/
DATE_RE = %r!#{DAY_RE}/#{MONTH_RE}/#{YEAR_RE}!
HOUR_RE = /[01]?\d|2[0-3]/
MIN_RE = /[0-5]\d/
SEC_RE = MIN_RE
TIME_RE = /#{HOUR_RE}:#{MIN_RE}:#{SEC_RE}\s-400/
DATETIME_RE = /#{DATE_RE}:#{TIME_RE}/
STRING_RE = /"(?:\\.|[^\\"])*"/

logLine_regex = /^#{IP_RE} - (?:\w*|-) \[#{DATETIME_RE}\] #{STRING_RE} \d{4} (?:\d+|-)$/ 
isVal = lg.readlines.all? { |line| line =~ logLine_regex }

The BYTE_RE only accepts strings whose integer value is 0-255, so we don't have to validate that afterwards. This does include 000, so if you want to limit it to numbers without leading zeroes, change it to /\d|[1-9]\d|[12]\d\d/.

The DAY_RE only accepts strings whose integer value is 1-31. Again, if you want to eliminate leading zeroes, use /[1-9]|[12]\d|3[01]/. There's no need to validate the year in your example - since it's exactly four digits, it must be between 0 and 9999 inclusive. We can do the same for position 14 to avoid validating that.

The HOUR_RE only accepts strings whose integer value is 0-23. Not accepting leading zeroes would give /1?\d|2[0-3]/. The MIN_RE and SEC_RE limit accepted strings to those with an integer value between 0 and 59.

Then, to validate a string we use STRING_RE. I'll break this one down.

" - the open quote
(?:...) - non-capturing parens, good for grouping.
- \\. - any backslash-letter combination - matches string escapes like \n, \a, \\, or \"
- | - either the preceding pattern or the following
- [^\\"] - any character except a backslash or a doublequote
* - zero or more of the preceding atom
" - the close quote

So this matches an open doublequote, any number of escaped characters or regular characters, and then a close quote.

There's no need to make sure that the amount checked by the regex is the entire line, since the starting ^ and closing $ anchors take care of that.

So we've eliminated all your validation after the fact. Since we only want to know if all the lines match the given regex, we can use Enumerable#all?, which will return true iff and only if all the lines match the given regex. Plus, as a side benefit, it will exit early if any return false, which means this will run a little faster in that case.

rampion 2009-06-10 01:14:02

ansaurus

tags:

views:

answers:

Getting rid of additional double quotes in a web log request string

related questions