tags:

views:

406

answers:

5

How can I create a regex for a string such as this:

<SERVER> <SERVERKEY> <COMMAND> <FOLDERPATH> <RETENTION> <TRANSFERMODE> <OUTPUTPATH> <LOGTO> <OPTIONAL-MAXSIZE> <OPTIONAL-OFFSET>

Most of these fields are just simple words, but some of them can be paths, such as FOLDERPATH, OUTPUTPATH, these paths can also be paths with a filename and wildcard appended.

Retention is a number, and transfer mode can be bin or ascii. The issue is, LOGTO which can be a path with the logfile name appended to it or can be NO, which means no log file.

The main issue, is the optional arguments, they are both numbers, and OFFSET can't exist without MAXSIZE, but MAXSIZE can exist without offset.

Heres some examples:

loveserver love copy /muffin* 20 bin C:\Puppies\ NO 256 300
loveserver love copy /muffin* 20 bin C:\Puppies\ NO 256
loveserver love copy /hats* 300 ascii C:\Puppies\no\ C:\log\love.log 256

Now the main issue, is that paths can have spaces in them, so if I use . to match everything, the regex ends up breaking, when parsing the optional arguments where the LOG destination ends up getting attached to the outputpath.

Also if I end up using . and start removing parts of it, the regex will start putting things where it shouldn't.

Heres my regex:

^(\s+)?(?P<SRCHOST>.+)(\s+)(?P<SRCKEY>.+)(\s+)(?P<COMMAND>COPY)(\s+)(?P<SRCDIR>.+)(\s+)(?P<RETENTION>\d+)(\s+)(?P<TRANSFER_MODE>BIN|ASC|BINARY|ASCII)(\s+)(?P<DSTDIR>.+)(\s+)(?P<LOGFILE>.+)(\s+)?(?P<SIZE>\d+)?(\s+)?(?P<OFFSET>\d+)?$
+4  A: 

The problem is that because you're allowing spaces in filenames and using spaces to separate fields, the solution is ambiguous. You either need to use a different field separator character that can't appear in filenames, or use some other method of representing filenames with spaces in them, e.g. putting them in quotation marks.

Adam Rosenfield
+3  A: 

it is theoretically possible, but you are making things incredibly difficult for yourself. You have a number of problems here:

1) You are using space as a separator and you are also allowing spaces in the path names. You can avoid this by forcing the application to use paths without spaces in them.

2) You have 2 optional parameters on the end. This means that with the line ending "C:\LogTo Path 256 300" you have no idea if the path is C:\LogTo Path 256 300 with no optional parameters or C:\Log To Path 256 with one optional parameter or C:\LogTo Path with 2 optional parameters.

This would be easily remedied with a replacement algorithm on the output. Replacing spaces with underscore and underscore with double underscore. Therefore you could reverse this reliably after you have split the log file on spaces.

Even as a human you could not reliably perform this function 100%.

If you presume that all paths either end with a asterisk, a backslash or .log you could use positive lookahead to find the end of the paths, but without some kind of rules regarding this you are stuffed.

I get the feeling that a single regex would be too difficult for this and would make anyone trying to maintain the code insane. I am a regex whore, using them whenever possible and I would not attempt this.

Xetius
A: 

You need to restrict the fields between the paths in a way that the regexp can distinct them from the pathnames.

So unless you put in a special separator, the sequence

<OUTPUTPATH> <LOGTO>

with optional spaces will not work.

And if a path can look like those fields, you might get surprising results. e.g.

c:\ 12 bin \ 250 bin \output

for

<FOLDERPATH> <RETENTION> <TRANSFERMODE> <OUTPUTPATH>

is indistinguishable.

So, let's try to restrict allowed characters a bit:

<SERVER>, <SERVERKEY>, <COMMAND> no spaces -> [^]+
<FOLDERPATH> allow anything -> .+
<RETENTION> integer -> [0-9]+
<TRANSFERMODE> allow only bin and ascii -> (bin|ascii)
<OUTPUTPATH> allow anything -> .+
<LOGTO> allow anything -> .+
<OPTIONAL-MAXSIZE>[0-9]*
<OPTIONAL-OFFSET>[0-9]*

So, i'd go with something along the lines of

[^]+ [^]+ [^]+ .+ [0-9]+ (bin|ascii) .+ \> .+( [0-9]* ( [0-9]*)?)?

With a ">" to separate the two pathes. You might want to quote the pathnames instead.

NB: This was done in a hurry.

Stroboskop
+1  A: 

Just splitting on whitespace is never going to work. But if you can make some assumptions on the data it could be made to work.

Some assumptions I had in mind:

  • SERVER, SERVERKEY and COMMAND not containing any spaces: \S+
  • FOLDERPATH beginning with a slash: /.*?
  • RETENTION being a number: \d+
  • TRANSFERMODE not containing any spaces: \S+
  • OUTPUTPATH beginning with a drive and ending with a slash: [A-Z]:\\.*?\\
  • LOGTO either being the word "NO", or a path beginning with a drive: [A-Z]:\\.*?
  • MAXSIZE and OFFSET being a number: \d+

Putting it all together:

^\s*
(?P<SERVER>\S+)\s+
(?P<SERVERKEY>\S+)\s+
(?P<COMMAND>\S+)\s+
(?P<FOLDERPATH>/.*?)\s+   # Slash not that important, but should start with non-whitespace
(?P<RETENTION>\d+)\s+
(?P<TRANSFERMODE>\S+)\s+
(?P<OUTPUTPATH>[A-Z]:\\.*?\\)\s+   # Could also support network paths
(?P<LOGTO>NO|[A-Z]:\\.*?)
(?:
  \s+(?P<MAXSIZE>\d+)
  (?:
    \s+(?P<OFFSET>\d+)
  )?
)?
\s*$

In one line:

^\s*(?P<SERVER>\S+)\s+(?P<SERVERKEY>\S+)\s+(?P<COMMAND>\S+)\s+(?P<FOLDERPATH>/.*?)\s+(?P<RETENTION>\d+)\s+(?P<TRANSFERMODE>\S+)\s+(?P<OUTPUTPATH>[A-Z]:\\.*?\\)\s+(?P<LOGTO>NO|[A-Z]:\\.*?)(?:\s+(?P<MAXSIZE>\d+)(?:\s+(?P<OFFSET>\d+))?)?\s*$

Testing:

>>> import re
>>> p = re.compile(r'^(?P<SERVER>\S+)\s+(?P<SERVERKEY>\S+)\s+(?P<COMMAND>\S+)\s+(?P<FOLDERPATH>/.*?)\s+(?P<RETENTION>\d+)\s+(?P<TRANSFERMODE>\S+)\s+(?P<OUTPUTPATH>[A-Z]:\\.*?\\)\s+(?P<LOGTO>NO|[A-Z]:\\.*?)(?:\s+(?P<MAXSIZE>\d+)(?:\s+(?P<OFFSET>\d+))?)?\s*$',re.M)
>>> data = r"""loveserver love copy /muffin* 20 bin C:\Puppies\ NO 256 300
... loveserver love copy /muffin* 20 bin C:\Puppies\ NO 256
... loveserver love copy /hats* 300 ascii C:\Puppies\no\ C:\log\love.log 256"""
>>> import pprint
>>> for match in p.finditer(data):
...   print pprint.pprint(match.groupdict())
...
{'COMMAND': 'copy',
 'FOLDERPATH': '/muffin*',
 'LOGTO': 'NO',
 'MAXSIZE': '256',
 'OFFSET': '300',
 'OUTPUTPATH': 'C:\\Puppies\\',
 'RETENTION': '20',
 'SERVER': 'loveserver',
 'SERVERKEY': 'love',
 'TRANSFERMODE': 'bin'}
{'COMMAND': 'copy',
 'FOLDERPATH': '/muffin*',
 'LOGTO': 'NO',
 'MAXSIZE': '256',
 'OFFSET': None,
 'OUTPUTPATH': 'C:\\Puppies\\',
 'RETENTION': '20',
 'SERVER': 'loveserver',
 'SERVERKEY': 'love',
 'TRANSFERMODE': 'bin'}
{'COMMAND': 'copy',
 'FOLDERPATH': '/hats*',
 'LOGTO': 'C:\\log\\love.log',
 'MAXSIZE': '256',
 'OFFSET': None,
 'OUTPUTPATH': 'C:\\Puppies\\no\\',
 'RETENTION': '300',
 'SERVER': 'loveserver',
 'SERVERKEY': 'love',
 'TRANSFERMODE': 'ascii'}
>>>
MizardX
That was amazing. Thank you very much.
UberJumper
A: 

Are less than/greater than allowed inside the values? Because if not you have a very simple solution:

Just replace ever occurance of "> " with just ">", split on "><", and strip out all less than/greater than from each item. It's probably longer than the regex code, but it will be clearer what's going on.

Joel Coehoorn
<> are not used to quote the tokens in the actual strings - they are just in the questioner's specification of the string format.
mackenir