views:

78

answers:

6

I need help to create a regex (for JavaScript .match and PHP preg_match) that validates a unix type absolute path to a file (with international characters such as åäöøæð and so on) so that:

  1. /path/to/someWhere is valid
  2. /path/tø/sömewhere is valid
  3. /path/to//somewhere is invalid
  4. path/to/somewhere is invalid
  5. /path/to/somewhere/ is invalid

The regex needs to handle paths regardless of their depth (/path/to or /path/to/somewhere or /path/to/somewhere/else)

I have a regexp that marks 1 to 3 as valid /^\/.+[^\/]$/ , the problem is to make this regex not to mark 3 as valid as it contains // without any other character in between.

A: 

This should work:

^/[^/]?$|^/[^/]([^/]|/[^/])*?[^/]$

It allows any character except /, or a / followed by any character except /. It also makes sure that the last character isn’t a /, and that the second character isn’t one either.

Finally, this uses / without escaping. To use it in PHP, don’t use / as the regex delimiter – this just makes the regular expression hard to read. Use any other character, e.g. ; to delimit the expression instead:

;^/[^/]?$|^/[^/]([^/]|/[^/])*?[^/]$;

EDIT: Added special handing for the root path, "/", and paths that consist of a single letter directory.

Konrad Rudolph
This doesn't match "/", a single slash, which is the pathname for the root directory. Also, non-greedy quantifiers can cause performance problems.
Pointy
@Pointy: it’s the *greedy* quantifiers that may cause performance problems. But good call about the root path.
Konrad Rudolph
Well it depends on the regex. Using a greedy quantifier can significantly reduce backtracking. [See this excellent blog post.](http://blog.stevenlevithan.com/archives/greedy-lazy-performance)
Pointy
@Pointy: True but in this case there are no nested quantifiers so catastrophic backtracking cannot happen, and in this case it actually saves us one backtracking because the last character won’t be consumed and then spit out again.
Konrad Rudolph
I'm no expert. One of these days I need to try out Regex Buddy :-)
Pointy
+4  A: 

Regex isn't really needed here. As far as I can see, there are three things you want to ensure:

  1. The string starts with /
  2. The string doesn't end with /, unless the whole string is /
  3. The string doesn't contain any instances of //

All three of the above can be done with string functions.

In PHP:

if ($string != '/' && ($string[0] != '/' || $string[strlen($string)-1] == '/' || strpos($string, '//') > -1))
{
  // string is invalid
}

In Javascript:

if (string != '/' && (string.charAt(0) != '/' || string.charAt(string.length - 1) == '/' || string.indexOf('//') > -1))
{
  // string is invalid
}

Resources:

Daniel Vandersluis
+1. Regexps aren't silver bullets!
August Lilleaas
The pathname consisting of a single slash **is** a valid pathname.
Pointy
@Pointy fair enough, updated.
Daniel Vandersluis
A: 

If the path matches ^[^\/]|\/\/|.\/$, it is invalid. Otherwise it is valid.

sth
A pathname consisting of a single slash is a valid pathname.
Pointy
+1  A: 

I think this will do it:

^(:?\/$|(:?\/[^/]+)+$)

That says to accept any string that's either just a /, or any string formed from a sequence of one or more repetitions of a / followed by one or more non-/ characters.

This uses all greedy quantifiers so it should be fast; also, for performance, the ^ anchor is factored out.

That's a Javascript regex. I'm not a PHP programmer so the main thing I don't know is whether the non-capturing group syntax works in PHP. Also, I'm not sure how you'd handle "quoting" the slash characters.

Pointy
+1  A: 

A Solution for PHP:

    $lines =  array(
        "/path/to/someWhere",
        "/path/tø/sömewhere",
        "/path/to//somewhere",
        "path/to/somewhere",
        "/path/to/somewhere/",
    );

    foreach($lines as $line){
        var_dump(preg_match('#^(/[^/]+)+$#',$line)); // dumps int(1) int(1) int(0) int(0) int(0) 
    }
Hannes
That pattern does not match "/", a single slash, which is a valid pathname.
Pointy
^(/[^/]+)+$ was exactly what I was looking for, so simple when you see it. I forgot to say that I did not want to validate only / as this root level is kind of a directory. This regexp is perfect for my needs. Thanks
Tirithen
np, repetition rules, repetition rules
Hannes
A: 

it's not regex, but works just as well.

str_replace('//', '/', $file)
AutoSponge
This doesn't do any of the 3 checks required, it only replaces on one condition.
I'm only suggesting to use your normal RegEx with str_replace instead of going nuts with the RegEx and making a suboptimal token disaster. Given the RegEx in the OP, yes it does do the required.
AutoSponge