views:

195

answers:

5

I want to be able to parse file paths like this one:

 /var/www/index.(htm|html|php|shtml)

into an ordered array:

 array("htm", "html", "php", "shtml")

and then produce a list of alternatives:

/var/www/index.htm
/var/www/index.html
/var/www/index.php
/var/www/index.shtml

Right now, I have a preg_match statement that can split two alternatives:

 preg_match_all ("/\(([^)]*)\|([^)]*)\)/", $path_resource, $matches);

Could somebody give me a pointer how to extend this to accept an unlimited number of alternatives (at least two)? Just regarding the regular expression, the rest I can deal with.

The rule is:

  • The list needs to start with a ( and close with a )

  • There must be one | in the list (i.e. at least two alternatives)

  • Any other occurrence(s) of ( or ) are to remain untouched.

Update: I need to be able to also deal with multiple bracket pairs such as:

 /var/(www|www2)/index.(htm|html|php|shtml)

sorry I didn't say that straight away.

Update 2: If you're looking to do what I'm trying to do in the filesystem, then note that glob() already brings this functionality out of the box. There is no need to implement a custom solutiom. See @Gordon's answer below for details.

+4  A: 

Not exactly what you are asking, but what's wrong with just taking what you have to get the list (ignoring the |s), putting it into a variable and then explodeing on the |s? That would give you an array of however many items there were (including 1 if there wasn't a | present).

Blair McMillan
True, good point. Trying that out now.
Pekka
+5  A: 

I think you're looking for:

/(([^|]+)(|([^|]+))+)/

Basically, put the splitter '|' into a repeating pattern.

Also, your words should be made up 'not pipes' instead of 'not parens', per your third requirement.

Also, prefer + to * for this problem. + means 'at least one'. * means 'zero or more'.

CWF
Cheers @CWF, this is exactly what I asked for. I've run out of votes for today, otherwise I'd +1. I will look into this some more tomorrow, I'm not yet sure how to build the variation strings, I may need a preg_match_callback - will try. Anyway, thanks a lot already for the repeating pattern.
Pekka
+3  A: 

Non-regex solution :)

<?php

$test = '/var/www/index.(htm|html|php|shtml)';

/**
 *
 * @param string $str "/var/www/index.(htm|html|php|shtml)"
 * @return array "/var/www/index.htm", "/var/www/index.php", etc
 */
function expand_bracket_pair($str)
{
    // Only get the very last "(" and ignore all others.
    $bracketStartPos = strrpos($str, '(');
    $bracketEndPos = strrpos($str, ')');

    // Split on ",".
    $exts = substr($str, $bracketStartPos, $bracketEndPos - $bracketStartPos);
    $exts = trim($exts, '()|');
    $exts = explode('|', $exts);

    // List all possible file names.
    $names = array();

    $prefix = substr($str, 0, $bracketStartPos);
    $affix = substr($str, $bracketEndPos + 1);
    foreach ($exts as $ext)
    {
        $names[] = "{$prefix}{$ext}{$affix}";
    }

    return $names;
}

function expand_filenames($input)
{
    $nbBrackets = substr_count($input, '(');

    // Start with the last pair.
    $sets = expand_bracket_pair($input);

    // Now work backwards and recurse for each generated filename set.
    for ($i = 0; $i < $nbBrackets; $i++)
    {
        foreach ($sets as $k => $set)
        {
            $sets = array_merge(
                $sets,
                expand_bracket_pair($set)
            );
        }
    }

    // Clean up.
    foreach ($sets as $k => $set)
    {
        if (false !== strpos($set, '('))
        {
            unset($sets[$k]);
        }
    }
    $sets = array_unique($sets);
    sort($sets);

    return $sets;
}

var_dump(expand_filenames('/(a|b)/var/(www|www2)/index.(htm|html|php|shtml)'));
Coronatus
Very nice work - Kudos to you. *But* it can't deal with multiple bracket pairs as I did *not* mention in my question - I will correct that straight away - but *did* in my challenge to you. :) I think this approach is hard to extend so it can deal with multiple bracket pairs. Or am I mistaken?
Pekka
Okay, I'm convinced. I will split the multiple bracket pairs using a simple regex, and then run your function on them. This works too nicely not to use :)
Pekka
Does multiple bracket pairs mean like `(html|php(4|5))` ? I'm not sure I understand but will update the code if you can confirm this. The code currently only matches the very last bracket pair.
Coronatus
@Coronatus see my update, there's an example there. If you want, feel free to try out whether that can be achieved as well - it will be useful to me, but I can work with this already.
Pekka
Fixed to do unlimited pairs of brackets.
Coronatus
+2  A: 

Maybe I'm still not getting the question, but my assumption is you are running through the filesystem until you hit one of the files, in which case you could do

$files = glob("$path/index.{htm,html,php,shtml}", GLOB_BRACE);

The resulting array will contain any file matching your extensions in $path or none. If you need to include files by a specific extension order, you can foreach over the array with an ordered list of extensions, e.g.

foreach(array('htm','html','php','shtml') as $ext) {
    foreach($files as $file) {
        if(pathinfo($file, PATHINFO_EXTENSION) === $ext) {
            // do something
        }
    }
}

Edit: and yes, you can have multiple curly braces in glob.

Gordon
It was *that* easy. Thanks Gordon. I had no idea Glob could do such things. I can't in good conscience unaccept the answer given, as I was asking specifically for how to parse the string, but I'll put a note about your answer into the question.
Pekka
For future reference, more info on `GLOB_BRACE`, with examples, here: http://de.php.net/manual/en/function.glob.php#88250
Pekka
Minor caveat: `GLOB_BRACE` is not available on some non GNU systems, including Solaris (but is supported on Windows). I'll try to find out which ones exactly http://stackoverflow.com/questions/2536924/glob-brace-portability
Pekka
+1  A: 

The answer is given, but it's a funny puzzle and i just couldn't resist

function expand_filenames2($str) {
    $r = array($str);
    $n = 0;
    while(preg_match('~(.*?) \( ( \w+ \| [\w|]+ ) \) (.*) ~x', $r[$n++], $m)) {
        foreach(explode('|', $m[2]) as $e)
            $r[] = $m[1] . $e . $m[3];
    }
    return array_slice($r, $n - 1);
}  



print_r(expand_filenames2('/(a|b)/var/(ignore)/(www|www2)/index.(htm|html|php|shtml)!'));

maybe this explains a bit why we like regexps that much ;)

stereofrog
@stereofrog sweet!!! +1.
Pekka
@stereofrog however, the `\w` would need to be expanded to something like `\w\d.` to match any conceivable (standard) file name.
Pekka