views:

79

answers:

3

I am trying to parse my URL into different parts using this RegExp:

([\w\\.-]*)

Given an example URL http://www.foo.com/bar/baz I get those results from preg_match_all():

Array
(
[0] => Array
    (
        [0] => http
        [1] => 
        [2] => 
        [3] => 
        [4] => www.foo.com
        [5] => 
        [6] => bar
        [7] => 
        [8] => baz
        [9] => 
    )

)

It seems that it parses any invalid character into an empty item.
How do I solve this?

A: 

You sure you want \\. ?

In other words, from what you've posted, it looks like you've escaped a backslash instead of the period as you've likely intended to. EDIT: For tidiness, no harm to remove redundant escaping, but this isnt the actual problem [as pointed out by blixt -- thanks].

Highly recommend The Regulator as a regex debugging tool [Though its based on .NET regexes so isnt ideal for PHP work - but the general point that there are tools that will let you identify the basis on which matching is operating]

Still don't understand what you want with the backslashes in the range. Can you post the final regex you use in the question please? And sorry for the distractions that this answer has been!

EDIT: As blixt pointed out, period doesnt act as a metachar as I suggested.

Ruben Bartelink
Yeah, this is probably the problem.
Paul McMillan
-1: In character classes, periods have no special meaning.
Blixt
Are you sure that it does so inside brackets ?
streetpc
@Blixt, good point.
Ruben Bartelink
+6  A: 

By using * you're capturing empty groups - use + instead:

([\w\.-]+)

I assume the extra \ in your RE is because you have it inside a quoted string.

Greg
Re explaining the \... If that is the case, why isnt there one one the \w ?
Ruben Bartelink
+1 `*` will match any count of the preceding expression (including 0), while `+` is "1 or more"
streetpc
+1: When using `*`, the character class will match 0 or more times. That means that even if the character class fails, the expression will match an empty string. That's why `:`, `/` and `/` after `http` matched as three empty strings.
Blixt
It avoids an infinite loop after matching an empty string by advancing one character.
Aftershock
@Greg: Thanks a lot
the_drow
A: 

this may do what you want :([\w.-]+|.) This will match all part of the address.

Aftershock
I think all he wants to do is to match any string with one or more letters, periods or dashes in it. So the appropriate regex would be: `([\w.-]+)` Adding `|.` would stop the strings from being empty (they would be `":"`, `"/"`, `"/"`, etc.), but they would still be there.
Blixt
He wants to break up the html into parts...That is what I read.
Aftershock
True, although since he says the empty strings are for "invalid characters", I assumed he didn't want them included in the list of matches.
Blixt
Another conclusion could be that all he wants is the domain.
Aftershock
Correct. The accepted answer provides the right results.
the_drow
No, because it matches the directory names too.
Aftershock
@Aftershock: What you suggest is not what I need.
the_drow