tags:

views:

1379

answers:

3

I have a space delimited list of files names, where spaces in the file names are prefixed by '\'

e.g. "first\ file second\ file"

How can I get my regex to match each file name?

+6  A: 
(\\ |[^ ])+

Everything except spaces, except when they're escaped. Should work, sorry for misunderstanding your question initially.

Stefan Mai
Shouldn't that be "\\"?
Aaron Digulla
This will match an empty line as well as a[TAB]b.
phihag
Tomalak
If your are using this regex and .Net be sure not to turn IgnorePatternWhiteSpace on.
Martin Brown
`(\\[ ]|[^ ])+` might get around IgnorePatternWhiteSpace
Brad Gilbert
+4  A: 
(\S|(?<=\\) )+

Explanation:

You are looking for either non white-space characters (\S) or a space preceded by a backslash, multiple times.

All matches will be saved to mach group 1, apply the pattern globally to get all matches in the string.

EDIT

Thinking about it, you would not even need capturing to a sub-group. The match alone will be enough, so this could be a tiny bit more efficient (the ?: switches to a non-capturing group):

(?:\S|(?<=\\) )+
Tomalak
I picked your answer at first because of the comment about Stefan's matching aTABb, but yours does too. Both answers are great, and solve my problem, but to be fair Stafan's was earlier
David Sykes
I changed the pattern so it does not match TAB or other white space anymore.
Tomalak
+1  A: 

I would do it like this:

/[^ \\]*(?:\\ [^\\ ]*)*/

This is Friedl's "unrolled loop" idiom. There will probably be very few escaped spaces in the target string relative to the other characters, so you gobble up as many of the other characters as you can each time you get a chance. This is much more efficient than an alternation matching one character at a time.

Edit: (Tomalak) I put slashes around the regex because the syntax highlighter seems to recognize them and paints the whole regex in one color. Without them, it can pick up on other characters, like quotation marks, and incorrectly (and confusingly) paint parts of the regex in different colors.

(Brad) The OP only mentioned spaces, so I only allowed for quoting them, but you're right. The original unrolled-loop example in the book was for double-quoted strings, which may contain any of several escape sequences, one of which is an escaped quotation mark. Here's the regex:

/"[^\\"]*(?:\\.[^\\"]*)*"/

(Tomalak) I don't know what you mean when you say that it doesn't match "the file name at the start of the string." It seems to match both of the file names in the OP's example. However, it also matches an empty string, which isn't good. That can be fixed, but unless efficiency is proved to be a problem, it isn't worth the effort. Stefan's solution works fine.

Alan Moore
Actually, in Java I would wite a more straightforward regex using possessive quantifiers and/or atomic groups. But the unrolled loop will work in any language.
Alan Moore
This does not match the file name at the start of the string.
Tomalak
`(?:\\ [^ ]*?)*` might be better, if for example 'ab\cd' would be a valid match.
Brad Gilbert