tags:

views:

856

answers:

5

I would like to build a regexp in Java that would be passed in a FilenameFilter to filter the files in a dir.

The problem is that I can't get the hang of the regexp "mind model" :)

This is the regexp that I came up with to select the files that I would like to exclude

((ABC|XYZ))+\w*Test.xml

What I would like to do is to select all the files that end with Test.xml but do not start with ABC or XYZ.

Could you please add any resources that could help me in my battle with regexps.

Thanks

The following resource explains a lot of things about regexp regular-expressions.info

+7  A: 

This stuff is easier, faster and more readable without regexes.

if (str.endsWith("Test.xml") && !str.startsWith("ABC"))
Yoni Roit
+1  A: 

Just for the fun of the regex:

(?ms)^([^\r\n]{3}(?<!ABC|XYZ)[^\r\n]*?)?Test\.xml$

Even if this is not the most readable solution, that should work, and would avoid you to define your own custom file filter.

(?<!ABC|XYZ) is a look-behind expression avoiding any fourth character (after the first three characters) to be preceded by what you want to avoid.

VonC
This does not work for "DEFTest.xml" or "Test.xml".
Tomalak
@Tomalak: Thank you, I just fixed my regex and +1 to yours (even though I prefer [\r\n] to '.'.
VonC
But in this case: filenames are cannot contain line breaks. And besides, in normal mode the dot does not match newline characters, so using [^\r\n] seems like overkill to me.
Tomalak
Agreed. I just had some bad previous experience with '.' before (by presuming my entries would not have any newline). That way, I avoid any surprise.
VonC
Whether I use ".", "[^\r\n]" or single-line mode depends on the situation, I don't generalize here. But *always* being explicit is also fine. :) I think the bottom line is: regex is really bad at matching "everything except this or that". Positive matching is easier to accomplish in any case.
Tomalak
+2  A: 

What I would like to do is to select all the files that end with Test.xml but do not start with ABC or XYZ.

Either you match all your files with this regex:

^(?:(?:...)(?<!ABC|XYZ).*?)?Test\.xml$

or you do the opposite, and take every file that does not match:

^(?:ABC|XYZ).*?Test\.xml$

Personally, I find the second alternative much simpler.

ABC_foo_Test.xml   // #2 matches
XYZ_foo_Test.xml   // #2 matches
ABCTest.xml        // #2 matches 
XYZTest.xml        // #2 matches
DEF_foo_Test.xml   // #1 matches
DEFTest.xml        // #1 matches
Test.xml           // #1 matches
Tomalak
+1 for your two regexs, I fixed mine (see comments)
VonC
A: 

This will select files that do not begin in A, B, C, X, Y, or Z, and that end in Test.xml:

"[^ABCXYZ].*Test\\.xml\\z"

  • [^ABCXYZ]: Any character not in the set A, B, C, X, Y, Z.
  • .*: Any character, zero or more times
  • Test: The exact text "Test"
  • \\.: The dot character (need to escape using backslash, and if you're in a string, that backslash needs to be escaped... by a backslash!)
  • xml: The exact text "xml"
  • \\z: The end of the input
That was not the question, I'm afraid. This does not match "ACD_Test.xml", though it should, and the double backslashes are wrong for regex, they are a programming language requirement.
Tomalak
The OP did say this is a Java regex, and in Java string literals, backslashes in regex escape sequences have to be doubled. However, the negated character class at the beginning is definitely wrong.
Alan Moore
A: 

The regexes provided by Tomalak and VonC are more complicated than they need to be. Putting a negative lookahead at the beginning of the regex is much clearer than matching three characters and doing a negative lookbehind. And if you use the matches() method, you don't even have to use anchors (^, $, \z).

public boolean accept(File dir, String name) {
    return name.matches("(?!ABC|XYZ).*Test\\.xml");
}
Alan Moore