tags:

views:

1248

answers:

5

Hello,

Before asking this question I have Googled for this problem and I have looked through all StackOverflow related questions.

The problem is pretty simple

I have a string "North Atlantic Treaty Organization"

I have a pattern "a.*z", at moment it would match

north ATLATIC TREATY ORGANIZation

But I need it to match complete words only (orgANIZation for example)

I have tried "\ba*z\b" and "\Ba*z\B" as pattern, but I think I don't quite get it

How should I change my pattern in order to match complete words that string contains (without matching multiple words)

The patterns are generated on the fly, user enteres a*z and my application translates it into pattern that matches parts of complete words in string.

My problem is that I don't know what user is going to search for. Ideally I would preppend some regexp to user's expression.

Thank You!

+4  A: 

ANIZ in orgANIZation is not a complete word -- it's a part of a word. Your pattern btw is not what you wrote -- a*z would not match as you describe; you're probably using a.*z instead, which would. So, try a[^ ]*z so it won't match spaces. If there are other characters besides spaces that you don't want to match, e.g. some kinds of punctuation, stick them in the [^...] construct as well, of course.

Alex Martelli
You are right. Please look at my edit...Thank You :)
Daniil Harik
+3  A: 
"a[^\s]*z"

This means an 'a' followed by any number of non-whitespace characters, followed by a 'z'.

EDIT: You seem to want '*' to be interpreted as a wildcard character. The user is thus not to enter a regex, but a string with certain wildcards. You can translate these wildcard characters to regex by reasoning over the intended meaning. Let's say that '*' should mean "zero or more characters that are not whitespace". You replace this character, then, with the corresponding regex:

                       [^\s]*
                       `-.-´|
     Character class-----´  `---Zero or more of these

     '\s': "Whitespace"
     Inside Character class: if it starts with '^': "not"

You might also want to define '?' as matching exactly a single non-whitespace character. This is the same character class, but you omit the '*' at the end.

So, what you do is regex-replace "*" with "[^\s]*" and "?" with "[^\s]".

Svante
Please look at my edit..Thank You
Daniil Harik
+1  A: 

that is what you are looking for:

new Regex( @"\b[^ ]*a[^ ]*z[^ ]*\b" );

it matches only a single word (no spaces are allowed) - but the whole one. You can translate your users input into such an regex - just replace * by [^ ]* - it works even with more than one wildcard.

tanascius
+1  A: 

Not related to your question directly, but you may want to check out a RegEx visualization tool which shows you the caputred results based on text input and a given regular expression.

Such a tool is very helpful to find the right pattern, which can be quite tricky. A nice tool specialized for .net RegEx is RegExLab, a bit older but does a good job in showing what exactly your regular expression matches. Since the page is in German, just click on the regexlab.006.zip link. Source code is also included.

Olli
I have been using http://regexplib.com/RETester.aspx , but your tool is more easier to use. Thank You.
Daniil Harik
+1  A: 
Regex reWord = new Regex("\\b[A-Za-z]*?(a.*z)[A-Za-z]*\\b");

... this will return "Atlantic Treaty Organization", with the capture from a.*z being "antic Treaty Organiz".

The problem is inherent in your method - unless you parse the user supplied "regex" of a*z (or a.*z, that's not quite clear from your post) by modifing * to [^\s]*? as Svante suggests (or perhaps \w*?), you're going to gobble up far more characters than you like.

".*" is, generally speaking, a bad idea when you're trying to be specific. It'll match everything but a newline, and there's nothing you can append to it that will stop that.

Regex reWord = new Regex("\\b\\w*?(a\\w*?z)\\w*\\b");

...will return just "Organization".

Alternatively, if you absolutely must, for whatever reason, avoid modifying the user supplied regex, perhaps try spliting your strings into an array of words and test each word individually against the regex.

Ultimately, it's GIGO - garbage in, garbage out. Feed your system a bad regex and if you don't fix it appropriately, you won't get what you're looking for.

patjbs