tags:

views:

800

answers:

5

I would like a regular expression to match given a partial or camel cased string. For example, if the search set contains the string "MyPossibleResultString" I want to be able to match it with the likes of the following:

  • MyPossibleResultString
  • MPRS
  • MPRString
  • MyPosResStr
  • M

I'd also like to include wildcard matching, e.g.:

  • MyP*RString
  • *PosResString
  • My*String

If it's not clear what I mean, the only example I can think of is Eclipse's "Open Type" dialog which is pretty much the exact behaviour I'm looking for. I'm not too up on the use of regexes, so I'm not sure if it matters if I'm looking for a solution in Java.

+1  A: 

It is not possible to do this with a single regular expression. You will have to build a regular expression based on the input and use this to search. It is easy to see that you cannot use a single regex - the user can search for any (cammel cased) string and so your regex needs to match any (cammel cased) string but than it is not a search anymore.

Daniel Brückner
So do you mean, for example, the input string "MyPosRS" would be parsed to generate the regex for "My*Pos*R*S*", which could then be matched on?
Grundlefleck
Yes, you have to generate a regex based on the user input and use this to search. You cannot build one now, because you do not know the input hence not what the regex must match.
Daniel Brückner
Okay, that makes sense. Seems pretty obvious now...+1
Grundlefleck
+2  A: 

As danbruc said, you have to generate a new regex for each new query. This code should do what you want.

public Pattern queryToPattern(String query) {
 StringBuilder sb = new StringBuilder();
 char[] chars = query.toCharArray();
 boolean incamel = false;
 for (int i=0; i < chars.length; i++) {
  if (chars[i] == '*') {
                            if (!incamel)
           sb.append(".*");
  } else if (Character.isUpperCase(chars[i])) {
   if (incamel) {
    sb.append(".*");
   }
   sb.append(chars[i]);
   incamel = true;
  } else {
   sb.append(chars[i]);
  }

 }
 sb.append(".*");
 return Pattern.compile(sb.toString());
}

A query of: MyP*RString

Creates a pattern of: My.* P.* R.* String.*

John Ellinwood
This worked a treat, +1 for now, but I will let time mature to see if another answer is somehow better (in my ignorance I can't really judge this at the moment). Thanks.
Grundlefleck
+2  A: 

Ok so I can't really see why you would need the wildcard feature if you can already support the matching described in the first example. This is what I put together. Given a query string query, you use a regular expression to create a regular expression:

String re = "\\b(" + query.replaceAll("([A-Z][^A-Z]*)", "$1[^A-Z]*") + ".*?)\\b";

For example the query MyPosResStr will become the regex:

\\b(My[^A-Z]*Pos[^A-Z]*Res[^A-Z]*Str[^A-Z]*.*?)\\b

You then use this regex for your matching using the Matcher.find method to get something like this:

public static String matchCamelCase(String query, String str) {
    query = query.replaceAll("\\*", ".*?");
    String re = "\\b(" + query.replaceAll("([A-Z][^A-Z]*)", "$1[^A-Z]*") + ".*?)\\b";

    System.out.println(re);
    Pattern regex = Pattern.compile(re);

    Matcher m = regex.matcher(str);

    if  (m.find()) {
     return m.group();
    } else return null;
}

This will return the first match to your camel case query in the string str.

EDIT: I have added a line to handle wildcards since in my tired stupor I didn't appreciate the need for them

Il-Bhima
The first example does not include the case where I want "*PRS" to match, where "PRS" should not.
Grundlefleck
Sorry, check again
Il-Bhima
I was going to say it doesn't work for a few patterns before you edited, now the only example that doesn't match is the likes of 'M'/'MP'/'MPR' without the explicit dangling wildcard. I'm sure it's just a small factor, I do like that it's more concise (I'm sure I'll be able to read it soon :-D) +1
Grundlefleck
Thank for catching that! You are correct, thats another case I missed. Coming right up.
Il-Bhima
A: 

You could try something like:

class RegexTransformer {
    public String fromQuery(String query) {
        StringBuilder sb = new StringBuilder();
        sb.append("^");
        sb.append(query.replaceAll("(\\p{Upper}(?!\\p{Lower}))",
                "$1\\\\p{Alpha}*?"));
        sb.append("$");
        return sb.toString();
    }
}

See Pattern API for description of negative lookahead assertions (?!pat), POSIX character classes \p{class}, and reluctant quantifiers *?.

Example test case:

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;

import org.junit.Test;

public class RegexTransformerTest {

    private RegexTransformer rt = new RegexTransformer();

    @Test
    public void testQueries() {
        String in = "MyPossibleResultString";
        String q1 = "MyPossibleResultString";
        String q2 = "MPRS";
        String q3 = "MPRString";
        String q4 = "MyPosResStr"; // this wont work
        String q5 = "M";

        test(in, q1, "^MyPossibleResultString$");
        test(in, q2, "^M\\p{Alpha}*?P\\p{Alpha}*?R\\p{Alpha}*?S\\p{Alpha}*?$");
        test(in, q3, "^M\\p{Alpha}*?P\\p{Alpha}*?R\\p{Alpha}*?String$");
        test(in, q5, "^M\\p{Alpha}*?$");
    }

    private void test(String in, String query, String expected) {
        assertEquals("transform", expected, rt.fromQuery(query));
        assertTrue("match", in.matches(rt.fromQuery(query)));
    }
}
toolkit
A: 

Il-Bhima's answear is great, but I find that this code works better for me (pardon my C#, but it's the same):

pattern = Regex.Escape(pattern);
pattern = pattern.Replace(@"\*", ".*?");
pattern = Regex.Replace(pattern, "([A-Z][^A-Z]*)", "$1[^A-Z]*?") + ".*";

Notice the ".* " at the end that allows incomplete "startof" phrases (also allows not specifying all of the capital letters) Also, the asterisk after the "[^A-Z]*" matcher fixes problems like q4 in toolkit's answer where small letters were provided after the capital ones (they should appear directly after the capital letter, not before the next one).

ewino