tags:

views:

1555

answers:

5

Hi, i write program that parse text with regular expression. Regular expression should be obtained from user. I deside to use glob syntax for user input, and convert glob string to the regular expression internally. For example:

"foo.? bar*"

should be converted to

"^.*foo\.\w\bar\w+.*"

Somehow, i need to escape all meaningful characters from the string, then i need to replace glob * and ? characters with apropriate regexp syntax. What is the most convinient way to do this?

+1  A: 

Jakarta ORO has an implementation in Java.

orip
+3  A: 

Try this link: match globbing patterns against text

Dror
A: 

I write my own function, using c++ and boost::regex

std::string glob_to_regex(std::string val)
{
    boost::trim(val);
    const char* expression = "(\\*)|(\\?)|([[:blank:]])|(\\.|\\+|\\^|\\$|\\[|\\]|\\(|\\)|\\{|\\}|\\\\)";
    const char* format = "(?1\\\\w+)(?2\\.)(?3\\\\s*)(?4\\\\$&)";
    std::stringstream final;
    final << "^.*";
    std::ostream_iterator<char, char> oi(final);
    boost::regex re;
    re.assign(expression);
    boost::regex_replace(oi, val.begin(), val.end(), re, format, boost::match_default | boost::format_all);
    final << ".*" << std::ends;
    return final.str();
}

it looks like all works fine

Lazin
+1  A: 

I'm not sure I fully understand the requirements. If I assume the users want to find text "entries" where their search matches then I think this brute way would work as a start.

First escape everything regex-meaningful. Then use non-regex replaces for replacing the (now escaped) glob characters and build the regular expression. Like so in Python:

regexp = re.escape(search_string).replace(r'\?', '.').replace(r'\*', '.*?')

For the search string in the question, this builds a regexp that looks like so (raw):

foo\..\ bar.*?

Used in a Python snippet:

search = "foo.? bar*"
text1 = 'foo bar'
text2 = 'gazonk foo.c bar.m m.bar'

searcher = re.compile(re.escape(s).replace(r'\?', '.').replace(r'\*', '.*?'))

for text in (text1, text2):
  if searcher.search(text):
    print 'Match: "%s"' % text

Produces:

Match: "gazonk foo.c bar.m m.bar"

Note that if you examine the match object you can find out more about the match and use for highlighting or whatever.

Of course, there might be more to it, but it should be a start.

PEZ
Thats right, but you need alsough replace ()|\ [] and other meaningful characters in serarch string
Lazin
Thanks for pointing that out. Now fixed.
PEZ
+8  A: 

no need for incomplete or unreliable hacks. there's a function included with python for this

>>> import fnmatch
>>> fnmatch.translate( '*.foo' )
'.*\\.foo$'
>>> fnmatch.translate( '[a-z]*.txt' )
'[a-z].*\\.txt$'
Chadrik