tags:

views:

231

answers:

2

I was hoping someone could help me writing a regex for c++ that matches words in a searchphrase, and explain it bit by bit for learning purposes.

What I need is a regex that matches string within " " like "Hello you all", and single words that starts/ends with * like *ack / overfl*.

For the quote part I have \"[\^\\s][\^\"]*\" but I can't figure out the wildcard (*) part, and how I should combine it with the quote regex.

A: 

As long as there is no quote nesting (nesting in general is something regex is bad at):

"(?:(?<=\\)"|[^"])*"|\*[^\s]+|[^\s]+\*

This regex allows for escaped double quotes ('\"'), though, if you need that. And the match includes the enclosing double quotes.

This regex matches:

  • "A string in quotes, possibly containing \"escaped quotes\""
  • *a_search_word_beginning_with_a_star
  • a_search_word_ending_with_a_star*
  • *a_search_word_enclosed_in_stars*

Be aware that it will break at strings like this:

  • A broken \"string "with the quotes all \"mangled up\""

If you expect (read: can't entirely rule out the possibility) to get these, please don't use regex, but write a small quote-aware parser. For a one-shot search and replace activity or input in a guaranteed format, the regex is okay to use.

For validating/parsing user input, it is not okay to use. That's where I would recommend a parser. Knowing the difference is the key.

Tomalak
A: 

Try this regular expression:

(?:\*?\w+\*?|"(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*")*")+

For readability I replaced the backslash characters by \x5C.

The expression "(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*")*" will also match "foo \"bar\"" and other proper escaped quote sequences (but only the " might be escaped).

So foo* bar *baz *quux* "foo \"bar\"" should be splitted into:

  • foo*
  • bar
  • *baz
  • *quux*
  • "foo \"bar\""

If you don’t want to match bar in the example above, use this:

(?:\*\w+|\w+\*|"(?:[^\x5C"]+|\x5C(?:\x5C\x5C)*")*")+
Gumbo
I'm sorry to say that, but your first regex does not work. It seems to match every single word in: 'This is a "test string"', though it ought to match '"test string"' only.
Tomalak
I had that before but thought he also would want to match those words too. Let’s see what Qwark says.
Gumbo
Vel it's monday morning and I got to test the regex at work, and it did work more perfect than I was hoping for =) , Thanks.
Qwark