tags:

views:

76

answers:

1

How would I go about using Regex to match Unicode strings? I'm loading in a couple keywords from a text file and using them with Regex on another file. The keywords both contain unicode (such as á, etc). I'm not sure where the problem is. Is there some option I have to set?


Code:

foreach (string currWord in _keywordList)
{
    MatchCollection mCount = Regex.Matches(
        nSearch.InnerHtml, "\\b" + @currWord + "\\b", RegexOptions.IgnoreCase);

    if (mCount.Count > 0)
    {
        wordFound.Add(currWord);
        MessageBox.Show(@currWord, mCount.ToString());
    }
}

And reading the keywords to a list:

var rdComp = new StreamReader(opnDiag.FileName);
string compSplit = rdComp.ReadToEnd()
                         .Replace("\r\n", "\n")
                         .Replace("\n\r", "\n");
rdComp.Dispose();
string[] compList = compSplit.Split(new[] {'\n'});

Then I change the array to a list.

A: 

When matching on a specific character, I believe regular expressions only support literals for the ASCII character set. Beyond that, you can use \uxxxx to match on the Unicode code point.

See here.

mbeckish
I'm not sure that's the problem. She/he isn't using character classes but verbatim strings, surrounded by word boundaries.
Tim Pietzcker
@Pietzcker - That's the problem. S/he needs to parse the string and add each character as a unicode code point.
mbeckish
Well, I've just tried using Unicode literals in a regex in C#, and it worked perfectly. `Console.WriteLine(Regex.Replace("It BӦЯӁڀ!", @"\bBӦЯӁڀ\b", "works"));` returns `It works!`
Tim Pietzcker