tags:

views:

60

answers:

2

I want to find words inside text, where word contains only preselected character set.

For example: I use regex to split on characters not in set and remove entries that are empty

Like:

string inp = @"~T!@#e$мудак%š^t<>is69&.,;((טעראָר))_+}{{男子}[죽은]ที่เดิน:?/Ök\|`'+*-¤=";
string reg[] = {"[^A-Za-zšžõäöüŠŽÕÄÖÜ]"};

foreach (string word in inp.Split(reg, StringSplitOptions.RemoveEmptyEntries))
    Console.Write(word + " ");

Output, that I am trying to get, is:

 T e š t is Ök
+5  A: 

You want Regex.Split(String, String) instead of String.Split(String[], StringSplitOptions) – the latter does no regex matching.

Kind of like the following (tested):

string inp = @"~T!@#e$мудак%š^t<>is69&.,;((טעראָר))_+}{{男子}[죽은]ที่เดิน:?/Ök\|`'+*-¤="; 
string reg = "[^A-Za-zšžõäöüŠŽÕÄÖÜ]";

foreach (string word in Regex.Split(inp, reg))  
    if (word != string.Empty)
        Console.Write(word + " ");

PowerShell test:

PS> $inp = '~T!@#e$мудак%š^t<>is69&.,;((טעראָר))_+}{{男子}[죽은]ที่เดิน:?/Ök\|`''+*-¤='
PS> $inp -split '[^A-Za-zšžõäöüŠŽÕÄÖÜ]' -join ' '
 T   e š t  is                                      Ök

Obviously you need to filter out the empty strings, so either

PS> $inp -split '[^A-Za-zšžõäöüŠŽÕÄÖÜ]' -ne '' -join ' '
T e š t is Ök

or

PS> $inp -split '[^A-Za-zšžõäöüŠŽÕÄÖÜ]+' -join ' '
 T e š t is Ök

(although the latter still contains an empty item at the start ... ah well, I'll leave that to you.)

Joey
+1  A: 

This is what you want (tested):

string inp = @"~T!@#e$мудак%š^t<>is69&.,;((טעראָר))_+}{{男子}[죽은]ที่เดิน:?/Ök\|`'+*-¤=";
Regex reg = new Regex("[^A-Za-zšžõäöüŠŽÕÄÖÜ]");

foreach (string s in reg.Split(inp))
{
      if (String.IsNullOrEmpty(s))
           continue;

      Console.Write(s + " ");
}
Richard J. Ross III
That “(tested)” was a challenge, wasn't it? ;-)
Joey
Of course, I had to make it better than yours :)
Richard J. Ross III