views:

272

answers:

4

Is there any regular expression pattern to change this string

This is a mix string of üößñ and English. üößñ üößñ are Unicode words.

to this?

This is a mix string of, üößñ, and English., üößñ üößñ, are Unicode words.

Actually, I want to split English words and non-English words with comma.

Thanks.

+1  A: 

No regular expression can detect strings in a particular language, but you can certainly match characters in (or not in) a range of code points, by using unicode literals, such as

/[\u0900-\u097F]+/

which matches a sequence of Devanagari characters.

Remember that a Script (a collection of characters) can be used by many languages.

Jonathan Feinberg
+2  A: 

Sure, you can use \x to filter specific ASCII code ranges

For example (in JavaScript):

var x = "This is a mix string of üößñ and English. üößñ üößñ are Unicode characters.";
x.replace(/([^\x00-\x80]+\s)+/g, function(match) { return match.slice(0,-1)+", "; } ); // matches characters outside the 0-128 ASCII range

Output:

This is a mix string of üößñ, and English. üößñ üößñ, are Unicode characters.

I'm sure another regex savvy person can optimize further, but this is the best I can think of half-awake :)

Matt
+1  A: 

javascript

/((?:\ [^\w\d]+)+)/g

'This is a mix string of üößñ and English. üößñ üößñ are Unicode words.'.replace(/((?:\ [^\w\d]+)+)/g,',$1,')

This is a mix string of, üößñ, and English., üößñ üößñ, are Unicode words.

Mark

S.Mark
+1  A: 
 String s = "This is a mix string of üößñ and English. üößñ üößñ are Unicode words.";
 System.out.println(s.replaceAll("((?: ?[\\p{L}&&[^A-Za-z]]+)+)", ",$1,"));

Unicode scripts define about 45 different language scripts. The above simply detects any unicode not in the ASCII range.

brianegge