views:

33

answers:

1

how can i extract only the characters in a particular language from a file containing language characters, alphanumeric character english alphabets

A: 

This depends on a few factors:

  1. Is the string encoded with UTF-8?

  2. Do you want all non-English characters, including things like symbols and punctuation marks, or only non-symbol characters from written languages?

  3. Do you want to capture characters that are non-English or non-Latin? What I mean is, would you want characters like é and ç or would you only want characters outside of Romantic and Germanic alphabets?

and finally,

  1. What programming language are you wanting to do this in?

Assuming that you are using UTF-8, you don't want basic punctuation but are okay with other symbols, and that you don't want any standard Latin characters but would be okay with accented characters and the like, you could use a string regular expression function in whatever language you are using that searches for all non-Ascii characters. This would elimnate most of what you probably are trying to weed out.

In php it would be:

$string2 = preg_replace('/[^(\x00-\x7F)]*/','', $string1);

However, this would remove line endings, which you may or may not want.

Anthony