tags:

views:

149

answers:

4

Someone help me i have a file containing following

a                       // true
тодорхойгүй гишүүн\n    // false
ямар нэг                // false
нэгэн                   // false
a good deal             // true
нэлээн                  // false
a long face             // true
уруу царай              // false
...

My java code

while ((strLine = br.readLine()) != null) { 
 // string from file

                Pattern pattern = Pattern.compile("[\\sa-zA-Z]{1,}");
                Matcher matcher = pattern.matcher(strLine);
                if (matcher.matches()) {
                    System.out.print(true+ "\n");
                } else {
                    System.out.print(false + "\n");
                    }
            }

Output

false // there is problem this line must true 
false
false
false
true
false
true
false

Why first time not match.

I inserted blank line into start of file then output

false
true   // this line was false before i insert blank line
false
false
false
true
false
true
false
A: 

Have you tried [\sa-zA-Z]+

Peter Lawrey
This is exactly equivalent to the regex he has. It shouldn't change anything.
Avi
Yes.Java will inform "illegal escape character".I think no problem in regex.
+2  A: 

It is strange. You might want to try to carefully examine the first couple lines of the file with hexdump:

head -2 file | hexdump -C

This should tell you exactly what bytes are at the beginning of the line.

Avi
A: 

first two line

a
тодорхойгүй гишүүн

hexdump

0000-0010:  ef bb bf 61-0d 0a d1 82-d0 be d0 b4-d0 be d1 80  ...a.... ........
0000-0020:  d1 85 d0 be-d0 b9 d0 b3-d2 af d0 b9-20 d0 b3 d0  ........ ........
0000-0029:  b8 d1 88 d2-af d2 af d0-bd                       ........ .
The first three characters are not ASCII. Are you sure this is really a simple text file? How are you creating it?
Stephen C
Those three bytes are the UTF-8 BOM [*]. Its use is discouraged by the Unicode Consortium, but many editors insert it anyway when they save a file as UTF-8 (Windows Notepad being the most notorious example). [*] http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark
Alan Moore
A: 

Tanks for all answers. I removed first three non ascii code then solved problem. :)

That's not really a solution, although it may be all you can do. If you're creating the file, see if you can elect to save it as UTF-8 *without a BOM* (or signature, as some apps call it).
Alan Moore