views:

292

answers:

5
String a ="the STRING TOKENIZER CLASS ALLOWS an APPLICATION to BREAK a STRING into TOKENS.  ";

StringTokenizer st = new StringTokenizer(a);
while (st.hasMoreTokens()){
  System.out.println(st.nextToken());

Given above codes, the output is following,

the
STRING TOKENIZER CLASS
ALLOWS
an
APPLICATION
to
BREAK
a
STRING
into
TOKENS. 

My only question is why the "STRING TOKENIZER CLASS" has been combined into one token????????

When I try to run this code,

System.out.println("STRING TOKENIZER CLASS".contains(" "));

It printed funny result,

FALSE

It sound not logical right? I've no idea what went wrong.

I found out the reason, the space was not recognized as valid space by Java somehow. But, I don't know how it turned up to be like that from the front processing up to the code that I've posted.

Guys, I need to highlight that, below code runs first before the above one..

if (!suspectedContentCollector.isEmpty()){ Iterator i = suspectedContentCollector.iterator(); String temp=""; while (i.hasNext()){ temp+=i.next().toLowerCase()+ " "; } StringTokenizer st = new StringTokenizer(temp);

        while (st.hasMoreTokens()){
            temp=st.nextToken();
            temp=StopWordsRemover.remove(temp);
            analyzedSentence = analyzedSentence.replace(temp,temp.toUpperCase());
        }
    }

Hence, once it has been changed to UPPERCASE, something seems to went wrong somewhere and I realized only certain spaces were not recognized. Could it be the reason of retrieving the text from the document?

Following code,

String a ="the STRING TOKENIZER CLASS ALLOWS an APPLICATION to BREAK a STRING into TOKENS.  "; for (int i : a.toCharArray()) { System.out.print(i + " "); }

produced following output,

116 104 101 32 83 84 82 73 78 71 160 84 79 75 69 78 73 90 69 82 160 67 76 65 83 83 32 65 76 76 79 87 83 32 97 110 32 65 80 80 76 73 67 65 84 73 79 78 32 116 111 32 66 82 69 65 75 32 97 32 83 84 82 73 78 71 32 105 110 116 111 32 84 79 75 69 78 83 46 160 32

+3  A: 

Is it possible that you're using something other than normal ascii blanks in "STRING TOKENIZER CLASS"? Maybe you held down the shift key and got a shifted-space in there instead?

Paul Tomblin
I was thinking as same as you..But, the original String was all in lowercase..and I changed some of the words to uppercase..After changing that part, some of the spaces seems to be undetected which is a very weird case to me..Any idea why??
Mr CooL
Did you change them to uppercase by hitting "caps lock" or by holding down the "shift" key as you typed? If the latter, Paul's point would seem right.
Jim Kiley
+6  A: 

There -- the answer is in the snippet that you added. The integers listed show that the space after the word STRING is ASCII character 160, which is  , instead of character 32, which is the ordinary space. Edit your original string, replacing the spaces within STRING TOKENIZER CLASS with actual spaces instead of shift-spaces.

Just a side comment, from the 1.4.2 Javadoc:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

Jim Kiley
It's the same....the space was not recognized...
Mr CooL
Thanks Jim Kiley
Mr CooL
+1  A: 

If you copy/pasted the sentence from a web page or a Word document, chances are you got some special characters instead of spaces (ex: non-breaking spaces, etc.). Try again by typing the sentence in your Java editor.

Olivier Croisier
Yeah....If I type it, it has no problem, however, if through some processing only, it has this problem....
Mr CooL
+2  A: 

Do us all a favor and copy and paste the output of this snippet:

    for (int i : a.toCharArray()) {
        System.out.print(i + " ");
    }

OK, now looking at the output, it confirms what we've all been suspecting: those "spaces" are ASCII 160, the &nbsp non-breaking space. It's a different character from the ASCII 32 regular space.

You can let the tokenizer (which is obsolete as others have said) to include ASCII 160 as delimiter, or you can filter it out from the input string if it's not supposed to be there in the first place.

For now, a = a.replace((char) 160, (char) 32); before tokenizing is a quick-fix.

polygenelubricants
Okay...thanks,,,
Mr CooL
Sorry polygenelubricants,How to actually replace with the ASCII 160 to ASCII 32 regular space?because the code pasted by you, a = a.replace(160, 32); didn't work.
Mr CooL
Sorry, I forgot to add the cast `(char)`.
polygenelubricants
Thanks polygenelubricants~! ;)
Mr CooL
+3  A: 

Looking at the character codes, the 'space' in question is 0xA0, which is intended to be a non-breaking space. My guess is that it was entered deliberately so that 'STRING TOKENIZER CLASS' is treated as one word.

The solution (if you indeed deem it correct to break up 'STRING TOKENIZER CLASS' into three words) would be to pass add the non-breaking space as delimiter to the StringTokenizer class (resp. the String.split() method). E.g.

  new StringTokenizer(string, " \t\n\r\f\240")
Lars
Thanks man....the code works to remove the funny space!
Mr CooL