you might want to start by looking at the BreakIterator
class.
From the JavaDoc.
The BreakIterator class implements
methods for finding the location of
boundaries in text. Instances of
BreakIterator maintain a current
position and scan over text returning
the index of characters where
boundaries occur. Internally,
BreakIterator scans text using a
CharacterIterator, and is thus able to
scan text held by any object
implementing that protocol. A
StringCharacterIterator is used to
scan String objects passed to setText.
You use the factory methods provided
by this class to create instances of
various types of break iterators. In
particular, use getWordIterator,
getLineIterator, getSentenceIterator,
and getCharacterIterator to create
BreakIterators that perform word,
line, sentence, and character boundary
analysis respectively. A single
BreakIterator can work only on one
unit (word, line, sentence, and so
on). You must use a different iterator
for each unit boundary analysis you
wish to perform.
Line boundary analysis determines
where a text string can be broken when
line-wrapping. The mechanism correctly
handles punctuation and hyphenated
words.
Sentence boundary analysis allows
selection with correct interpretation
of periods within numbers and
abbreviations, and trailing
punctuation marks such as quotation
marks and parentheses.
Word boundary analysis is used by
search and replace functions, as well
as within text editing applications
that allow the user to select words
with a double click. Word selection
provides correct interpretation of
punctuation marks within and following
words. Characters that are not part of
a word, such as symbols or punctuation
marks, have word-breaks on both sides.
Character boundary analysis allows
users to interact with characters as
they expect to, for example, when
moving the cursor through a text
string. Character boundary analysis
provides correct navigation of through
character strings, regardless of how
the character is stored. For example,
an accented character might be stored
as a base character and a diacritical
mark. What users consider to be a
character can differ between
languages.
BreakIterator is intended for use with
natural languages only. Do not use
this class to tokenize a programming
language.