Pseudocode would look like this:
create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right
The python code would be something like this:
words = input.split()
words = [word.strip(PUNCTUATION) for word in words]
where
PUNCTUATION = ",. \n\t\\\"'][#*:"
or any other characters you want to remove.
I believe Java has equivalent functions in the String class: String.split() .
Output of running this code on the text you provided in your link:
>>> print words[:100]
['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis',
'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for',
'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and',
'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may',
'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under',
... etc etc.