ansaurus

Question

Extract words out of a text file

Answer 1

A:

You could try regex, using a pattern you've made, and run a count the number of times that pattern has been found.

dotnetdev 2008-11-09 22:11:35

Answer 2

+2 A:

Pseudocode would look like this:

create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right

The python code would be something like this:

words = input.split()
words = [word.strip(PUNCTUATION) for word in words]

where

PUNCTUATION = ",. \n\t\\\"'][#*:"

or any other characters you want to remove.

I believe Java has equivalent functions in the String class: String.split() .

Output of running this code on the text you provided in your link:

>>> print words[:100]
['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis', 
'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for', 
'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 
'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 
'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 
... etc etc.

Claudiu 2008-11-09 22:16:11

The advantage of this code over regular expressions is that it can be done simply in a single pass.

Tom Leys 2008-11-09 22:43:05

Yes Java has a 'split' method, but it doesn't have the equivalent of the 'strip' method.

nute 2008-11-09 22:43:30

Answer 3

A:

Basically, you want to match

([A-Za-z])+('([A-Za-z])*)?

right?

Ed Marty 2008-11-09 22:20:06

Answer 4

+3 A:

This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don't know how to start:

String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\w']+");
Matcher m = p.matcher(input);

while ( m.find() ) {
    System.out.println(input.substring(m.start(), m.end()));
}

The pattern [\w']+ matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.

Tomalak 2008-11-09 22:20:45

I had to slightly change the regexp to not include numbers, underscore, and to not have words that start with a quote, but otherwise, good!

nute 2008-11-09 23:10:30

ansaurus

tags:

views:

answers:

Extract words out of a text file

related questions