Form what you are doing, I suspect that the following would suit you near perfectly:
from collections import defaultdict
text = ( "cat", "dog", "rat", "bat", "rat", "dog",
"man", "woman", "child", "child") #
d = defaultdict(list)
for lineno, word in enumerate(text):
d[word].append(lineno)
print d
This gives you an output of:
defaultdict(<type 'list'>, {'bat': [3], 'woman': [7], 'dog': [1, 5],
'cat': [0], 'rat': [2, 4], 'child': [8, 9],
'man': [6]})
This simply sets up an empty default dictionary containing a list for each item you access, so that you don't need to worry about creating the entry, and then enumerates it's way over the list of words, so you don't need to keep track of the line number.
As you don't have a list of correct spellings, this doesn't actually check if the words are correctly spelled, just builds a dictionary of all the words in the text file.
To convert the dictionary to a set of words, try:
all_words = set(d.keys())
print all_words
Which produces:
set(['bat', 'woman', 'dog', 'cat', 'rat', 'child', 'man'])
Or, just to print the words:
for word in d.keys():
print word
Edit 3:
I think this might be the final version:
It's a (deliberately) very crude, but almost complete spell checker.
from collections import defaultdict
# Build a set of all the words we know, assuming they're one word per line
good_words = set() # Use a set, as this will have the fastest look-up time.
with open("words.txt", "rt") as f:
for word in f.readlines():
good_words.add(word.strip())
bad_words = defaultdict(list)
with open("text_to_check.txt", "rt") as f:
# For every line of text, get the line number, and the text.
for line_no, line in enumerate(f):
# Split into seperate words - note there is an issue with punctuation,
# case sensitivitey, etc..
for word in line.split():
# If the word is not recognised, record the line where it occurred.
if word not in good_words:
bad_words[word].append(line_no)
At the end, bad_words
will be a dictionary with the unrecognised words as the key, and the line numbers where the words were as the matching value entry.