a friend of mine wrote this little progam.
the textFile
is 1.2GB in size (7 years worth of newspapers).
He successfully manages to create the dictionary but he cannot write it to a file using pickle(program hangs).
import sys
import string
import cPickle as pickle
biGramDict = {}
textFile = open(str(sys.argv[1]), 'r')
biGramDictFile = open(str(sys.argv[2]), 'w')
for line in textFile:
if (line.find('<s>')!=-1):
old = None
for line2 in textFile:
if (line2.find('</s>')!=-1):
break
else:
line2=line2.strip()
if line2 not in string.punctuation:
if old != None:
if old not in biGramDict:
biGramDict[old] = {}
if line2 not in biGramDict[old]:
biGramDict[old][line2] = 0
biGramDict[old][line2]+=1
old=line2
textFile.close()
print "going to pickle..."
pickle.dump(biGramDict, biGramDictFile,2)
print "pickle done. now load it..."
biGramDictFile.close()
biGramDictFile = open(str(sys.argv[2]), 'r')
newBiGramDict = pickle.load(biGramDictFile)
thanks in advance.
EDIT
for anyone interested i will briefly explain what this program does.
assuming you have a file formated roughly like this:
<s>
Hello
,
World
!
</s>
<s>
Hello
,
munde
!
</s>
<s>
World
domination
.
</s>
<s>
Total
World
domination
!
</s>
<s>
are sentences separators.- one word per line.
a biGramDictionary is generated for later use.
something like this:
{
"Hello": {"World": 1, "munde": 1},
"World": {"domination": 2},
"Total": {"World": 1},
}
hope this helps. right now the strategy changed to using mysql because sqlite just wasn't working (probably because of the size)