ansaurus

Question

Answer 1

+3 A:

For sparse matrices, do not use an actual matrix or list.

Use a dictionary, keyed, by word and annotation. Much simpler.

matrix[ (word,annotation) ] += 1

S.Lott 2010-07-13 10:26:52

S. Lott, Thanks for the response. Are you talking in reference with the "dictionary" data structure of python? The number of unique words in my corpus is 11788567.

Denzil 2010-07-13 11:14:21

@Denzil, a dictionary with millions of keys is no problem for Python

gnibbler 2010-07-13 11:56:18

Answer 2

A:

In python2.7+ you can use a Counter

>>> from collections import Counter
>>> matrix = Counter()
>>> matrix[(word,annotation)]+=1

for older python use a defaultdict

>>> from collections import defaultdict
>>> matrix = defaultdict(int)
>>> matrix[(word,annotation)]+=1

gnibbler 2010-07-13 12:08:00

Gnibbler, Thanks for the response. If I understand correctly, you are suggesting that I should create a dictionary of 11788567 * 318k of key-value pairs ?

Denzil 2010-07-13 12:14:51

@Denzil, then it wouldn't be a sparse matrix, would it? Don't create entries for the word/annotation pairs that are zero. How many word/annotation pairs do you have?

gnibbler 2010-07-13 12:23:20

Gnibbler, I currently don't know the number of "non-zero" word annotation pairs. I am waiting for suggestions on the "search" part of my query. The text is pretty noisy and hence handling search is itself a challenge.

Denzil 2010-07-13 12:48:50

ansaurus

tags:

views:

answers:

Term-Topic Matrix for a huge file

related questions