views:

57

answers:

2

Hi,

I have a list of 17 million sentences in a text file. Each sentence contains at max 200 characters. Each sentence is also accompanied by one or more annotation(s) with it. I have a list of unique annotations and a list of unique words obtained from the 17 million sentences. I have to create a sparse matrix with the rows as the unique words and the columns as the annotations(318k). Each value of the matrix would be the number of times each word has appeared with the annotation.

Matrix Data Structure
The size of the matrix is obviously going to be very large. Pointers towards handling such huge matrix sizes? One immediate thought to my mind was the use of a CSV file.

Co-occurrence word search
Each sentence may contain one or more annotations. Pointers on things I should do to speed up my search and things to take care of.

  • I am fine with Python/Java. If there's something else like a Shell Script/Perl etc. which would ease my task, I would be glad to use it
  • I am thinking about using Lucene for the search. I am NOT sure if Lucene is required as all my sentences are indexed in a DB
  • I apologize for posting no code, but this ain't no homework! An idea/suggestion/pointer will work for me.
+3  A: 

For sparse matrices, do not use an actual matrix or list.

Use a dictionary, keyed, by word and annotation. Much simpler.

matrix[ (word,annotation) ] += 1
S.Lott
S. Lott, Thanks for the response. Are you talking in reference with the "dictionary" data structure of python? The number of unique words in my corpus is 11788567.
Denzil
@Denzil, a dictionary with millions of keys is no problem for Python
gnibbler
A: 

In python2.7+ you can use a Counter

>>> from collections import Counter
>>> matrix = Counter()
>>> matrix[(word,annotation)]+=1

for older python use a defaultdict

>>> from collections import defaultdict
>>> matrix = defaultdict(int)
>>> matrix[(word,annotation)]+=1
gnibbler
Gnibbler, Thanks for the response. If I understand correctly, you are suggesting that I should create a dictionary of 11788567 * 318k of key-value pairs ?
Denzil
@Denzil, then it wouldn't be a sparse matrix, would it? Don't create entries for the word/annotation pairs that are zero. How many word/annotation pairs do you have?
gnibbler
Gnibbler, I currently don't know the number of "non-zero" word annotation pairs. I am waiting for suggestions on the "search" part of my query. The text is pretty noisy and hence handling search is itself a challenge.
Denzil