Hi,
I have a list of 17 million sentences in a text file. Each sentence contains at max 200 characters. Each sentence is also accompanied by one or more annotation(s) with it. I have a list of unique annotations and a list of unique words obtained from the 17 million sentences. I have to create a sparse matrix with the rows as the unique words and the columns as the annotations(318k). Each value of the matrix would be the number of times each word has appeared with the annotation.
Matrix Data Structure
The size of the matrix is obviously going to be very large. Pointers towards handling such huge matrix sizes? One immediate thought to my mind was the use of a CSV file.
Co-occurrence word search
Each sentence may contain one or more annotations. Pointers on things I should do to speed up my search and things to take care of.
- I am fine with Python/Java. If there's something else like a Shell Script/Perl etc. which would ease my task, I would be glad to use it
- I am thinking about using Lucene for the search. I am NOT sure if Lucene is required as all my sentences are indexed in a DB
- I apologize for posting no code, but this ain't no homework! An idea/suggestion/pointer will work for me.