views:

101

answers:

3

How to Store and retrieve 3,000,000+ words in Dynamically without using SQL..

Get a word form a document then check whether the word is available or not.

if available, then increment it in corresponding document count...

if not available i.e, New word then create a new column then increment the document count and put Zero to all other documents.

For Example..

I having 93,000 documents each contains more or less 5000 words. If new word comes then add a new column. Likewise 960000 words came.

----------------Word1 word2 word3 word4 word5 ….---- New Word … word96000

Document1 ----2 ----19 ----45 ----16 ----7 ---- ------….0 ----.. ----..

Document2 ----4 ----6 ----3 ----56 ----3 ----…. --------0 ----.. ----..

Document3 ----56 ----34 ----1 ----67 ----4 ----…. --------0 ----.. ----..

Document4 ----7 ----45 ----9 ----45 ----6 ----…. --------0 ----.. ----..

Document5 ----56 ----43 ----234 ----87 ----46 ----…. --------0 ----..

Document6 ----56 ----6 ----2 ----5 ----23 ----…. --------0 ----.. ----..

. …. . .. ..

. …. . .. ..

. …. . .. ..

. …. . .. ..

. …. . .. ..

. …. . .. ..

. …. . .. ..

Document1000 ----5 ----9 ----9 ----89 ----34 ----…. --------1 .. ..

Count of those words that are added are dynamically updated in the corresponding document's entry.

+2  A: 

Such a sparse matrix is often best implemented as a dictionary of dictionaries.

Dictionary<string, Dictionary<string, int> index;

But the question lacks too many details to give more advice.

0xA3
A: 

To avoid wasting memory, I would suggest the following:

class Document {
   List<int> words;
}
List<Document> documents;

If you have 1000 documents then create List<Document> documents = new List<Document>(1000);
Now if document1 has the words: word2, word19 and word45, add the index of these words to your document

documents[0].words.add(2);
documents[0].words.add(19);
documents[0].words.add(45);

You can modify the code to store the words themselves.
To see how many times the word word2 is repeated, you can go throw the entire list of documents and see if the document contains the word index or not.

foreach (Document d in documents) {
   if (d.words.Contain(2)) {
      count++;
   }
}
sh_kamalh
A: 
var nWords = (from Match m in Regex.Matches(File.ReadAllText("big.txt").ToLower(), "[a-z]+")
              group m.Value by m.Value)
             .ToDictionary(gr => gr.Key, gr => gr.Count());

Provide you with a dictionary list indexed by word and count. I'm sure you could then save the info as each file is read in and then build up your final lists. maybe?

SomeMiscGuy