tags:

views:

143

answers:

6

Basically i have a dictionary containing all the words of my vocabulary as keys, and all with 0 as value.

To process a document into a bag of words representation i used to copy that dictionary with the appropriate IEqualityComparer and simply checked if the dictionary contained every word in the document and incremented it's key.

To get the array of the bag of words representation i simply used the ToArray method.

This seemed to work fine, but i was just told that the dictionary doesnt assure the same Key order, so the resulting arrays might represent the words in different order, making it useless.

My current idea to solve this problem is to copy all the keys of the word dictionary into an ArrayList, create an array of the proper size and then use the indexOf method of the array list to fill the array.

So my question is, is there any better way to solve this, mine seems kinda crude... and won't i have issues because of the IEqualityComparer?

+1  A: 

There is also an OrderedDictionary.

Represents a collection of key/value pairs that are accessible by the key or index.

Mitch Wheat
Which is not generic.
gWiz
@gWiz: which wasn't a requirement.
Mitch Wheat
@gWiz - There are versions out there: http://www.codeproject.com/KB/recipes/GenericOrderedDictionary.aspx
Nick Craver
I want to conver it to an array to be able to use the cosine similarity later
brokencoding
True, not a stated requirement. I'm just noting it since he referred to "dictionary" which is generic.
gWiz
A: 

Something like this might work although it is definitely ugly and I believe is similar to what you were suggesting. GetWordCount() does the work.

class WordCounter {

public Dictionary dictionary = new Dictionary();

    public void CountWords(string text)
    {
        if (text != null && text != string.Empty)
        {
            text = text.ToLower();
            string[] words = text.Split(' ');
            if (dictionary.ContainsKey(words[0]))
            {
                if (text.Length > words[0].Length)
                {
                    text = text.Substring(words[0].Length + 1);
                    CountWords(text);
                }

            }
            else
            {
                int count = words.Count(
                    delegate(string s)
                    {
                        if (s == words[0]) { return true; }
                        else { return false; }
                    });
                dictionary.Add(words[0], count);
                if (text.Length > words[0].Length)
                {
                    text = text.Substring(words[0].Length + 1);
                    CountWords(text);
                }

            }
        }
    }

    public int[] GetWordCount(string text)
    { 
        CountWords(text);
        return dictionary.Values.ToArray<int>();
    }


}
awright18
No, i have no problems parsing the text, the idea is to represent the text like this:Text = "Cat dog wolf cat horse dog";I would have a dictionary like this:[cat, 2][dog, 2][wolf, 1][horse, 1]And the bag of words representation would be simply:[2 2 1 1]However if the dictionary doesnt maintain order, i could end up with stuff like [2 1 2 1] which defeats the purpose
brokencoding
+1  A: 

If I understand correctly, you want to split a document by word frequency.

You could take the document and run a Regex over it to split out the words:

var words=Regex
    .Matches(input,@"\w+")
    .Cast<Match>()
    .Where(m=>m.Success)
    .Select(m=>m.Value);

To make the frequency map:

var map=words.GroupBy(w=>w).Select(g=>new{word=g.Key,freqency=g.Count()});

There are overloads of the GroupBy method that allow you to supply an alternative IEqualityComparer if this is important.

Reading your comments, to create a corresponding sequence of only frequencies:

map.Select(a=>a.frequency)

This sequence will be in exactly the same order as the sequence map above.

Is this any help at all?

spender
+3  A: 

Let me see if I understand the problem. You have two documents D1 and D2 each containing a sequence of words drawn from a known vocabulary {W1, W2... Wn}. You wish to obtain two mappings indicating the number of occurrences of each word in each document. So for D1, you might have

W1 --> 0
W2 --> 1
W3 --> 4

indicating that D1 was perhaps "W3 W2 W3 W3 W3". Perhaps D2 is "W2 W1 W2", so its mapping is

W1 --> 1
W2 --> 2
W3 --> 0

You wish to take both mappings and determine the vectors [0, 1, 4] and [1, 2, 0] and then compute the angle between those vectors as a way of determining how similar or different the two documents are.

Your problem is that the dictionary does not guarantee that the key/value pairs are enumerated in any particular order.

OK, so order them.

vector1 = (from pair in map1 orderby pair.Key select pair.Value).ToArray();
vector2 = (from pair in map2 orderby pair.Key select pair.Value).ToArray();

and you're done.

Does that solve your problem, or am I misunderstanding the scenario?

Eric Lippert
That's the scenario, but for a dictionary with let's say 20k words doing a sort everytime i convert a document is computationaly heavy or not?
brokencoding
Eric, I don't understand how "W2 W1 W2" corresponds to the second mapping.
spender
@brokencoding: If you're only comparing two documents then why perform this operation on a dictionary with all 20k words? How big are the documents? You only need to include counts of the words that appear in at least one document.
Aaronaught
It's not just 2 documents, let's say i have to represent a directory of documents into vectors like in your example, and then i could do stuff like cluster them by similarity
brokencoding
@spender: true, the second mapping should be [1, 2, 0], didnt even notice
brokencoding
So is there a corpus of words against which document words must match, or does that corpus get built while processing the documents?
spender
@brokencoding: So you are basically trying to do some level of natural-language processing on an entire directory of documents... and you're worrying about the performance of individual sorts? Speculation obviously isn't a substitute for profiling but I have a feeling that whatever post-processing you do, not to mention the I/O associated with opening and reading the files, is going to be the critical path. You can easily parallelize the sorts. If you're in doubt, test it.
Aaronaught
@brokencoding: Performance questions have to be answered by profiling, not by guessing. If you want to know the computational burden of sorting a list of 20K strings, write a program that sorts a list of 20K strings a hundred thousand times and tells you the average of the results, and then you'll know. If it turns out to not be fast enough then consider other options, like writing a dictionary that *does* guarantee maintaining the insertion order. Such a dictionary is not hard to write.
Eric Lippert
A: 

Would be this helpful to you:

SortedDictionary<string, int> dic = new SortedDictionary<string, int>();

            for (int i = 0; i < 10; i++)
            {
                if (dic.ContainsKey("Word" + i))
                    dic["Word" + i]++;
                else
                    dic.Add("Word" + i, 0);
            }

            //to get the array of words:
            List<string> wordsList = new List<string>(dic.Keys);
            string[] wordsArr = wordsList.ToArray();

            //to get the array of values
            List<int> valuesList = new List<int>(dic.Values);
            int[] valuesArr = valuesList.ToArray();
Sameh Serag
A: 

If all you're trying to do is calculate cosine similarity, you don't need to convert your data to 20,000-length arrays, especially considering the data would likely be sparse with most entries being zero.

While processing the files, store the file output data into a Dictionary keyed on the word. Then to calculate the dot product and magnitudes, you iterate through the words in the full word list, look for the word in each of the file ouptut data, and use the found value if it exists and zero if it doesn't.

gWiz