ansaurus

Question

Bag of Words representation problem

Answer 1

+1 A:

There is also an OrderedDictionary.

Represents a collection of key/value pairs that are accessible by the key or index.

Mitch Wheat 2010-03-05 01:28:15

Which is not generic.

gWiz 2010-03-05 01:28:39

@gWiz: which wasn't a requirement.

Mitch Wheat 2010-03-05 01:37:25

@gWiz - There are versions out there: http://www.codeproject.com/KB/recipes/GenericOrderedDictionary.aspx

Nick Craver 2010-03-05 01:37:58

I want to conver it to an array to be able to use the cosine similarity later

brokencoding 2010-03-05 01:40:05

True, not a stated requirement. I'm just noting it since he referred to "dictionary" which is generic.

gWiz 2010-03-05 01:44:48

Answer 2

A:

Something like this might work although it is definitely ugly and I believe is similar to what you were suggesting. GetWordCount() does the work.

class WordCounter {

public Dictionary dictionary = new Dictionary();

    public void CountWords(string text)
    {
        if (text != null && text != string.Empty)
        {
            text = text.ToLower();
            string[] words = text.Split(' ');
            if (dictionary.ContainsKey(words[0]))
            {
                if (text.Length > words[0].Length)
                {
                    text = text.Substring(words[0].Length + 1);
                    CountWords(text);
                }

            }
            else
            {
                int count = words.Count(
                    delegate(string s)
                    {
                        if (s == words[0]) { return true; }
                        else { return false; }
                    });
                dictionary.Add(words[0], count);
                if (text.Length > words[0].Length)
                {
                    text = text.Substring(words[0].Length + 1);
                    CountWords(text);
                }

            }
        }
    }

    public int[] GetWordCount(string text)
    { 
        CountWords(text);
        return dictionary.Values.ToArray<int>();
    }


}

awright18 2010-03-05 01:31:49

No, i have no problems parsing the text, the idea is to represent the text like this:Text = "Cat dog wolf cat horse dog";I would have a dictionary like this:[cat, 2][dog, 2][wolf, 1][horse, 1]And the bag of words representation would be simply:[2 2 1 1]However if the dictionary doesnt maintain order, i could end up with stuff like [2 1 2 1] which defeats the purpose

brokencoding 2010-03-05 01:48:39

Answer 3

+1 A:

If I understand correctly, you want to split a document by word frequency.

You could take the document and run a Regex over it to split out the words:

var words=Regex
    .Matches(input,@"\w+")
    .Cast<Match>()
    .Where(m=>m.Success)
    .Select(m=>m.Value);

To make the frequency map:

var map=words.GroupBy(w=>w).Select(g=>new{word=g.Key,freqency=g.Count()});

There are overloads of the GroupBy method that allow you to supply an alternative IEqualityComparer if this is important.

Reading your comments, to create a corresponding sequence of only frequencies:

map.Select(a=>a.frequency)

This sequence will be in exactly the same order as the sequence map above.

Is this any help at all?

spender 2010-03-05 01:45:47

Answer 4

+3 A:

Let me see if I understand the problem. You have two documents D1 and D2 each containing a sequence of words drawn from a known vocabulary {W1, W2... Wn}. You wish to obtain two mappings indicating the number of occurrences of each word in each document. So for D1, you might have

W1 --> 0
W2 --> 1
W3 --> 4

indicating that D1 was perhaps "W3 W2 W3 W3 W3". Perhaps D2 is "W2 W1 W2", so its mapping is

W1 --> 1
W2 --> 2
W3 --> 0

You wish to take both mappings and determine the vectors [0, 1, 4] and [1, 2, 0] and then compute the angle between those vectors as a way of determining how similar or different the two documents are.

Your problem is that the dictionary does not guarantee that the key/value pairs are enumerated in any particular order.

OK, so order them.

vector1 = (from pair in map1 orderby pair.Key select pair.Value).ToArray();
vector2 = (from pair in map2 orderby pair.Key select pair.Value).ToArray();

and you're done.

Does that solve your problem, or am I misunderstanding the scenario?

Eric Lippert 2010-03-05 01:52:28

That's the scenario, but for a dictionary with let's say 20k words doing a sort everytime i convert a document is computationaly heavy or not?

brokencoding 2010-03-05 01:58:30

Eric, I don't understand how "W2 W1 W2" corresponds to the second mapping.

spender 2010-03-05 01:59:16

@brokencoding: If you're only comparing two documents then why perform this operation on a dictionary with all 20k words? How big are the documents? You only need to include counts of the words that appear in at least one document.

Aaronaught 2010-03-05 02:00:23

It's not just 2 documents, let's say i have to represent a directory of documents into vectors like in your example, and then i could do stuff like cluster them by similarity

brokencoding 2010-03-05 02:08:30

@spender: true, the second mapping should be [1, 2, 0], didnt even notice

brokencoding 2010-03-05 02:10:07

So is there a corpus of words against which document words must match, or does that corpus get built while processing the documents?

spender 2010-03-05 02:14:34

@brokencoding: So you are basically trying to do some level of natural-language processing on an entire directory of documents... and you're worrying about the performance of individual sorts? Speculation obviously isn't a substitute for profiling but I have a feeling that whatever post-processing you do, not to mention the I/O associated with opening and reading the files, is going to be the critical path. You can easily parallelize the sorts. If you're in doubt, test it.

Aaronaught 2010-03-05 02:29:02

@brokencoding: Performance questions have to be answered by profiling, not by guessing. If you want to know the computational burden of sorting a list of 20K strings, write a program that sorts a list of 20K strings a hundred thousand times and tells you the average of the results, and then you'll know. If it turns out to not be fast enough then consider other options, like writing a dictionary that *does* guarantee maintaining the insertion order. Such a dictionary is not hard to write.

Eric Lippert 2010-03-05 07:36:43

Answer 5

A:

Would be this helpful to you:

SortedDictionary<string, int> dic = new SortedDictionary<string, int>();

            for (int i = 0; i < 10; i++)
            {
                if (dic.ContainsKey("Word" + i))
                    dic["Word" + i]++;
                else
                    dic.Add("Word" + i, 0);
            }

            //to get the array of words:
            List<string> wordsList = new List<string>(dic.Keys);
            string[] wordsArr = wordsList.ToArray();

            //to get the array of values
            List<int> valuesList = new List<int>(dic.Values);
            int[] valuesArr = valuesList.ToArray();

Sameh Serag 2010-03-05 05:23:35

Answer 6

A:

If all you're trying to do is calculate cosine similarity, you don't need to convert your data to 20,000-length arrays, especially considering the data would likely be sparse with most entries being zero.

While processing the files, store the file output data into a Dictionary keyed on the word. Then to calculate the dot product and magnitudes, you iterate through the words in the full word list, look for the word in each of the file ouptut data, and use the found value if it exists and zero if it doesn't.

gWiz 2010-03-05 06:57:20

ansaurus

tags:

views:

answers:

Bag of Words representation problem

related questions