views:

374

answers:

2

Hi! I recently found out about n-grams and the cool possibility to compare frequency of phrases in a text body with it. Now I'm trying to make an vb.net app that simply gets an text body and returns a list of the most frequently used phrases (where n >= 2).

I found an C# example of how to generate a n-gram from a text body so I started out with converting the code to VB. The problem is that this code does create one gram per character instead of one per word. The delimiters I want to use for the words is: VbCrLf (new line), vbTab (tabs) and the following characters: !@#$%^&*()_+-={}|\:\"'?¿/.,<>’¡º×÷‘;«»[]

Does anyone have an idea how I can rewrite the following function for this purpose:

   Friend Shared Function GenerateNGrams(ByVal text As String, ByVal gramLength As Integer) As String()
    If text Is Nothing OrElse text.Length = 0 Then
        Return Nothing
    End If

    Dim grams As New ArrayList()
    Dim length As Integer = text.Length
    If length < gramLength Then
        Dim gram As String
        For i As Integer = 1 To length
            gram = text.Substring(0, (i) - (0))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next

        gram = text.Substring(length - 1, (length) - (length - 1))
        If grams.IndexOf(gram) = -1 Then
            grams.Add(gram)

        End If
    Else
        For i As Integer = 1 To gramLength - 1
            Dim gram As String = text.Substring(0, (i) - (0))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)

            End If
        Next

        For i As Integer = 0 To (length - gramLength)
            Dim gram As String = text.Substring(i, (i + gramLength) - (i))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next

        For i As Integer = (length - gramLength) + 1 To length - 1
            Dim gram As String = text.Substring(i, (length) - (i))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next
    End If
    Return Tokeniser.ArrayListToArray(grams)
End Function
+1  A: 

An n-gram for words is just a list of length n that stores these words. A list of n-grams is then simply a list of list of words. If you want to store frequency then you need a dictionary that is indexed by these n-grams. For your special case of 2-grams, you can imagine something like this:

Dim frequencies As New Dictionary(Of String(), Integer)(New ArrayComparer(Of String)())
Const separators as String = "!@#$%^&*()_+-={}|\:""'?¿/.,<>’¡º×÷‘;«»[] " & _
                             ControlChars.CrLf & ControlChars.Tab
Dim words = text.Split(separators.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)

For i As Integer = 0 To words.Length - 2
    Dim ngram = New String() { words(i), words(i + 1) }
    Dim oldValue As Integer = 0
    frequencies.TryGetValue(ngram, oldValue)
    frequencies(ngram) = oldValue + 1
Next

frequencies should now contain a dictionary with all two consecutive word pairs contained in the text, and the frequency with which they appear (as a consecutive pair).

This code requires the ArrayComparer class:

Public Class ArrayComparer(Of T)
    Implements IEqualityComparer(Of T())

    Private ReadOnly comparer As IEqualityComparer(Of T)

    Public Sub New()
        Me.New(EqualityComparer(Of T).Default)
    End Sub

    Public Sub New(ByVal comparer As IEqualityComparer(Of T))
        Me.comparer = comparer
    End Sub

    Public Overloads Function Equals(ByVal a As T(), ByVal b As T()) As Boolean _
            Implements IEqualityComparer(Of T()).Equals
        System.Diagnostics.Debug.Assert(a.Length = b.Length)
        For i As Integer = 0 to a.Length - 1
            If Not comparer.Equals(a(i), b(i)) Then Return False
        Next

        Return True
    End Function

    Public Overloads Function GetHashCode(ByVal arr As T()) As Integer _
            Implements IEqualityComparer(Of T()).GetHashCode
        Dim hashCode As Integer = 17
        For Each obj As T In arr
            hashCode = ((hashCode << 5) - 1) Xor comparer.GetHashCode(obj)
        Next

        Return hashCode
    End Function
End Class

Unfortunately, this code doesn’t compile on Mono because the VB compiler has problems finding the generic EqualityComparer class. I’m therefore unable to test whether the GetHashCode implementationw works as expected but it should be fine.

Konrad Rudolph
Super thanks! Full answer and upcoming questions in post below:
Majgel
I have some problem with getting this code to run. When I tried it in VS2008 I first got the warning that the functions Equals and GetHashCode "shadows an overridable method in the base class 'Object'. To override the base method, this method must be declared 'Overrides'". At run the code breaks at "frequencies.Add(ngram, oldValue + 1)" with the exception "An item with same key has already been added" when trying to insert the second occurrence of the same ngram.
Majgel
@Majgel: the shadowing is indeed an issue. As I’ve written I can’t test the code. To correct the error, just add `Overloads`. I’ve corrected the answer. The other error has also been corrected.
Konrad Rudolph
Superthanks again for your fast answer! Now the VS2008 warning is gone. The code break described above still exists however. It only happen when an n-gram which already exists in the dictionary is trying to be inserted. Both "frequencies.TryGetValue(ngram, oldValue)" and "frequencies.Add(ngram, oldValue + 1)" runs the class Equals() function and returns True as supposed. Then the exception "An item with same key has already been added" appears. Any ideas?
Majgel
@Majgel: I had also changed that piece of code: instead of `Add` the method now uses the indexed accessor, i.e. `frequencies(ngram) = oldValue + 1`.
Konrad Rudolph
Super, all works now :)
Majgel
A: 

Thank you Konrad very much for this beginning of an solution!

I tried your code and got the following result:

Text = "Hello I am a test Also I am a test"
(I also included whitespace as a separator)

frequencies now has 9 items:
---------------------
Keys: "Hello", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------
Keys: "test", "Also"
Value: 1
---------------------
Keys: "Also", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------

My first question: shouldn't the 3 last key pairs get a value of 2 as they're found two times in the text?

Second: The reason I got into the n-gram approach is that I don't want to limit the word count (n) to a specific length. Is there a way to make a dynamic approach that tries to find the longest phrase match first and then step down to the last wordcount of 2?

My goal result for the sample query above is:

---------------------
Match: "I am a test"
Frequency: 2
---------------------
Match: "I am a"
Frequency: 2
---------------------
Match: "am a test"
Frequency: 2
---------------------
Match: "I am"
Frequency: 2
---------------------
Match: "am a"
Frequency: 2
---------------------
Match: "a test"
Frequency: 2
---------------------

There is an similar C++ approach for this written by Hatem Mostafa over at codeproject.com: N-gram and Fast Pattern Extraction Algorithm

Sadly I'm no C++ expert and have no idea how to convert this bit of code as it includes a lot of memory handling that .Net doesn't. The only problem with this example is that you have to specify the minimum word pattern length and I want it to be dynamic from 2 to max found.

Majgel
First off, to address the error: DAMN! Apparently, the `GetHashCode` method of arrays doesn’t work. Unfortunately, this method is used internally by the dictionary. This is the reason for the weird result. See my updated answer for a workaround: we need to define a class that does the correct equality comparison of arrays.
Konrad Rudolph
About your other problem: I’m not sure what you expect as the output. The longest possible n-gram that occurs more than once? I think this will be some variation of the longest common subsequence problem, and as such, it will be NP hard, i.e. not efficiently solvable since it doesn’t just entail two distinct sequences but rather all possible subsequences from the original sequence, of which there are exponentially many.
Konrad Rudolph
One final note: please create a new question thread for this new problem. Stack Overflow unfortunately isn’t well suited for this kind of answering-in-comments and asking additional questions.
Konrad Rudolph
Ok, I'll start an new thread with my a more specific question for this. First I want to get this code running, se my problem in the comment of your updated answer
Majgel