views:

1404

answers:

4

There are some good examples on how to calculate word frequencies in C#, but none of them are comprehensive and I really need one in VB.NET.

My current approach is limited to one word per frequency count. What is the best way to change this so that I can get a completely accurate word frequency listing?

wordFreq = New Hashtable()

Dim words As String() = Regex.Split(inputText, "(\W)")
    For i As Integer = 0 To words.Length - 1
        If words(i) <> "" Then
            Dim realWord As Boolean = True
            For j As Integer = 0 To words(i).Length - 1
                If Char.IsLetter(words(i).Chars(j)) = False Then
                    realWord = False
                End If
            Next j

            If realWord = True Then
                If wordFreq.Contains(words(i).ToLower()) Then
                    wordFreq(words(i).ToLower()) += 1
                Else
                    wordFreq.Add(words(i).ToLower, 1)
                End If
            End If
        End If
    Next

Me.wordCount = New SortedList

For Each de As DictionaryEntry In wordFreq
        If wordCount.ContainsKey(de.Value) = False Then
            wordCount.Add(de.Value, de.Key)
        End If
Next

I'd prefer an actual code snippet, but generic 'oh yeah...use this and run that' would work as well.

+1  A: 

This might be helpful:

Word frequency algorithm for natural language processing

CMS
I've already looked at that - everything either uses LINQ or isn't in .net
amdfan
+3  A: 

This might be what your looking for:

    Dim Words = "Hello World ))))) This is a test Hello World"
    Dim CountTheWords = From str In Words.Split(" ") _
                        Where Char.IsLetter(str) _
                        Group By str Into Count()

I have just tested it and it does work

EDIT! I have added code to make sure that it counts only letters and not symbols.

FYI: I found an article on how to use LINQ and target 2.0, its a feels bit dirty but it might help someone http://weblogs.asp.net/fmarguerie/archive/2007/09/05/linq-support-on-net-2-0.aspx

Nathan W
I'm using .net 2.0, so unfortuantely I can't use LINQ.
amdfan
Awww that totally just stuff it all up.
Nathan W
It would have been so easy for you.
Nathan W
Use the new compiler and target 2.0 framework. Copy enumerable.cs from Mono, and presto.
MichaelGG
@MichealGG Does that really work?
Nathan W
I just found this, which may help:http://weblogs.asp.net/fmarguerie/archive/2007/09/05/linq-support-on-net-2-0.aspx
Nathan W
+1  A: 
Public Class CountWords

    Public Function WordCount(ByVal str As String) As Dictionary(Of String, Integer)
     Dim ret As Dictionary(Of String, Integer) = New Dictionary(Of String, Integer)

     Dim word As String = ""
     Dim add As Boolean = True
     Dim ch As Char

     str = str.ToLower
     For index As Integer = 1 To str.Length - 1 Step index + 1
      ch = str(index)
      If Char.IsLetter(ch) Then
       add = True
       word += ch
      ElseIf add And word.Length Then
       If Not ret.ContainsKey(word) Then
        ret(word) = 1
       Else
        ret(word) += 1
       End If
       word = ""
      End If
     Next

     Return ret
    End Function

End Class

Then for a quick demo application, create a winforms app with one multiline textbox called InputBox, one listview called OutputList and one button called CountBtn. In the list view create two columns - "Word" and "Freq." Select the "details" list type. Add an event handler for CountBtn. Then use this code:

Imports System.Windows.Forms.ListViewItem

Public Class MainForm

    Private WordCounts As CountWords = New CountWords

    Private Sub CountBtn_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles CountBtn.Click
     OutputList.Items.Clear()
     Dim ret As Dictionary(Of String, Integer) = Me.WordCounts.WordCount(InputBox.Text)
     For Each item As String In ret.Keys
      Dim litem As ListViewItem = New ListViewItem
      litem.Text = item
      Dim csitem As ListViewSubItem = New ListViewSubItem(litem, ret.Item(item).ToString())

      litem.SubItems.Add(csitem)
      OutputList.Items.Add(litem)

      Word.Width = -1
      Freq.Width = -1
     Next
    End Sub
End Class

You did a terrible terrible thing to make me write this in VB and I will never forgive you.

:p

Good luck!

EDIT

Fixed blank string bug and case bug

nlaq
index As Integer = 0 should be = 1, otherwise the first character of the first word is missed. Otherwise this is great, thanks. And congrats on 2,000 points!
amdfan
+1  A: 

Pretty close, but \w+ is a good regex to match with (matches word characters only).

Public Function CountWords(ByVal inputText as String) As Dictionary(Of String, Integer)
    Dim frequency As New Dictionary(Of String, Integer)

    For Each wordMatch as Match in Regex.Match(inputText, "\w+")
        If frequency.ContainsKey(wordMatch.Value.ToLower()) Then
            frequency(wordMatch.Value.ToLower()) += 1
        Else
            frequency.Add(wordMatch.Value.ToLower(), 1)
        End If
    Next
    Return frequency
End Function
gregmac
You and your fancy regular expressions. I just got back from writing a Lexer for one of my projects and was in the "lexing" mode there. Your solution is better though... Perhaps not as fast? I would have to do research. +1
nlaq