views:

101

answers:

4

I need to calculate how many times each keyword is reoccurring in a string, with sorting by highest number. What's the fastest algorithm available in .NET code for this purpose?

+1  A: 

You could break the string into a collection of strings, one for each word, and then do a LINQ query on the collection. While I doubt it would be the fastest, it would probably be faster than regex.

wllmsaccnt
I've implemented single pass string readers before that check for occurances of words/characters as the reading progresses. You see this type of code functions for CSV parsing.
wllmsaccnt
+4  A: 

Dunno about fastest, but Linq is probably the most understandable:

var myListOfKeywords = new [] {"struct", "public", ...};

var keywordCount = from keyword in myProgramText.Split(new []{" ","(", ...})
   group by keyword into g
   where myListOfKeywords.Contains(g.Key)
   select new {g.Key, g.Count()}

foreach(var element in keywordCount)
   Console.WriteLine(String.Format("Keyword: {0}, Count: {1}", element.Key, element.Count));

You can write this in a non-Linq-y way, but the basic premise is the same; split the string up into words, and count the occurrences of each word of interest.

KeithS
+2  A: 

Simple algorithm: Split the string into an array of words, iterate over this array, and store the count of each word in a hash table. Sort by count when done.

Bernard
+4  A: 

EDIT: code below groups unique tokens with count

string[] target = src.Split(new char[] { ' ' });

var results = target.GroupBy(t => new
{
    str = t,
    count = target.Count(sub => sub.Equals(t))
});

This is finally starting to make more sense to me...

EDIT: code below results in count correlated with target substring:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};

var results = target.Select((t, index) => new {str = t, 
    count = src.Select((c, i) => src.Substring(i)).
    Count(sub => sub.StartsWith(t))});

Results is now:

+       [0] { str = "string", count = 4 }   <Anonymous Type>
+       [1] { str = "the", count = 4 }  <Anonymous Type>
+       [2] { str = "in", count = 6 }   <Anonymous Type>

Original code below:

string src = "for each character in the string, take the rest of the " +
    "string starting from that character " +
    "as a substring; count it if it starts with the target string";
string[] target = {"string", "the", "in"};

var results = target.Select(t => src.Select((c, i) => src.Substring(i)).
    Count(sub => sub.StartsWith(t))).OrderByDescending(t => t);

with grateful acknowledgement to this previous response.

Results from debugger (which need extra logic to include the matching string with its count):

-       results {System.Linq.OrderedEnumerable<int,int>}    
-       Results View    Expanding the Results View will enumerate the IEnumerable   
        [0] 6   int
        [1] 4   int
        [2] 4   int
Steve Townsend
Now that is pretty cool.
jamietre
Yes, I need to go back and upvote my source.
Steve Townsend
I wonder how it would perform compared to a brute force method (e.g. just loop through the keywords you're seeking, use IndexOf to find occurrences, and count them to a collector array)? In no way do I mean to take away from the awesomeness of this solution, I am just curious since I don't have a good sense for the efficiency of linq.
jamietre
@jamietre - no idea. Any input from LINQ experts?
Steve Townsend
Could you suggest how to make it retrieve string as well?
Sphynx
@Sphynx - see edit - you can order by count if you so wish
Steve Townsend
This is all so beautiful it makes me want to cry. I really need to use LINQ more.
jamietre
I know, I know next to nothing about it so I am using the qs here as a classroom - standing on the shoulders of giants.
Steve Townsend
Thanks Steve, another quick question - currently it looks for specific keywords ("string", "the", "in"), can we modify it to retrieve ALL keywords from a string?
Sphynx
@Sphynx - you mean count the tokens separated by space? You can probably use `Split` and then `GroupBy` per @KeithS, just do not filter on input string list. Take out the `where` clause in his code and try it.
Steve Townsend
I removed the "where" clause, but his code gives few errors, such as "Invalid expression term" in "by" keyword and "The best overloaded method match for 'string.Split(params char[])' has some invalid arguments"
Sphynx
@Sphynx - the `...` in that code calling `Split()` marks it as just a model, intended for you to add more separators if you wish. Just make this into a well-formed list of chars and it should work. Just use " " and remove the rest for now if you like.
Steve Townsend
@Sphynx. That code won't compile for me at all. Let me sort this out.
Steve Townsend
Thanks, I actually removed the placeholder before trying it.
Sphynx
@Sphynx - see latest edit
Steve Townsend
@Sphynx - fyi there is a more elegant variation for this at http://stackoverflow.com/questions/4038836/minimize-linq-string-token-counter
Steve Townsend
I have upvoted your comment as well. Just one final dumb question :) How to enumerate values of "results"? (I want to load them into NameValueCollection)
Sphynx