Hi!
Am making a small C# application and would like to extract a tag cloud from a simple plain text.
I'm not sure about if there is a function that could do that for me...
Any tips? (Or any other suggestion?)
Thank you!
Hi!
Am making a small C# application and would like to extract a tag cloud from a simple plain text.
I'm not sure about if there is a function that could do that for me...
Any tips? (Or any other suggestion?)
Thank you!
I'm not sure if this is exactly what your looking for but it may help you get started:
LINQ that counts word frequency(in VB but I'm converting to C# now)
Dim Words = "Hello World ))))) This is a test Hello World"
Dim CountTheWords = From str In Words.Split(" ") _
Where Char.IsLetter(str) _
Group By str Into Count()
Here is an ASP.NET Cloud COntrol, that might help you at least get started, full source included.
Building a tag cloud is, as I see it, a two part process:
First, you need to split and count your tokens. Depending on how the document is structured, as well as the language it is written in, this could be as easy as counting the space-separated words. However, this is a very naive approach, as words like the, of, a, etc... will have the biggest word-count and are not very useful as tags. I would suggest implementing some sort of word black list, in order to exclude the most common and meaningless tags.
Once you have the result in a (tag, count) way, you could use something similar to the following code:
(Searches is a list of SearchRecordEntity, SearchRecordEntity holds the tag and its count, SearchTagElement is a subclass of SearchRecordEntity that has the TagCategory attribute,and ProcessedTags is a List of SearchTagElements which holds the result)
double max = Searches.Max(x => (double)x.Count);
List<SearchTagElement> processedTags = new List<SearchTagElement>();
foreach (SearchRecordEntity sd in Searches)
{
var element = new SearchTagElement();
double count = (double)sd.Count;
double percent = (count / max) * 100;
if (percent < 20)
{
element.TagCategory = "smallestTag";
}
else if (percent < 40)
{
element.TagCategory = "smallTag";
}
else if (percent < 60)
{
element.TagCategory = "mediumTag";
}
else if (percent < 80)
{
element.TagCategory = "largeTag";
}
else
{
element.TagCategory = "largestTag";
}
processedTags.Add(element);
}
You could store a category and the amount of items it has in some sort of collection, or database table.
From that, you can get the count for a certain category and have certain bounds. So your parameter is the category, and your return value is a count.
So if the count is >10 & <20, then apply a .CSS style to the link which will be of a certain size.
You can store these counts as keys in a collection, and then get the value where the key matches your return value (as I mentioned above).
I haven't got source code at hand for this process, but you won't find a simple function to do all this for you either. A control, yes (as above).
This is a very conventional approach and the standard way of doing it from what I've seen in magazine tutorials, etc, and the first approach I would think of (not necessarily the best).
You may want to take a look at WordCloud, a project on CodeProject. It includes 430 stops words (like the
, an
, a
, etc.) and uses the Porter stemming algorithm, which reduces words to their root for so that "stemmed stemming stem" are all counted as 1 occurrence of the same word.
It's all in C# - the only thing you would have to do it modify it to output HTML instead of the visualization it creates.
I would really recommend using http://thetagcloud.codeplex.com/. It is a very clean implementation that takes care of grouping, counting and rendering of tags. It also provides filtering capabilities.