views:

645

answers:

5

How can i get the occurrences count of a Word in a database text field With LINQ ?

Keyword token sample : ASP.NET

EDIT 4 :

Database Records :

Record 1 : [TextField] = "Blah blah blah ASP.NET bli bli bli ASP.NET blu ASP.NET yop yop ASP.NET"

Record 2 : [TextField] = "Blah blah blah bli bli bli blu ASP.NET yop yop ASP.NET"

Record 3 : [TextField] = "Blah ASP.NET blah ASP.NET blah ASP.NET bli ASP.NET bli bli ASP.NET blu ASP.NET yop yop ASP.NET"

So

Record 1 Contains 4 occurrence of "ASP.NET" keyword

Record 2 Contains 2 occurrence of "ASP.NET" keyword

Record 3 Contains 7 occurrence of "ASP.NET" keyword

Collection Extraction IList < RecordModel > (ordered by word count descending)

  • Record 3
  • Record 1
  • Record 2

LinqToSQL should be the best, but LinqToObject too :)

NB : No issue about the "." of ASP.NET keyword (this is not the goal if this question)

A: 

You could Regex.Matches(input, pattern).Count or you could do the following:

int count = 0; int startIndex = input.IndexOf(word);
while (startIndex != -1) { ++count; startIndex = input.IndexOf(word, startIndex + 1); }

using LINQ here would be ugly

Yuriy Faktorovich
+1  A: 

Use String.Split() to turn the string into an array of words, then use LINQ to filter this list returning only the words you want, and then check the count of the result, like this:

myDbText.Split(' ').Where(token => token.Equals(word)).Count();
Dylan Beattie
The word could be followed by a period, or have a capital letter.
Yuriy Faktorovich
+2  A: 

Edit 2: I see you updated the question, changes things a bit, word counts per word eh? Try this:

string input = "some random text: how many times does each word appear in some random text, or not so random in this case";
char[] separators = new char[]{ ' ', ',', ':', ';', '?', '!', '\n', '\r', '\t' };

var query = from s in input.Split( separators )
            where s.Length > 0
            group s by s into g
      let count = g.Count()
      orderby count descending
      select new {
          Word = g.Key,
       Count = count
      };

Since you are wanting words that might have a "." in them (e.g. "ASP.NET") I've excluded that from the separator list, unfortunately that will pollute some words as a sentence like "Blah blah blah. Blah blah." would show "blah" with a count of 3 and "blah." with a count of 2. You'll need to think of what cleaning strategy you want here, e.g. if the "." has a letter either side it counts as part of a word, otherwise it's whitespace. That kind of logic is best done with some RegEx.

Timothy Walters
What if the word is have and you have haven't in your text? It would depend on the requirements if your solution would work.
Yuriy Faktorovich
i don't really need the count of a specific word, but a data extraction ordered by the max count of a specific word count founded in each records
Yoann. B
Same issues with [.] would also apply to ['], assuming you want to have quote marks excluded except when they're part of a word. This issue is probably best split into another question since you'll want the best regex to extract words (if there isn't already a question answering this).
Timothy Walters
Once you have a nice regex that matches words with [.] and/or ['] in a way you like, simply replace the "input.Split( separators )" with "Regex.Matches( input, wordFindingRegEx )" and I think "s" (our string) would have to become "match.Value" in in 4 places. With the correct RegEx you could also remove the where clause.
Timothy Walters
Timothy > not exactly what i want, i've updated my question, hope it should be more clear ...
Yoann. B
+3  A: 

A regex handles this nicely. You can use the \b metacharacter to anchor the word boundary, and escape the keyword to avoid unintended use of special regex characters. It also handles the cases of trailing periods, commas, etc.

string[] records =
{
    "foo ASP.NET bar", "foo bar",
    "foo ASP.NET? bar ASP.NET",
    "ASP.NET foo ASP.NET! bar ASP.NET",
    "ASP.NET, ASP.NET ASP.NET, ASP.NET"
};
string keyword = "ASP.NET";
string pattern = @"\b" + Regex.Escape(keyword) + @"\b";
var query = records.Select((t, i) => new
            {
                Index = i,
                Text = t,
                Count = Regex.Matches(t, pattern).Count
            })
            .OrderByDescending(item => item.Count);

foreach (var item in query)
{
    Console.WriteLine("Record {0}: {1} occurrences - {2}",
        item.Index, item.Count, item.Text);
}

Voila! :)

Ahmad Mageed
A: 

I know this is more than the original question asked, but it still matches the subject and I'm including it for others who search on this question later. This does not require that the whole word be matched in the strings that are searched, however it can be easily modified to do so with code from Ahmad's post.

//use this method to order objects and keep the existing type
class Program
{
  static void Main(string[] args)
  {
    List<TwoFields> tfList = new List<TwoFields>();
    tfList.Add(new TwoFields { one = "foo ASP.NET barfoo bar", two = "bar" });
    tfList.Add(new TwoFields { one = "foo bar foo", two = "bar" });
    tfList.Add(new TwoFields { one = "", two = "barbarbarbarbar" });

    string keyword = "bar";
    string pattern = Regex.Escape(keyword);
    tfList = tfList.OrderByDescending(t => Regex.Matches(string.Format("{0}{1}", t.one, t.two), pattern).Count).ToList();

    foreach (TwoFields tf in tfList)
    {
      Console.WriteLine(string.Format("{0} : {1}", tf.one, tf.two));
    }

    Console.Read();
  }
}


//a class with two string fields to be searched on
public class TwoFields
{
  public string one { get; set; }
  public string two { get; set; }
}

.

//same as above, but uses multiple keywords
class Program
{
  static void Main(string[] args)
  {
    List<TwoFields> tfList = new List<TwoFields>();
    tfList.Add(new TwoFields { one = "one one, two; three four five", two = "bar" });
    tfList.Add(new TwoFields { one = "one one two three", two = "bar" });
    tfList.Add(new TwoFields { one = "one two three four five five", two = "bar" });

    string keywords = " five one    ";
    string keywordsClean = Regex.Replace(keywords, @"\s+", " ").Trim(); //replace multiple spaces with one space

    string pattern = Regex.Escape(keywordsClean).Replace("\\ ","|"); //escape special chars and replace spaces with "or"
    tfList = tfList.OrderByDescending(t => Regex.Matches(string.Format("{0}{1}", t.one, t.two), pattern).Count).ToList();

    foreach (TwoFields tf in tfList)
    {
      Console.WriteLine(string.Format("{0} : {1}", tf.one, tf.two));
    }

    Console.Read();
  }
}

public class TwoFields
{
  public string one { get; set; }
  public string two { get; set; }
}
Micah Burnett