ansaurus

Question

C# Finding Relevant Document Snippets for Search Result Display

Answer 1

A:

Well, here's the hacked together version I made using the algorithm I described above. I don't think it is all that great. It uses three (count em, three!) loops an array and two lists. But, well, it is better than nothing. I also hardcoded the maximum length instead of turning it into a parameter.

private static string FindRelevantSnippets(string infoText, string[] searchTerms)
    {
        List<int> termLocations = new List<int>();
        foreach (string term in searchTerms)
        {
            int termStart = infoText.IndexOf(term);
            while (termStart > 0)
            {
                termLocations.Add(termStart);
                termStart = infoText.IndexOf(term, termStart + 1);
            }
        }

        if (termLocations.Count == 0)
        {
            if (infoText.Length > 250)
                return infoText.Substring(0, 250);
            else
                return infoText;
        }

        termLocations.Sort();

        List<int> termDistances = new List<int>();
        for (int i = 0; i < termLocations.Count; i++)
        {
            if (i == 0)
            {
                termDistances.Add(0);
                continue;
            }
            termDistances.Add(termLocations[i] - termLocations[i - 1]);
        }

        int smallestSum = int.MaxValue;
        int smallestSumIndex = 0;
        for (int i = 0; i < termDistances.Count; i++)
        {
            int sum = termDistances.Skip(i).Take(5).Sum();
            if (sum < smallestSum)
            {
                smallestSum = sum;
                smallestSumIndex = i;
            }
        }
        int start = Math.Max(termLocations[smallestSumIndex] - 128, 0);
        int len = Math.Min(smallestSum, infoText.Length - start);
        len = Math.Min(len, 250);
        return infoText.Substring(start, len);
    }

Some improvements I could think of would be to return multiple "snippets" with a shorter length that add up to the longer length -- this way multiple parts of the document can be sampled.

Clever Human 2008-11-12 14:25:02

Note that in practice you have to tokenize the input string and compare the tokens - otherwise your users may start finding SEX in Sussex and Essex:-) (A problem with some web filtering programs!)

MZB 2010-07-27 19:39:37

Answer 2

A:

If you use CONTAINSTABLE you will get a RANK back , this is in essence a density value - higher the RANK value, the higher the density. This way, you just run a query to get the results you want and dont have to result to massaging the data when its returned.

Coolcoder 2008-12-10 09:43:35

This is for displaying the results of the search. They are displayed in Rank order. But in order to display them, I need a google-esque snippet of relevant text from the document -- not the whole document.

Clever Human 2008-12-12 20:30:16

Answer 3

+1 A:

This is a nice problem :)

I think I'd create an index vector: For each word, create an entry 1 if search term or otherwise 0. Then find the i such that sum(indexvector[i:i+maxlength]) is maximized.

This can actually be done rather efficiently. Start with the number of searchterms in the first maxlength words. then, as you move on, decrease your counter if indexvector[i]=1 (i.e. your about to lose that search term as you increase i) and increase it if indexvector[i+maxlength+1]=1. As you go, keep track of the i with the highest counter value.

Once you got your favourite i, you can still do finetuning like see if you can reduce the actual size without compromising your counter, e.g. in order to find sentence boundaries or whatever. Or like picking the right i of a number of is with equivalent counter values.

Not sure if this is a better approach than yours - it's a different one.

You might also want to check out this paper on the topic, which comes with yet-another baseline: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.4357&rep=rep1&type=pdf

Nicolas78 2010-07-27 17:41:03

Answer 4

+2 A:

Although it is implemented in Java, you can see one approach for that problem here: http://rcrezende.blogspot.com/2010/08/smallest-relevant-text-snippet-for.html

Rodes 2010-08-11 21:28:53

Good overview of the different algorithms.

Clever Human 2010-08-17 18:58:08

ansaurus

tags:

views:

answers:

C# Finding Relevant Document Snippets for Search Result Display

related questions