views:

197

answers:

1

I'm using dtSearch to highlight text search matches within a document. The code to do this, minus some details and cleanup, is roughly along these lines:

SearchJob sj = new SearchJob();
sj.Request = "\"audit trail\""; // the user query
sj.FoldersToSearch.Add(path_to_src_document);
sj.Execute();
FileConverter fileConverter = new FileConverter();
fileConverter.SetInputItem(sj.Results, 0);
fileConvert.BeforeHit = "<a name=\"HH_%%ThisHit%%\"/><b>";
fileConverter.AfterHit = "</b>";
fileConverter.Execute();
string myHighlightedDoc = fileConverter.OutputString;

If I give dtSearch a quoted phrase query like

"audit trail"

then dtSearch will do hit highlighting like this:

An <a name="HH_0"/><b>audit</b> <a name="HH_1"/><b>trail</b> is a fun thing to have an <a name="HH_2"/><b>audit</b> <a name="HH_last"/><b>trail</b> about!

Note that each word of the phrase is highlighted separately. Instead, I would like phrases to get highlighted as whole units, like this:

An <a name="HH_0"/><b>audit trail</b> is a fun thing to have an <a name="HH_last"/><b>audit trail</b> about!

This would A) make highlighting look better, B) improve behavior of my javascript that helps users navigate from hit to hit, and C) give more accurate counts of total # hits.

Is there good ways to make dtSearch highlight phrases this way?

A: 

Note: I think the text and code here could use some more work. If people want to help revise the answer or the code, this can probably become community wiki.

I asked dtSearch about this (4/26/2010). Their response was two-part:

First, it's not possible get the desired highlighting behavior just by, say, altering a flag.

Second, it is possible to get some lower-level hit information where phrase matches are treated as wholes. In particular if you set both the dtsSearchWantHitsByWord and the dtsSearchWantHitsArray flags in your SearchJob, then your search results will be annotated with the word offsets of where each word or phrase in your query matches. For example, if your input document is

An audit trail is a fun thing to have an audit trail about!

and your query is

"audit trail"

then (in the .NET API), sj.Results.CurrentItem.HitsByWord[0] will contain a string like this:

audit trail (2 11 )

indicating that the phrase "audit trail" is found starting at the 2nd word and the 11th word in the document.

One thing you can do with this information is to create a "skip list" indicating which of the dtSearch highlights are insignificant (i.e. which ones are phrase continuations, rather than being the start of a word or phrase). For example, if your skip list was [4, 7, 9], that might mean that the 4th, 7th and 9th hits were insignificant, whereas the other hits were legit. A "skip list" of this sort could be used in at least two ways:

  1. You can change your code that navigates from hit to hit, so that it skips over hit number i iff skipList.contains(i).
  2. Depending on requirements, you may also be able to rewrite the HTML generated by the dtSearch FileConverter. In my case I have dtSearch annotate hits with something like <name="HH_1"/><span class="highlight">hitword</span>, and use the A tags (and the fact that they're numbered sequentially -- HH_1, HH_2, HH_3, etc.) as the basis of hit navigation. So what I've tried, with some success, is walking the HTML, and stripping out all the A tags where the i in HH_i is featured in the skip list. Depending on your hit navigation code, you probably need to renumber the A tags so you don't have any gaps between, say, HH_1 and HH_3.

Supposing these "skip lists" are indeed useful, how would you generate them? Well here's some code that mostly works:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using NUnit.Framework;

public class DtSearchUtil
{
    /// <summary>
    /// Makes a "skip list" for the dtSearch result document with the specified
    /// WordArray data. The skip list indicates which hits in the dtSearch markup
    /// should be skipped during hit navigation. The reason to skip some hits
    /// is to allow navigation to be phrase aware, rather than forcing the user
    /// to visit each word in the phrase as if it were an independent hit.
    /// The skip list consists of 1-indexed hit offsets. 2, for example, would
    /// mean that the second hit should be skipped during hit navigation.
    /// </summary>
    /// <param name="dtsHitsByWordArray">dtSearch HitsByWord data. You'll get this from SearchResultItem.HitsByWord
    /// if you did your search with the dtsSearchWantHitsByWord and dtsSearchWantHitsArray
    /// SearchFlags.</param>
    /// <param name="userHitCount">How many total hits there are, if phrases are counted
    /// as one hit each.</param>
    /// <returns></returns>
    public static List<int> MakeHitSkipList(string[] dtsHitsByWordArray, out int userHitCount)
    {
        List<int> skipList = new List<int>();
        userHitCount = 0;

        int curHitNum = 0; // like the dtSearch doc-level highlights, this counts hits word-by-word, rather than phrase by phrase
        List<PhraseRecord> hitRecords = new List<PhraseRecord>();
        foreach (string dtsHitsByWordString in dtsHitsByWordArray)
        {
            hitRecords.Add(PhraseRecord.ParseHitsByWordString(dtsHitsByWordString));
        }
        int prevEndOffset = -1;

        while (true)
        {
            int nextOffset = int.MaxValue;
            foreach (PhraseRecord rec in hitRecords)
            {
                if (rec.CurOffset >= rec.OffsetList.Count)
                    continue;

                nextOffset = Math.Min(nextOffset, rec.OffsetList[rec.CurOffset]);
            }
            if (nextOffset == int.MaxValue)
                break;

            userHitCount++;

            PhraseRecord longestMatch = null;
            for (int i = 0; i < hitRecords.Count; i++)
            {
                PhraseRecord rec = hitRecords[i];
                if (rec.CurOffset >= rec.OffsetList.Count)
                    continue;
                if (nextOffset == rec.OffsetList[rec.CurOffset])
                {
                    if (longestMatch == null ||
                        longestMatch.LengthInWords < rec.LengthInWords)
                    {
                        longestMatch = rec;
                    }
                }
            }

            // skip subsequent words in the phrase
            for (int i = 1; i < longestMatch.LengthInWords; i++)
            {
                skipList.Add(curHitNum + i);
            }

            prevEndOffset = longestMatch.OffsetList[longestMatch.CurOffset] +
                (longestMatch.LengthInWords - 1);

            longestMatch.CurOffset++;

            curHitNum += longestMatch.LengthInWords;

            // skip over any unneeded, overlapping matches (i.e. at the same offset)
            for (int i = 0; i < hitRecords.Count; i++)
            {
                while (hitRecords[i].CurOffset < hitRecords[i].OffsetList.Count &&
                    hitRecords[i].OffsetList[hitRecords[i].CurOffset] <= prevEndOffset)
                {
                    hitRecords[i].CurOffset++;
                }
            }
        }

        return skipList;
    }

    // Parsed form of the phrase-aware hit offset stuff that dtSearch can give you 
    private class PhraseRecord
    {
        public string PhraseText;

        /// <summary>
        /// Offsets into the source text at which this phrase matches. For example,
        /// offset 300 would mean that one of the places the phrase matches is
        /// starting at the 300th word in the document. (Words are counted according
        /// to dtSearch's internal word breaking algorithm.)
        /// See also:
        /// http://support.dtsearch.com/webhelp/dtSearchNetApi2/frames.html?frmname=topic&amp;frmfile=dtSearch__Engine__SearchFlags.html
        /// </summary>
        public List<int> OffsetList;

        // BUG: We calculate this with a whitespace tokenizer. This will probably
        // cause bad results in some places. (Better to figure out how to count
        // the way dtSearch would.)
        public int LengthInWords
        {
            get
            {
                return Regex.Matches(PhraseText, @"[^\s]+").Count;
            }
        }

        public int CurOffset = 0;

        public static PhraseRecord ParseHitsByWordString(string dtsHitsByWordString)
        {
            Match m = Regex.Match(dtsHitsByWordString, @"^([^,]*),\s*\d*\s*\(([^)]*)\).*");
            if (!m.Success)
                throw new ArgumentException("Bad dtsHitsByWordString. Did you forget to set dtsHitsByWordString in dtSearch?");

            string phraseText = m.Groups[1].Value;
            string parenStuff = m.Groups[2].Value;

            PhraseRecord hitRecord = new PhraseRecord();
            hitRecord.PhraseText = phraseText;
            hitRecord.OffsetList = GetMatchOffsetsFromParenGroupString(parenStuff);
            return hitRecord;
        }

        static List<int> GetMatchOffsetsFromParenGroupString(string parenGroupString)
        {
            List<int> res = new List<int>();
            MatchCollection matchCollection = Regex.Matches(parenGroupString, @"\d+");
            foreach (Match match in matchCollection)
            {
                string digitString = match.Groups[0].Value;
                res.Add(int.Parse(digitString));
            }
            return res;
        }
    }
}


[TestFixture]
public class DtSearchUtilTests
{
    [Test]
    public void TestMultiPhrasesWithoutFieldName()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706 );",
            @"bana*, 4 (490 505 689 713 )"
            };

        // expected dtSearch hit order:
        // 0: apple@482
        // 1: pie@483 [should skip]
        // 2: banana-something@490
        // 3: apple@499
        // 4: pie@500 [should skip]
        // 5: banana-something@505
        // 6: apple@552
        // 7: pie@553 [should skip]
        // 8: apple@578
        // 9: pie@579 [should skip]
        // 10: apple@589
        // 11: pie@590 [should skip]
        // 12: apple@683
        // 13: pie@684 [skip]
        // 14: banana-something@689
        // 15: apple@706
        // 16: pie@707 [skip]
        // 17: banana-something@713

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(11, userHitCount);

        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(4, skipList[1]);
        Assert.AreEqual(7, skipList[2]);
        Assert.AreEqual(9, skipList[3]);
        Assert.AreEqual(11, skipList[4]);
        Assert.AreEqual(13, skipList[5]);
        Assert.AreEqual(16, skipList[6]);
        Assert.AreEqual(7, skipList.Count);
    }

    [Test]
    public void TestPhraseOveralap1()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 );",
            @"apple, 4 (482 490 499 552)"
            };

        // expected dtSearch hit order:
        // 0: apple@482
        // 1: pie@483 [should skip]
        // 2: apple@490
        // 3: apple@499
        // 4: pie@500 [should skip]
        // 5: apple@552
        // 6: pie@553 [should skip]

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(4, userHitCount);

        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(4, skipList[1]);
        Assert.AreEqual(6, skipList[2]);
        Assert.AreEqual(3, skipList.Count);
    }

    [Test]
    public void TestPhraseOveralap2()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 );",
@"pie, 4 (483 490 500 553)"
    };

        // expected dtSearch hit order:
        // 0: apple@482
        // 1: pie@483 [should skip]
        // 2: pie@490
        // 3: apple@499
        // 4: pie@500 [should skip]
        // 5: apple@552
        // 6: pie@553 [should skip]

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(4, userHitCount);

        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(4, skipList[1]);
        Assert.AreEqual(6, skipList[2]);
        Assert.AreEqual(3, skipList.Count);
    }

    // TODO: test "apple pie" and "apple", plus "apple pie" and "pie"

    // "subject" should not freak it out
    [Test]
    public void TestSinglePhraseWithFieldName()
    {
        string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706 ), subject" };

        int userHitCount;
        List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);

        Assert.AreEqual(7, userHitCount);

        Assert.AreEqual(7, skipList.Count);
        Assert.AreEqual(1, skipList[0]);
        Assert.AreEqual(3, skipList[1]);
        Assert.AreEqual(5, skipList[2]);
        Assert.AreEqual(7, skipList[3]);
        Assert.AreEqual(9, skipList[4]);
        Assert.AreEqual(11, skipList[5]);
        Assert.AreEqual(13, skipList[6]);
    }
}
Chris