views:

46

answers:

2

I have this parent-child relationship

Paragraph
---------
ParagraphID   PK
// other attributes ...


Sentence
--------
SentenceID    PK
ParagraphID   FK -> Paragraph.ParagraphID
Text         nvarchar(4000)
Offset       int
Score        int
// other attributes ...

I'd like to find paragraphs that are equivalent; that is paragraphs that contain the same set of sentences. Two sentences are considered the same if they have the same Text, Offset and Score - SentenceID/ParagraphID is not part of the comparison, and two paragraphs are equivalent if they contain an equal set of sentences.

Could someone show me what a query to find equal paragraphs would look like?

EDIT: There are ca. 150K paragraphs, and 1.5M sentences. The output should include the ParagraphID, and the lowest paragraph ID that is equivalent to this one. E.g. if paragraph1 and paragraph2 are equal, then output would be

ParagraphID  EquivParagraphID
1            1
2            1
+1  A: 

In short, you need a signature for each paragraph and then compare the signatures. You did not mention the nature of the output itself. Here, I"m returning a row of comma-delimited ParagraphId values for each identical paragraph signature.

With ParagraphSigs As
    (
    Select P.ParagraphId
        , Hashbytes('SHA1'
                ,   (
                    Select '|' + S1.Text 
                        '|' + Cast(S1.Offset As varchar(10)) 
                        '|' + Cast(S1.Score As varchar(10))
                    From Sentence As S1
                    Where S1.ParagraphId = P.ParagraphId
                    Order By S1.SentenceId
                    For Xml Path('')
                    )) As Signature
    From Paragraph As P
    )
Select Stuff(
            (
            Select ', ' + Cast(PS1.ParagraphId As varchar(10))
            From ParagraphSigs As PS1
            Where PS1.Signature = PS.Signature
            For Xml Path('')
            ), 1, 2, '') As Paragraph
From ParagraphSigs As PS
Group By PS.Signature

Given you addition about the desired output, you can change the query like so:

With ParagraphSigs As
    (
    Select P.ParagraphId
        , Hashbytes('SHA1'
                ,   (
                    Select '|' + S1.Text 
                        '|' + Cast(S1.Offset As varchar(10)) 
                        '|' + Cast(S1.Score As varchar(10))
                    From Sentence As S1
                    Where S1.ParagraphId = P.ParagraphId
                    Order By S1.SentenceId
                    For Xml Path('')
                    )) As Signature
    From Paragraph As P
    )
Select P1.ParagraphId, P2.ParagraphId As EquivParagraphId
From ParagraphSigs As P1
    Left Join ParagraphSigs As P2
        On P2.Signature = P1.Signature
            And P2.ParagraphId <> P1.ParagraphId

Obviously, it might be possible that three or four paragraphs share the same signature, so be warned that the above results will give you a cartesian product of matching paragraphs. (e.g. (P1,P2), (P1,P3), (P2,P1), (P2,P3), (P3,P1), (P3,P2)).

In comments you asked about effectively searching on sentence last. Since you have two other parameters, you could reduce the number of signatures generated by doing by comparing on the two int columns first:

With ParagraphsNeedingSigs As
    (
    Select P1.ParagraphId
    From Paragraph As P1
    Where Exists    (
                    Select 1
                    From Paragraph As P2
                    Where P2.ParagraphId <> P1.ParagraphId
                        And P2.Offset = P1.Offet
                        And P2.Score = P1.Score
                    )
    )
    , ParagraphSigs As
    (
    Select P.ParagraphId
        , Hashbytes('SHA1'
                ,   (
                    Select '|' + S1.Text 
                        '|' + Cast(S1.Offset As varchar(10)) 
                        '|' + Cast(S1.Score As varchar(10))
                    From Sentence As S1
                    Where S1.ParagraphId = P.ParagraphId
                    Order By S1.SentenceId
                    For Xml Path('')
                    )) As Signature
    From ParagraphsNeedingSigs As P
    )
Select P.ParagraphId, P2.ParagraphId As EquivParagraphId
From Paragraph As P
    Left Join ParagraphSigs As P1
        On P1.ParagraphId = P.ParagraphId
    Left Join ParagraphSigs As P2
        On P2.Signature = P1.Signature
            And P2.ParagraphId <> P1.ParagraphId
Thomas
Thank you for this. I'd not seen the hash function before, so that was an eye opener. Even though a false match it's statistically insignificant, is there a way use the key as a hint to equal paragraphs, but still use sentence comparison to be completely sure? Please also see my update about the desired output.
mdma
@mdma - I've updated my post. In short, you can filter out the paragraphs that have no potential match based on the two int columns and then calculate the hashes for those that have a potential match.
Thomas
My final query was very similar to this. I tried also combining this with INTERSECT to weed out hash collisions and be sure the sets of sentences were really equal, but the overhead took this from a few seconds to many minutes (I never let it complete.) Instead, I changed the hash to reference the actual data so that it was unique.
mdma
+1  A: 

Since you have SQL 2008 listed (I'm not sure if this syntax was available in 2005), you might be able to use the EXCEPT or INTERSECT comparisons. It involves correlated subqueries, so performance may be an issue.

SELECT
    *
FROM
    Paragraph P
WHERE
    (SELECT COUNT(*) FROM 
(
    SELECT
        S1.[Text],
        S1.Offset,
        S1.Score
    FROM
        Paragraph P1
    INNER JOIN Sentence S1 ON
        S1.ParagraphID = P1.ParagraphID
    WHERE
        P1.ParagraphID = P.ParagraphID
    INTERSECT
    SELECT
        S2.[Text],
        S2.Offset,
        S2.Score
    FROM
        Paragraph P2
    INNER JOIN Sentence S2 ON
        S2.ParagraphID = P2.ParagraphID
    WHERE
        P2.ParagraphID > P.ParagraphID
) SQ
) = (SELECT COUNT(*) FROM Sentence P3 WHERE P3.ParagraphID = P.ParagraphID)
Tom H.
Thank you for this. I tried running the query, but stopped it after 5 minutes. Is there a way of improving the performnace? I've added table sizes and desired output to the question.
mdma