Database Administration
sql-server t-sql performance optimization query-performance
Updated Sun, 17 Jul 2022 15:02:46 GMT

Is there an efficient way to match multiples headers that have the same set of details?


Here is an example of what I'm asking:

Detail Table:

HeaderId | DetailId
    1          100
    1          101
    2          100
    2          101
    3          101
    3          102
    3          103

I'm looking for a query strategy that will get me all header ids with the same set of detail ids for each.

So in the example, I would want HeaderId 1 and 2 to join together because they have the same two detail records, but 103 would not match because it has a third item in the set.

The strategy I'm taking so far is to use the STUFF technique to create a comma separate string of detail values, checksum that string, then join on the checksum result. It's seems to be working, but I'm not sure how to optimize it. Over a set of about 7000 headers, it returns in about 6-7 seconds.

Here is the query:

with Details as 
(   
    select distinct t2.HeaderId, 
    checksum(stuff((
        select
            ',' + convert(varchar(15), t2.DetailId)
        from
            DetailTable t2
        where
            t2.HeaderId = t2.HeaderId
        for xml path('')
    ),1,1,'')) as ChkSum
from
      DetailTable t1
)
select
    *
from
    Details t1
        join Details t2
            on t2.ChkSum = t1.ChkSum
            and t2.HeaderId <> t1.HeaderId -- To avoid matching the same record

So - is this the right approach? And if it is, how can I optimize? Query plan doesn't have anything jumping out at me. The most weight is given to a table spool. Also, I'm trying to make this a function or proc if that helps.

Edit: I began researching relational division, and I think that is relevant here, but perhaps not in the context I am thinking. To give this more context, here is the business case I am trying to solve.

I have a set of Promotions that can have any number of UPCs in them. I'm trying to find promotions that have exactly the same set of UPCs in them. A lot of the solutions I'm seeing rely on using count(*). So - just some context for anyone looking at this. Thanks!




Solution

So @Neil McGuigan's comment about relational division lead me to this article.

I found the example of "Todd's Division - Dwain.C 1" to be performant and give me the results I was looking for.

This is the example from the article that I pretty much used verbatim except for field/table names:

-- Todd's Division - Dwain.C 1
SELECT j.ProjectID, s.ResourceID
FROM #ProjectTasks j
JOIN #ResourceTasks s ON j.TaskID = s.TaskID
JOIN
(
    SELECT ProjectID, c_res=COUNT(*)
    FROM #ProjectTasks
    GROUP BY ProjectID
) c ON j.ProjectID = c.ProjectID
GROUP BY j.ProjectID, ResourceID
HAVING COUNT(*) = MAX(c_res)
ORDER BY j.ProjectID, ResourceID;

Thank you for the rest of the suggestions, they didn't quite get me there, but ultimately I found a solution.