Here is an example of what I'm asking:
Detail Table:
HeaderId | DetailId
1 100
1 101
2 100
2 101
3 101
3 102
3 103
I'm looking for a query strategy that will get me all header ids with the same set of detail ids for each.
So in the example, I would want HeaderId 1 and 2 to join together because they have the same two detail records, but 103 would not match because it has a third item in the set.
The strategy I'm taking so far is to use the STUFF
technique to create a comma separate string of detail values, checksum that string, then join on the checksum result. It's seems to be working, but I'm not sure how to optimize it. Over a set of about 7000 headers, it returns in about 6-7 seconds.
Here is the query:
with Details as
(
select distinct t2.HeaderId,
checksum(stuff((
select
',' + convert(varchar(15), t2.DetailId)
from
DetailTable t2
where
t2.HeaderId = t2.HeaderId
for xml path('')
),1,1,'')) as ChkSum
from
DetailTable t1
)
select
*
from
Details t1
join Details t2
on t2.ChkSum = t1.ChkSum
and t2.HeaderId <> t1.HeaderId -- To avoid matching the same record
So - is this the right approach? And if it is, how can I optimize? Query plan doesn't have anything jumping out at me. The most weight is given to a table spool. Also, I'm trying to make this a function or proc if that helps.
Edit: I began researching relational division, and I think that is relevant here, but perhaps not in the context I am thinking. To give this more context, here is the business case I am trying to solve.
I have a set of Promotions that can have any number of UPCs in them. I'm trying to find promotions that have exactly the same set of UPCs in them. A lot of the solutions I'm seeing rely on using count(*)
. So - just some context for anyone looking at this. Thanks!
So @Neil McGuigan's comment about relational division lead me to this article.
I found the example of "Todd's Division - Dwain.C 1" to be performant and give me the results I was looking for.
This is the example from the article that I pretty much used verbatim except for field/table names:
-- Todd's Division - Dwain.C 1
SELECT j.ProjectID, s.ResourceID
FROM #ProjectTasks j
JOIN #ResourceTasks s ON j.TaskID = s.TaskID
JOIN
(
SELECT ProjectID, c_res=COUNT(*)
FROM #ProjectTasks
GROUP BY ProjectID
) c ON j.ProjectID = c.ProjectID
GROUP BY j.ProjectID, ResourceID
HAVING COUNT(*) = MAX(c_res)
ORDER BY j.ProjectID, ResourceID;
Thank you for the rest of the suggestions, they didn't quite get me there, but ultimately I found a solution.
External links referenced by this document: