Here is an example of what I'm asking:
HeaderId | DetailId 1 100 1 101 2 100 2 101 3 101 3 102 3 103
I'm looking for a query strategy that will get me all header ids with the same set of detail ids for each.
So in the example, I would want HeaderId 1 and 2 to join together because they have the same two detail records, but 103 would not match because it has a third item in the set.
The strategy I'm taking so far is to use the
STUFF technique to create a comma separate string of detail values, checksum that string, then join on the checksum result. It's seems to be working, but I'm not sure how to optimize it. Over a set of about 7000 headers, it returns in about 6-7 seconds.
Here is the query:
with Details as ( select distinct t2.HeaderId, checksum(stuff(( select ',' + convert(varchar(15), t2.DetailId) from DetailTable t2 where t2.HeaderId = t2.HeaderId for xml path('') ),1,1,'')) as ChkSum from DetailTable t1 ) select * from Details t1 join Details t2 on t2.ChkSum = t1.ChkSum and t2.HeaderId <> t1.HeaderId -- To avoid matching the same record
So - is this the right approach? And if it is, how can I optimize? Query plan doesn't have anything jumping out at me. The most weight is given to a table spool. Also, I'm trying to make this a function or proc if that helps.
Edit: I began researching relational division, and I think that is relevant here, but perhaps not in the context I am thinking. To give this more context, here is the business case I am trying to solve.
I have a set of Promotions that can have any number of UPCs in them. I'm trying to find promotions that have exactly the same set of UPCs in them. A lot of the solutions I'm seeing rely on using
count(*). So - just some context for anyone looking at this. Thanks!
So @Neil McGuigan's comment about relational division lead me to this article.
I found the example of "Todd's Division - Dwain.C 1" to be performant and give me the results I was looking for.
This is the example from the article that I pretty much used verbatim except for field/table names:
-- Todd's Division - Dwain.C 1 SELECT j.ProjectID, s.ResourceID FROM #ProjectTasks j JOIN #ResourceTasks s ON j.TaskID = s.TaskID JOIN ( SELECT ProjectID, c_res=COUNT(*) FROM #ProjectTasks GROUP BY ProjectID ) c ON j.ProjectID = c.ProjectID GROUP BY j.ProjectID, ResourceID HAVING COUNT(*) = MAX(c_res) ORDER BY j.ProjectID, ResourceID;
Thank you for the rest of the suggestions, they didn't quite get me there, but ultimately I found a solution.
External links referenced by this document: