Multi-target Query Assistance

I ran into a scenario with a database and I'm hoping the smart people here could help me out with a "best" way to handle the lookup.

The question: What would be the best way to query across different "target" tables based on a field from the source table.

I've got a few tables that look something like this:

ContentId ContentTypeId IsEnabled
Content Table (the source)
E454820D-A695-47CC-9AFA-02FE581DCDD1 9262536B-49A3-4494-802F-04DFF10424ED 1
9E2D29B7-D8EA-4369-830C-0906A63A1B99 9262536B-49A3-4494-802F-04DFF10424ED 1
3FBE607E-31A8-4D4D-A309-0BED544F6199 F7D226AB-D59F-475C-9D22-4A79E3F0EC07 1
EBF10469-83F0-4666-9477-1234BE10A841 F586769B-0822-468A-B7F3-A94D480ED9B0 1
3274AE49-0B3B-45AF-99EF-1AF71412A4EF 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 1
5308ABB1-03E3-484A-943F-1BBEC44119C2 F586769B-0822-468A-B7F3-A94D480ED9B0 0
622B3BEC-8A74-4789-BC6E-213C4D14047F F7D226AB-D59F-475C-9D22-4A79E3F0EC07 1
D1908CF8-1EF2-4F8E-B314-2A037557BEDE 9262536B-49A3-4494-802F-04DFF10424ED 0
6AF53B6C-1273-4154-9538-33B7E6C92BA8 A0753CFB-923B-4975-AD2A-42E5282A6D5D 0
0F814B78-7246-466B-9255-34F6BB2ADD48 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 1

ContentID ContentTypeID UserId CreatedDateUtc
Content1 (one of the targets)
622B3BEC-8A74-4789-BC6E-213C4D14047F F7D226AB-D59F-475C-9D22-4A79E3F0EC07 1234 2023-01-01 16:38:32.111
3FBE607E-31A8-4D4D-A309-0BED544F6199 F7D226AB-D59F-475C-9D22-4A79E3F0EC07 2345 2022-04-27 09:14:01.978

ContentID ContentTypeID UserId CreatedDateUtc
Content2 (one of the targets)
0F814B78-7246-466B-9255-34F6BB2ADD48 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 91478 2012-09-27 22:45:51.521
3274AE49-0B3B-45AF-99EF-1AF71412A4EF 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 4567 2014-04-27 11:22:33.444

The first "Content" table has millions upon millions of records.  The Content1 and Content2 tables have a subset, but contain the details I need.

I'm trying to find the most efficient way to connect the Content table to both the Content1 AND Content2 tables so my output would look something like:

ContentId ContentTypeId IsEnabled UserID CreatedDateUtc
Desired Result Set
E454820D-A695-47CC-9AFA-02FE581DCDD1 9262536B-49A3-4494-802F-04DFF10424ED 1 1234 2012-01-06 12:00:00.00
9E2D29B7-D8EA-4369-830C-0906A63A1B99 9262536B-49A3-4494-802F-04DFF10424ED 1 5678 etc...
3FBE607E-31A8-4D4D-A309-0BED544F6199 F7D226AB-D59F-475C-9D22-4A79E3F0EC07 1 etc...
EBF10469-83F0-4666-9477-1234BE10A841 F586769B-0822-468A-B7F3-A94D480ED9B0 1
3274AE49-0B3B-45AF-99EF-1AF71412A4EF 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 1
5308ABB1-03E3-484A-943F-1BBEC44119C2 F586769B-0822-468A-B7F3-A94D480ED9B0 0
622B3BEC-8A74-4789-BC6E-213C4D14047F F7D226AB-D59F-475C-9D22-4A79E3F0EC07 1
D1908CF8-1EF2-4F8E-B314-2A037557BEDE 9262536B-49A3-4494-802F-04DFF10424ED 0
6AF53B6C-1273-4154-9538-33B7E6C92BA8 A0753CFB-923B-4975-AD2A-42E5282A6D5D 0
0F814B78-7246-466B-9255-34F6BB2ADD48 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 1

There are 28 possible connections from the Content table to other tables.

I've tried a host of things, but the performance is abysmal.  I'm asking this clever bunch: What would be the ideal way to do this work?

Parents
  • Is there a certain structure to the breakdown of data in Content1 and Content2, and what is the reasoning of them being two separate tables (they appear to have the same definition)?  For example, are newer records (by CreatedDateUtc) in Content2 and older records in Content1?  Also, what was the bottleneck with performance in your attempts so far?  What do the queries look like?  Understanding this context should help with solutioning.

    Perhaps there is a reasonable partitioning scheme that would sufficiently optimize the lookup?  There are some potential drawbacks, but maybe an indexed view would be worth consideration in this scenario as well.

  • I should have prefaced this with "This is a vendor database and we do not own the schema."  But that 100% would have my first question if I saw this question in the wild,  .

  • Thanks for the clarification and the sample query.  The LEFT JOINs in your query would be retrieving ALL values from the big table, and only matches from the smaller tables.  Since you are only looking for the ones with matching keys (eliminating NULLs from non-matches with COALESCE), I think the INNER JOIN approach with UNIONs might be a lot more performant.  Not sure if this is what you're looking for, but hopefully it is a step in the right direction.  Let me know if this helps.

    SELECT [Contents].[ContentId]
         , [custom_Content1].UserId
         , [custom_Content1].CreatedDateUtc
    FROM [ContentsSourceTable] AS [Contents] WITH (NOLOCK)
    INNER JOIN [custom_Content1] WITH (NOLOCK)
         ON [Contents].ContentId = [custom_Content1].ContentId
         
    UNION
    
    SELECT [Contents].[ContentId]
         , [custom_Content2].UserId
         , [custom_Content2].CreatedDateUtc
    FROM [ContentsSourceTable] AS [Contents] WITH (NOLOCK)
    INNER JOIN [custom_Content2] WITH (NOLOCK)
         ON [Contents].ContentId = [custom_Content2].ContentId
    -- add UNIONs here for the INNER JOIN queries to other 26 tables

  • The trouble here is that given a particular inefficient table design there isn't a lot you can do via your query to make it more efficient.  
    Is there currently an index on all the relevant tables for the keys and where conditions you need to be able to use?  Without indexes in the right places query performance is pretty doomed.

    Is your key on all these tables effectively the combined contentid + contenttypeid?  I've seen people make a lot of progress with silly large composite primary keys by creating a composite index with a hash.  Downside of hashes is they are not good if you need to do order by or want to use range conditions in the filter, but looking at these kinds of GUID style keys I'm not sure you are going to be doing that.  If a hash is not good then I think the indexed view built out of joins mentioned by   makes good sense.

    If the data set you actually intend to produce is really supposed to be in the millions of rows and no combination of clever indexes and where conditions is going to be able to reduce that you also just have to be prepared for the performance to be what it is.  At that point your main bottleneck is likely to be read speeds and RAM and caching.  How slow are you talking about?  Like a minute or two or walk away and make dinner?

  • Yeah - speeds are bad.

    • For 100 rows < 1 sec
    • For 1000 rows < 8 sec
    • For 10000 rows < 72 seconds
    • For all rows, I'll let you know when it finishes. Stuck out tongue

    Sounds like I just need to deal with this inefficiency at the moment.

    My understanding (which is complete conjecture) is that the application layer "above" this database knows about the contextual connections and pivots to the correct table as necessary.

  • Wanted to test this out when I had fresh eyes.  Today, I ran an experiment on the LEFT vs. INNER JOINs with my data set:

    • Returning 1,000,000 rows with the LEFT JOIN: 12 seconds
    • Returning 1,000,000 rows with the INNER JOIN: 0 seconds, but no rows returned. Cry

    Sadly, my initial hunch was correct, and I need to use the LEFT JOIN to eliminate bad lookups.

    But thanks  for the recommendation.

    And oddly, the speed seems way up this morning.  Might be the reindexing, might be the underlying IO, might be divine intervention, but I'll take whatever improvement I can.

  • I've seen people make a lot of progress with silly large composite primary keys by creating a composite index with a hash.

    Talk to me about how this would look.  I know composite keys, but I don't think I've ever boiled them down to a hash before.

  • I've done this before - it doesn't work in every situation, but can be a lifesaver when it does work.  Essentially you concatenate all of the fields in the composite key and dump it into the hashbytes function.  Picking a common datatype for all of the fields is usually the hardest part.  So for a table with a composite key of name, streetnumber, streetname, you'd end up with something like:

    select hashbytes('<algorithm>', name+convert(varchar, streetnumber)+streetname) as hashkey from 

    If the values for the columns of a given row were 'Big Bird', 123, 'Sesame Street' and using SHA2_256 as the algorithm, the statement would translate to 

    hashbytes('SHA2_256', 'Big Bird123SesameStreet') which yields a value of 0x737CFF2ABACBD32B5C2244561DB735AC29EFB623E053983E84F6D8DEA7FE589D

    Comparing hex values should nearly always be faster than comparing strings outright; the big question is whether the conversions to the common datatype and computing the hashes offset the gains or not.

  • When you tested the INNER JOINs, did you update the query to use the UNIONs as well (between the different selects for each target table)?  Substituting INNER for LEFT joins in the original query would have a much different effect.

  • Does anyone know (simply for curiosity) how the comparisons of hash bytes would compare to unique identifiers?

    I'd assume hashes vs. GUIDs equality comparisons would be similarly performant because neither are technically strings (varchar/nvarchar/text/ntext).

    Or am I totally off base with this assumption about them not being a primitive string type?

  • Great question!  I honestly don't know the answer here.  Your assumption is intuitive and feels correct - both are technically binary values.  Assuming they are both treated as binary for the purposes of a join, a GUID may actually be slightly faster than hashbytes depending on the output length of the hashbytes function.  I'm hesitant to commit to that stance though because I know that GUIDs are considered strings when doing comparisons against character strings; usage in the where clause for instance always hits performance hard.  For a straight join between GUIDs?  Not sure.

Reply
  • Great question!  I honestly don't know the answer here.  Your assumption is intuitive and feels correct - both are technically binary values.  Assuming they are both treated as binary for the purposes of a join, a GUID may actually be slightly faster than hashbytes depending on the output length of the hashbytes function.  I'm hesitant to commit to that stance though because I know that GUIDs are considered strings when doing comparisons against character strings; usage in the where clause for instance always hits performance hard.  For a straight join between GUIDs?  Not sure.

Children
No Data