Multi-target Query Assistance

I ran into a scenario with a database and I'm hoping the smart people here could help me out with a "best" way to handle the lookup.

The question: What would be the best way to query across different "target" tables based on a field from the source table.

I've got a few tables that look something like this:

ContentId ContentTypeId IsEnabled
Content Table (the source)
E454820D-A695-47CC-9AFA-02FE581DCDD1 9262536B-49A3-4494-802F-04DFF10424ED 1
9E2D29B7-D8EA-4369-830C-0906A63A1B99 9262536B-49A3-4494-802F-04DFF10424ED 1
3FBE607E-31A8-4D4D-A309-0BED544F6199 F7D226AB-D59F-475C-9D22-4A79E3F0EC07 1
EBF10469-83F0-4666-9477-1234BE10A841 F586769B-0822-468A-B7F3-A94D480ED9B0 1
3274AE49-0B3B-45AF-99EF-1AF71412A4EF 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 1
5308ABB1-03E3-484A-943F-1BBEC44119C2 F586769B-0822-468A-B7F3-A94D480ED9B0 0
622B3BEC-8A74-4789-BC6E-213C4D14047F F7D226AB-D59F-475C-9D22-4A79E3F0EC07 1
D1908CF8-1EF2-4F8E-B314-2A037557BEDE 9262536B-49A3-4494-802F-04DFF10424ED 0
6AF53B6C-1273-4154-9538-33B7E6C92BA8 A0753CFB-923B-4975-AD2A-42E5282A6D5D 0
0F814B78-7246-466B-9255-34F6BB2ADD48 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 1

ContentID ContentTypeID UserId CreatedDateUtc
Content1 (one of the targets)
622B3BEC-8A74-4789-BC6E-213C4D14047F F7D226AB-D59F-475C-9D22-4A79E3F0EC07 1234 2023-01-01 16:38:32.111
3FBE607E-31A8-4D4D-A309-0BED544F6199 F7D226AB-D59F-475C-9D22-4A79E3F0EC07 2345 2022-04-27 09:14:01.978

ContentID ContentTypeID UserId CreatedDateUtc
Content2 (one of the targets)
0F814B78-7246-466B-9255-34F6BB2ADD48 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 91478 2012-09-27 22:45:51.521
3274AE49-0B3B-45AF-99EF-1AF71412A4EF 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 4567 2014-04-27 11:22:33.444

The first "Content" table has millions upon millions of records.  The Content1 and Content2 tables have a subset, but contain the details I need.

I'm trying to find the most efficient way to connect the Content table to both the Content1 AND Content2 tables so my output would look something like:

ContentId ContentTypeId IsEnabled UserID CreatedDateUtc
Desired Result Set
E454820D-A695-47CC-9AFA-02FE581DCDD1 9262536B-49A3-4494-802F-04DFF10424ED 1 1234 2012-01-06 12:00:00.00
9E2D29B7-D8EA-4369-830C-0906A63A1B99 9262536B-49A3-4494-802F-04DFF10424ED 1 5678 etc...
3FBE607E-31A8-4D4D-A309-0BED544F6199 F7D226AB-D59F-475C-9D22-4A79E3F0EC07 1 etc...
EBF10469-83F0-4666-9477-1234BE10A841 F586769B-0822-468A-B7F3-A94D480ED9B0 1
3274AE49-0B3B-45AF-99EF-1AF71412A4EF 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 1
5308ABB1-03E3-484A-943F-1BBEC44119C2 F586769B-0822-468A-B7F3-A94D480ED9B0 0
622B3BEC-8A74-4789-BC6E-213C4D14047F F7D226AB-D59F-475C-9D22-4A79E3F0EC07 1
D1908CF8-1EF2-4F8E-B314-2A037557BEDE 9262536B-49A3-4494-802F-04DFF10424ED 0
6AF53B6C-1273-4154-9538-33B7E6C92BA8 A0753CFB-923B-4975-AD2A-42E5282A6D5D 0
0F814B78-7246-466B-9255-34F6BB2ADD48 46448885-D0E6-4133-BBFB-F0CD7B0FD6F7 1

There are 28 possible connections from the Content table to other tables.

I've tried a host of things, but the performance is abysmal.  I'm asking this clever bunch: What would be the ideal way to do this work?

Parents
  • Is there a certain structure to the breakdown of data in Content1 and Content2, and what is the reasoning of them being two separate tables (they appear to have the same definition)?  For example, are newer records (by CreatedDateUtc) in Content2 and older records in Content1?  Also, what was the bottleneck with performance in your attempts so far?  What do the queries look like?  Understanding this context should help with solutioning.

    Perhaps there is a reasonable partitioning scheme that would sufficiently optimize the lookup?  There are some potential drawbacks, but maybe an indexed view would be worth consideration in this scenario as well.

Reply
  • Is there a certain structure to the breakdown of data in Content1 and Content2, and what is the reasoning of them being two separate tables (they appear to have the same definition)?  For example, are newer records (by CreatedDateUtc) in Content2 and older records in Content1?  Also, what was the bottleneck with performance in your attempts so far?  What do the queries look like?  Understanding this context should help with solutioning.

    Perhaps there is a reasonable partitioning scheme that would sufficiently optimize the lookup?  There are some potential drawbacks, but maybe an indexed view would be worth consideration in this scenario as well.

Children
  • I should have prefaced this with "This is a vendor database and we do not own the schema."  But that 100% would have my first question if I saw this question in the wild,  .

  • I should have prefaced this with "This is a vendor database and we do not own the schema."

    Ah, fun!  While you don't own the schema, are you free to add indexes or views as long as there aren't any drastic changes that would impact the application?  Or are you limited to coming up with the most efficient lookup query possible with exactly what you've been handed?  With lookups over millions and millions of rows, the limited flexibility definitely hampers optimization attempts. 

  • We can add anything as long as we don't touch the original table content.  So yes, Indexes, views, functions, are all allowable.

    I've tried a very ugly view (using the 28 connections with a LEFT JOIN), tried a multi-statement table function to be used as a CROSS APPLY, and a handful of other things.  None of them seem particularly performant.

  • How often are these tables updated?  Perhaps the indexed view option could still be worth a shot?  Are the join column(s) indexed?  Would it be possible to show the attempted view query and associated query plan for additional context?

  • That's a long story, but for the sake of argument let's say it's updated once a day.  Like I said, we can add to the schema, but can't touch the data or fields in the data.  But we can add indexes.

  • The indexed view option doesn't modify the data or fields of the underlying tables, but it does introduce schemabinding and prevents structural changes to the underlying tables that may impact the indexed view unless you drop the view first.  Some additional info with sample queries in this article:  https://www.sqlshack.com/sql-server-indexed-views/

    The advantage is that the data in the view's index will be maintained for you, and if that is structured in the way you want to see your results then it should result in a simple and performant query plan.

    There may be ways to optimize with indexes alone, but I'd have an easier time assessing further with example queries/indexes/plans.

    A few clarifying questions...

    1. Can you elaborate on the 28 possible connections between source and targets?
      1. You mentioned an ugly view using these with a LEFT join...maybe I'm missing some context regarding the complexity of these.
    2. Are the necessary JOINs already indexed well?
    3. If you just need to pull in the additional column values from the target tables where you have a matching key with the source table, have you tried two INNER JOIN queries (one for each target table) with a UNION between them (potentially inside a view definition)?  With indexes on the join columns, I'm curious if you'd get the performance you need with this approach.
  • Yes, the ContentID fields in each of the "connected" tables are indexed.

    This is the framework I've done with that crazy join list...

    SELECT [Contents].[ContentId]
         , COALESCE(
             [Content1].UserId
           , [Content2].UserId
           -- Additional fields here for the coalesce
           ) AS [UserId]
         , COALESCE(
             [Content1].CreatedDateUtc
           , [Content2].CreatedDateUtc
           -- Additional fields here for the coalesce
         ) AS [CreatedDateUtc]
    FROM [ContentsSourceTable] AS [Contents] WITH (NOLOCK)
    LEFT JOIN [custom_Content1] AS [Content1] WITH (NOLOCK)
         ON [Contents].ContentId = [Content1].ContentId
    LEFT JOIN [custom_Content2] AS [Content2] WITH (NOLOCK)
         ON [Contents].ContentId = [Content2].ContentId
    -- add a while bunch of other joins here for the 26 other types
    
    

    I should note: I only play a DBA on television.  I'm a novice in many ways.

  • Thanks for the clarification and the sample query.  The LEFT JOINs in your query would be retrieving ALL values from the big table, and only matches from the smaller tables.  Since you are only looking for the ones with matching keys (eliminating NULLs from non-matches with COALESCE), I think the INNER JOIN approach with UNIONs might be a lot more performant.  Not sure if this is what you're looking for, but hopefully it is a step in the right direction.  Let me know if this helps.

    SELECT [Contents].[ContentId]
         , [custom_Content1].UserId
         , [custom_Content1].CreatedDateUtc
    FROM [ContentsSourceTable] AS [Contents] WITH (NOLOCK)
    INNER JOIN [custom_Content1] WITH (NOLOCK)
         ON [Contents].ContentId = [custom_Content1].ContentId
         
    UNION
    
    SELECT [Contents].[ContentId]
         , [custom_Content2].UserId
         , [custom_Content2].CreatedDateUtc
    FROM [ContentsSourceTable] AS [Contents] WITH (NOLOCK)
    INNER JOIN [custom_Content2] WITH (NOLOCK)
         ON [Contents].ContentId = [custom_Content2].ContentId
    -- add UNIONs here for the INNER JOIN queries to other 26 tables

  • The trouble here is that given a particular inefficient table design there isn't a lot you can do via your query to make it more efficient.  
    Is there currently an index on all the relevant tables for the keys and where conditions you need to be able to use?  Without indexes in the right places query performance is pretty doomed.

    Is your key on all these tables effectively the combined contentid + contenttypeid?  I've seen people make a lot of progress with silly large composite primary keys by creating a composite index with a hash.  Downside of hashes is they are not good if you need to do order by or want to use range conditions in the filter, but looking at these kinds of GUID style keys I'm not sure you are going to be doing that.  If a hash is not good then I think the indexed view built out of joins mentioned by   makes good sense.

    If the data set you actually intend to produce is really supposed to be in the millions of rows and no combination of clever indexes and where conditions is going to be able to reduce that you also just have to be prepared for the performance to be what it is.  At that point your main bottleneck is likely to be read speeds and RAM and caching.  How slow are you talking about?  Like a minute or two or walk away and make dinner?

  • Yeah - speeds are bad.

    • For 100 rows < 1 sec
    • For 1000 rows < 8 sec
    • For 10000 rows < 72 seconds
    • For all rows, I'll let you know when it finishes. Stuck out tongue

    Sounds like I just need to deal with this inefficiency at the moment.

    My understanding (which is complete conjecture) is that the application layer "above" this database knows about the contextual connections and pivots to the correct table as necessary.