Everything to Know About SQL INNER JOINs
As a follow-up to my introduction to SQL Server joins, in this T-SQL tutorial, I’m going to take a closer look at the most used type of joins in SQL Server – the INNER JOIN.
An INNER JOIN is a join type used to return rows from two inputs (tables or views) where there is generally an exact match on one or more columns between the two inputs. This just type is most common because of how relational databases are designed. The normalization process models business entities into groups of related tables, so returning related data between those tables is a hugely important aspect of application development, and that is done using INNER JOIN.
Let’s start with the two different ways to write an INNER JOIN clause. One is the older ANSI SQL-89 syntax, and the other is the newer ANSI SQL-92 syntax. I prefer the newer syntax because I think it looks better and leads to fewer coding accidents when writing the join statement (more on this later).
To start with, I’ll create two temporary tables and store some number values in them. The table #Numbers1 will contain values 1 through 10 while #Numbers2 will contain the value 1 and then all even numbers between 2 and 10.
IF OBJECT_ID ('tempdb..#Numbers1') IS NOT NULL DROP TABLE #Numbers1; GO IF OBJECT_ID ('tempdb..#Numbers2') IS NOT NULL DROP TABLE #Numbers2; GO CREATE TABLE #Numbers1 ( NumberCol INT ); GO CREATE TABLE #Numbers2 ( NumberCol INT ); GO INSERT INTO #Numbers1(NumberCol) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10); GO INSERT INTO #Numbers2(NumberCol) VALUES (1), (2), (4), (6), (8), (10); GO"@ += @" SELECT * FROM #Numbers1 n1, #Numbers2 n2 WHERE n1.NumberCol = n2.NumberCol;
First let’s look at the older ANSI SQL-89 syntax, and there are two things to notice with this syntax. The first is how the table names and aliases are listed in the FROM clause separated by commas. If you were to forget to add the WHERE clause, this would still be perfectly valid syntax, but it would cause a CROSS JOIN rather than an INNER JOIN, resulting in an expensive Cartesian product. In other words, leaving out the WHERE clause would cause every row in the #Numbers1 table to be multiplied by every row in the #Numbers2 table – not our intention here. This syntax, while perfectly valid for INNER JOINs (not allowed for OUTER JOINs), can lead to accidental issues. You can learn more about the differences between INNER JOIN and OUTER JOIN here.
SELECT * FROM #Numbers1 n1 JOIN #Numbers3 n2 ON n1.NumberCol = n2.NumberCol;
Next is the ANSI SQL-92 syntax. With this syntax you use the JOIN keyword between the tables that you want to join, and the INNER keyword is not required in SQL Server. The ON clause is required with the INNER JOIN syntax, and this is where you specify the columns to compare for the join.
SELECT * FROM #Numbers1 n1 JOIN #Numbers3 n2 ON n1.NumberCol = n2.NumberCol;
One thing I want to note here is how the comparisons are evaluated for JOIN statements. In the SQL standard syntax, the FROM clause is evaluated before the WHERE clause. This means that the filter set in the FROM clause is evaluated before the WHERE clause – an important, yet subtle, distinction for when filters are applied for returning data in our queries.
One of the most common problems I see is how often customers will over-normalize their database design. Normalizing to third normal form is an excellent theory in terms of design. Still, the gain in reducing data redundancy is often heavily outweighed by the cost of the additional required joins to return data.
The SQL Server optimizer is generally reliable at making the correct decisions in terms of join order, but sometimes it gets it wrong. Including too many tables as part of your SQL statements can often lead to poor performance – especially if the SQL statement doesn’t require all of the tables listed in the join clause (which is more common than you might think!)
Below, I’ll discuss some other INNER JOIN issues and provide best practices and tips to help you optimize performance.
So often, I see customer code with SQL statements that involve six or more inner joins between tables. This is because the database design is so normalized that it’s necessary to have that number of joins, OR the customer is just trying to do too much with a single query. Either way, the more tables you INNER JOIN in a single SQL statement, the more likely the query optimizer will not choose the optimal join path. As a rule, I tend to tell customers to start considering refactoring inner joins when a select statement involves six inner joins or more.
When faced with a very complex SQL statement involving several inner joins between tables, the query optimizer has a clever way to handle such situations. To ensure timely query execution, the optimizer gives itself a time limit on how much optimization it will go through. Once it hits that limit, it will execute the best plan it has found up to that point in time. Consider the following INNER JOIN example, which involves 6 tables inner joined together in the AdventureWorks sample database:
SELECT s.BusinessEntityID, s.Name,at.Name AS AddressType, a.AddressLine1, a.AddressLine2, a.City, sp.Name AS StateProvinceName, a.PostalCode, cr.Name AS CountryRegionName FROM Sales.Store s INNER JOIN Person.BusinessEntityAddress bea ON bea.BusinessEntityID = s.BusinessEntityID INNER JOIN Person.Address a ON a.AddressID = bea.AddressID INNER JOIN Person.StateProvince sp ON sp.StateProvinceID = a.StateProvinceID INNER JOIN Person.CountryRegion cr ON cr.CountryRegionCode = sp.CountryRegionCode INNER JOIN Person.AddressType at ON at.AddressTypeID = bea.AddressTypeID;
In the execution plan, I can look at the plan property of ‘Reason For Early Termination of Statement Optimization’, which in this case has a value of ‘Time Out’:
This means that during query optimization, there were so many plan options that the time limit for optimization was hit and the “best” plan of the explored plan alternatives was chosen as the execution plan. This happens quite often, and if you have complex T-SQL statements and aren’t getting the execution plan you’d like, it may be time to refactor those T-SQL statements so they involve fewer joins, or break the statements up into separate statements.
There are cases when the SQL Server query optimizer can eliminate tables included in INNER JOIN statements if the tables are truly unnecessary. Consider this join statement between the Sales.SalesOrderHeader, Sales.SalesOrderDetail and Production.Product tables:
SELECT d.* FROM Sales.SalesOrderHeader AS h JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID JOIN Production.Product p ON d.ProductID = p.ProductID WHERE d.ModifiedDate >= '01/01/2022';
Looking at the execution plan for this statement we see there is a reference to the Sales.SalesOrderDetail table and the Production.Product table but no reference to the Sales.SalesOrderHeader table:
The reason is because there is a foreign key relationship between the Sales.SalesOrderHeader and Sales.SalesOrderDetail tables. There is no filter in the WHERE clause against the Sales.SalesOrderHeader table and no columns from this table are referenced in the SELECT list. The trusted foreign key lets SQL Server know that for every record in the Sales.SalesOrderDetail table there must be a related record (with the same SalesOrderID) in the Sales.SalesOrderHeader table. This is a one-to-many relationship between these two tables. This foreign key constraint, in conjunction with no references to the Sales.SalesOrderHeader table, lets SQL Server decide there is no reason to include the Sales.SalesOrderHeader table in the query.
There is no foreign key relationship between the Sales.SalesOrderDetail table and the Production.Product table. Because of this the joining of these two tables is required, even though no information from the Production.Product table is required in the query. If I add a foreign key and re-run the query, we can see that the physical join between the two tables is eliminated:
ALTER TABLE Sales.SalesOrderDetail WITH CHECK ADD CONSTRAINT fk_SOD_Prod FOREIGN KEY (ProductID) REFERENCES Production.Product (ProductID); GO SELECT d.* FROM Sales.SalesOrderHeader AS h JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID JOIN Production.Product p ON d.ProductID = p.ProductID WHERE d.ModifiedDate >= '01/01/2022';
Let’s remove this constraint for the rest of the demos.
ALTER TABLE Sales.SalesOrderDetail DROP CONSTRAINT fk_SOD_Prod;
Inequality (any comparison besides the equality (=) comparison operator) operators can be used in comparing column values for INNER JOIN statements, but often the comparison doesn’t make a lot of sense if the inequality join condition is the only comparison being done. Consider the following INNER JOIN example:
SELECT * FROM #Numbers1 n1 JOIN #Numbers2 n2 ON n1.NumberCol > n2.NumberCol;
This query results in 29 rows. How can a table with 10 rows joined with a table with 6 rows return a result set with 29 rows? What is happening is the Cartesian product between the two tables is first calculated and then the comparison occurs. So, every value from #Numbers1 is compared with every value from #Numbers2 and when the value from #Numbers1 is greater than the value from #Numbers2 it is returned. A more sensible query would include an equality comparison and an inequality comparison. For example, the following join query returns those sales from Sales.SalesOrderDetail where the price for a given ProductID is less than the ListPrice for that product from the Production.Product table:
SELECT * FROM Sales.SalesOrderDetail d JOIN Production.Product P ON d.ProductID = p.ProductID AND (UnitPrice / OrderQty) < p.ListPrice;
One area where I see customers really make bad choices is when it comes to using query hints in their code. This is almost always done with the best of intentions – usually because the query optimizer makes a certain choice that doesn’t work well for a specific point in time. A few causes for this are out-of-date statistics, local variable cardinality mis-estimation, and parameter sensitivity. Don’t get me wrong – there certainly are occasions where a query hint CAN help the execution of a query. However, I always urge customers to use them sparingly as they are most always honored, and we often do not notice the side effects of their use.
Consider again a variation of the query we’ve been using in this article. The join between the Sales.SalesOrderDetail table and the Production.Product table produces a ‘Hash Match’ physical join:
SELECT d.* FROM Sales.SalesOrderDetail d JOIN Production.Product p ON d.ProductID = p.ProductID WHERE d.ModifiedDate >= '01/01/2022'
A HASH JOIN is generally great for dealing with large amounts of data. It first builds a data structure in memory, hashing the keys involved in the join operation. This is known as the “Build” phase, and it is a blocking phase (blocking in the sense of pausing the query, not blocking other operations via locks) because it will often need to request additional workspace memory. Ideally this “outer” table is the smaller of the two tables. Then the rows for the other input are scanned, hashed, and compared to values in the build phase. This is a great choice for joining two large inputs.
It may be tempting to use a hint to ensure that a HASH JOIN is always used. However, this can cause some unforeseen issues if done without testing. The reason for this is when a physical join hint is used in a query, the ordering of how the tables is written in the statement is enforced. Simply adding the keyword HASH to the INNER JOIN statement will always cause the HASH JOIN to occur, as shown below:
SELECT d.* FROM Sales.SalesOrderDetail d INNER HASH JOIN Production.Product p ON d.ProductID = p.ProductID WHERE d.ModifiedDate >= '01/01/2022';
In this case, however, the larger of the two tables, Sales.SalesOrderDetail, is the build input, causing the overall execution of the query to be much slower than the original query without the hint.
Executing the two queries in the same batch, we can see the query without the hint is twice as fast (according to costs in the query plan – and the CPU time of the execution) as the query with the hint and the forced ordering:
Understanding how INNER JOIN works is a critical skill for any database developer. Use this join when you want only to return those rows where there is a match between tables. However, there are a couple of caveats to remember when writing these SQL statements. Firstly, you must be careful when writing INNER JOIN clauses and ensure you do not overcomplicate them – including too many tables in a join will likely lead to less-than-ideal performance because of the complexity involved for the query optimizer to generate a good execution plan. Secondly, be careful when using hints in your SQL statements – these statements are strictly followed and can have unforeseen side effects.
Are you spending too much time troubleshooting database performance and fixing one thing only to have two more things break? SolarWinds SQL Sentry is built to give you an all-encompassing performance view across SQL Server, the OS, virtualization layer, and SQL Server Analysis Services (SSAS). Learn more about how SQL Sentry is designed with the frustrations of a DBA in mind.