What's the Difference Between INNER JOIN, the OUTER JOINs, and CROSS JOIN?
Over the course of my career, when interviewing people, I tend to ask questions that locate the edge of a candidate’s knowledge. One of the questions I often ask is about the different types of JOIN, and it’s still relatively rare that someone can completely explain all the join types—even though they haven’t changed for a very long time. So, here’s a refresher on how the different types of JOIN are related.
One thing I want to make clear about this post is that we’re just talking about logical JOINs—i.e. we are trying to clearly define what result you would expect when you issue a query that contains a JOIN. We are not covering how joins are navigated physically—i.e. how the database engine actually fulfils the request and produces the correct result. More on that at the end!
A long time ago, I was interviewing someone and asked them about what the phrase “cartesian product” meant. The reply was, “This is getting silly now. I’m interviewing for a database developer position and you’re asking me about hair gel?” I wasn’t asking about hair gel, and that particular interviewee did not get the position.
The cartesian product is a mathematical term in which two sets of items are combined such that every item in the first set is combined with every item in the second set. Let’s demonstrate that with a simple example. Consider set L that has the items [1, 2, 3] and set R that has the items [A, B, C]. The cartesian product would be 9 pairs:
You can see that for each item in the set L we are paired up with each of the items from set R. In databases, the cartesian product is exactly the same thing—except for the fact that the sets are tables and contain rows rather than items.
So, let’s consider a set which contains some basic numbers in SQL:
DECLARE @Left TABLE (ID [int], Value1 [varchar](10), Value2 [varchar](10)) DECLARE @Right TABLE (ID [int], Value1 [varchar](10), Value2 [varchar](10)) INSERT INTO @Left ([ID], [Value1], [Value2]) VALUES (1, 'One', 'First'), (2, 'Two', 'Second'), (3, 'Three', 'Third') INSERT INTO @Right ([ID], [Value1], [Value2]) VALUES (2, 'Deux', 'Deuxième'), (3, 'Trois', 'Troisième'), (4, 'Quatre', 'Quatrième') SELECT * FROM @Left L CROSS JOIN @Right R
The result of this query is, in fact, the cartesian product:
Hence, we can say that the CROSS JOIN produces the cartesian product—i.e. for every row in the left-hand set, our result contains a row for every row in the right-hand set.
The first thing to notice about the CROSS JOIN syntax is that there is no ‘ON’ clause—we are not specifying any way to “match” the rows from the left- and right-hand sets. All other join types do have an ON clause and so we can reduce the result set based on intent. Let’s consider that we want to match the ID columns from our left and right tables—and so we might want to have a statement like this:
SELECT * FROM @Left L INNER JOIN @Right R ON L.ID = R.ID
If we consider the ON clause, and highlight the ID values that match in our CROSS JOIN result set, we can see two rows fit the bill:
And if we actually execute the query, we can see that we do get those two rows returned:
We can say, therefore, that the INNER JOIN returns only the rows where both sides of the join can provide a matching key.
Again, I’d like to point out that this is not how the INNER JOIN result set is created physically—no database engine would create the cartesian product and then select rows from that.
Going from the cartesian product to the result of an INNER JOIN is fairly simple. We can visualize it as certain combinations being ‘picked’ from the cartesian product to produce a result. Getting to the results of an OUTER JOIN is trickier. Let’s look at why.
If we think of the definition of an OUTER JOIN, it includes the rows that aren’t matched by a predicate. Because of the fact that rows are repeated in the cartesian product, it’s no longer natural to think of the JOIN as a filter.
Considering our query again, but this time using an outer join:
SELECT * FROM @Left L LEFT OUTER JOIN @Right R ON L.ID = R.ID
Let’s just highlight the keys that aren’t matched against that predicate in our diagram:
We’ve highlighted the rows from the left and right sets that wouldn’t form a part of our INNER JOIN result. However, we have three of each, which isn’t hugely helpful for our visualization because an OUTER JOIN result would not include multiple rows. Also, in an OUTER JOIN, the unmatched side is represented by NULL values—and there aren’t any pairs matched with NULL in our cartesian product.
A typical flow for this sort of operation is to get the cartesian product, reduce it, and then add outer rows. I’m not sure that adding outer rows makes as much sense in terms of explanation, so it’s perhaps easier to think of OUTER JOINs as starting with a set that includes both the matches and the unmatched rows:
With OUTER JOIN, we are simply specifying which unmatched rows we would like to keep. In a LEFT OUTER JOIN we are keeping the unmatched rows from the left only. Similarly, a RIGHT OUTER JOIN keeps those on the right only. A FULL OUTER JOIN will keep the unmatched rows from both sides.
Let’s have a look at the actual results:
SELECT * FROM @Left L LEFT OUTER JOIN @Right R ON L.ID = R.ID
SELECT * FROM @Left L RIGHT OUTER JOIN @Right R ON L.ID = R.ID
SELECT * FROM @Left L FULL OUTER JOIN @Right R ON L.ID = R.ID
We can see that the columns from the non-matching sides have NULL values, indicating that a match is missing.
In this post, we’ve talked about logical joins—the “what” rather than the “how” of querying. It’s worth taking a look at the syntax that T-SQL offers because join hints provide a way that the logical intent is mixed in with the physical execution. Looking at an example from the linked resource:
SELECT p.Name, pr.ProductReviewID
FROM Production.Product AS p
LEFT OUTER HASH JOIN Production.ProductReview AS pr
ON p.ProductID = pr.ProductID
ORDER BY ProductReviewID DESC;
In this query, we’re interested in the ‘LEFT OUTER HASH JOIN’ phrase—it contains a mixture of logical intent and physical instruction. The ‘LEFT OUTER JOIN’ is the join we want to perform. When we add the ‘HASH’ we’re adding an instruction about how we want to perform that join. In this post we’ve covered the logical operation only—the physical operation will be a great blog topic in the future!
Let’s summarise what we’ve covered:
- CROSS JOIN is the full cartesian product of the two sides of a JOIN.
- INNER JOIN is a reduction of the cartesian product—we specify a predicate and get a result where the predicate matches.
- OUTER JOINs are more than a simple reduction—because the cartesian product contains non-matching rows multiple times and does not contain any pairs that have one NULL side.
- LEFT, RIGHT and FULL OUTER JOINs simply change the side on which unmatched rows are kept.
So, we’ve looked at the logical implications of INNER, OUTER, and CROSS JOINs. There’s a lot of depth that we haven’t covered, including semi-joins, joins with table-valued function output using the APPLY operator, the physical execution of various JOIN types, and more. Those may well be the topic for another day!