T-SQL UNION Operator: A How-To Guide
What exactly is a
UNION operator? A
UNION operation is defined as concatenating the results of two or more queries into a single result set.
I remember a client struggling to bring data from several tables into a single result set using a complex
OR query. I asked whether there was any overlap in the data in the various tables, and he said the most perfect five words, "Oh, no, they're mutually exclusive."
You may be wondering why I described his response as "perfect." Well, having mutually exclusive data meant there was no possibility of duplicates for him to be concerned about, and a simple
UNION ALL with a straightforward and efficient query plan would work.
UNION ALL may not be the best option in some cases, which is why it's crucial to understand your data and how query operators work under the covers, so your query has the most optimal execution plan. In this post, I discuss the
UNION ALL set operators in detail, including use cases for each and common performance considerations.
Here's a simple T-SQL
SELECT FirstName, LastName, AddressLine1, City, State, Zip FROM Customers UNION SELECT FName, LName, Address, CityName, StateName, ZipCode FROM Employees;
There are a couple of restrictions for using a
- All queries in the
UNIONmust return the same number of columns.
- The columns for each result set must have compatible data types. For example, combining an int column and a bigint column is fine, but attempting to combine a uniqueidentifier column with a decimal column would result in an error. Also, the order you define the column in your
SELECTstatement matters as the results are concatenated ordinally.
The queries in the above
UNION statement are simple
SELECT statements to show the overall syntax. However, the queries being concatenated together can be quite complex, including things like subqueries, aggregates, GROUP BY, and HAVING clauses. ORDER BY is a special case for the
When combining results using
UNION, it’s important to understand how duplicate values should be treated. A
UNION will remove any duplicate values returned from concatenating various result sets. The following query returns a unique list of numbers, from 1 to 10. While the values 1 and 7 are included in both result sets, the duplicates are filtered out before the data is returned, as shown below:
SELECT Number FROM (VALUES (1), (3), (5), (7), (8), (9)) AS Set1 (Number) UNION SELECT Number FROM (VALUES (1), (2), (4), (6), (7), (10)) AS Set2 (Number);
If duplicate rows aren’t a concern, or you know there are no duplicates, consider using the T-SQL
UNION ALL instead of
UNION ALL doesn’t attempt to filter out any duplicate rows. The following query is the same as the previous
UNION query but uses
UNION ALL instead of
SELECT Number FROM (VALUES (1), (3), (5), (7), (8), (9)) AS Set1 (Number) UNION ALL SELECT Number FROM (VALUES (1), (2), (4), (6), (7), (10)) AS Set2 (Number);
You can see the results are returned in the order in which they’re defined in the row constructor, and duplicates aren’t removed. The reason I mention the order in which the rows are returned here compared to the
UNION example is a
UNION must perform a sort to remove duplicate rows. Result sets from a
UNION statement are often ordered when returned, but you cannot depend on this. The only way to guarantee a result set is ordered in a specific manner is to include the ORDER BY clause as part of the statement.
One of the biggest performance problems I've seen with using the
UNION clause is when an expensive sort operation must be performed. So, as a rule, only use a
UNION if you need to remove duplicate values or use a
When a sort operation occurs, a request is made for some memory to perform the operation, known as a memory grant. Memory grants have an initial requirement for how much memory is required to begin the sort operation, and there is additional memory needed to perform the operation entirely in memory. Performing a sort in memory is much faster than spilling the data to tempdb and performing the sort operation there. However, sometimes cardinality estimations are incorrect, and SQL Server guesses it can fit all rows in memory, but it cannot, leading to an expensive sort spill to tempdb.
The following is a warning you might see in an execution plan for a sort operation if it has to spill to tempdb:
By definition, any result set is unordered, but there’s often a need to order the results from a
UNION. To order results from a
UNION, use an ORDER BY clause placed after the last statement of the
UNION. Consider the following query:
SELECT AddressLine1, City FROM Person.Address WHERE City = 'Redmond' UNION SELECT a.AddressLine1, City FROM Sales.SalesOrderHeader h JOIN Person.Address a ON h.BillToAddressID = a.AddressID JOIN Person.StateProvince sp ON a.StateProvinceID = sp.StateProvinceID WHERE sp.name = 'Washington AND City = 'Redmond' ORDER BY AddressLine1 ASC;
The ORDER BY clause is always evaluated last in a SQL statement. In the above example, the two queries are combined, and then ordering occurs. However, it’s important to understand the column name/alias you must reference in the ORDER BY clause.
The column names returned in a
UNION operation are always based on the result set of the first
SELECT statement, so you must define column aliases in the first
SELECT statement. You must also refer to these column names if you use an ORDER BY clause. Here’s an example of this:
SELECT pp.LastName AS LName, pp.FirstName AS FName, 'SalesPerson' AS OfficalJobTitle FROM Person.Person AS pp JOIN Sales.SalesPerson AS e ON e.BusinessEntityID = pp.BusinessEntityID UNION ALL SELECT pp.LastName, pp.FirstName, e.JobTitle FROM Person.Person AS pp JOIN HumanResources.Employee AS e ON e.BusinessEntityID = pp.BusinessEntityID ORDER BY LName ASC;
UNION clause combines the results of two or more queries. The data from the queries don’t need to be related as long as each query has the same number of columns the data types are compatible
JOIN operation combines rows from different tables where there’s a relationship between those tables. Depending on the type of
JOIN, the rows returned may be the same between the tables being joined (e.g.
INNER JOIN), or one table may be preserved. If there are matching rows in other tables, they’ll be included (e.g. OUTER
One practical application of the
UNION operator is to combine similar queries from the same set of tables together to form a new result set. It's been my experience to see application developers use the pattern of having similar queries with equality comparisons in the predicates because it's easier to write or generate through a tool and seems more efficient than one single query including an
OR comparison in the predicate. A simple example of this would be as follows, with the associated query plan:
SELECT ProductID, SalesOrderID, UnitPrice FROM Sales.SalesOrderDetail WHERE ProductID = 712 UNION SELECT ProductID, SalesOrderID, UnitPrice FROM Sales.SalesOrderDetail WHERE ProductID = 870;
You'll notice in the plan the
SalesOrderDetail table is referenced twice; one scan for the lookup of the
712 and another scan for the
ProductID value of
UNION these, SQL Server must run the same scan operation twice. Using the output of
SET STATISTICS IO ON, you can see 2492 pages from the buffer pool were accessed to return the results:
A more efficient way to refactor the
UNION statement is to simply use the IN clause (a shorthand way of writing an
OR clause), so the work of filtering out the
ProductID values occurs in a single scan of the table, as shown below:
SELECT ProductID, SalesOrderID, UnitPrice FROM Sales.SalesOrderDetail WHERE ProductID IN (712, 870);
You can see from the execution plan the table is now only scanned once, and the number of pages read from the buffer pool is cut in half:
It's worth noting there’s a common misconception about using the
OR operator in a
WHERE clause. Using
OR doesn’t always require scanning all the data in a table to return the rows being searched. Factors such as column cardinality, the number of values being searched, and the columns returned in the
SELECT list are considered. If there were a covering nonclustered index on the
ProductID column (which included
UnitPrice), you could use a single seek operation instead of scanning the table.
A Common Table Expression (CTE) is a type of temporary named result set to allow you to define a query once and then refer to the query by name in subsequent statements in the same batch. CTEs include the ability to do recursive queries, which is a way for a query to refer to itself until a base case is met. The structure of a recursive CTE is defined by an anchor query and a recursive query. The recursive portion of the query references the name of the CTE and will recurse until its base condition is met.
The critical thing to take away here is a recursive CTE requires a
UNION ALL statement to combine the results from the anchor query and all subsequent recursive queries into a final result set.
CTEs are great for hierarchical queries, such as a company organization structure. Here's an example:
;WITH DirectReportsCTE ( ManagerID, ManagerName, EmployeeID, EmployeeName, Title, Level ) AS ( --base/anchor case SELECT e.ManagerID, CAST (NULL AS VARCHAR (60)), e.EmployeeID, e.FullName, e.Title, 0 AS Level FROM dbo.CompanyEmployees AS e WHERE ManagerID IS NULL UNION ALL --recurse SELECT e.ManagerID, CAST (d.EmployeeName AS VARCHAR (60)), e.EmployeeID, e.FullName, e.Title, Level + 1 FROM dbo.SQLSkillsEmployees AS e INNER JOIN DirectReportsCTE AS d ON e.ManagerID = d.EmployeeID ) SELECT ManagerID, ManagerName, EmployeeID, EmployeeName, Title, Level FROM DirectReportsCTE;
While the SQL Server partitioning function can be great from a data manageability perspective, it’s sometimes advantageous to design your own partitioning strategy for tables. Designing a partition gives you more flexibility for data movement and allows you to index each table as needed. You can also take advantage of the statistics for individual tables (vs. a single statistic object for a partitioned table in SQL Server).
Consider the following table definition:
CREATE TABLE SalesQ12021 ( SalesOrderID int NOT NULL, SalesOrderDetailID int NOT NULL, OrderQty smallint NOT NULL, ProductID int NOT NULL, UnitPrice money NOT NULL, UnitPriceDiscount money NOT NULL, SaleDate datetime NOT NULL CHECK (SaleDate >= '1/1/2021' AND SaleDate < '4/1/2021'), CONSTRAINT pk_SalesQ12021 PRIMARY KEY CLUSTERED (SalesOrderID, SalesOrderDetailID) );
The important concept here is the constraint on the SaleDate column – this feature will ensure only records for a given quarter of 2021 can be inserted into the table. The real power of this design comes when there are multiple tables for certain date ranges, each with its own date constraints.
Consider the following view using the
UNION ALL operator to produce a dataset combining the data from these four tables together:
CREATE VIEW vwSales AS SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID, UnitPrice, UnitPriceDiscount, SaleDate FROM SalesQ12021 UNION ALL SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID, UnitPrice, UnitPriceDiscount, SaleDate FROM SalesQ22021 UNION ALL SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID, UnitPrice, UnitPriceDiscount, SaleDate FROM SalesQ32021 UNION ALL SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID, UnitPrice, UnitPriceDiscount, SaleDate FROM SalesQ42021; A query against this view with a filter on the SaleDate column would produce the following execution plan: SELECT COUNT (*) AS RowCount FROM vwSales WHERE SaleDate >= '10/11/2021' AND SaleDate < '10/12/2021';
Notice only the SalesQ42021 table is being referenced. Since each table has constraints in place, the query optimizer knows the query only needs to access a single table to execute this query.
UNION ALL T-SQL query operators are useful for concatenating result sets from two or more
SELECT queries. There’s a lot of flexibility with these statements, but you must ensure each
SELECT statement returns the same number and compatibility of the columns.
Like in my "perfect" example I opened the post with, try to use the
UNION ALL statement when possible, as there can be much overhead involved in sorting the data to return a distinct list for
UNION. Knowing your data and operator options can help you achieve the most efficient query execution plan.
Maintaining and improving query performance can also be crucial to the overall health of your SQL Server environment. With SolarWinds SQL Sentry, you can use the integrated Environment Health Overview (EHO) dashboard for an at-a-glance view designed to allow you to prioritize troubleshooting events causing the most pressing performance issues.