An Introduction to Database Sharding with SQL Server
One of my favorite consulting questions to ask when dealing with a scalability problem is, “If you could change the system design with the knowledge you have today, what would/wouldn’t you change and why?” Sometimes it’s best to ask this question on a one-on-one basis with different developers, DBAs, report writers, and architects to get honest answers that aren’t intimidated by the other people sitting in a meeting or to avoid potential debating and arguing over a solution. This kind of information is incredibly important to know and understand before starting down the path of sharding data with SQL Server—primarily because sharding isn’t a simple venture involving changing a configuration option or flipping a switch. In some cases, it can be a total re-architecture of how the data is being accessed and stored, so we might as well fix any technical debt along the way.
Click here to learn more about how SolarWinds database solutions are designed to provide deeper insights into the health of your database, so you can quickly identify performance problems - typically within minutes. If you want to learn more about monitoring Citus clusters, check out this blog.
Analysis of the Problem
As a consultant, I also love to ask, “What problem are we trying to solve?” Often a client will come to us with a solution they want implemented, and they either don’t know how, or they have tried and failed and merely want it to work the way they think it should. Success rarely comes from trying to make a technology do a task it was never intended to perform or solve a problem it wasn’t designed to solve. This is definitely true when it comes to designing a sharding architecture, as there are several different ways to do it depending on the specific problem being solved.
Building a scalable solution with SQL Server is similar to building a skyscraper. The end product can only be as good as the foundation on which it was built. There are lots of different foundations in architecture—what you would use for a shed is different than for a house, which doesn’t compare to what you need for a skyscraper. And, as you can safely put a shed on top of the foundation for a skyscraper, you can’t safely put a skyscraper on the foundation built for a shed. Trying to scale out a flawed design won’t be successful. Instead, it can be incredibly expensive, so it may be advisable to bring in an outside expert to review the database architecture and where you need it to go.
Albert Einstein once said, “Problems cannot be solved with the same mindset that created them.” Within a business where there’s relatively little employee turnover, this can become a handicap to change. We’re all limited by our experience and training. The architect for a bridge considers many different things than the architect of a house. An architect in Florida would design a gradually sloping roof with no gabled ends to sustain the high winds of a hurricane, while a Vermont architect would design a roof with a steep pitch to prevent snow from building up on it. Which one’s right? They both are for the problem they’re trying to solve.
Application Aware or Not?
In an ideal world, sharding would be understood not only at the data tier of an application but also by the application itself. However, it is possible to implement range-based sharding (essentially horizontal partitioning) in a manner somewhat transparent to the application. Typically, in SQL Server, this is through a partitioned view, but it doesn’t necessarily have to be. Technology is constantly changing, and there are newer features outside of the SQL Server stack to help with a sharding implementation.
For example, Citrix offers the NetScaler Application Delivery Controller (ADC), which is commonly used for load balancing web applications but can also read the Tabular Data Stream (TDS) for SQL Server traffic and handle content switching to segment the traffic to the relevant database server based on the information in a SQL query. This can be incredibly useful for multi-tenant designs using a different database for each tenant’s data since the DataStream Content Switch can be performed based on the database name being accessed.
Within SQL Server, a partitioned view can provide a common access point to smaller chunks of data that abstracts the actual underlying physical implementation away from the application. A partitioned view can be implemented in one of three different ways:
- Local partitioned view over multiple tables in the same database—e.g., divide the overall Sales dataset across yearly subset tables Sales2020, Sales2021, Sales2022, all with the same column_list and schema:
CREATE VIEW dbo.Sales AS SELECT <column list> FROM Sales2020 UNION ALL SELECT SELECT <column list> FROM Sales2021 UNION ALL SELECT SELECT <column list> FROM Sales2022;
- Each table must have check constraints enforced on the date in distinct boundaries to create the partitioning key for table elimination during query optimization.
- Referencing the partitioned view in a stored procedure requires the OPTION (RECOMPILE) hint to optimize for table elimination during execution.
- Cross-database partitioned view over multiple databases in the same server (essentially database partitioning)—same implementation as local partitioned view except using three-part naming for the database tables:
CREATE VIEW dbo.Sales AS SELECT <column list> FROM Sales2020.dbo.Sales UNION ALL SELECT SELECT <column list> FROM Sales2021.dbo.Sales UNION ALL SELECT SELECT <column list> FROM Sales2022.dbo.Sales;
- Distributed partitioned view over multiple servers/instances and databases—requires linked servers to be created and requires much stricter rules for querying data to prevent non-remotable query executions which copy tables of data from remote database nodes to the local node to perform join operations:
CREATE VIEW dbo.Sales AS SELECT <column list> FROM SalesDB2020.dbo.Sales UNION ALL SELECT <column list> FROM ServerB.SalesDB2021.dbo.Sales UNION ALL SELECT <column list> FROM ServerC.SalesDB2022.dbo.Sales;
Analysis of Data Access Patterns
This is probably the single most important aspect of partitioning data using any method but also the least considered in initial attempts. How the data is going to be accessed determines how fast, or slow, the required amount of data can be fetched when it’s broken down into different shards. Every partitioning strategy will have a partitioning key (think of it as the shard key) used to determine in which of the single shards the data resides. Therefore, it’s necessary to choose a partitioning key for use as a filtering criterion in most if not ALL of the queries accessing the data. Otherwise, you ultimately have to scan every shard to find the row(s) being accessed since it’s impossible to eliminate one or multiple shards when the partitioning key isn’t used.
Different types of data might need to be partitioned differently based on their use cases in the application. For example, a global application might shard the database by geographical region first to allow routing of requests to the respective continent/country as a part of meeting data storage compliance laws. Accounting data for a business might be sharded into a single database for each year to be queried independently based on the date range required for reporting purposes. A multi-tenant architecture might have a separate database for each customer, only storing their specific data, and then a centralized database with configuration and application-specific data.
Depending on the problem needing to be solved, a hash-based partitioning strategy might be used where a given key is hashed. SQL Server doesn’t support native hash-based partitioning, but it can be implemented using a persisted computed column (based on some kind of hash function/algorithm) that becomes the partitioning key. However, keep in mind unless you can determine the hash value before submitting a query, you won’t have good performance for reading data back out of the solution.
So Much to Consider
While sharding a single monolithic design into smaller partitions can allow for almost infinite horizontal scaling, there are many considerations before you begin down this road to get the benefits of sharding. As an athlete can be limited by poor form and technique, a database design can be limited by poor schema design, indexing strategies, and query writing.
Before taking on a project as big as sharding, it’s often best to first ensure you have a sound foundation upon which you plan to extend the architecture. When was the last time you made a list of technical debt, fast fixes, and band-aids pushed out in expedience to solve a problem but were never intended to be permanent fixes? Is your business aware of technical debts festering in the corner as a part of the application needing to be re-architected? Have you already gone down the path of sharding data in SQL Server, and was it successful?
Remember, the current limitation you see in front of you might not be the root of the issue, so it’s imperative to define and understand the problem needing to be solved accurately upfront; otherwise, you may build a new solution having the same performance due to not addressing an existing inherent limitation. Or worse, now queries slow down, and throughput creeps to a trickle causing additional problems as the data access and usage were never correctly understood before a partitioning strategy was chosen.