As organizations adopt cloud technologies and modernize their applications, the data they generate and ingest often grows exponentially, leaving them with difficult choices for storing and using this data. Customers are beginning to explore moving away from traditional relational database management systems (RDBMS) because of the data volume to be ingested, as these RDBMS often cannot handle workloads.
Horizontally partitioning, also called “sharding,” a relational database workload onto multiple servers is a common solution; however, it can be a challenging problem to solve. The initial engineering effort to shard a system can be steep – and it not only requires the action of great engineers to make the effort successful, but it also requires a high amount of buy-in from management. Sadly, we’ve seen many of these projects needing to be refactored or outright abandoned because of the difficulty.
Fortunately, Microsoft has a product designed to be automated and solves many of these technical challenges for you: Azure Database for PostgreSQL Hyperscale. In this guide, I’ll discuss what PostgreSQL Hyperscale is, how it works, and how to get stared configuring, with insights and best practices for designing a distributed database system with Hyperscale.
In early 2019 Microsoft acquired a company named Citus Data, which had created an extension for the PostgreSQL database engine to horizontally distribute database tables. Since renamed Azure Database for PostgreSQL Hyperscale, this feature enables database developers to design tables so portions of the data are distributed among different worker nodes in the topology – allowing for more processing capability than would be possible on a single server.
Since Azure PostgreSQL Hyperscale is a managed service, it can allow you to scale up and down the amount of horizontal partitions associated with your data and easily mark tables as distributed. It’s so cool and useful because it automates what has traditionally been the hardest challenge associated with sharding a database. It not only handles data distribution but also does a great job of seamlessly pulling data from each of the horizontal partitions for reporting purposes.
Years ago, I worked with a customer who had designed a scale-out database. The design of the system was complicated but overall good. The data was segmented by the customer, and the application was designed around the data scale out – the application knew exactly which database server to go to per customer. The problem I was brought in to help with was the reporting. Running aggregate queries against all customers without a data warehouse was almost impossible to do efficiently. If the customer had used technology like Azure Database for PostgreSQL Hyperscale, which handles horizontal partitioning and data movement automatically, it would have been a huge win for them.
For each table that is to be distributed, a single distribution column must be selected and used to determine how to deterministically map values from the table to different nodes. Because this distribution column must be specified per table, migration from on-premises or a cloud version of PostgreSQL or from Oracle to Azure Database for PostgreSQL Hyperscale is no trivial task.
There are two types of servers (nodes) involved in the Hyperscale configuration (also known as a server group or cluster):
- Coordinator Node
The coordinator node is the entry point into the Hyperscale cluster. It accepts application connections and relays queries sent from applications to the different worker nodes, from which it takes the results and sends them back to the end user.
- Worker Nodes
Worker nodes are the nodes in the server group responsible for storing the distributed data. The query engine knows how the data is distributed among the different worker nodes and will pull the necessary information from each node to the coordinator node to satisfy the issued queries. You can scale resources on existing worker nodes as demand increases or add worker nodes as necessary. Through the Azure portal, you can scale your workload up to 20 nodes, possibly expanding to even more nodes if you work with Microsoft directly.
High availability for Hyperscale happens through a warm-standby node. When enabling HA, each node (coordinator node and each worker node) receives a warm stand-by where data replication is handled automatically. An important note is if you are doubling the number of servers involved in the topology, the cost will also double, so that’s a trade-off against downtime.
There are three different table taxonomies in Azure Database for PostgreSQL Hyperscale, and it is vital to understand when to use which table type.
- Reference Tables
Reference tables are generally smaller, and lookup system tables are frequently joined to larger ones. These tables are copied to each of the worker nodes. While the data is duplicated, the advantage is any join is a local join requiring no data from any other worker node.
- Local Tables
Local tables exist solely on the coordinator node and not on any worker node. Examples are small administrative tables that are never joined to other tables.
- Distributed Tables
Distributed tables are horizontal partitioned among the different worker nodes, with each node having a subset of the data in the table. To distribute a table, choose a distribution column when defining the table. Values from the distribution column map to different worker nodes. Take care when defining the distribution column for tables – poorly chosen columns can result in poor colocation of joins, which means a lot of data movement will need to occur when joins between tables occur.
Let’s build a PostgreSQL Hyperscale server, connect to it, and load some sample data. To begin, I’ll navigate to the Azure portal and “Create a Resource.” From there, I will choose “Azure Database for PostgreSQL.”
Azure Database for PostgreSQL has several options, and I will choose the “Hyperscale (Citus) server group” option. It’s called a server group because of the number of nodes working together to horizontally distribute the data.
Next, I’ll configure the server group. There are two tiers to choose from: Basic and Standard. Basic should be used for testing purposes only, as the coordinator and worker nodes are not separated. Use this tier for testing to make sure everything works properly, but do not use this tier in production, as you won’t get any benefit from using the Citus extension with this tier. The real power of the Citus extension comes when you have a separate coordinator node and several worker nodes, and you’ll need to choose the Standard tier for this. However, for our purposes today, I will use the Basic tier as I’ll show a simple script to distribute a table, and I’ll avoid the costs of the Standard tier (which can be quite high if you use many worker nodes).
When moving your workload to Production, you should use the Standard tier so you can have multiple worker nodes. When you do this, you’ll need to determine how much compute and storage capacity the coordinator node and each worker node should have. In addition, you’ll need to choose how many worker nodes will be part of your server group. Twenty is the maximum for worker nodes by default, but you can contact Microsoft to increase this if you absolutely need more. There is only a single coordinator node, so you’ll just assign the vCores and Storage for it. You can also enable high availability, but remember that will double your costs.
Next, I’ll give my server group a name and make sure it’s assigned to a Resource Group. I’ll also choose the version of PostgreSQL I want to be available on the servers. One interesting thing about using PostgreSQL Hyperscale is there is a single database for your data to reside in, and this database is always named “citus.”
Finally, I’ll add an IP exclusion for my server, as public traffic isn’t allowed by default. In a Production environment, you’ll want to be a lot more sophisticated than allowing any IP to connect to your server. However, since this is simply a demo, I’ll allow it.
To connect to my newly created Hyperscale server group, I must first register the server. To do this in pgadmin, right-click the “Servers” server group, choose “Create” and then select “Server.” You’ll need to create a “Server Group” first if one doesn’t already exist.
To connect to this server group, I’ll need to grab the full name of the coordinator node from the Azure Portal. To do this, select the “Connection strings” blade. From this screen, each of the Connection Strings has the server’s name. I have my server group name outlined in red below.
On the “Create Server” General tab, I’ll give this server a friendly name of “SQLSkillsPGHS.”
On the Connections tab, I’ll enter the server's address I pulled from the Azure portal. I’ll change the “Maintenance database” to citus and then enter the password for the citus user I entered when creating the citus server group.
Here’s a look at the objects listed under the Citus database:
Now that I’ve created a PostgreSQL Hyperscale server setup, I’ll create a distributed table. First, I’ll create a table named distributedTableTest with two columns. The first is a monotonically increasing integer column named IDColumn. The column named DistributedColumn will have an even distribution of values between 1 and 10.
CREATE TABLE distributedTableTest
IDColumn integer not null,
DistributionColumn integer not null
I want this table to be distributed among worker nodes. To do this, I’ll need to call the function create_distributed_table and provide the table name and the name of the distribution column.
SELECT create_distributed_table ('distributedTableTest', 'DistributionColumn');
Next, I’ll use the PostgreSQL function generate_series to generate ten thousand rows to insert into the distributedTableTest table.
INSERT INTO distributedTableTes t(IDColumn, DistributionColumn)
SELECT generate_series, (generate_series % 10) + 1
FROM generate_series (1, 10000);
I can query the citus_tables system table to verify that the table is now distributed.
SELECT * FROM citus_tables;
Microsoft Azure Database for PostgreSQL Hyperscale is a Platform as a Service (PaaS) offering that can make it easier to design your database system to be distributed among worker nodes, so you can quickly provide scalability for your database workload. You saw above how easy it is to configure PostgreSQL Hyperscale and now know about the different types of tables available in the service.
For deeper insights into database performance monitoring, learn more about SolarWinds PostgreSQL solutions built to help with query tuning, I/O tuning, and more.
Are there applications inside your organization that would benefit from horizontal partitioning? If so, what’s preventing you from designing these applications using multiple nodes today? I’d love to hear from you! Please leave me a comment below.