Tales From Building a SQL Server DW in Azure

On June 19th,   and I presented a webinar called "Tales from Building a SQL Server in Azure." In this session, we shared experiences and lessons learned from a recent migration of our business intelligence and data warehousing environment to Azure. Key discussion topics in the webinar:

Azure Webinar

  • Deciding between SQL Server in a VM, Azure SQLDB, and Azure SQLDW
  • Naming conventions
  • Structuring Development, Test, and Production environments
  • Key decisions for storage, encryption, and backups
  • Automation with Azure Resource Manager templates and PowerShell
  • Techniques for monitoring and alerting

The recording can be found here: SentryOne Webinar Video Recording

The interest level in this topic was amazing. We received a whopping 125 questions before and during the webinar. Below are the most relevant questions and responses.

Azure SQL Database (PaaS)

Question Response
Did I see on the slides that you cannot access your Azure SQLDB through VPN? Correct. However, the new managed instance option for Azure SQLDB will have VNET (virtual network) capability which should open up VPN as well. Info: https://azure.microsoft.com/en-us/blog/new-options-to-modernize-your-application-with-azure-sql-database/
What sort of instance features do you mean for managed instance? Instance-level features include items such as SQL Agent, database mail, cross-database queries or transactions, change data capture, and CLR. Nearing 100% full compatibility with SQL Server in a managed Azure SQLDB instance will be helpful to customers who currently have compatibility issues. Info: https://azure.microsoft.com/en-us/blog/new-options-to-modernize-your-application-with-azure-sql-database/
You mentioned managed Azure SQL and other components possibly coming. Ths includes SQL Agent? Yes, for managed instances this is true. Info: https://azure.microsoft.com/en-us/blog/new-options-to-modernize-your-application-with-azure-sql-database/
How do you transfer and migrate the CLR assembly object from on-premises to SQL Azure? CLR is not currently supported for Azure SQLDB. However, CLR can be run inside of a virtual machine, and will be supported for managed instances of SQLDB. How the migration process will work for Azure SQLDB managed instances is currently unknown since managed instances were just announced publicly.
What can you say about Azure SQLDB Elastic Pools? Since elastic pools are not suitable for data warehousing workloads, this isn't a feature the Analytics Team has experience with. Elastic pools can share resources among multiple Azure SQLDBs which have varying and inconsistent workloads. Info: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-elastic-pool/
Are we able to take a native SQL backup and restore that to Azure? Yes, if you are running SQL Server within a virtual machine. You may want to look into exporting schema and data to a BACPAC then importing to Azure SQLDB: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-cloud-migrate. (The Azure SQLDB managed instance will likely support backup and restore operations when it is released.)

Azure SQL Data Warehouse (PaaS)

Question Response
Can the upscale or downscale of Azure SQLDW be scripted? ie. at 1am step up compute to handle load without user logging in and doing it directly? Absolutely. The ability to scale DWUs (data warehouse units) up and down to meet demands on a schedule is very beneficial. This can be done with T-SQL, PowerShell, or CLI (in addition to the portal interface). The technique for T-SQL can be found here: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-manage-compute-tsql.
Have you dug into the nitty gritty of tuning a star schema's distribution hashes? And if so, what strategies did you use to minimize data movement without the ability to use Distribution=Replicate on dimension tables? Fortunately, replicated tables appears to be coming as a feature (work has started according to the feedback site: https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10638774-replicated-table-support). In the meantime, you may want to initially test distribution of the data on a column that is used in join operations (this would typically be a surrogate key in a star schema / denormalized data warehouse design that relates to the foreign key in fact tables). Avoiding data movement entirely is not realistic in an MPP system. Besides good table design, be sure that your statistics are up to date (there are no automatic statistics updates in Azure SQLDW) to minimize the impact of data movement.
In the absence of an Azure SQLDW project type in Visual Studio, have you had any thoughts about how you will manage dev/deploy lifecycle? Maintaining the DDL in a source-controlled database project is so important. Either you can wrangle with an SSDT database project (in which case many of the Azure SQLDW-related scripts will need to be set to *not* build to avoid build errors since some Azure SQLDW syntax doesn't align with SQL Server or Azure SQLDB). Two additional alternatives exist.

(1) You might look into is VSCode. If you are ok with working with folders and files instead of solutions and projects, VS Code has an MSSQL extension which supports Azure SQLDW: https://azure.microsoft.com/en-gb/updates/azure-sql-data-warehouse-support-for-mssql-extension-for-vs-code/ (the first major non-preview release went out early May 2017).

(2) You might also consider mssql-scripter which supports Azure SQL Data Warehouse as of June 2017: https://azure.microsoft.com/en-us/updates/azure-sql-data-warehouse-support-for-mssql-scripter/.

Best approach for deciding on your distribution key? An optimal distribution column has these characteristics:
  • Not updateable
  • Non-nullable (or a very small # of nulls)
  • Has a minimum of 60 distinct values (so each of the 60 distributions contains some data and can perform some work)
  • Has values which are evenly spread among the rows (minimal skew)
  • Is used for joins and/or group by operations
  • Is *not* commonly used as a predicate (i.e., in a where statement)

Additional info: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute

When using Azure SQLDW, are there integration advantages if we also want to use some of the NoSQL Microsoft products? Azure SQLDW integrates with Azure Data Lake and Azure Blob Storage, primarily for purposes of loading Azure SQLDW (federated queries do work as well, but the technology isn't yet optimized for federated query scenarios). This integration occurs via PolyBase, and we can expect these types of 'schema on read' integration points to continue to expand. Info: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-load-from-azure-data-lake-store
What things should be consider when creating a star schema in Azure DW? And what design patterns and considerations should we take into account to keep the cost of Azure SQLDW (DWU) low? Because Azure SQLDW relies heavily on columnstore indexes by default, a star schema (i.e., denormalized data model in facts and dimensions) is an important design pattern in order to achieve the intended benefits of columnstore. Carefully monitoring the scale level, programmatically reducing DWUs during non-peak hours, can help control cost. Azure SQLDW can even be paused during times in which there are no data loads or queries being performed, in which case the only cost is data storage. Also, if you provision your database via T-SQL instead of the portal you can specify the storage level (10 TB of premium storage is provisioned unless otherwise specified) which can save cost on premium storage: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-database-azure-sql-data-warehouse
How best tune performance of Azure SQLDW, and key things to be mindful of? Data movement and data skew are both key items which can adversely impact performance of data loads and/or user queries across the distributions. DW Sentry can assist not only with monitoring Azure SQLDW, but also with testing choices made during the development phase: https://www.sentryone.com/platform/azure-sql-dw-performance-monitoring
What size DW is the tipping point for switching from Azure SQLDB to Azure SQLDW? This is a difficult question as there is no fixed threshold. Generally, 1-4TB is considered the minimum in terms of data size requirements to justify an MPP system. There are other considerations as well, such as elastic scalability and PolyBase integration.

Azure Virtual Machines (IaaS)

Question Response
So the standards of installing SQL binaries on the D: drive separate from the OS binaries needs to be changed? Since D: is reserved as the temporary drive in an Azure virtual machine, you definitely want to use another drive for data files. We have opted to use G: for data, L: for logs, and T: for TempDB.
Can you take an on-premises VM and use in Azure? You can use an existing VHD as a starting point for an Azure virtual machine. Info: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/upload-generalized-managed
When you shut down a VM with the auto shutdown, do you run the risk of shutting down an active ETL job or a report or other operation a user is running? Yes, that is definitely a risk. The shutdown that qualifies for suspending billing for the VM is the one within Azure (not the internal Windows shutdown). The Azure shutdown does not check for actively running processes before it proceeds.
Will the VM Backup cause a "freeze" in the environment when it is being backed up? My VM backups are scheduled for after business hours, so I do not have not experienced any performance impacts.


Question Response
Can you get near real-time cost impact information for everything in a resource group so you can quickly see the impact of a sizing or scaling decision or do you have to wait for the bill to arrive? Check the Azure billing API. Info: https://docs.microsoft.com/en-us/azure/billing/billing-usage-rate-card-overview
If one chooses resource groups early on is it easy to move resources between groups? Many resources can be moved to another resource group, but not all. The list of what can and cannot be relocated can be found here: https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-move-resources#services-that-enable-move
Does a development subscription not give you any cost savings? For VM environments: We have Dev and Test shut down at night (or if no activity is actively occurring). In this situation, the only cost incurred is the data storage which saves on compute cost. With regard to licensing of SQL Server, we are using the Developer edition of SQL Server in our non-production VMs. In that case, we are paying for the Windows license and VM compute cost only. Conversely, in Production our VM cost does also include the SQL Server licensing.

For Azure SQL DB or Azure SQL DW: It's likely you could have those resources scaled lower than Production, thus minimizing cost.

Have you used Data Factory pipelines to move data around azure data platform? Yes, we do have an Azure Data Factory implementation in our Analytics environment. However, usage of ADF is out of scope for this webinar (i.e., it's unrelated to our SQL-related infrastructure). If you are planning to use ADF, be sure to plan time for learning it as it's a very different paradigm. The use cases for what ADF is expected to do well are very different than SSIS. For our SQL-related infrastructure discussed in this webinar, we are currently relying on SSIS running in a VM. If you have straightforward data movement needs related to the cloud, with minimal transformations, ADF can be a viable option.
If you put your data into Azure can you also have Azure host as SaaS or PaaS the tools for doing ETL into and out of the data and doing reporting with SSRS from the data? It appears you have to create and maintain those services in an IaaS virtual machine farm? For an Analysis Services tabular model, you could host that in Azure Analysis Services which is a PaaS service. However, for SSAS multidimensional, SSRS, or SSIS, we currently would need to run those from within a virtual machine (IaaS) in Azure since PaaS equivalents are not currently available.
Would you recommend optimizing BEFORE migrating to Azure from on-prem? To the extent that you know there are inefficiencies, absolutely. For instance, perhaps you have a data load process that does a full truncate and reload. Tuning its performance may have never been a priority before, but in the cloud we want to minimize data transfer volumes where possible. It is true that existing performance issues can become more prevalent in a cloud environment.

The Webinar

THWACK - Symbolize TM, R, and C