Monitoring Central

3 Posts authored by: john.maxwell

Background

 

Fire-fighting mode for DBAs can be stressful when they have co-workers and managers breathing down their necks due to application slow-downs and/or outages. Logic says something changed, but what? In a worst-case scenario, the database instance itself looks fine, nothing changed within the database and the SQL being executed was running fine before. Of course, the SysAdmin says nothing is wrong with the physical server or storage which makes it even more questionable. Hmm, could you be running in a virtual machine (VM)? Is your VM resource starved and competing with other VMs?

 

According to Gartner’s Market Guide for Server Virtualization[1], “Hypervisor-based server virtualization is now mature, with 80% to 90% of server workloads running in a virtual machine (VM) for most midsize to large enterprises.” Additionally, anecdotal evidence states 70% of all databases are virtualized. In fact, here at SolarWinds, 50% of our database instances run in a VM. For all the benefits of virtualization like cost savings and ease of migrating workloads, the abstraction of the virtual layer from the physical hardware can introduce some challenges.

 

And let’s not forget the elephant in the room, snapshots. Many DBAs I’ve talked to are at a loss as to why SysAdmins and IT ops perform snapshots of their database instance VMs, which in turn can cause performance issues, especially if a memory snapshot is invoked which renders the VM inactive while the memory is written to disk. Database backups are best left to DBAs who ensure referential integrity is maintained to recover a database.

 

Which Metrics Matter?

 

If you find yourself running your database instances in a VMware VM, what do you need to look for to see if the VM your database is running in has problems? There are many metrics available, so let’s review the usual suspects.

 

CPU Ready

 

  • This metric indicates the VM (and the database trying to run inside it) was ready to run but instead sat idle waiting behind other VMs contending to control the same shared resources such as physical CPUs or memory.

    For example, a vSphere host has six physical CPUs, and two VMs are configured to each require four virtual CPUs (vCPUs) before they can run. This situation means only one VM can run at a time. You can eliminate the VMs queueing behind each other by either moving a VM to another host or configuring both VMs to require three or fewer virtual CPUs.

 

    • The term “oversubscription” simply means you’ve assigned more virtual resources than what physical resources exist to run all VMs concurrently. It may seem a bit strange but reducing the number of vCPUs may dramatically increase its performance. Generally, oversubscription should not go above 5%.With the SolarWinds®Database Performance Analyzer (DPA) VM Option, an easy way to see how many physical CPUs your host server has is to view the Host tab on the VM CONFIG page.

 

VM CPU Usage

  • Actively used CPU as a percent of total available virtual CPU in the virtual machine.

Host CPU Usage

 

  • Actively used CPU as a percent of total available CPU on the machine. If this number is high you might see VMs with high CPU ready and/or co-stop.
    • Active CPU is approximately equal to the ratio of the used CPU to the available CPU where: Available CPU = # of physical CPUs x clock rate.
    • When your database instance is running in a VM, with the VM Option, DPA automatically expands the data in the CPU tab to include this information along with other VM specific metrics.

 

Co-Stop

 

  • The time a VM waits for a vCPU is due to scheduling (lack of resources). So basically, your VM can be waiting on physical CPU resources in use by other VMs. If you see high Host CPU Usage this is probably a sign there are too many VMs on this host and/or you need more physical CPU resources.

 

VM Memory Swap Rate

 

  • The “swap in” and “swap out” rates generally mean you have a shortage of physical memory on the host, so the memory is swapped out and in from disk.

 

VM Active Memory Usage

  • This is the memory in use as a percent of the memory configured for the VM.

 

Host Memory Usage

  • This is the memory usage on the host (consumed memory / total machine memory). If this is high (e.g., GT 90%) this could indicate host memory over-commit which could lead to high VM swap rates.

VM Memory Overhead

  • This is simply the amount of memory used to run the VM. Over-configuring memory (or excess vCPU for that matter) will unnecessarily increase overhead. That said, there’s memory needed by ESXi itself and the virtual machine (virtual machine frame buffer).

 

VM Memory Balloon

  • The balloon driver reclaims pages on the server considered less valuable. The crux of this VMware proprietary technique is to match the behavior of a guest OS. You should only see this when the host is running low or out of physical memory.
  • If you see the virtual machine your database instance is running in has a certain percent of memory claimed by the balloon driver, look for memory swapping which could affect your VM’s performance. However, if you don’t see any swapping issues you don’t and won’t necessarily have a performance problem.

 

VM Disk Commands

  • Number of disk commands executed is an indication of how busy the disks are. That said, unless you see large queues developing and commands start to be aborted there isn’t a problem.
  • If you see aborted disk commands, then your storage is severely overloaded and can lead to serious application response issues.

 

VM Disk Usage

  • Available if you aren’t using a NFS datastore, it will show the average disk I/O rates across all virtual disks on the VM.

 

VM Read / Write Rates

  • VM disk read rate is the average amount of data read from the disk each second during the collection interval. For a VM, this is the rate at which data is read from each virtual disk to the virtual machine.
  • VM disk write rate is the average amount of data written to disk each second during the collection interval—simply the rate data is written to each virtual disk on the VM.

 

Host Disk Device Read / Write Rates

  • The host disk read-and-write rate is the average read/write rate across all disks/LUNs on the host. The rate represents the read/write throughput at the host level across all disks/LUNs and VMs running on the host.
    • If the database instance has I/O performance issues, you may have another VM on the same host causing the delays. Compare this metric to the physical I/O rate from the database instance. If the Host rate is higher, then it’s likely another VM is the problem. Otherwise, the VM your instance is running in may be causing too much of a demand on the underlying physical storage.

Host Max Disk Latency

  • This is the highest latency value across all disks used by this host.

 

Host Disk Latency

  • Read latency is the average amount of time to process a read command to a disk to the host (across all VMs). High disk latency indicates storage may be slow or overloaded.
  • Write latency is similar to read and is the average amount of time to process a write command from the specific disk across all VMs.
    • Disk Write Latency = Kernel Write Latency + Device Write Latency
  • Expected disk latencies will depend on the nature of the storage like read/write mix, randomness and I/O size along with the capability of the storage subsystem.

 

In addition to these metrics being found in DPA, you can execute the “esxtop” command from your VMware ESXi host or look at various utilization metrics from the VMware ESXi console. SolarWinds Virtualization Manager also reports on all of these metrics and more in a friendlier format with both historical and real-time data.

 

 

 

Sample Nightmare Scenario Avoided

 

As I mentioned when I started off, a nightmare scenario could be when everything associated with the database instance seems fine—nothing changed. Since we’ve covered the essential VM metrics you should be monitoring, let’s walk through a hard-to-find problem for a database instance running in a VM using SolarWinds Database Performance Analyzer (DPA) with the VM Option. In the 2019.4 release of DPA, we expanded the VM option to go beyond the basic resource metrics to include additional HOST metrics and to make note of events, as seen in the DPA CPU tab in RESOURCES.

 

* Example of event logging in DPA 2019.4

 

 

Let’s walk through our sample “nightmare” scenario.

 

  • Problem ticket open for poor application performance response time
    • Users complained the morning of Monday, December 2 “around 8 a.m.” they experienced abnormally long wait times.

 

  • No outages were recorded from the IT Ops group

 

  • You go to DPA to look at the Database instance supporting the application
    • You notice a longer than normal wait occurrence on December 2, and the machine learning anomaly detection flags this time as a critical wait time delta from what is normally expected at this time of day.

 

 

  • You then look at the tab ADVISORS for additional data for this day.

    1. As it turns out, a specific query accounts for the top amount of execution time.


  • You select this query to find out more about it and what occurred at the time. From the QUERY DETAIL page, you see the longest wait time was for memory/CPU from which you click on the green bar for memory/CPU to explore further by going down to the hour.

 

  • Once you get down to the hourly view, you see a noticeable spike in wait time in the morning hours when the application response time issue was occurring.
  • As you scroll down the page to the end where VM metrics are shown, you see the new co-stop metric where there’s a corresponding spike. By hovering over the annotation dots, you see during this time the VM was being moved via vMotion from one host to another.


  • Just as with snapshots, vMotion events can have a negative impact on the performance of the VM the database instance is running in. Without visibility into the virtualized infrastructure, it can be time consuming to find the culprit of poor performance. 
    With DPA, you can easily line up all of resources for a specific time to pinpoint the problem as seen below.

 

Summary

 

With VMware’s 500,000 customers and tens of millions of VMs, virtualization is here to stay. Since many database on-premises to cloud migrations involve virtualization, e.g., Azure VM, many of the same challenges existing on-premises will exist in IaaS environments. DBA’s don’t have to be virtual admins, but they do need to be aware of the environment their database instances run in and the impact those environments have on database performance.

 

That said, I’ve discovered many DPA customers have no idea there’s a purpose-built option for VMware that can be added to the product. It’s easy to see if you have the option by looking for the VIRTUALIZATION tab on the home page.

 

*  This all-in-one view lets you line up all your resources in a single view to look for problems on a specific date and time.

 

Our goal at SolarWinds is to listen to our customers which is why we’ve enhanced the VM option for DPA. If you are a DPA customer, be sure to utilize our THWACK® feature request page to request and vote on feature enhancements. Lastly, if you are currently running your database instances in a VM, we'd appreciate you taking this 60 second survey (and reward you with THWACK points).

 

 

 

 


[1] Gartner Market Guide for Server Virtualization, Published 24 April 2019, ID G00350674

Background

 

This blog initially started out as an examination of how SolarWinds uses Database Performance Analyzer (DPA) within our own production environment. It now includes not only how our DBA uses DPA, but how other business units within SolarWinds use it and why. It isn’t surprising to find people in IT operations and application development using DPA, since our own customer studies have shown a high number of people outside of the DBA role use it, too.

 

Recent product-specific studies for DPA showed a high number of DevOps/IT Ops and AppDev roles using the product and an eye-opening, broad customer census exposed even more. In the 2019 THWACK Member Census, we asked over 2,200 IT professionals to select their primary job role and only 2.4% selected DBA. Interestingly, when we asked respondents if they managed or monitored databases, 42.7% said yes.

 

This brings to light the discussion of the “accidental DBA” and some interesting changes in IT organizations. First is the growth in number of DevOps people who handle database-related tasks. Second is the importance of databases as the platform for most mission-critical applications and why everyone has a keen interest in their availability and performance. And last, but not least, the number of DBAs is going down according to Computer Economics, who has seen the percentage of DBAs relative to total IT staff drop to 2.8% in 2017 from 3.3% in 2013. Our own Head Geek, Thomas LaRock, wrote an article pointing out the number of DBA jobs has stagnated for almost 20 years. On the flipside, Gartner pointed out that DBMS (Database Management Systems) revenue grew an astounding 18.4% to $46 billion in 2018.

 

Armed with this information, I decided I’d investigate the SolarWinds DBA team and see if any of these trends held true.

 

Let’s Start With the DBA

 

As I mentioned, I initially thought I’d interview the DBA team here at SolarWinds to see how we “drink our own champagne,” since I knew DPA was used by our internal IT team. As it turns out, the “DBA Team” is one person. I guess for a company that did $833 million in revenue in 2018 I expected an entire DBA organization, not just one hardworking DBA. But maybe this isn’t the exception?

 

I learned a lot from our DBA about how she can keep track of over 250 Microsoft SQL Server databases running on a mix of physical and virtual machines. My biggest takeaway from talking to her was that DBA’s don’t “monitor databases.” They want to be alerted when there are problems and they need a product to help them quickly find and resolve problems when they arise. They also want a product to help them optimize their databases proactively.

 

The first thing we discussed was “what’s important and who is it important to?” Here are the top things SolarWinds uses DPA for and the primary users:

 

  • - Overall database health: DBA and IT Ops
  • - Debugging after deployment: AppDev and DBA
  • - Ad-hoc trouble shooting: DBA and AppDev
  • - Capacity planning: DBA

After I learned about the overall database environment (250+ SQL Server databases), I wanted to understand specific, real-world use cases of DPA in action.

 

DBA Usage Scenarios

 

So how does the DBA at SolarWinds use DPA? First, she sets up alerts, so she can immediately be sent text notifications from DPA if something goes awry. DPA has had alert notification for a while, but the 2019.4 release made it even easier via a “drag and drop” interface, making alert customization simple. Second, DPA is the first place she goes to when she gets notified about something going wrong, whether it’s an alert, phone call, email, or a help desk ticket opened and assigned to her.

 

Scenario 1 of 2

 

In this first real-life scenario, our DBA was alerted to an “assertion check fail” pointing to possible corruption. The SQL Server instance itself created a hard-to-decipher stack dump and the only noticeable thing she could pick out of was the process ID.

With this in hand, she went into DPA to the specific time the event occurred in the SQL Server instance. Since DPA provides both real-time and historical data, she was able to drill down to find 1) the session ID executing this query, and 2) the SQL script running and the database. After speaking with the developer who ran the query, she determined it was a problem with SQL Server itself and asked the developer to refrain from running the query until they got the problem resolved by Microsoft.

 

*Screenshot the SolarWinds DBA used to find the culprit of the stack dump SQL Server generated.

 

Scenario 2 of 2

 

This second use case brings to light how important DPA is for establishing the overall health of a database and for capacity planning. Our DBA could not stress enough how important it was for her to know the baseline of a database instance and associated queries. From the baselines DPA develops, with the help of machine learning, she can know what a typical day looks like and the behavior of typical database activity. This allows her to spot both anomalies and trends.

 

Regarding capacity planning, she uses DPA to monitor the utilization and performance of applications and make note of trends she uses for future capacity requirements such as new or additional servers. Luckily, SolarWinds does a quarterly two-week freeze on new applications and changes, and this two-week period gives her a chance to go through DPA reports and proactively tune the environment. DPA’s anomaly detection powered by machine learning is a great way to graphically see the biggest opportunities for proactive optimization.

 

*This resource tab in DPA is a favorite of our DBA because it gives her a good overview of server resources being used.

 

Our DBA believes DPA will be even more useful as SolarWinds starts to migrate databases to Azure PaaS. As she stated, being on top of performance issues like poorly written SQL and poor performing tables doesn’t go away, and the cost of making mistakes, especially those consuming resources, can lead to spikes in usage charges.

 

Application Development and DPA

 

As I mentioned at the beginning, I learned a lot about how DPA is used at SolarWinds and the various people and departments using it. The application development (AppDev) team is one of the bigger teams in need of the data DPA provides. Why? Because they, along with our DBA, are constantly deploying changes and want to see the difference.

 

For example, is the SQL query running slower or faster than before? As previously mentioned, some people are “accidental DBAs,” so if the query they implemented ran fine on a QA instance but in production performs poorly, they need to know why. Case in point, this exact scenario happened recently and was due to a missing index DPA quickly pointed out. As our DBA stressed, for someone not very experienced with index recommendations, the tuning advisors in DPA can be a life-saver.

 

*One of the most popular DPA pages used for before and after is also the one used to look at overall waits and is great for seeing changes in before and after performance.

 

Finally, IT Operations

 

At SolarWinds, IT Operations (IT Ops) is where the buck stops for overall system availably, and just like our DBA, they make extensive use of alerts. Depending on the alert, they may send a priority 3 email when something has reached a certain threshold. But if SQL Server were down, they would send an email as well as page Opsgenie, which then goes to the primary person on call and posts a message on Microsoft Teams. The IT Ops group also has certain alerts integrated with SolarWinds® Service Desk to automatically open tickets.

 

But what about databases and their health…does IT Ops care? The answer is yes because they rely on the DPA integration with SolarWinds Server & Application Monitor (SAM) to find the cause of performance issues on servers or when someone complains about application performance. Since DPA and SAM integrate with the SolarWinds Orion® Platform, you can navigate seamlessly between the products.

 

For example, they used the SAM integration to track a CPU spike on a server to a SQL Server database instance in a critical state. In this case, they immediately reached out to the SolarWinds DBA because they could tell the issue with the server was related to the database. However, if the DBA is unavailable, they rely on the suggestions and recommendations in DPA to diagnose the problem and take action or provide further documentation for either our DBA or AppDev.

 

Just as DBA and AppDev look for signs of abnormality, IT Ops looks at historical trends to find issues that may correlate to database issues. The integration of SAM and DPA makes this simple.

 

*IT Ops uses this page in Server & Application Monitor to see trends and then drill down and isolate the root cause. SAM’s integration with DPA makes this simple.

 

Summary

 

As stated in the introduction, the role of the DBA is changing and many people without a DBA title are involved with the performance of database applications. With the movement of database instances to IaaS and PaaS implementations, the ability to optimize, find, and resolve performance issues doesn’t go away. In some ways it becomes more important due to the potential impact on OpEx (aka your monthly Azure bill).

PASS Summit 2019 is here and SolarWinds will be at the conference, booth #416, November 5 – 8 in Seattle, Washington.

 

We’ll be showcasing the latest release of SolarWinds® Database Performance Analyzer (DPA), including our just announced support for Azure® SQL Managed Database Instance and SQL Server® 2019. No matter if you’re currently using DPA for your cross-platform database performance monitoring, or if you’re a “casual DBA,” stop by our booth for a demo to see the great new features we’ve added to this release.

 

And if you’re currently using any other SolarWinds products capable of integrating with the Orion® Platform, such as Server and Application Manager (SAM) or Virtualization Manager (VMAN), ask us how DPA seamlessly integrates with other products to give you a complete end-to-end view of your database applications.

 

Lastly, we’ve got two SolarWinds-sponsored events at the conference you should put on your calendar. One is the first timer’s reception we’re sponsoring Tuesday, November 5, from 4:45 – 6 p.m. in ballroom 6E at the convention center. The second is our presentation “SQL Server Performance Analysis Powered by Machine Learning,” Wednesday, November 6, at 1:30 p.m. in room 618 at the convention center.

 

Stop by and say hi. In addition to product demos, we’ll be giving away some cool swag.

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.