RDS “Gotchas,” How to Identify Possible Throttling on RDS, Leveraging CloudWatch Metrics Through DPM

While testing some different query workloads against AWS RDS MySQL, I encountered an interesting issue. The first issue I noticed was my top query by total time is now going slower than it previously was with no changes in query workload. In this article, I’ll walk you through how your query performance can be affected by the size of your gp2 volume.

In the screenshot above, I looked at the period when my top query slowed down and compared it to the previous hour when latency was lower. In the Change column, there was an 86% increase in total time. Looking at the Count and Average Latency columns were the next interesting find. The count was down significantly; however, the average latency had tripled.

Next, I looked at the CPU utilization for the entire two-hour period. CPU utilization decreased when latency increased. I was expecting to see either no change or an increase in CPU utilization; the decrease was unexpected behavior.

Since the affected query was an UPDATE, I looked at Write IOPS. Again, I expected to see an increase in the number of Write IOPS; however, there was a significant drop off. Prior to the drop off, I noticed there was some variability in the number of IOPS, but during the drop off, Write IOPS now plateaued at 100.

With the drop in Write IOPS, I wanted to see if there was a bottleneck on disk. Viewing the Disk Queue Depth chart, I see an obvious change in I/Os waiting on disk. What changed to cause an increase in Disk Queue Depth and query latency while also presenting a decrease in Write IOPS and CPU utilization?

Scanning the dashboards in DPM to look for a pattern possibly leading me to an explanation, I noticed a chart called CloudWatch Disk Burst Balance. As you can see, this value decreased over time, and once it reached zero, my issues began.

 

This is a logical root cause of the issues I’m seeing. So what exactly is Disk Burst Balance? The RDS instance I’m testing against is using a 20GB gp2 volume instead of provisioned IOPS. It turns out AWS determines the amount of IOPS on gp2 based on the size of the volume with a minimal value of 100 IOPS. This explains why it plateaus to 100 IOPS. The reason for better performance before the issue is because AWS starts the volume off at 5.4 million I/O credits regardless of the volume size. Once the credits (Disk Burst Balance) are exhausted, the instance will only get the max base IOPS for the volume size until its credits are replenished.

 

Understanding this limitation is crucial while monitoring your AWS RDS instances when using gp2 instead of Provisioned IOPS. If you’re doing any load testing to compare gp2 to Provisioned IOPS, it’s imperative to understand your workload during testing to ensure you’re reproducing the appropriate workload so as not to miss this in production. If you run into this issue in production, you could increase the volume size to get better I/O performance from RDS.

Anonymous