Little Database of Horrors - Please Don't Feed The Monsters!

brianflynn over 8 years ago 8 minute read time

FEED ME SEYMOUR! That's what I hear when anyone jumps to a conclusion that the database needs more CPU, memory, or faster disks. Why? Because I'm a DBA who has seen this too many times. The database seems to be the bottleneck and there's no shortage of people suggesting more system resources. I usually caution them, "don't mistake being busy with being productive!" Are we sure the workload is cleverly organized? Are we applying smart leverage with proper indexing, partitioning and carefully designed access patterns or are we simply pouring data into a container, doing brute force heavy lifting and having concurrency issues?

Now if you're a SysAdmin, you might be thinking, this is DBA stuff, why do I care? The reason you should care is that I've seen too many times where resources were added and it only produced a bigger monster!

So for crying out loud, please don't feed the monsters. THEY BITE!

To demonstrate my point, I'll tell you about an interesting discovery I explored with Tom LaRock AKA @SQLRockstar while creating demo data for the release of Database Performance Analyzer 9.2. I set out to do the opposite of my usual job. I set out to create problems instead of solving them. It was fun! :-)

My primary goal was to generate specific wait types for:

Memory/CPU - Not a real wait type. We put this in the wait type field when you're working, rather than waiting because the only thing you're waiting on then is the CPU and Memory to complete whatever task the CPU was tasked with.
ASYNC_NETWORK_IO - Ironically, this is seldom truly a network problem, but it could be and may be interesting to a SysAdmin.
PAGEIOLATCH_XX - These are significant signs that you're waiting on storage.
LCK_M_X - This is a locking wait type and locking can harm performance in ways that adding system resources can't help.

I knew that table scans cause all sorts of pressure, so I created 1 process that used explicit transactions to insert batches into a table in a loop while 4 processes ran SELECT queries in infinite loops. To maximize the pain, I ensured that they'd always force full table scans on the same table by using a LIKE comparison in the WHERE clause and comparing it to a string with wild cards. There's no index in the world that can help this! In each pass of their respective loops, they each wait a different amount of time between table scans. 1 second, 2 seconds, 3 seconds, and 4 seconds respectively. Three of the processes use the NOLOCK hint while one of them does not. This created a pattern of alternating conflicts for the database to resolve.

So I got the wait types I targeted, but LATCH_EX just sort of happened. And I'm glad it did! Because I also noticed how many signal waits I’d generated and that the CPU was only at 50%. If *Signal Waits accounting for more than 20% our waits is cause for concern, and it is…then why does the server say the CPU is only around 50% utilized? I found very little online to explain this directly so I couldn't help myself. I dug in!

My first suspect was the LATCH_EX waits because I'd produced an abundance of them and they were the queries with the most wait time. But I wasn't sure why this would cause signal waits because having high signal waits is like having more customers calling in than staff to answer the phones. I really didn't have much running so I was puzzled.

The theory I developed was that when SQL Server experiences significant LATCH_EX contention, it may require SQL Server to spawn additional threads to manage the overhead which may contribute toward signal waits. So I asked some colleagues with lots of SQL Server experience and connections with other experienced SQL Server pros. One of my colleagues had a contact deep within Microsoft that was able to say with confidence that my guess was wrong. Back to the guessing game…

With my first hypothesis dead on arrival, I turned back to Google to brush up on LATCH_EX. I found this Stack Exchange post, where the chosen correct answer stated that,

There are many reasons that can lead to exhaustion of worker threads :

Extensive long blocking chains causing SQL Server to run out of worker threads.
Extensive parallelism also leading to exhaustion of worker threads.
Extensive wait for any type of "lock" - spinlocks, latches. An orphaned spinlock is an example.

Well I didn't have any long blocking chains and I didn't see any CXPACKET waits. But I did see latches! So I developed hope that I wasn't crazy about this connection from latches to signal wait. I kept searching…

I found this sqlserverfaq.net link. It provided the query I used to identify my latch wait class was ACCESS_METHODS_DATASET_PARENT. It also broke down latches into 3 categories and identified that mine was, a non-buffer latch. So I had a new keyword and a new search phrase, ACCESS_METHODS_DATASET_PAREN and "non-buffer latch".

SELECT latch_class, wait_time_ms / 1000.0 AS [Wait In sec],

waiting_requests_count AS [Count of wait],

100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage

FROM sys.dm_os_latch_stats

WHERE latch_class NOT IN ('BUFFER')

AND wait_time_ms > 0

Then I found this MSDN post. About half way in, the author writes this about ACCESS_METHODS_DATASET_PARENT: "Although CXPacket waits are perhaps not our main concern, knowing our main latch class is used to synchronize child dataset access to the parent dataset during parallel operations, we can see we are facing a parallelism issue".

Then I found another blog post not only supporting the new theory, but also referencing a post by Paul Randal from SQLskills.com, one of the most reputable organizations regarding SQL Server Performance. It states, "ACCESS_METHODS_DATASET_PARENT...This particular wait is created by parallelism...."

And for the icing on the cake, I found this tweet from SQLskills.com. It may have been posted by Paul Randal himself.

So now I know that LATCH_EX shows when SQL Server parallelizes table scans. So instead of one thread doing a table scan, I had several threads working together on each table scan. So it started to make sense. I had ruled out parallelization because I didn't see any CXPACKET waits, which many DBAs think of as THE parallelism wait. And now THIS DBA (me) knows it's not the only parallelism wait! #LearnSomethingNewEveryDay

So now I feel confident I can explain how an abundance of LATCH_EX waits can result in high CPU signal waits. But I'm still left wondering why signal waits can be over 20% and the CPU is only showing 50% utilization. I'd like to tell you that I have an answer, or even a theory, but for now, I have a couple of hypotheses.

It may be similar to comparing bandwidth and latency. It seems server CPU utilization is like bandwidth i.e. how much work can be done vs what is getting done, while signal waits is like latency i.e. how long does a piece of work wait before work begins. Both contribute to throughput but in very different ways. If this is true, then perhaps the CPU workload for a query with LATCH_EX is not so much work but rather time consuming and annoying. Like answering kids in the back seat that continually ask, "are we there yet?" Not hard. Just annoying me and causing me to miss my exit.
It may simply be that I had such little load on the server, that the little amount that was signal waits, accounted for a larger percentage of the work. In other words, I may have had 8 threads experiencing signal wait at any time. Not a lot, but 8 of 35 threads is over 20%. So in other words, "these are not the droids I was looking for."

Maybe you have a hypothesis? Or maybe you know that one or both of mine are wrong. I welcome the discussion and I think other Thwack users would love to hear from you as well.

Related resources:

Article: Hardware or code? SQL Server Performance Examined — Most database performance issues result not from hardware constraint, but rather from poorly written queries and inefficiently designed indexes. In this article, database experts share their thoughts on the true cause of most database performance issues.

Whitepaper: Stop Throwing Hardware at SQL Server Performance — In this paper, Microsoft MVP Jason Strate and colleagues from Pragmatic Works discuss some ways to identify and improve performance problems without adding new CPUs, memory or storage.

Infographic: 8 Tips for Faster SQL Server Performance — Learn 8 things you can do to speed SQL Server performance without provisioning new hardware.

Top Comments

goodzhere over 8 years ago +3

Very very good read. Bookmarking this!
Jfrazier over 8 years ago +2

Interesting stuff. I am going to bookmark it and re-read it later....after I have had some more coffee and bacon.
clubjuggle over 8 years ago +2

I'm with you on this. I'm not a DBA so I read this type of thing mainly in hopes of gathering little nuggets of useful information to come back to later. I'll be interested to see the theories that people…

network_defender over 8 years ago

Good write up. Food for thought
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
brianflynn over 8 years ago

That's great Richard. Let me know if I can help!
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
rschroeder over 8 years ago

brianflynn‌, I know what you mean. As a Network Analyst it seems most everything is sent to my group first, where we have to prove a negative (no, it's not the network, but only because I'm not seeing any errors, and I AM seeing packets flowing between the source & dest defined, over the defined ports) before we can triage it and hand it on to the right party.
Quickly reducing the MTTI (Mean Time To Innocence) means wasting cycles that could be better identified if we had SAM and DPA in our organization. I'm hoping our team can find time to evaluate those products and determine if they think the tools can make their lives more simple and easy.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
brianflynn over 8 years ago

Thanks Richard. I had fun writing this and am glad others appreciate this real world problem. It's NOT always the database, but you are guilty until proven innocent. :-)
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
rschroeder over 8 years ago

Some of the most useful words I've seen about data base troubleshooting and indications, condensed into one spot.
Thank you!
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel