FEED ME SEYMOUR! That's what I hear when anyone jumps to a conclusion that the database needs more CPU, memory, or faster disks. Why? Because I'm a DBA who has seen this too many times. The database seems to be the bottleneck and there's no shortage of people suggesting more system resources. I usually caution them, "don't mistake being busy with being productive!" Are we sure the workload is cleverly organized? Are we applying smart leverage with proper indexing, partitioning and carefully designed access patterns or are we simply pouring data into a container, doing brute force heavy lifting and having concurrency issues?
Now if you're a SysAdmin, you might be thinking, this is DBA stuff, why do I care? The reason you should care is that I've seen too many times where resources were added and it only produced a bigger monster!
So for crying out loud, please don't feed the monsters. THEY BITE!
To demonstrate my point, I'll tell you about an interesting discovery I explored with Tom LaRock AKA @SQLRockstar while creating demo data for the release of Database Performance Analyzer 9.2. I set out to do the opposite of my usual job. I set out to create problems instead of solving them. It was fun! :-)
My primary goal was to generate specific wait types for:
I knew that table scans cause all sorts of pressure, so I created 1 process that used explicit transactions to insert batches into a table in a loop while 4 processes ran SELECT queries in infinite loops. To maximize the pain, I ensured that they'd always force full table scans on the same table by using a LIKE comparison in the WHERE clause and comparing it to a string with wild cards. There's no index in the world that can help this! In each pass of their respective loops, they each wait a different amount of time between table scans. 1 second, 2 seconds, 3 seconds, and 4 seconds respectively. Three of the processes use the NOLOCK hint while one of them does not. This created a pattern of alternating conflicts for the database to resolve.
So I got the wait types I targeted, but LATCH_EX just sort of happened. And I'm glad it did! Because I also noticed how many signal waits I’d generated and that the CPU was only at 50%. If *Signal Waits accounting for more than 20% our waits is cause for concern, and it is…then why does the server say the CPU is only around 50% utilized? I found very little online to explain this directly so I couldn't help myself. I dug in!
My first suspect was the LATCH_EX waits because I'd produced an abundance of them and they were the queries with the most wait time. But I wasn't sure why this would cause signal waits because having high signal waits is like having more customers calling in than staff to answer the phones. I really didn't have much running so I was puzzled.
The theory I developed was that when SQL Server experiences significant LATCH_EX contention, it may require SQL Server to spawn additional threads to manage the overhead which may contribute toward signal waits. So I asked some colleagues with lots of SQL Server experience and connections with other experienced SQL Server pros. One of my colleagues had a contact deep within Microsoft that was able to say with confidence that my guess was wrong. Back to the guessing game…
With my first hypothesis dead on arrival, I turned back to Google to brush up on LATCH_EX. I found this Stack Exchange post, where the chosen correct answer stated that,
There are many reasons that can lead to exhaustion of worker threads :
Well I didn't have any long blocking chains and I didn't see any CXPACKET waits. But I did see latches! So I developed hope that I wasn't crazy about this connection from latches to signal wait. I kept searching…
I found this sqlserverfaq.net link. It provided the query I used to identify my latch wait class was ACCESS_METHODS_DATASET_PARENT. It also broke down latches into 3 categories and identified that mine was, a non-buffer latch. So I had a new keyword and a new search phrase, ACCESS_METHODS_DATASET_PAREN and "non-buffer latch".
SELECT latch_class, wait_time_ms / 1000.0 AS [Wait In sec],
waiting_requests_count AS [Count of wait],
100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage
WHERE latch_class NOT IN ('BUFFER')
AND wait_time_ms > 0
Then I found this MSDN post. About half way in, the author writes this about ACCESS_METHODS_DATASET_PARENT: "Although CXPacket waits are perhaps not our main concern, knowing our main latch class is used to synchronize child dataset access to the parent dataset during parallel operations, we can see we are facing a parallelism issue".
Then I found another blog post not only supporting the new theory, but also referencing a post by Paul Randal from SQLskills.com, one of the most reputable organizations regarding SQL Server Performance. It states, "ACCESS_METHODS_DATASET_PARENT...This particular wait is created by parallelism...."
So now I know that LATCH_EX shows when SQL Server parallelizes table scans. So instead of one thread doing a table scan, I had several threads working together on each table scan. So it started to make sense. I had ruled out parallelization because I didn't see any CXPACKET waits, which many DBAs think of as THE parallelism wait. And now THIS DBA (me) knows it's not the only parallelism wait! #LearnSomethingNewEveryDay
So now I feel confident I can explain how an abundance of LATCH_EX waits can result in high CPU signal waits. But I'm still left wondering why signal waits can be over 20% and the CPU is only showing 50% utilization. I'd like to tell you that I have an answer, or even a theory, but for now, I have a couple of hypotheses.
Maybe you have a hypothesis? Or maybe you know that one or both of mine are wrong. I welcome the discussion and I think other Thwack users would love to hear from you as well.
Article: Hardware or code? SQL Server Performance Examined — Most database performance issues result not from hardware constraint, but rather from poorly written queries and inefficiently designed indexes. In this article, database experts share their thoughts on the true cause of most database performance issues.
Whitepaper: Stop Throwing Hardware at SQL Server Performance — In this paper, Microsoft MVP Jason Strate and colleagues from Pragmatic Works discuss some ways to identify and improve performance problems without adding new CPUs, memory or storage.
Infographic: 8 Tips for Faster SQL Server Performance — Learn 8 things you can do to speed SQL Server performance without provisioning new hardware.