cancel
Showing results for 
Search instead for 
Did you mean: 

Little Database of Horrors - Please Don't Feed The Monsters!

Level 12



FeedMeSeymour.jpgDontFeedAnimals.pngFEED ME SEYMOUR! That's what I hear when anyone jumps to a conclusion that the database needs more CPU, memory, or faster disks.  Why? Because I'm a DBA who has seen this too many times. The database seems to be the bottleneck and there's no shortage of people suggesting more system resources. I usually caution them, "don't mistake being busy with being productive!" Are we sure the workload is cleverly organized? Are we applying smart leverage with proper indexing, partitioning and carefully designed access patterns or are we simply pouring data into a container, doing brute force heavy lifting and having concurrency issues?

Now if you're a SysAdmin, you might be thinking, this is DBA stuff, why do I care? The reason you should care is that I've seen too many times where resources were added and it only produced a bigger monster!

So for crying out loud, please don't feed the monsters.  THEY BITE!

To demonstrate my point, I'll tell you about an interesting discovery I explored with Tom LaRock AKA @SQLRockstar while creating demo data for the release of Database Performance Analyzer 9.2. I set out to do the opposite of my usual job. I set out to create problems instead of solving them.  It was fun!  :-)

LittleDBofHorrorsWaits.png

Latch_EX_Tip.JPG

My primary goal was to generate specific wait types for:

  1. Memory/CPU - Not a real wait type. We put this in the wait type field when you're working, rather than waiting because the only thing you're waiting on then is the CPU and Memory to complete whatever task the CPU was tasked with.
  2. ASYNC_NETWORK_IO - Ironically, this is seldom truly a network problem, but it could be and may be interesting to a SysAdmin.
  3. PAGEIOLATCH_XX - These are significant signs that you're waiting on storage.
  4. LCK_M_X - This is a locking wait type and locking can harm performance in ways that adding system resources can't help.

I knew that table scans cause all sorts of pressure, so I created 1 process that used explicit transactions to insert batches into a table in a loop while 4 processes ran SELECT queries in infinite loops. To maximize the pain, I ensured that they'd always force full table scans on the same table by using a LIKE comparison in the WHERE clause and comparing it to a string with wild cards. There's no index in the world that can help this! In each pass of their respective loops, they each wait a different amount of time between table scans. 1 second, 2 seconds, 3 seconds, and 4 seconds respectively. Three of the processes use the NOLOCK hint while one of them does not. This created a pattern of alternating conflicts for the database to resolve.


SignalWaitMystery.pngSo I got the wait types I targeted, but LATCH_EX just sort of happened. And I'm glad it did! Because I also noticed how many signal waits I’d generated and that the CPU was only at 50%. If *Signal Waits accounting for more than 20% our waits is cause for concern, and it is…then why does the server say the CPU is only around 50% utilized? I found very little online to explain this directly so I couldn't help myself. I dug in!


My first suspect was the LATCH_EX waits because I'd produced an abundance of them and they were the queries with the most wait time. But I wasn't sure why this would cause signal waits because having high signal waits is like having more customers calling in than staff to answer the phones. I really didn't have much running so I was puzzled. 

The theory I developed was that when SQL Server experiences significant LATCH_EX contention, it may require SQL Server to spawn additional threads to manage the overhead which may contribute toward signal waits. So I asked some colleagues with lots of SQL Server experience and connections with other experienced SQL Server pros. One of my colleagues had a contact deep within Microsoft that was able to say with confidence that my guess was wrong. Back to the guessing game…



With my first hypothesis dead on arrival, I turned back to Google to brush up on LATCH_EX. I found this Stack Exchange post, where the chosen correct answer stated that,

There are many reasons that can lead to exhaustion of worker threads :

  • Extensive long blocking chains causing SQL Server to run out of worker threads.
  • Extensive parallelism also leading to exhaustion of worker threads.
  • Extensive wait for any type of "lock" - spinlocks, latches. An orphaned spinlock is an example.


Well I didn't have any long blocking chains and I didn't see any CXPACKET waits. But I did see latches! So I developed hope that I wasn't crazy about this connection from latches to signal wait. I kept searching…

LatchClass.pngI found this sqlserverfaq.net linkIt provided the query I used to identify my latch wait class was ACCESS_METHODS_DATASET_PARENT.  It also broke down latches into 3 categories and identified that mine was, a non-buffer latch.  So I had a new keyword and a new search phrase, ACCESS_METHODS_DATASET_PAREN and "non-buffer latch".


SELECT  latch_class, wait_time_ms / 1000.0 AS [Wait In sec],

waiting_requests_count AS [Count of wait],

100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage

FROM sys.dm_os_latch_stats

WHERE latch_class NOT IN ('BUFFER')

AND wait_time_ms > 0

Then I found this MSDN postAbout half way in, the author writes this about ACCESS_METHODS_DATASET_PARENT: "Although CXPacket waits are perhaps not our main concern, knowing our main latch class is used to synchronize child dataset access to the parent dataset during parallel operations, we can see we are facing a parallelism issue".


SqlSkillsTweet.pngThen I found another blog post not only supporting the new theory, but also referencing a post by Paul Randal from SQLskills.com, one of the most reputable organizations regarding SQL Server Performance.  It states, "ACCESS_METHODS_DATASET_PARENT...This particular wait is created by parallelism...."


And for the icing on the cake, I found this tweet from SQLskills.com.  It may have been posted by Paul Randal himself.



So now I know that LATCH_EX shows when SQL Server parallelizes table scans.  So instead of one thread doing a table scan, I had several threads working together on each table scan.  So it started to make sense.  I had ruled out parallelization because I didn't see any CXPACKET waits, which many DBAs think of as THE parallelism wait.  And now THIS DBA (me) knows it's not the only parallelism wait!  #LearnSomethingNewEveryDay

So now I feel confident I can explain how an abundance of LATCH_EX waits can result in high CPU signal waits.  But I'm still left wondering why signal waits can be over 20% and the CPU is only showing 50% utilization.  I'd like to tell you that I have an answer, or even a theory, but for now, I have a couple of hypotheses.

  1. It may be similar to comparing bandwidth and latency.  It seems server CPU utilization is like bandwidth i.e. how much work can be done vs what is getting done, while signal waits is like latency i.e. how long does a piece of work wait before work begins.  Both contribute to throughput but in very different ways.  If this is true, then perhaps the CPU workload for a query with LATCH_EX is not so much work but rather time consuming and annoying.  Like answering kids in the back seat that continually ask, "are we there yet?"  Not hard.  Just annoying me and causing me to miss my exit.

  2. It may simply be that I had such little load on the server, that the little amount that was signal waits, accounted for a larger percentage of the work.  In other words, I may have had 8 threads experiencing signal wait at any time.  Not a lot, but 8 of 35 threads is over 20%.  So in other words, "these are not the droids I was looking for."

Maybe you have a hypothesis?  Or maybe you know that one or both of mine are wrong.  I welcome the discussion and I think other Thwack users would love to hear from you as well.


Related resources:

Article: Hardware or code? SQL Server Performance Examined — Most database performance issues result not from hardware constraint, but rather from poorly written queries and inefficiently designed indexes. In this article, database experts share their thoughts on the true cause of most database performance issues.

Whitepaper: Stop Throwing Hardware at SQL Server Performance — In this paper, Microsoft MVP Jason Strate and colleagues from Pragmatic Works discuss some ways to identify and improve performance problems without adding new CPUs, memory or storage.

Infographic: 8 Tips for Faster SQL Server Performance — Learn 8 things you can do to speed SQL Server performance without provisioning new hardware.


19 Comments
Jfrazier
Level 18

Interesting stuff.  I am going to bookmark it and re-read it later....after I have had some more coffee and bacon.

jkump
Level 15

These are the kinds of dig and get to the root cause details that I wake up each morning and absorb.  Particularly with several pieces of bacon.  Keep up the good work!

mr.e
Level 14

Wow, that's definitely food for thought... I too will need some time to re-read this and process it.  Thanks!

contactjt
Level 10

You could try MAXDOP and Processor Affinity.

In short most servers now have a ton of real CPU cores and then a bunch of hyperthreaded cores.

Lets talk about hyperthreading first. The jury is still out on if hyperthreading hurts or helps. In short I see it as a problem because with parallelism you potentially are sending two threads from the same query to the same physical core. Yikes! YMMV, some report hyperthreading helps their DB. This could depending on any number of factors. There are two ways to deal with this, 1) disable hyperthreading in the BIOS, or 2) assign fixed affinity on each processor cores. See this article on how to select the cores, you want ONLY 1 hyper core per processor, the other should be left off. https://support.microsoft.com/en-us/kb/322385

I however have not done either of those options related to hyperthreading, You do those are your own risk. Actually if you do any of these this is at your own risk. I think they would be useful.

Now separate from the processor affinity, there is another way to deal with parallelism. This is called MAXDOP, or MAXIMUM DEGREE OF PARALLELISM and it's related factor COST THRESHOLD FOR PARALLELISM.

I will refer you to Brent Ozar (http://www.brentozar.com/archive/2013/09/five-sql-server-settings-to-change/)

I know first hand that changing the MAXDOP can be reconfigured on the fly, and subsequently changed back as needed.

What does this do, well MAXDOP default of 0, is unlimited parallelism. Often people set this to 1, meaning one query <-> one core, no parallelism at each query. It's a little complicated so I can't say no parallelism at all but for our concerns this will take your expansive parallel query and drop it from unlimited cores to 1 core. You are likely to see the waits disappear, just like that. Poof!

If you did that test you would likely be shocked at how drastic the waits went down but is 1 the correct value? It works for most data sets, is it optimal? well there could be other settings. Best practices say you should set it to half of the total physical cores in the system, and no higher than 8. Even if you have a massive 32 physical core server, it's recommended to set it to 8.

Higher values tend to help larger reports. Think data warehouse.

Lower values tend to reduce waits for quick queries. Like interactive databases.

This is not a one size fits all setting. Some argue that Microsoft isn't really making this easy on us because SQL server was written first in a time when a 2 cpu box was practically unheard of. I can't really believe Microsoft to be so short sighted but my experiences clearly show a MAXDOP of 0 is just bad.

Once you have confirmed parallelism is the issue you are seeing and that you have some control here we can try to factor in the Cost Threshold for Parallelism. I haven't really tested this setting yet. It essentially allows you to adjust how difficult a query must be before it tries to split it up for parallel executions. This could help allow you to run a hybrid system with a data warehouse/reporting functions, and some interactive pages. Again there is no one size fits all. I don't have recommended values currently.

cahunt
Level 17

mmm... Bacon!

cahunt
Level 17

Serious read here, though I think the best thing for a good running system is a DB admin who knows what they are doing. These are massive DB's and even just finding some lines to adjust or customize setup can take several minutes. Administering from a novice standpoint is not the way to do this. At least setup the app monitoring for the DB - if you can't hire a good admin - so you have a place to start when trying to troubleshoot. We have huge DB's and our systems eat resources due to that. New DB structure in post 11.5 will help with this - but we are no there yet. Any other system I would fret when seeing all the added RAM used up (100% for what is allocated as usable by the SQL DB) - but having SW  Egineers impressed at our DB management when we have such a  large enterprise system lets me know we have who we need at the helm. If your having issues with this I feel bad for you mon! I have 99 problems but a DB Admin ain't one!

clubjuggle
Level 13

I'm with you on this. I'm not a DBA so I read this type of thing mainly in hopes of gathering little nuggets of useful information to come back to later. I'll be interested to see the theories that people post.

goodzhere
Level 14

Very very good read.  Bookmarking this!

mr.e
Level 14

ditto. 

brianflynn
Level 12

I agree with you Charles.  A good DBA is worth their weight in gold.  Sadly, many organizations don't realize that value until they've accumulated technical debt standing in their way of solving performance problems.  Those situations can be too little and too late.  I once had a DBA job where they'd never had a DBA before.  The first 3 months was a lot of fun cleaning up the low hanging fruit to yield big gains.  After that the solutions frequently required other people to be involved, and were often more complicated than anyone had time for, resulting in "solutions" that were more workaround than fix.  It is my opinion that organizations who feel they don't need a DBA are 80% likely to simply not realize that they do need a DBA to prevent a problem they'll discover in the future, when it is much harder to solve.  Glad you have good DBA coverage!  For those of you without sufficient DBA coverage, the right tools can make a difference.  I've seen novice DBAs or jack-of-all-trades perform above seasoned DBAs when they had good tools.  So if budget is a problem, I suggest organizations consider a less experienced DBA with good tools as a compromise to having seasoned DBAs.

muwale
Level 12

nice presentation of DB. I think the best thing for a good running system is a DB admin who knows what they are doing.

muwale
Level 12

nice

wwfsu2
Level 8

Thanks . It was very good

mosesmac
Level 9

nice read. I am sharing this with our DBA.

rschroeder
Level 21

Some of the most useful words I've seen about data base troubleshooting and indications, condensed into one spot.

Thank you!

brianflynn
Level 12

Thanks Richard.  I had fun writing this and am glad others appreciate this real world problem.  It's NOT always the database, but you are guilty until proven innocent.  :-)

rschroeder
Level 21

brianflynn‌, I know what you mean.  As a Network Analyst it seems most everything is sent to my group first, where we have to prove a negative (no, it's not the network, but only because I'm not seeing any errors, and I AM seeing packets flowing between the source & dest defined, over the defined ports) before we can triage it and hand it on to the right party.

Quickly reducing the MTTI (Mean Time To Innocence) means wasting cycles that could be better identified if we had SAM and DPA in our organization.  I'm hoping our team can find time to evaluate those products and determine if they think the tools can make their lives more simple and easy.

brianflynn
Level 12

That's great Richard.  Let me know if I can help!

network_defender
Level 14

Good write up.  Food for thought

About the Author
I'm an IT nomad whose accumulated a well rounded collection of skills. I've functioned as technical contributor and as a manager. I'm creative and like to innovate. Web Designer turned Developer turned DBA, turned System Engineer, turned Manager. I enjoy new challenges. I'm interested in the burgeoning fields of Big Data and Cloud Technology. I possess a boundless desire to learn then teach. I'm passionate about complex problem solving, enjoy forensic investigation & presenting discoveries. I've got experience analyzing distributed systems to identify performance bottlenecks and optimization opportunities.