Ever since ciulei joined our team in January we've become a little obsessed about the performance of our Orion database. With 12,000+ nodes, 85,000+ elements, 20,000 application monitors comprised of 75,000+ components you can imagine that there is an awful lot of data flowing into our database on a daily basis. Add to that 450+ enabled alerts for both infrastructure and application monitoring along with data sharing to our incident management platform and CMDB -- there is just a whole lot going on.
Earlier this year we upgraded our server to include a RAID 10 array of SSD and then embarked on some significant tuning (including updates detailed in The Ultimate CPU Alert for Large Environments) as well as building some custom indexes where we thought they would be needed. This past weekend we embarked on an upgrade from NPM 11.0.1 to 11.5.2 (including the Orion 2015.1 upgrade). Granted, we did run out of time during our 12 hour maintenance window and were not able to upgrade our SAM modules and, because we aren't upgrading our VMAN environment until this weekend, we had to temporarily break the integration with VMAN for the week, but we are pleasantly surprised with the decreased wait that we are seeing as well as the change in the overall profile of the waits.
Thanks to DPA we have some great insights into our DB. There are was a significant dip on Sept 19/20 as we stopped the services on our polling engines but as we got back into things this week (remembering that we aren't sending data from Virtualization Manager this week) we saw 302 hours of total wait on our *worst* day so far this week. Contrast that with a more typical day pre-Orion 2015 schema where we saw between 390 and 450 hours of wait per day.
Notice the change in overall lock types as well. Pre-upgrade we saw large blocks of LCK_M_IX (Intent_Exclusive) and LCK_M_S (Shared). CPU is a constant and not something that we have a whole lot of control over as it is driven by application-based SQL queries. (In case you are wondering our overall CPU utilization is < 25% and an effective 1 min CPU queue length average of 0).
Post-change we effectively eliminated the LCK_M_S waits and cut the LCK_M_IX by 50-70%. To sqlrockstar and the rest of the team I say "Job well done!" We'll definitely keep an eye on things when we do our VMAN upgrade (and re-integration) this weekend and our SAM upgrade during our next maintenance window but needless to say I am impressed!
EDIT: We reintegrated VMAN this past weekend after applying a fix that was blowing up one of our Orion polling engine services when polling our largest vCenters and, I am happy to report, we have a negligible increase in wait time. In fact, comparing last Sunday to the previous Sunday we actually saw a *lower* total waits than previous. I suspect this is caused by moving the data collection and mangling from IVIM (Orion DB-centric) to VMAN (MongoDB-centric) and only requiring that data be transferred from VMAN to Orion. Just my partially educated guess though.