The Trouble Of VDI

Posted by mindthevirt May 29, 2015

In my last post, How To Right-Size Your VDI Environment, I gave some insight in how you can right-size your VDI environment and how to validate your VDI master template.

As VDI is gaining a bigger market share and SAN/NAS prices are dropping every year, there are also some problems on the horizon.

The benefits are often obvious and include, but are not limited to:


  • Central management for your entire desktop infrastructure, whether your end-users are located in the U.S. or in Europe.
  • Back and recover capabilities are endless since you can use build-in snapshots to take quick backups of your desktops and restore them within minutes and not hours or days.
  • Save money by reducing your carbon footprint and cut your costs for expensive workstations.
  • Easy and inexpensive way to migrate OS’s. You can easily deploy new desktops for your users with new Operating Systems.


With all the benefits of VDI, there are also some problems associated and you should be aware of them:


  • Not every Antivirus software is optimized for VDI desktops and often Antivirus scans & updates can cause severe performance impacts.
  • Windows license management becomes a nightmare, especially if you dynamically provision desktops.
  • Troubleshooting becomes more difficult since several components are involved which means you’ll often coordinate with other teams e.g. network admin, storage admin, VMware admin.
  • Not every storage solution is optimized or recommendable for VDI use cases. Just because you have a SAN sitting in your basement, doesn’t mean it will be a great solution for your VDI environment.  You should count ~10IOPS per active user.
  • Not all applications can be virtualized


This comprehensive list of benefits and disadvantages when choosing VDI, hopefully helps to you make a better decision as you and your company consider investing and expanding their VDI environment. If you have any questions regarding VDI benchmarking, things to consider or best practice, please post them in the comments below and I will make sure to address them quickly.

As of this past week it is the official 'beginning' of Summer here in North America.  That's pretty exciting, that means more people will be out of school, work, potentially going on vacation which are all very good, and of the same notion can be very bad.   What this means for you and me is;


Fewer people are watching the hen house, those few may be pent up and wanting to be outside! Yet of the same token, you have bored hackers who are looking to compromise our networks!

Okay, maybe this is true, and maybe it isn't, but here is one thing that is true!


... When was the last time you did some Spring Cleaning of your rulesets?  Gone through and validated that your Syslog, NTP, who knows policies are set correctly, communicating to the right places, so on and so forth?! Do we have scripts which help automate this?!



Share and share alike!


There is no better time than the present to go through and cleanup your rules but we shouldn't have to operate in a vacuum, we are community!


A few years ago, I wrote a few blogposts on VMware PowerCLI one-liners because they'd be useful

PowerCLI One-Liners to make your VMware environment rock out!

Using PowerCLI to dump your permission structure in vCenter


The one-liners are in both the post itself and then even more in the comments (which I had updated after the fact).   Not all of them are about security related matters, there's insight into setting your Syslog, NTP, and so much more.  But you guys I'm sure have something you find very useful, that you either wrote a script for once, or regularly run to make sure your environment hasn't changed.


If you have any of these to share, I bet I know some fellow Thwack users who would be ecstatic to use them!

So let's get into the Spring Cleaning spirit through the power of automation, or as I like to say, "Let's not do things more than once, for the first time!" Okay, I'm not sure I like to say it in that particular context... But I digress significantly.


If you have any particular scripts, things to check, one-liners or powerful life changing experiences to share, we all look forward to it!

Happy Summer everyone! Let's make it an awesome one!

Capturing lightning in a bottle seems apropos for achieving disruptive innovation. In fact, disrupting IT is a small price to pay if the business can make it happen. “It” doesn’t happen often, but when it does, new industries are created and new markets emerge to eclipse the old ones. Realistically, the value to the business shows up as more efficient IT ops, lower IT OPEX, and shorter app development cycles.


In part one of this series, I covered aspects of old & new IT management. Interestingly, new and old IT management are intertwined in their generational relationship. Old methods shape policies that construct the workflows that new methods use to manage the data center environments. But what really matters to the IT pro? At the end of the day, the only thing that matters is the application.

So understanding the application stack in its entirety, with context around all of its connection with a single point of truth is paramount to successful IT management. Ideally, this single point of truth is where old and new IT management converges to.

The right IT management mix is a function of organizations aligning business operations with IT management to maximize efficiency and effectiveness of application delivery, while minimizing disruption to IT and downstream to the business. Unfortunately, politics and inertia have to be factored in and they tend to have critical mass in organization silos.

Is your organization converging to a single point of truth for your applications?

Unlike most application support professionals, or even system administrators, as database professionals, you have the ability to look under the hood of nearly every application that you support. I know in my fifteen plus years of being a DBA, I have seen it all. I’ve seen bad practices, best practices, and worked with vendors who didn’t care that what the were doing was wrong, and others with whom I worked closely to improve the performance of their systems.


One of my favorite stories was an environmental monitoring application—I was working for a pharmaceutical company, and this was the first new system I helped implement there. The system was up for a week and performance had slowed to a crawl. After running some traces, I confirmed that there was a query without a where clause that was scanning 50,000 rows several times a minute. Mind you, this was many years ago, when my server had 1 GB of RAM—so this was a very expensive operation. The vendor was open to working together, and I helped them design a where clause, an indexing strategy, and a parameter change to better support the use of the index. We were able to quickly implement a patch and get the system moving quickly.


Microsoft takes a lot of grief from DBAs for their production systems like SharePoint and Dynamics, and some interesting design decisions that are made within. I don’t disagree—there are some really bad database designs. However, I’d like to give credit to whomever designed System Center Configuration Manager (SCCM)—this database has a very logical data model (it uses integer keys—what a concept!), and I was able to build a reporting system against it.


So what horror stories do you have about vendor databases? Or positives?

The terms EMS, NMS and OSS are often misunderstood and used interchangeably. This can, sometimes, lead to a confusion on which system can do what functions. Therefore I am attempting to clarify these terms in a simple way. This may help you in making informed decisions when procuring management systems.


But before understanding the terms for management systems, one should understand what FCAPS is, in relation to management systems. In fact every management system should perform FCAPS. The five alphabets in FCAPS stand for the following:


F- Fault Management – i.e.  Reading and reporting of faults in a network; for example link failure or node failure.


C-Configuration Management- Relates to loading/changing configuration on network elements and configuring services in network.


A-Account Management- Relates to collection of usage statistics for the purpose of billing.


P- Performance Management- Relates to reading performance related statistics, for example reading utilization, error rates, packet loss, and latency.


S-Security Management- Relates to controlling access to assets of network. This includes authentication, encryption and password management.


Ideally, any management system should do all of the FCAPS functions described above. However, some commercial solutions allow only some of the FCAPS functions. In that case, there will be a need for additional management system to do the rest of the FCAPS functions. The FCAPS applies to all types of management systems including EMS, NMS and OSS.


Now that we covered the general functions of management systems, let’s understand the terms, EMS, NMS and OSS.


EMS stands for “Element Management System”. It is also called Element manager. EMS can manage (i.e. FCAPS) a single node/element or a group of similar nodes. For example it can configure, read alarms etc. on a particular node or group of nodes.


NMS (Network Management System) on the other hand manages a complete network i.e. it covers all the functions of EMS as well as does FCAPS with relation to the communication between different devices.


So the difference between EMS and NMS is that NMS can understand the inter-relationship between individual devices, which EMS cannot. Although EMS can manage a group of devices of the same type but it treats all the devices in a group as single devices and does not recognize how individual devices interact with one another.


So to sum up:

  1. I.e. NMS = EMS + link/connectivity management of all devices+ FCAPS on network basis.


NMS can manage different types of network elements/technologies of a same vendor.


An example would clarify. An EMS would be able to give individual alarms on nodes. But NMS can correlate the alarms on different nodes; it can, thus find out root cause alarms when a service is disrupted. It can do so because it has network wide view and intelligence.


OSS (Operation Support Systems) takes a step further. It can not only manage a single vendor but can also manage multiple vendors. OSS will be needed in addition to vendor specific NMS. OSS will interact with individual NMSs and provide one dashboard for FCAPS management.


OSS, thus, can give a single view of the network end to end including all vendors. An example would be a service provisioning tool that can create an end to end service between Cisco and Juniper routers. This would need an OSS that can talk to the NMS of the both vendors for this purpose or can even configure the network elements directly.


After explaining the terms EMS, NMS and OSS, I would end up my blog by asking


  • Does your management system do all the FCAPS functions or some?


  • Do you prefer to have one network management system that does all FCAPS or different ones depending on different specialized functions?


Or may be I should ask, are you using any management system at all


Would love to hear your opinion!

Email has become the core of every organization. It is the primary source for communication that everybody turns to, regardless of the type of email server they are running. Who doesn’t have email??? When email is down, communication is down, resulting in lost productivity and potentially thousands to millions of lost dollars.


Microsoft Exchange Server is the most widely used on-premises enterprise email system today. When it works…it works great. But, when Exchange is not working correctly, it can be a nightmare to deal with.


For an Exchange administrator, dealing with Exchange issues can be challenging because there are so many variables when it comes to email. Having a user complain about email being “slow” can be the result of many different factors. It could be a network issue, a desktop client issue, or even a poorly performing Exchange Server. So many possibilities of what is wrong and 1 unhappy user. 


Not only do issues arise in everyday working situations, but if you are preparing for a migration or an Exchange upgrade there are always “gotchas.” These are things that get overlooked until something breaks and then everybody is scrambling around to fix it and do damage control later. These are probably the most annoying to me because often times somebody else has run into the problem first. So, if I had known about the gotchas I could’ve been prepared.


I recently had the opportunity to present a webinar, “The Top Troubleshooting Issues Exchange Admins Face and How to Tackle Them.One of the great things about the IT community is that it’s there for us to share our knowledge, so others can learn from our experiences. That’s what I hoped to share in this webinarsome tips on solving annoying problems, as well as providing some true tried lessons from managing Exchange myself. We discussed some of the challenges with Exchange migrations, mailbox management issues (client issues), and even discussed Office 365. You can view a recording of the Webinar here.


Since our time was limited, I could not answer all the questions that were asked during the webinar, so I wanted to take an opportunity to answer some of them here:


1. Is there a size limit on outlook 2010/exchange 2010? We had a laptop user with a 20GB mailbox with cache mode enabled who had an issue with his offline address book dying within a day of his cache being resynced - we saw it as an issue with him trying to view other users calendars.


This type of issue can be many things and could be a whole blog in itself so I will keep it short. Large mailboxes will create large OST files locally on the machine they are using and can become corrupt. If that is the case, then creating a new OST file may resolve your issue. When trying to view others calendars, you can try removing the calendar and re-adding the shared calendar again. Also double check calendar permissions!


2. Do you know the issue when installing EX2010 SP3 it fails at 'removing exchange files'? SP3 is needed for upgrading to EX2013.


There is a known issue for Exchange failing to remove setup files with SP3 when PowerShell script execution is defined in Group Policy. For more details on this issue use the Microsoft KB# 2810617 site.

3. Any other resources we should review for Exchange Best Practices &/or Monitoring Tips (Other than Thwack?)


The Microsoft TechNet site has Exchange Best Practices, as well as Monitoring tips that be helpful. There are also various Microsoft MVP sites that can be helpful as well, such as:







4. Any advice for having Exchange 2003 and Exchange 2010 coexist until migration is completed?


Coexistence periods can be a challenge to manage and is best if kept to a short period if possible. Microsoft provides some checklist and documentation that can help with coexistence which can be found here on their TechNet site.


5. Is it possible to add more than 10 shared mailboxes to outlook 2010 client?


Yes, it is possible to have more than 10 shared mailboxes. By default, Outlook 2010 & Outlook 2013 has a default limit of 10 mailboxes with a maximum supported of up to 9999 accounts. To configure outlook for more than the default limits, you will need to edit registry settings or apply a Group Policy.


6. Is there a way can we enable "On behalf of" for the shared mailbox, so the end user who receives the email knows who sends the email?


To enable send on behalf for a shared mailbox, you can configure delegates for the shared mailbox. You can also apply the send on behalf of setting under the mail flow settings in the mailbox account properties.



I had a great time participating in the webinar with Leon and hope our viewers were able to take some tips back with them to help within their Exchange environment. Exchange is the core of most businesses and Managing an Exchange environment can be challenging, I know that from personal experience. However, given the right tools by your side you can tame the beast and keep the nightmare events to a minimum.


Watching David Letterman sign off was a reminder that old shows are still great, but they are often replaced by new shows that resonate better with current times. It’s a vicious cycle driven by the nature of business. The same can be said of IT management with its constant struggle of old vs new management methods.


The old IT management principles rely on tried-and-true and trust-but-verify mantras. If it isn’t broke, don’t go breaking it. These processes are built on experience and born off IT feats of strengths. Old IT management collects, alerts, and visualizes the data streams. The decision-making and actions taken rest in the hands of IT pros and trust is earned over a long period of time.

Outdated is how new IT management characterizes the old management ways. Too slow, too restrictive, and too dumb for the onset of new technologies and processes. New IT management is all about the policies and analytics engine that remove the middle layer—IT pros. Decisions are automatically made with the analytics engine while remediation actions leverage automation and orchestration workflows. Ideally, these engines learn over time as well so they become self-sustaining and self-optimizing.

The driver for the new management techniques lie in the business needs. The business needs agility, availability, and scalability for their applications. Whether they are developing the application or consuming the application, the business wants these features to drive and deliver differentiated value. So applications are fundamental to the business initiatives and bottom line.

Where does your organization sit on the IT management curve – more old, less new or less old, more new or balanced? Stay tuned for Part 2 of 2015 IT Management Realities.

Look, almost all of us have been there you’re slogging through Monday morning after staying up late watching The Walking Dead and Game of Thrones.


That new hot shot application owner is guaranteeing that his new application solution is the best thing since sliced bread, and of course it’s more bulletproof than anything else you’ve never heard of.


As monitoring geeks we know deep down that we can help make that application everything someone else wants it to be. We’re going to monitor it, not just for uptime and utilization, but for application performance and reliability. We have the capability and responsibility to help the business deliver on those promises.


So, how can one even hope to perform this dark magic I’m suggesting?

Standing up for standards

Elementary! As monitoring geeks, we have a bevy of tools in the box; but just as important as those tools, we have standards. Standards that we adhere to, advocate, and answer to.

Setting Monitoring Implementation Standards for hardware and application platforms yields a standardized process tailored to each individual’s environment. This can afford a consistent and streamlined monitoring experience; even if the platforms and applications are diverse - the process can remain the same.

Eliminating the impossible

For starters, we’re going to eliminate some of the guess work by closely examining our scheduled discovery results; keeping an eye out for any wayward hardware platforms that require further inquiry and ensure they’re being attributed with the appropriate custom properties. Then start asking some critical questions to narrow our focus that may include:


  • What does the app do and who utilizes it?
  • What is it running on and where?
  • What OS does it require?
  • What database does it use?
  • What languages are running the application?
  • What processes and services are needed to make it function?
  • Does it have a Web portal and ports that need to be monitored?
  • Who needs to know when there is a problem?
  • What alerts are needed? Up and down? Do you want to know when specific components go into warning or critical states?


Consistently asking these standard questions at the onset of every monitoring activity will help build your customized standard, allowing you to target the unknowns quicker and start amassing data points.

As your monitoring system continues to amass those data points; riddle me this hero: What good is the data if we never look at it?


The devil is in the data


Scheduled data review is incredibly important for trend detections and data integrity. Start digging into that mountain of data you’ve been collecting with canned or custom reports then schedule them to be sent straight to your inbox. Review them weekly, monthly, and quarterly. You might be surprised at what you find (Or what you don’t!).


After consistently reviewing the information you will be able to start to sorting and collating that data into quantifiable metrics to show, for example, the ridiculous availability and uptime of the hot shot’s new application.


This charted data is now powerful business intelligence for decision makers when budgets get tighter, or just a good measurement for regulatory reporting.


By standardizing the appropriate level of hardware and application monitoring, scheduling automated reports, and reviewing the data you ensure the business’ applications and services are delivered reliably time and time again.


What monitoring standards do you utilize to consistently deliver applications and services to your constituents?

Career management is one of my favorite topics to write and or talk about, because I can directly help people. Something I notice as a consultant going into many organizations is that many IT professionals aren’t thinking proactively about their careers, especially those that work in support roles (supporting an underlying business, not directly contributing to revenue like a consulting firm or software development organization). One key thing to think about is how your job role fits into your organization—this is a cold hard ugly fact that took me a while to figure out.


Let’s use myself as an example—I was a DBA at a $5B/yr medical device company—that didn’t have tremendous dependencies on data or databases. The company needed someone in my slot—but frankly it did not matter how good they were at there job beyond a point. Any competent admin would have sufficed. I knew there was a pretty low ceiling of how far my salary and personal success could go at that company. So I moved to a very large cable company—they weren’t a technology company per se, but they were large enough organization that high level technologists roles were available—I got onto a cross platform architectural team that was treated really well.


I see a lot of tweets from folks that often seem frustrated in their regular jobs—the unemployment rate in database roles is exceedingly low—especially for folks like you who actively reading and staying on top of technology—don’t be scared to explore the job market, you might be pleasantly surprised.

Guys, I'm really excited to be part of the Thwack community.


My first post "THERE MUST BE A BETTER WAY TO MANAGE VIRTUALIZED SYSTEMS" reached more than 2550 people and my second post reached over 800 people plus a bunch of people who actively participated. This is great sign and shows that you all enjoy being part of this community.


In my previous posts we covered how you are managing your virtualized systems and what features are you using most.


In today's post, I would like to discuss one particular part of your virtual infrastructure, your Virtual Desktop Infrastructure (VDI). Common questions I come across are, how to right size my VDI infrastructure, how many IOPS will my users generate and should I utilize VMware View Thin Apps or Citrix XenApp. I found the VDI Calculator by Andre Leibovici very helpful.


Screen Shot 2015-05-18 at 9.37.38 PM.png


Additionally, I found LoginVSI to be great tool for VDI storage benchmarking and to find out how many VMs you can actually host on your system. I doubt that many of you are using it since it isn’t cheap but if you are using it or you have used it in the past, you know what I’m talking about. This tool fully simulates real VDI workloads and not just some artificial vdbench/sqlio/fio load. Also, VMware View Planner is supposed to be a great tool for benchmarking and right sizing your environment but I haven’t touched it just yet. Have you?

The last tool in my VDI repository is a tool created by the VMware Technical Marketing Group - VMware OS Optimization Tool. I am not going too much in detail here but just click on the VMware OS Optimization Tool and read my blog post about it. It is a great tool, which can be used to create your golden VDI image.


If you know some useful VDI tools or you used the tools, which I’ve mentioned above, please comment and share your experience with us. Let’s make this post a great resource for all VDI admins out there.

Fault Management (FM) and Performance management (PM) are two important elements of OAM in layer 2 and layer 3 networks.


FM covers faults management related to connectivity/communication of end stations.  While PM includes, monitoring the performance of link using statistics like packet loss, latency and delay variation (also called jitter) etc.


Here we need to differentiate between layer 2 and layer 3 networks.


For layer 2 networks, FM is usually done using CCM messages (connectivity check messages) while PM is done using standard protocols like 802.1ag or Y.1731 that can monitor all parameters mentioned above.


For layer 3 networks, Ping and trace route are primary tools for FM and by far the most widely used tools for troubleshooting, while IP SLA is one of the PM tools for Cisco devices. IP SLA can monitor all stats including loss, latency and delay variation at IP layer ( can also do it on layer 2) in addition to helpful stats for VOIP like MOS score. (Please note, Cisco use the term IP SLA also for both layer 3 and layer 2 links, even though the stats at layer 2 are on the Ethernet layer).


Coming from a carrier Ethernet background in my last job, when I look back, I can say that tools especially the PM tools at layer 2, were not used very often. It may be, because many people were not aware of it or the thresholds of pass/fail for performance measurements were not very well defined. Recently, Metro Ethernet Forum (MEF) has done a great job by standardizing the threshold and limits for jitter, delay and packet loss. Therefore, the PM tools have started gaining acceptance industry wide and are being rolled out in layer 2 service provider networks more actively.


However, I am quite curious on how often OAM tools are used in IP networks.


Fault management tools like Ping/traceroute are the bread and butter of an IP engineer when it comes to troubleshooting networks but I am especially interested to know more about the IP SLA and its use in the networks.


So my question to you would be


  • How often do you use IP SLA ( or any similar tool)  in your network? Do you use it in specific applications like VoIP?


  • Do you used it for both layer 2 and layer 3 networks. In enterprises as well as service provider environment?


  • Are the thresholds of the PMs (Delay, jitter and packet loss) well defined;  by Cisco or any standard body?


Would love to hear your opinion here!


Trust... But Verify

Posted by cxi May 18, 2015

Let me start by saying, wow and thank you to everyone who maintains such a high activity in this community. While I may occasionally share some jibber jabber with all of you, you are all the real champions of this community and I cannot thank you enough for your contributions, feedback and more!


This leads me to my segment this week... One I welcome your contributions as always...


Trust; But Verify!


Screen Shot 2015-05-17 at 3.55.00 PM.png


This line of thought isn't limited to Authentication, but it certainly shines as a major element of a trust model.

How many times are we put into a position of, "Oh yea, it's all good, no one has access to our systems without two factor authentication!" "What about service accounts?" "...crickets"


I've been there. My account to login to look at files, personal email, etc has such a high level of restraint and restriction that they have everything under the sun, username, password, secret pin, blood sample, DNA matrix...   Yet then the Admins themselves, either directly for elevated accounts, or indirectly through Service Accounts, or other credentials are 'secured' through simple password.   "Oh, we can't change the password on the account because it takes too long, so it goes unchanged for 60, 90, 180, never?"


Now, not every organization operates this way. I remember having tokens back in the 90s for authentication and connectivity for Unix systems, but that is truly few and far between.


I won't even go into the model whereby people 'verify' and 'validate' the individual who is hired to protect and operate in the network as that's VERY much outside the scope of this little blog, but it leaves to question... How far do we go?


What do you feel is an appropriate authentication strategy? One form (password), two form (Password+something else) or even a more complex intermixing of multiple methods?

And forget what we 'think' vs what you actually see implemented.


What do you prefer? Love, Hate, Other!


FeedMeSeymour.jpgDontFeedAnimals.pngFEED ME SEYMOUR! That's what I hear when anyone jumps to a conclusion that the database needs more CPU, memory, or faster disks.  Why? Because I'm a DBA who has seen this too many times. The database seems to be the bottleneck and there's no shortage of people suggesting more system resources. I usually caution them, "don't mistake being busy with being productive!" Are we sure the workload is cleverly organized? Are we applying smart leverage with proper indexing, partitioning and carefully designed access patterns or are we simply pouring data into a container, doing brute force heavy lifting and having concurrency issues?


Now if you're a SysAdmin, you might be thinking, this is DBA stuff, why do I care? The reason you should care is that I've seen too many times where resources were added and it only produced a bigger monster!


So for crying out loud, please don't feed the monsters.  THEY BITE!


To demonstrate my point, I'll tell you about an interesting discovery I explored with Tom LaRock AKA @SQLRockstar while creating demo data for the release of Database Performance Analyzer 9.2. I set out to do the opposite of my usual job. I set out to create problems instead of solving them.  It was fun!  :-)



My primary goal was to generate specific wait types for:

  1. Memory/CPU - Not a real wait type. We put this in the wait type field when you're working, rather than waiting because the only thing you're waiting on then is the CPU and Memory to complete whatever task the CPU was tasked with.
  2. ASYNC_NETWORK_IO - Ironically, this is seldom truly a network problem, but it could be and may be interesting to a SysAdmin.
  3. PAGEIOLATCH_XX - These are significant signs that you're waiting on storage.
  4. LCK_M_X - This is a locking wait type and locking can harm performance in ways that adding system resources can't help.


I knew that table scans cause all sorts of pressure, so I created 1 process that used explicit transactions to insert batches into a table in a loop while 4 processes ran SELECT queries in infinite loops. To maximize the pain, I ensured that they'd always force full table scans on the same table by using a LIKE comparison in the WHERE clause and comparing it to a string with wild cards. There's no index in the world that can help this! In each pass of their respective loops, they each wait a different amount of time between table scans. 1 second, 2 seconds, 3 seconds, and 4 seconds respectively. Three of the processes use the NOLOCK hint while one of them does not. This created a pattern of alternating conflicts for the database to resolve.

SignalWaitMystery.pngSo I got the wait types I targeted, but LATCH_EX just sort of happened. And I'm glad it did! Because I also noticed how many signal waits I’d generated and that the CPU was only at 50%. If *Signal Waits accounting for more than 20% our waits is cause for concern, and it is…then why does the server say the CPU is only around 50% utilized? I found very little online to explain this directly so I couldn't help myself. I dug in!

My first suspect was the LATCH_EX waits because I'd produced an abundance of them and they were the queries with the most wait time. But I wasn't sure why this would cause signal waits because having high signal waits is like having more customers calling in than staff to answer the phones. I really didn't have much running so I was puzzled. 


The theory I developed was that when SQL Server experiences significant LATCH_EX contention, it may require SQL Server to spawn additional threads to manage the overhead which may contribute toward signal waits. So I asked some colleagues with lots of SQL Server experience and connections with other experienced SQL Server pros. One of my colleagues had a contact deep within Microsoft that was able to say with confidence that my guess was wrong. Back to the guessing game…

With my first hypothesis dead on arrival, I turned back to Google to brush up on LATCH_EX. I found this Stack Exchange post, where the chosen correct answer stated that,


There are many reasons that can lead to exhaustion of worker threads :

  • Extensive long blocking chains causing SQL Server to run out of worker threads.
  • Extensive parallelism also leading to exhaustion of worker threads.
  • Extensive wait for any type of "lock" - spinlocks, latches. An orphaned spinlock is an example.

Well I didn't have any long blocking chains and I didn't see any CXPACKET waits. But I did see latches! So I developed hope that I wasn't crazy about this connection from latches to signal wait. I kept searching…

LatchClass.pngI found this sqlserverfaq.net linkIt provided the query I used to identify my latch wait class was ACCESS_METHODS_DATASET_PARENT.  It also broke down latches into 3 categories and identified that mine was, a non-buffer latch.  So I had a new keyword and a new search phrase, ACCESS_METHODS_DATASET_PAREN and "non-buffer latch".

SELECT  latch_class, wait_time_ms / 1000.0 AS [Wait In sec],

waiting_requests_count AS [Count of wait],

100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage

FROM sys.dm_os_latch_stats

WHERE latch_class NOT IN ('BUFFER')

AND wait_time_ms > 0

Then I found this MSDN postAbout half way in, the author writes this about ACCESS_METHODS_DATASET_PARENT: "Although CXPacket waits are perhaps not our main concern, knowing our main latch class is used to synchronize child dataset access to the parent dataset during parallel operations, we can see we are facing a parallelism issue".

SqlSkillsTweet.pngThen I found another blog post not only supporting the new theory, but also referencing a post by Paul Randal from SQLskills.com, one of the most reputable organizations regarding SQL Server Performance.  It states, "ACCESS_METHODS_DATASET_PARENT...This particular wait is created by parallelism...."

And for the icing on the cake, I found this tweet from SQLskills.com.  It may have been posted by Paul Randal himself.

So now I know that LATCH_EX shows when SQL Server parallelizes table scans.  So instead of one thread doing a table scan, I had several threads working together on each table scan.  So it started to make sense.  I had ruled out parallelization because I didn't see any CXPACKET waits, which many DBAs think of as THE parallelism wait.  And now THIS DBA (me) knows it's not the only parallelism wait!  #LearnSomethingNewEveryDay


So now I feel confident I can explain how an abundance of LATCH_EX waits can result in high CPU signal waits.  But I'm still left wondering why signal waits can be over 20% and the CPU is only showing 50% utilization.  I'd like to tell you that I have an answer, or even a theory, but for now, I have a couple of hypotheses.


  1. It may be similar to comparing bandwidth and latency.  It seems server CPU utilization is like bandwidth i.e. how much work can be done vs what is getting done, while signal waits is like latency i.e. how long does a piece of work wait before work begins.  Both contribute to throughput but in very different ways.  If this is true, then perhaps the CPU workload for a query with LATCH_EX is not so much work but rather time consuming and annoying.  Like answering kids in the back seat that continually ask, "are we there yet?"  Not hard.  Just annoying me and causing me to miss my exit.

  2. It may simply be that I had such little load on the server, that the little amount that was signal waits, accounted for a larger percentage of the work.  In other words, I may have had 8 threads experiencing signal wait at any time.  Not a lot, but 8 of 35 threads is over 20%.  So in other words, "these are not the droids I was looking for."


Maybe you have a hypothesis?  Or maybe you know that one or both of mine are wrong.  I welcome the discussion and I think other Thwack users would love to hear from you as well.



Related resources:


Article: Hardware or code? SQL Server Performance Examined — Most database performance issues result not from hardware constraint, but rather from poorly written queries and inefficiently designed indexes. In this article, database experts share their thoughts on the true cause of most database performance issues.


Whitepaper: Stop Throwing Hardware at SQL Server Performance — In this paper, Microsoft MVP Jason Strate and colleagues from Pragmatic Works discuss some ways to identify and improve performance problems without adding new CPUs, memory or storage.


Infographic: 8 Tips for Faster SQL Server Performance — Learn 8 things you can do to speed SQL Server performance without provisioning new hardware.


In a network, whether small or large, spread over one location or manythere are network administrators, system administrators, or network engineers who frequently access the IP address store. While many organizations still use spreadsheets, database programs, and other manual methods for IP address management, the same document/software is accessed and updated by multiple people. Network administrators take on the role of assigning IPs in small networks, as well as when they add new network devices or reconfigure existing ones.  The system administrator takes care of assigning IPs to new users that join the network and adding new devices like printers, servers, VMs, DHCP & DNS services, etc. Larger networks that are spread over multiple locations sometimes have a dedicated person assigned to specifically manage planning, provisioning and allocation of IP space for the organization. They also take care of research, design and deployment of IPv6 in the network. Delegating IP management tasks to specific groups’ based on expertise or operations (network & systems team) allows teams to work independent of each other and meet IP requirements faster.


Again, if the central IP address repository is maintained by a single person, then the problem lies in the delay of meeting these IP address requests. Furthermore, they could run into human-errors and grievances stemming from teams experiencing downtime -- waiting to complete their tasks.


What Could Go Wrong When Multiple Users Access the Same Spreadsheet?

Spreadsheets are an easily available and less-expensive option to maintain IP address data. But, it does come with its own downsides when multiple users access the same spreadsheet. Typically, users tend to save a copy to their local drive and then finding the most recently updated version becomes another task! You end up with multiple worksheets with different data on each of them. There is no way to track who changed what. Ultimately, this leads to no accountability for misassignments or IP changes made.


In short, this method is bound to have errors, obsolete data and lacks security controls. There could be situations when an administrator makes a change in the status of an IP address, but forgets to communicate the same to the team/person that handles DHCP or DNS services. In turn, chances are higher that duplicate IP addresses are assigned to a large group of users causing IP conflicts and downtime.


With all that said, the questions that remain are: Can organizations afford the network downtime? And are the dollars saved from not investing in a good IP address management solution more than those lost due to loss in productivity? This post discusses the problems of using manual methods for IP address management. In my next blog we  look at associated issues and the best practices of roles and permissions enabling task delegation across teams.


Do you face similar difficulties with your IP administration? If yes, how are you tackling them?


Doing IT Remotely!

Posted by vinod.mohan May 14, 2015

Often, as organizations grow and expand, it can make the job harder for IT teams. The IT infrastructure may become larger and more complicatedbe distributed across various sites and locations. For example, end-users to support could be onsite, offsite, or even on the road travelling. There may not be enough admins in all locations, and the need for remote IT management becomes essential.


Even in smaller businesses and start-ups where office space and IT infrastructure is not quite ready yet, and employees are telecommuting from home and elsewhere, the need for remote IT surfaces. A single IT pro wearing a dozen different IT hats will have to make do and support end-users wherever they may be.


Remote IT is generally defined in different ways by solution providers based on the solution they offer. In this blog, I am attempting to cover as many scenarios as possible that could be called remote IT.



  • IT pros in one location managing the infrastructure (network, systems, security, etc.) in remote location
  • IT  pros in one location supporting end-users in a remote location
  • IT pros within the network supporting end-users outside the network
  • IT pros monitoring and troubleshooting infrastructure issues while on the go, on vacation, or after office hours
  • Monitoring the health of remote servers, applications, and infrastructure on the Cloud
  • Remote monitoring and management (RMM) used by IT service providers to manage the IT infrastructure of their clients
  • User experience monitoring of websites and web applications—both real user monitoring and synthetic user monitoring
  • Site-to-site WAN monitoring to track the performance of devices from the perspective of remote locations
  • Certain organizations have their mobile device management (MDM) policies that include remote wiping of data on lost or stolen BYOD devices containing confidential corporate information


This may not be a comprehensive list. Please do add, in the comments below, what else you think fits in the realm of remote IT.


But the primary need for remote IT is that, without having to physically visit in person a remote site or user, we have to make IT work—monitor performance, diagnose faults, troubleshoot issues, support end-users, etc. And, this should be done in a way that is cost-effective and result-effective to the business.


Just like how we need a phone or computer (a tool, basically) to communicate with a person situated remotely, to make remote IT work, it comes down to using remote IT tools. When you’ve  equipped with the right tools and gear to manage IT remotely, you will gain greater control and simplicity to work your IT mojo wherever the IT infrastructure is, the user is, or you—the IT pro—are.


Also, share with us what tools you use for doing IT remotely.

In my last post "THERE MUST BE A BETTER WAY TO MANAGE VIRTUALIZED SYSTEMS", we talked about what systems are out there and which ones everyone is using. Ecklerwr1, posted a nice chart from VMare which compares VMware vRealize Operations to SolarWinds Virtualization Manager and a few others.



Based on the discussion, it seems like many people are using some kind of software to get things sorted in their virtual environment. In my previous job, I was responsible for parts of the lab infrastructure. We hosted 100+ VMs for customer support, so our employees can reproduce customer issues  or use it for training.


While managing the lab and making sure we have always enough resources available, I found it difficult to identify which VMs have actively been used and which VMs were idle for some time. Another day-to-day activity was to hunt down snapshots which consumed an massive amount of space.

Back then, we wrote some vSphere CLI scripts to get the job done. Not really efficiently but done. However, using SolarWind's Virtuailzation Manger now, I see how easy my life could have been.


My favorite features are the ability to view idle VMs and monitor the VM snapshots disk usage. Both features could have saved me lots of hours in my previous job. 

I am curious to know what features are saving you on a regular basis? Or are there any features, which we are all missing but just don’t know it yet?As Jfrazier mentioned, maybe Virtual Reality Glasses?

If you are an Oracle DBA and reading this, I am assuming all of your instances run on *nix and you are a shell scripting ninja. For my good friends in the SQL Server community, if you haven’t gotten up to speed on PowerShell, you really need to this time. Last week, Microsoft introduced the latest version of Windows Server 2016, and it does not come with a GUI. Not like, click one thing and you get a GUI, more like run through a complex set of steps on each server and you eventually get a graphical interface. Additionally, Microsoft has introduced an extremely minimal server OS called Windows Nano, that will be ideal for high performing workloads that want to minimize OS resources.


One other thing to consider is automation and cloud computing—if you live in a Microsoft shop this all done through PowerShell, or maybe DOS (yes, some of us still use DOS for certain tasks).  So my question for you is how are you learning scripting? In a smaller shop the opportunities can be limited—I highly recommend the Scripting Guy’s blog. Also, doing small local operating system tasks via the command line is a great way to get started.

I was watching a recent webcast titled, “Protecting AD Domain Admins with Logon Restrictions and Windows Security Log” with Randy Franklin Smith where he talked (and demonstrated) at length techniques for protecting and keeping an eye on admin credential usage. As he rightfully pointed out, no matter how many policies and compensating controls you put into place, at some point you really are trusting your fellow IT admins to do their job—but not more—with the level of access we grant and entrust in them.


However, there’s a huge catch 22—as an IT admin I want to know you trust me to do my job, but I also have a level of access that could really do some damage (like the San Francisco admin that changed critical  device passwords before he left). On top of that, tools that help me and my fellow admins do my job can be turned into tools that help attackers access my network, like the jump box in Randy’s example from the webcast.


Now that I’ve got you all paranoid about your fellow admins (which is part of my job responsibilities as a security person), let’s talk techniques. The name of the game is: “trust, but verify.”


  1. Separation of duties: a classic technique which really sets you up for success down the road. Use dedicated domain admin/root access accounts separate from your normal everyday logon. In addition, use jump boxes and portals rather than flat out providing remote access to sensitive resources.
  2. Change management: our recent survey of federal IT admins showed that the more senior you are, the more you crave change management. Use maintenance windows, create and enforce change approval processes, and leave a “paper” trail of what’s changing.
  3. Monitor, monitor, monitor: here’s your opportunity to “verify.” You’ve got event and system logs, use them! Watch for potential misuse of your separation of duties (accidental OR malicious), unexpected access to your privileged accounts, maintenance outside of expected windows, and changes performed that don’t follow procedure.


The age old battle of security vs. ease-of-use wages on, but in the real world, it’s crucial to find a middle ground that helps us get our jobs done, but still respects the risks at hand.


How do you handle the challenge of dealing with admin privileges in your environment?


Recommended Resources


REVIEW - UltimateWindowsSecurity Review of Log & Event Manager by Randy Franklin Smith -


VIDEO – Actively Defending Your Network with SolarWinds Log & Event Manager



Throughout previous blog posts, I talked about thin provisioning, approaches to move from fat to thin, and the practice of over committing. All what I communicated was about their system, advantages, pluses & minuses, methodology, drawbacks etc. Likewise, I also talked about the need for constant monitoring of your storage as the solution to many drawbacks. This article will talk about how to apply a storage monitoring tool to your infrastructure to monitor your storage devices. But when you select the tool make sure that you select one which has alerting options too.  I will walk you through SolarWinds Storage Resource Monitor (SRM in short) which is one of the storage monitoring tools and in the course I will talk about the different necessary  features that any storage monitoring tool require to overcome the weaknesses of thin provisioning.


Introduction to SRM:

SRM is SolarWinds storage monitoring product. SRM monitors, reports, and alerts on SAN and NAS devices like Dell, EMC, NETAPP and so on. For a detailed list check here. In addition, SRM helps to manage and troubleshoot storage performance and capacity problems.

You can download SRM from the link below:

Storage Resource Monitor

Once you have installed SRM, next you will need to add your storage device. Adding your storage device is different based on your vendor. Visit the below page for instructions on how to add storage devices from different vendors.

How to add storage devices


Once you have installed SRM and added your storage devices to SRM, you will have instant visibility into all storage layersextending to virtualization and applications with the Application Stack Environment Dashboard. Using SRM, troubleshooting storage problems across your application infrastructure is a cake walk. Let’s start with SRM’s dashboard.




The dashboard gives you a birds-eye view of any issues on your storage infrastructure. Further, the dashboard displays all storage devices monitored by SRM classified via product and relevant status of each layer of storage, such as storage arrays, storage pools, and LUN’s.


SRM and Thin Provisioning:

Moving on to Thin Provisioning, SRM allows you to more effectively manage Thin Provisioned LUN’s. And when thin provisioning is managed and monitored accurately over-provisioning or over committing can be done efficiently. SRM helps you view, analyze and plan thin provisioning deployments by collecting and reporting detailed information of virtual disks, so you can manage the level of over-commitment on your datastores.




This resource presents a grid of all LUNs using thin provisioning in the environment.

The columns are:

  • LUN : Shows the name of the LUN and its status
  • Storage Pool: Shows which storage pool the LUN belongs to
  • Associated Endpoint: The server volume or the datastore using the LUN
  • Total Size : The total User size of the LUN
  • Provisioned Capacity : Amount of capacity currently provisioned

There are also columns that show the provisioned percentage, File System Used Capacity, and File System Used Capacity percentage for the concerned LUN.


A tool tip will appear when you hover over the LUN or Storage Pool which gives you a quick snapshot of performance and capacity. This helps you decide if you need to take action. Moreover, this tool tip when hovered over storage pool shows the pool’s usable capacity summary. This shows the total usage capacity (i.e, the collected amount of storage capacity that a user can actually use), remaining capacity (the storage left behind to get occupied) and over-subscribed capacity (total capacity this storage pool is over committed).


hoverover storage pool _ 2.png


A drill-down on a specific storage pool gives information that presents important key/value pairs of information for the current storage pool. Moreover, detailed information on:

  • Total Usable Capacity
  • Total Subscribed Capacity
  • Over-Subscribed Capacity
  • Provisioned Capacity
  • Projected Run-Out time, approximate time it will take to wholly utilize this storage pool.


drill down on storage pool.png


In addition, Active Alerts displays the alerts related to this storage pool. This displays the alert name, alert message in short, name of the LUN for which alert is triggered and it’s time.

Learn how to create an alert in SRM.


Alerting helps proactive monitoring:

Storage performance issues can happen anytime and you cannot literally monitor each and every second on how storage is performing. This is why you need alerts. They help you by warning you before a problem occurs. By setting up alerts based on criteria, you will gain complete visibility into your storage. You have to setup an alert forecasting a particular situation that can cause issues with storage performance.


all active alerts.png


Provided below is a list of Example alerts that you can use for LUN’s while doing thin provisioning:

  • Alert when usable space in the LUN goes below a particular % (i.e 20%)
  • Alert when usable space in a storage pool goes below a particular %
  • Alert when the storage pools oversubscribed % goes higher than a particular % (i.e 10%)

The % values can only be decided by you, as it will be differ based on infrastructure. Some can add more storage in days, where as in many organizations it might take up to months to get approval for additional storage. Therefore, the decision of setting % can only be done by you. 


Once you have alerts in place, you can just sit back and relax. And spare your time (that you spent to monitoring thin provisioning and over committing in storage) for other endeavors.


Well, my last blog generated quite an interest and discussion on the use of CLI for box configuration.


As a follow up I want to write on a related topic although it may generate some difference of opinion here but this is my goal to generate a wider discussion on this topic.


OK, in my last post I said that CLI is cumbersome; it takes a while to get used to and the worst thing is that if something goes wrong, the troubleshooting takes ages, sometimes.

I also said that protocols like NETCON and YANG would really make the configuration easier, more intelligent and move the focus from the box configuration to the network configuration in addition to making the configuration GUI friendly.


I want to bring a new dimension to this discussion.


Let’s see if Cisco would really like to give you a better user interface and a better configuration tool.


Although I write Cisco here, but it can mean any vendor that gives CLI experience, for example Juniper etc. ( I specifically mean any CLI which is propriety and vendor specific )


Ok to start with; let’s agree on a fact that using CLI is a skill; rather an expert skill. This skill is required to configure a box and additionally to troubleshoot networking issues. Not only do you need how to move around with CLI, but you should be able to do it with speed. Isn’t it?


This skill requires training and certification. If one has expert certification, it means that he is not only intelligent but he is a command guru. Correct?


Cisco certification is a big money making industry. If not a billion dollar, it must be generating hundreds of million dollars of revenue for Cisco ( I contacted Cisco to get real figures, but seems these are not public figures). Cisco makes money by making one pay for exams and selling them trainings. Then there is a whole echo-system of Cisco learning partners, where Cisco makes money by combining their products with training services and selling through them.


It costs to get expert level certifications. There is a cost if one passes, and there is more cost, if one fails.


An engineer may end up paying thousands of dollars on trainings and exams. We are talking about huge profits for Cisco here just because of the popularity of certifications. There is one for everyone; for a beginner to expert; for an operation guy to architects.


Besides creating experts, Cisco is winning here from three angles:


  1. It is making its customers used to CLI as customers feel at home using the codes they are trained on.
  2. It is creating loyal users and customers as they would recommend products they already know very well.
  3. It is generating big revenue. ( and big margins as it is a service)


For sure the It is win-win for Cisco here.


In my perspective, therefore, a difficult to operate switch and router is in the direct interest of Cisco, as Cisco needs experts to run their products and the experts need certifications.

Cisco, therefore, would NOT be very encouraged to make networks easy to operate and configured. Even I have seen the GUI of one of Cisco products; it simply sucks. It seems to me it is not one of their focuses.


Thus, this raises an important question here:


Why would Cisco take steps to make the network more programmable, easy to operate with newer tools and take CLI out of their central focus? Wouldn’t it like to stick around with difficult to operate products and keep on making more money?


Would you agree with me?


I like to hear, both if you agree or disagree and why?




After publishing this article, the majority of comments only focused on CLI versus GUI. For sure GUI is more user-friendly but CLI has delivered well because of not having good competition either from good GUI or SNMP, uptill now.


However the main message was to talk about “vendor specific CLI” NOT command line in general. In programmable age, tools like NETCONF and YANG offer a standard way to configure network elements. Whether you use it with GUI or with command line, the benefits far exceed compared to vendor “CLI”. NETCONF/YANG is a standard way to configure any vendor equipment. The protocol leaves it to the vendor to determine how to apply configuration instructions and in what order within their devices. This means this puts pressure on the vendors to do additional development on their products to execute the user configuration in whatever order he ordered. This removes pressure from the user to learn configuration for multiple vendors and learn multiple CLIs. This is the future, NOT CLI.


The IT Approach to Security

Posted by cxi May 11, 2015

Hello again! Welcome to my next installment with various slides I've stolen from my own presentations I'd deliver at conference

If you read last weeks installment on this Checkbox vs Checkbook Security you probably know by now that security is an area which is personally important to me.


With that said, let's dive a little deeper into what is often the IT Approach to security...

Screen Shot 2015-05-10 at 8.43.04 PM.png

How many times have you heard someone say "I'm not a big enough target" heck, maybe you've even heard yourself say that.

Certainly in solidly targeted world where theater actors are striking to stop you from publishing what was otherwise a horrible movie (Sony) or you experience where credit card and customer data is to be stolen for purposes of stealing monies or other uses (JPMC/Chase) or where hundreds of millions are dollars are stolen from hundreds of banks (Too many sources to count). 


Then sure, that puts you into the landscape of, "I'm not a big enough target, why would anyone bother with me!"


Let's not forget for a moment here though, that the security landscape is not hard and fast... attacking scripts and threat engines are indiscriminate in their assault at times.  A perfect example is (taken from the old war-dialing days)... Just as we'd dial entire banks of phone numbers looking for modems to connect into, there are attackers who will cycle through entire IP banks while trying to exploit the latest zero day attack on the horizon.   Most Wordpress sites that are hacked on a regular basis are not because they were targeted, it is because they were vulnerable.


Or if this analogy helps.. More people are likely to take something from a car with its windows open or its top down, than one which is all locked up.



What is it that makes us irrespective of size, a target?

Screen Shot 2015-05-10 at 8.58.46 PM.png

I included this image here from my own threatmap to give you a sense of just what kinds of things can and do happen.

So the question then arises of, what exactly makes something 'targetable'

You are a target if:

  • You are connected to a network
  • You run a service which is accessible via a network protocol (TCP, IP, UDP, ICMP, Token-Ring...;))
  • You run an application, server, service which has a vulnerability in it, whether known or unknown
    • I just want to mention for a moment... Shellshock the Bash Vulnerability disclosed 24SEP2014 has been VULNERABLE since September 1989; just food for thought


So you're pretty much a target if you... Exist, Right? Wow that leaves us all warm and fuzzy I imagine...

But it doesn't have to be that way! You don't have to run in terror and shut everything down for fear of it being hacked.  But in the same breath, we need not stick our head in the sand assuming that we are invincible and invulnerable because no one would ever attack us, or steal our data, or whatever other lies we tell ourselves to sleep at night.


Do you see a future with fewer Zero Day attacks or more critical ones being discovered which had existed for 25 years before being discovered (ala Shellshock) or introduced in the recent past such as Heartbleed?

You know I love your insight! So you tell me... How are you a target, or NOT a target!   What other ways do you see people being a target? (I haven't even touched the mobile landscape...)


I look forward to your thoughts on this matter Thwack Community!

Microsoft Ignite 2015 concluded its inaugural event with 20k+ attendees. The SolarWinds team united in the Windy City, Chicago, to provide the single point of truth in IT monitoring for the continuous delivery and integration era with Ignite attendees. SolarWinds also teamed up with Lifeboat Distribution and hosted a Partner Meet and Greet during Microsoft Ignite at Chicago's Smith & Wollensky covering steaks and application stack management. Ignite lit IT from start to finish.


Microsoft Announcements at Ignite

There were plenty of announcements made by Microsoft and they've been covered extensively especially on Microsoft's channel 9 program. The announcements involved SoCoMo - social, cloud and mobility with Office 365, Azure, and Windows OS taking the front and center roles. The Edge beat out the Project Spartan name by a brow...ser to become Internet Explorer's named successor. Other notable news included Windows 10 being the last version of Windows and showcase demos of some of the "software defined" roles of Windows Server 2016 aka Windows Server Technical Preview 2 especially Active Directory, Docker containers, RMS, and Hyper-V. And there was something about Office 365 and its E3 subscription, which includes the core Office application suite, plus cloud-based Exchange, SharePoint, and Skype for Business. Exchange, SharePoint, and Unified Communications admins were put on notice and the consensus was that they had to broaden and deepen their skills in other areas especially cloud.


From the Expo Floor

SolarWinds booth was non-stop traffic throughout Ignite. The conversations ranged from the latest and greatest Microsoft announcements to solutions that were two or three generations behind. But regardless of the environment whether it be on-premises, colo, private/public cloud or hybrid, it was clear that the application was clearly on the minds of IT Ops AND it required monitoring along with database, security, log & patch management. Conversations also included a healthy dose of the Dev side of the DevOps equation. And yes, Devs need monitoring as well. Without baselines and trends, there can be no truth on what "good" should be.


Enjoy some of the moments from SolarWinds' Microsoft Ignite booth.


Ignite PicturesIgnite Pictures
booth.jpgbooth presentation.PNG
demo.jpgswag bag.PNG
geek out.PNGthwack hammer.PNG


Thank you Ignite

Thank you Ignite attendees for the conversations from those of us who attended and represented the SolarWinds family! Fantastic job SolarWinds team! See you next year.


Pictured: 1st row - Ryan Albert Donovan, Brian Flynn, Troy Lehman, Danielle Higgins, Aaron Searle, Wendy Abbott. 2nd row - Matthew Diotte, Kong Yang, Mario Gomez, Patrick Hubbard, Michael Thompson. 3rd row - Dan Balcauski, Karlo Zatylny, Cara Prystowsky, Ash Recksiedler. Not pictured (because of flight times): Thomas LaRock, Jennifer Kuvlesky, Jon Peters.

Leon Adato

Convention Season

Posted by Leon Adato Employee May 8, 2015

Convention season is upon us. I know that conventions happen throughout the year, but it seems like April is when things kick into high gear.


As anyone who has been in IT for more than a month can tell you, there are so many incredible opportunities to get out there and network, learn, and see what is heading down the pipeline. It can really be overwhelming both to the senses and the budget.


The Head Geeks try very hard to find opportunities to meet up with customers, fellow thwack-izens, and like-minded IT Professionals. But like you, there are only so many days in a month and dollars in the budget.


I took a quick poll of the other Geeks to find out:


  1. Which shows we are GOING to be attending this year.
  2. Which ones we know we SHOULD be attending, but can’t due to other constraints.
  3. Which ones we WISH we could attend, even if it’s a little off the beaten path.


Here’s what I’d like from you: In the comments, let us know which shows YOU are going to be attending, and which ones you would like to see US attend next year. That will help us justify our decisions (and budget!) and (hopefully) meet up with you!



Tom: PASS Summit, VMworld, Ignite

Kong: MS Ignite, VMworld, SpiceWorld Austin, Philadelphia VMUG USERCON, and Carolina VMUG USERCON

Patrick: Cisco Live, Ignite

Leon: Cisco Live


Should Attend:

Tom: Spiceworks, VMworld (Barcelona)

Kong: “Are you insane?!?! Did you see what I’m already going to?”

Patrick: VMworld

Leon: Interop, Ignite, SpiceWorld


Wish We Could Attend:

Tom: SXSW, AWS re:Invent 2015

Kong: AWS re:Invent

Patrick: RSA, AWS re:Invent 2015

Leon: Interop, DefCon, RSA,


Like I said, let us know in comments where YOU are going to be, and we’ll start to make plans to be there the next time around.

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here along with information on the first question (Why did I get this alert). You can get the low-down on the second question (Why DIDN'T I get an alert) here. And the third question (What is monitored on my system) is here.


My goal in this post is to give you the tools you need to answer the fourth question: Which of the existing alerts will potentially trigger for my system?


Reader's Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.


Riddle Me This, Batman...

It's 3:00pm. You can't quite see the end of the day over the horizon, but you know it's there. You throw a handful of trail mix into your face to try to avoid the onset of mid-afternoon nap-attack syndrome and hope to slide through the next two hours unmolested.


Which, of course, is why you are pulled into a team meeting. Not your team meeting, mind you. It's the Linux server team. On the one hand, you're flattered. They typically don't invite anyone who can't speak fluent Perl or quote every XKCD comic in chronological order. On the other...well, team meeting.


The manager wrote:

            kill `ps -ef | grep -i talking | awk '{print $1}'`

on the board, eliciting a chorus of laughter from everyone but me. My silence gave the manager the perfect opportunity to focus the conversation on me.


“We have this non-trivial issue, and are hoping you can grep out the solution for us.” He begins, “we're responsible for roughly 4,000 sytems...”


Unable to contain herself, a staff member followed by stating, “4,732 systems. Of which 200 are physical and the remainder are virtualized...”


Unimpressed, her manager said, “Ms. Deal, unless I'm off by an order of magnitude, there's no need to correct.”


She replied, “Sorry boss.”


“As I was saying,” he continued. “We have a...significant number of systems. Now how many alerts currently exist in the monitoring system which could generate a ticket?”


“436, with 6 currently in active development.” I respond, eager to show that I'm just on top of my systems as they are of theirs.


“So how many of those affect our systems?” the manager asked.


Now I'm in my element. I answer, “Well, if you aren't getting tickets, then none. I mean, if nothing has a spiked CPU or RAM or whatever, then it's safe to say all of your systems are stable. You can look at each node's detail page for specifics, although with 4,000I can see where you would want a summary. We can put something together to show the current statistics, or the average over time, or...”


“You misunderstand,” he cuts me off. “I'm fully cognizant of the fact that our systems are stable. That's not my question. My question is…should one of my systems become unstable, how many of your... what was the number? Oh, right: How many of your 436-soon-to-be-442 alerts WOULD trigger for my systems?”


“As I understand it, your alert logic does two things: it identifies the devices which could trigger the alertAll Windows systems in the 10.199.1 subnet, for exampleand at the same time specifies the conditions under which an alert is triggeredsay, when the CPU goes over 80% for more than 15 minutes.”


“So what I mean,” he concluded, “Is this: can you create a report that shows me the devices which are included in the scope of an alert logic irrespective of the trigger condition?”


Your Mission, Should You Choose to Accept it...


As with the other questions we've discussed in this series, the specifics of HOW to answer this question is less critical than knowing you will be asked it.


In this case, it's also important to understand that this question is actually two questions masquerading as one:

  1. For each alert, tell me which machines could potentially be triggers
  2. For each machine, tell me which alerts may potentially triggered

Why is this such an important questionperhaps the most important of the Four Questions in this series? Because it determines the scale of the potential notifications monitoring may generate. It's one thing if 5 alerts apply to 30 machines. It's entirely another when 30 alerts apply to 4,000 machines.


The answer to this question has implications to staffing, shift allocation, pager rotation, and even the number of alerts a particular may approve for production.


The way you go about building this information is going to depend heavily on the monitoring solution you are using.


In general, agent-based solutions are better at this because trigger logic – in the form of an alert name -  is usually pushed down to the agent on each device, and thus can be queried (both “Hey, node, what alerts are on you?” and “hey, alert, which nodes have you been pushed to?”)


That's not to say that agentless monitoring solutions are intrinsically unable to get the job done. The more full-featured monitoring tools have options built-in.


Reports that look like this:



Or even resources on the device details page that look like this:




Houston, We Have a Problem...


What if it doesn't though? What if you have poured through the documentation, opened a ticket with the vendor, visited the online forums and asked the greatest gurus up on the mountain, and came back with a big fat goose egg? What then?


Your choices at this point still depend largely on the specific software, but generally speaking there are 3 options:


  • Reverse-engineer the alert trigger and remove the actual trigger part

Many monitoring solutions use a database back-end for the bulk of their metrics, and alerts are simply a query against this data. The alert trigger queries may exist in the database itself, or in a configuration file. Once you have found them, you will need to go through each one removing the parts which comprise the actual trigger (i.e.: CPU_Utilization > 80%). This will likely necessitate your learning the back-end query language for your tool. Difficult? Probably, yes. Will it increase your street cred with the other users of the tool? Undoubtedly. But once you've done it, running a report for each alert becomes extremely simple.


  • Create duplicate alerts with no trigger


If you can't export the alert triggers, another option is to create a duplicate of each alert that has the “scope” portion, but not the trigger elements (so the “Windows machines in the 10.199.1.x subnet” part but not the “CPU_Utilization > 80%” part). The only recipient of that alert will be you and the alert action should be something like writing to a logfile with a very simple string (“Alert x has triggered for Device y”). Every so oftenevery month or quarterfire off those alerts and then tally up the results that recipient groups can slice and dice.


  • Do it by hand

If all else fails (and the inability to answer this very essential question doesn't cause you to re-evaluate your choice of monitoring tool), you can start documenting by hand. If you know up-front that you are in this situation, then it's simply part of the ongoing documentation process. But most times it's going to be a slog through of existing alerts and writing down the trigger information. Hopefully you can take that trigger info and turn it into an automated query against your existing devices. If not, then I would seriously recommend looking at another tool. Because in any decent-sized environment, this is NOT the kind of thing you want to spend your life documenting, and it's also not something you want to live without.


What Time Is It? Beer:o’clock

After that last meetingnot to mention the whole dayyou are ready pack it in. You successfully navigated the four impossible questions that every monitoring expert is asked (on more or less a daily basis)Why did I get that alert, Why didn't I get that alert, What is being monitored on my systems, and What alerts might trigger on my systems? Honestly, if you can do that, there's not much more that life can throw at you.


Of course, the CIO walks up to you on your way to the elevator. “I'm glad I caught up to you,” he says, “I just have a quick question...”


Stay tuned for the bonus question!

Related Resources

SolarWinds Lab Episode 24 - Web-based Alerting + Wireless Heat Maps, Duplex Mismatch Detection & More



Tech Tip:  How To Create Intelligent Alerts Using Network Performance Monitor



New Features & Resources for NPMv11.5



Recommended Download: Network Performance Monitor


Hello Thwack-community,


For the month of May, I will be the Ambassador for the Systems Management Community.


First off, I would like to provide some background about me. My name is Jan Schwoebel and I'm on Twitter as @MindTheVirt and write the blog www.MindTheVirt.com - Mind The Virtualization. I scored my first job in IT back in 2007 starting out as a junior consultant, managing customer systems and provide decision-making support. Over the last 4+ years I have spent time in technical support positions, specializing in virtualization and storage systems.


Today, I would like to start a discussion with you regarding managing virtualized systems. As the years progress, virtualization has become mainstream and today, many servers and applications are virtualized. An increasing amount of companies are even starting to run 100% of their systems on VMware ESXi and Microsoft Hyper-V. The reasons to virtualize 100% of servers and applications, or your whole datacenter, reach from being green and reducing the carbon footprint to ease of deployment of new servers and systems.


However, as it becomes easier to deploy new servers, switches and applications, it becomes more complex to manage all these systems efficiently and be aware of any issues which might arise. Often, we are not aware of how many snapshots a VM has, if we need to run a snapshot consolidation, how scalable the current infrastructure is, or what application is creating a bottleneck. Every other week a new company appears with a product promising to simplify server and data management.


Since, I’m working in technical support, I only hear from customers once it is too late and they hit some issue or limitation. As Kevin O’Leary on Shark Tank always says: “There must be a better way”.

Indeed, there must be a better way and I would love to hear from you. What are you doing to avoid support calls? How do you manage your virtualized infrastructure efficiently? What products, workflows and techniques are you using and why?

Last week at their Build Developer Conference and the week at Ignite, Microsoft introduced a broad range of new technologies. In recent years, Microsoft has become a more agile and dynamic company. In order for you and your organization to take advantage of this rapid innovation, your organization needs to keep with the change, and quickly adapt to new versions of technology, like Windows 10, or SQL Server 2016 . Or maybe you work with open source software like Hadoop and are missing out on some of the key new projects like Spark or the newer non-map reduce solutions. Or perhaps you are using a version of Oracle that doesn’t support online backups.  It’s not your fault; it’s what management has decided is best.


As an IT professional it is important to keep your skills up to date. In my career as a consultant, I have the good fortune to be working with software vendors, frequently on pre-release versions, so it is easy for me to stay up to date on new features. However, in past lives, especially when I worked in the heavily regulated health care industry, it was a real challenge to stay on top of new features and versions. I recently spoke with a colleague there and they are still running eight-year-old operating systems and RDBMSs.


So how you manage these challenges in your environment? Do you do rogue side projects (don’t worry we won’t share your name)? Or do you just keep your expert knowledge of old software?  Do you pay for training on your own? Attend a SQL Saturday or Code Camp? What do your team mates do?  Do you have tips to share for everyone on staying current when management thinks “we are fine with old technology”?

Last week, Omri posted a blog titled, What Does APM Mean to You? Personally, I think it means several things, but it really got me thinking about security issues related to APMhow they are of high concern in today’s IT world. Systems and application environments are specifically prone to denial of service attacks, malware, and resource contention issuescaused by remote attacks or other miscellaneous security issues.


I've always looked at continuous application or systems monitoring as something that goes hand-in-hand with security monitoring. If SysAdmins are able to provide security insights, along with systems and application performance, it will only benefit the security and operations team.  After all, IT as a whole works best when teams interface and collaborate with each other.


It’s not ideal to rely on an application performance monitoring software for IT security, but such tools are certainly designed with some basic features that deliver capabilities that are related to security use casesto complement your existing IT security software.


Here are some key security related use cases you get visibility into using an application and systems monitoring software.


Check for important updates that should be applied

Forgetting to install an OS or hardware update may put your servers and apps at risk. Your apps may be prone to attacks from malicious software and other vulnerabilities. OS updates will ensure such vulnerabilities are corrected immediately when they are discovered. In addition, you should report on the number of critical, important, and optional updates that are not yet applied to the server.  Remember, you can also view when updates were last installed and correlate that time period to performance issues.  Sometimes these updates cause unexpected performance impacts.

Windows Server.png


Keep an eye on your antivirus program

Monitor the status of your antiviruswhether it is installed or not, make sure to check if key files are out of date. When you fail to scan your antivirus software or monitor whether it’s up and running, then you increase your chances of security issues.


Ensure your patch updates are installed

Collects information related to patch updates, and answers questions like: are they installed, what’s their severity, by whom and when were they installed? You install patches so that security issues, programs, and system functionalities can be fixed and improved. If you fail to apply patchesonce an issue has been detected and fixed, hackers can then leverage this publically available information and create malware for an attack.

OS Updates.png


View event logs for unusual changes

Monitor event logs and look for and alert on potential security events of interest. For example, you can look for account lockouts, logon failures, or other unusual changes. If you don’t have other mechanisms for collecting log data, you can simply leverage some basic log collection, such as event logs, syslog, and SNMP traps. You can use these for also troubleshooting.



Diagnose security issues across your IT infrastructure

Troubleshoot security issues by identifying other systems that may have common applications, services, or operating systems installed. Say a security issue with an application or website occurs, you can quickly identify what systems were in fact affected, by quickly searching for all servers that are related to the website or application. 



While these are just a few use cases, tell us how you use your APM softwaredo you use it to monitor key system and app logs, do you signal your IT security teams when you see something abnormal, or do you rely on an APM tool for basic security monitoring? Whatever the case is, we’re curious to learn from you.

My first experience in the IP domain was that of a shock!


I had moved from the optical transport domain in an operator to the IP department.


As an optical guy, I used Network Management system (NMS) for all tasks including configuration, fault and performance measurements. Above all, I liked the nice Graphical User Interface (GUI) of NMS.


However, I found that in the IP world, Command Line (CLI) is used for everything; from provisioning to troubleshooting. CLI rules in the IP domain.


“CLI is the tool for Engineers”, I was told.


OK fine! This may have something to do with my personal preference that I do not like the user interface of CLI or because I came from optical background, that this stuff seemed strange to me.


Irrespective of the user interface, and with all functionality that CLI provides,from my perspective, CLI is not the ideal tool for configuration. First, it focuses on a single box i.e. configuring box by box, which is cumbersome.  Second, it is to prone to human error and because of errors sometimes troubleshooting takes considerable time. And lastly, it is vendor specific so changing a vendor box needs a totally different skill-set to configure a box.


Therefore, as an operator, in my view, there is a need for a more flexible way of configuring/ service provisioning. The focus should move out from “box configuration” towards “network configuration”. Also, in this age of emerging technologies like SDN and NFV, where NMS is the primary focus; CLI will simply block the innovation.


Network configuration is a major part of the operators' OPEX. Studies put it around 45% of the total TCO of the network.


CLI has a place today because the management protocol -SNMP itself is not ideal for service provisioning. That is why operators are using SNMP primarily for monitoring purpose, not for configuration purposes.


Both CLI and SNMP, also, do not support one another important requirement for large complex service provider networks. That is they do not support transactional mode for network configuration.


Transaction enables multiple configurations to take place as one transaction or fail completely (All or none). To clarify this very important point, take an example of IPTV service that involves configuring one router, two switches, two firewalls and a billing system.  A transactional protocol   enables configurations on all involved network elements or NONE. This is beneficial because if there is any problem of configuration validation on even one network element, the configuration would fail on all other network elements.  This means that configuration would never be implemented partially on some network elements. This is the essence of “network configuration” as we talked earlier.


So do we have to live with SNMP and CLI for network configuration, forever?




The NETCONF YANG protocol, developed by IETF for network management, has a single focus and that is configuring network as easy as possible. IETF learned from the experience of SNMP on what can be improved and approached the new protocol in ground up fashion. It is purpose built for configuring network.


NETCONF is the management protocol primarily for network configuration while Yang is text based modeling language designed to be used with NETCONF. Both are needed for a complete flexible service provisioning in IP networks.


There are FOUR main features of NETCONF YANG:


  1. Support of Transactionality:  Configurations can be applied to multiple network elements as one transaction to either succeed or otherwise.
  2. Get configuration feature. This is distinct advantage compared to SNMP. With SNMP backup config. is available but it is polluted with operational data ( Alarms, statistics); with NETCONF one can just have the configuration data.
  3. Vendor device independence. NETCONF can be used as standard configuration protocol for any vendor. The vendor’s box will sequence the configurations and execute them. This sequencing is internal to the vendor’s box and NETCONF does not need to be aware of it.
  4. Multiple network elements can be configured at one time thus saving time, configuring the network.


Therefore in summary, NETCONF is the right solution to solve network management issues in standard way. It is the next generation of network management protocol, which will reduce the time to provision services for an operator and will help in applying multiple configurations to multiple network elements at one time.


Now it is your turn to tell me:


  1. How do you feel about CLI as a network configuration tool, would you like to live with it forever?
  2. What issues do you face, using CLI? If there are any.
  3. Do you think NETCONF can deliver better than SNMP/CLI?


Would love to hear your opinion!

Happy Month of May everyone!

I wanted to talk to you about a larger topic in the realm of IT Security, Network Security, or the general purpose 'security' space as it were...

The image below was a slide I stole from myself (thanks me!) from a presentation I've delivered at some conferences over the past few months, titled, "Is your IT Department Practicing Security Theater"

You might remember a similarly titled post I did back in January "Are you Practicing Security Theater in IT"

And just like that post itself was not the panacea to solve all matters of security it certainly did inspire both the presentation I delivered as well as some of the points contained here.


So, let's discuss for a moment...


Screen Shot 2015-05-01 at 10.38.25 AM.png


What exactly is Checkbox vs Checkbook Security?


The way I was looking at it initially is most organizations, especially budget constrained or regulatory driven ones are faced with the delicate decision to 'check a box', whether the answer solves their problem or not.


An example of that is, organizations which are required to implement logging and monitoring solutions.   Often times they'll just get some run of the mill Syslog server, have it collect all of the data and then archive it. Someone will pretend to go and review the logs every now and then, and they can officially check the box saying WE HAVE LOGGING AND MONITORING!

While sure, they TECHNICALLY do, but do they really? Will they be able to provide a backtrack history should an event occur and correlate it? Perhaps.  Will they be able to detect something happening inflight and mitigate it? Yea, no. Does that make it right? It does not, but does it check the box technically? Absolutely 'sort of' depending upon the rules they're required to follow.


But what does that mean for you and I? I mean I checked the box within a reasonable budget, even if by merely checking the box it doesn't provide any real value to the organization, what is the long-term impact?

The rub there is exactly that...  A checkbox without efficacy will definitely require you to open your Checkbook later on, whether to really resolve the problem, or due to loss of business, money or otherwise.


That's why I broke this list down in this scenario as a series of the 'checkbox' vs the 'checkbook'.  It's not to say that by adopting something in the Checkbook column it will cost more than in the checkbox (Sometimes it MAY, but it doesn't have to)

It really comes down to figuring out a strategy that works best for you and your business.


But like all things not being a panacea this is also not an exhaustive list of 'vice versa' possibilities.  I'd love your insight into whether you agree with these approaches. Situations where you've seen this be effective (I love personal stories! I have a fair share of my own ) Also if there are other situations which aren't included in here which should be addressed.

Share the love, spread the knowledge, let's all be smarter together!


Great to be back Thwack Community!


Ambassador @cxi signing off! <3

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.