cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Whose Fault Is It Anyway?

MVP

Hi there! I’m michael stump, a technology consultant with a keen focus on virtualization and a strong background in systems and application monitoring. I hope to spark some discussion this month on these topics and more.


Last month, I published a post on my personal blog about the importance of end-to-end monitoring. To summarize, monitoring all of the individual pieces of a virtualization infrastructure is important, but it does not give you all of the information you need to identify and correct performance and capacity problems. Just because each individual resource is performing well doesn’t mean that the solution as a whole is functioning properly.


This is where end-to-end monitoring comes in. You’re likely familiar with all of the technical benefits of e2e monitoring. But let’s talk about the operational benefits of this type of monitoring: reducing finger-pointing.


In the old days of technology, the battle lines between server and network engineers were well-understood and never crossed. But with virtualization, it’s no longer clear where the network engineer’s job ends and the virtualization engineer’s job begins. And the storage engineer’s work is now directly involved in both network and compute. When a VM starts to exhibit trouble, the finger-pointing begins.


“I checked the SAN, it’s fine.”

“I checked the network, it’s fine.”

“I checked vSphere, it’s fine.”


Does this sound familiar? Do you run into this type of fingerpointing at work? If so, share a story with us. How did you handle the situation? Does end-to-end monitoring help this problem?

63 Comments
MVP
MVP

Hi there,

Great topic. I just want to point out, or add rather, that true end to end monitoring includes monitoring the object between the monitor and the chair. Now if the object is dumb and does not include an API then that object needs to be replaced.

MVP
MVP

Yup it sure does (end-to-end monitoring would play a great role) - sorry I can't share an instance on the same

Level 13

We've had exactly those finger-pointing conversations many times.

So far our server/virtualization admins control vSphere and all components including the v-switch.  All we can say from the network side is we (network operations) see no errors on the trunk that we are providing to the hosts and that the requested vlans are present on the specified ports. We can't do any troubleshooting or make any statements whether the correct vlan is being presented to the guest.

End to end monitoring couldn't have done any good in some of these instances as the issue was caused by incorrect configuration brought about by incorrect knowledge.

MVP
MVP

I see this happen often. Have you looked into the 1000v? Network operation teams like the 1000v because it gives them control over the virtual networking, and avoids that situation you described where you lose visibility into the networking once it leaves your physical switch.

MVP
MVP

Upgrading users is always an option!

Level 14

wbrown just became one of my heros! "I'm sorry Mr. Customer, I can't tell you why your application does perform well. I don't have access to my entire network."

Level 14

The "other" team doesn't want to give up control...so we're left in the dark.

MVP
MVP

The only problem here is the "other" mentality. In the not-too-distant future, you won't be able to tell a network engineer from a virtualization engineer from a server engineer. Drawing battle lines on that old organizational structure is inherently bad.

Level 13

We may be moving to UCS chassis and we've been given logins to UCS manager for the few installed, so hopefully we can eliminate some of those situations.

MVP
MVP

UCS makes things slightly better, since it's RBAC friendly. But it also reinforces the server / network / storage team isolation that I think is holding many organizations back from modernization. But aside from controlling traffic through the Fabric Interconnects down to the vNICs, you still can't see what's happening with vSS or vDS in play.

Level 13

Hmmm, good to know.

Thanks.

Level 14

Not saying I disagree (especially as a network/security engineer with Linux cert), but there are 2-sides to the story as well. One IT pro can't know everything (Jack of all trades, Master of none), but an organization of  specialists can be costly. There is a fine line between integrated teams and true silos, and I really don't think people in general have fully grasped that concept. Management wants a black and white strategy for moving forward, but all they have to work with is a canvass of varying shades of gray.

And the same is true on the business side, you don't want someone in Accounts Receiving also handling Accounts Payable...money starts to disappear...with no way of knowing where.

But like all human nature, people want it now...and they don't stop and think the problem through thoroughly to understand that what they are actually asking for is not what they are going to get.

Just my humble opinion...

Level 10

I think we have all been there. It seems to fall back on the network guys more than anyone. If you dont know what the issue is its either a) the network or b) the firewall

One thing that I have run into in the past is server people, network people and APPS people all want different tools that provide different things. So you run into the issue that server/VM guys are watching SolarWinds for issues, Network is watching Nagios and apps is looking at Windows logs. Everyone sees the same situations just from a different pane of glass.

MVP
MVP

That's a great point. It's exactly why I like to see a combination of tools in use for complex infrastructures. Monitor your storage, compute, and network individually with whatever tools the engineers prefer. But having a single tool that can collect performance stats on the entire solution is beneficial because it reveals how each component affects the infrastructure.

Level 17

Interesting topic for discussion. My question back to you is this: who is seeing signs of a problem?

In other words, if everyone checks their dashboards and nothing is showing up as a potential bottleneck...who is lodging a complaint in the first place? Are we talking about an end user that simply calls to complain that things are "running slow"?

I'd also add that if the end user is having issues with an application that is supported by a database (SQL Server, Sybase, DB2, Oracle, etc.) then you are going to want to have insight into what is happening inside of the engine. Without that insight everyone is going to be guessing as to the root cause and possible fix. There are plenty of scenarios that can happen inside of a database engine (the simplest example is a blocking transaction) that won't trigger any alert for SAN, network, or VM admins.

In our shop we recognized this need to have a clear view of as many layers as possible, specifically with regards to database performance. I didn't want to spend one minute of my day trying to tune a query if I could see that a VM host was over-committed for CPU. I'd work with the VM admins to get the host issues corrected first before trying to alter or tune the database code or schema.

Again, great topic!

MVP
MVP

The interfaces between storage, compute, and network (and dbms and applications...) are the bits that are left out most often. The storage system might be fine, and the blade servers might be fine. But then you have a failure at the fabric level and realize that you weren't monitoring your FC switches. That's just an example; I'm sure you could contribute others.

I like what you said about having a clear view of as many layers as possible. Does that help to avoid the blame game? Does it speed up the time to resolution for your issues?

Level 10

We have the unfortunate burden with our group that if it stops working, no matter the situation, it's still out fault so we have to get it fixed.

.::Example::.

     Boss: "I understand our remote site has a T1 down, why haven't you fixed it yet?"
     Me: "But... but... there's a fiber cut I can't do anything about that..AT&T has to run new lines."
     Boss: "Mmm hmm... well then why did we not expect this to be something that could have happened and implemented a back up system like wireless. Sounds like you have some work to do."
     Me: "Aye aye captain..."

We've actually been discussing end to end monitoring anyways though just as one more thing we can do to make sure that its actually working and not just that the components themselves are in good shape.

Level 10

Truly, user upgrade is always the option.

MVP
MVP

Pretty sure we should all be running user 3.0 at this point. 1.0 is EoL.

Level 11

Ah, the age old IT debate...

So many problems start elsewhere but migrate their way to the network team because either the server or applications team couldn't figure out the problem.  Once it arrives on my desk, I toss in a bit of wireshark/observer here and a bit of common sense there and generally find the issue is related to a misconfigured application or server. As they say, the packets don't lie.

Generally though, if the requestor can quantify their complaint, we can come to a root cause (or trash the ticket) before having to point too many fingers.  Amazingly, our IT staff works well together, and we can generally come to a solution without too much finger pointing.

What is great is when you are called into your CIO's office and he rails about why you have not implemented end to end monitoring and demands a reason and you can either lie or say "Your refusal to issue a mandate and provide resources."

MVP
MVP

"Why don't you know everything that you don't know?"

                                                                          - mgmt

MVP
MVP

I think it's easy to blame the network first because network is one of those rare resources that EVERYTHING consumes. If a device isn't on the network, what good is it? Network becomes the LCD, therefore the "obvious" source of the problem. I don't agree at all with this, just wondering if this might be part of the reason people are so quick to interrogate the network engineers when it's really a storage problem (for example).

Level 15

Funny enough, the only place we had these 'discussions' was in a place where our network team was geographically separated from the server team(s) by several states... and we had FULL end-to-end monitoring.

Communication is key!

Level 15

I try to think about the end-end and user experience where it's possible to measure. An example is a NAS share that had an outage - all green on our storage monitoring and it turned out to be a DNS issue. Still we needed to find a way to show that legitimate user outage or our monitoring becomes a limited tool - and hence we now have dont_delete_me.txt files on each share for file existence monitors to report DNS or any other connection issues, at least from our poller. Anything else is likely keyboard - user.

Same with IPSLA, Netflow, WPM etc

Level 21

Problem is user 3.0 has a bug and there isn't a HotFix yet! 

Level 21

It seems no matter how much monitoring we add; we still frequently encounter problems that are not detected by the monitoring system until they cause something to go down.  Ultimately it always goes back to the logs!

MVP
MVP

Yup. Lately I've seen problems where engineers needed to go back to the logs, but by the time they needed to review the logs, the logs had been overwritten. Log monitoring is now being discussed with great passion at this particular shop.

MVP
MVP

This is awesome. Sounds like you're more interested in monitoring the services you provide, not just the pieces of the infrastructure. That's really what I'm driving at here.

Level 14

The Hotfix for User 3.0 is their children...anytime user 3.0 has an issue, I ask to speak with the 10 year old in the house because they understand what I'm telling them to do.


D

Level 14

@michael stump,

I just wanted to put a thank you out here...this has been a great post for sparking communication among peers!

D

Level 10

This was common until we deployed SolarWinds and it showed us where to fix our networking, storage, and alerting issues.

MVP
MVP

Hey, thanks!

MVP
MVP

Tell me more. What SW products are you using, and how? I always like to hear about success stories (mainly so I can steal borrow your ideas).

Level 9

SolarWinds (usually NPM and SAM) has been my attempted solution to this problem on a few occasions too.

For me it tended to be more along the lines of Application Support teams blaming servers, networking or storage for Apps issues, without having done any real investigation of their own. This led me down the path of custom reports, alerts and views in SolarWinds - this would let them check the status of the components under their control more easily before escalating to other teams.

This usually leads to a few less telephone calls telling me the server or network is down, when it was simply a case of restarting a database or application service.

Level 8

Hmm Great information. Thanks.

Level 15

Roy,

Do you have a sample of how you are presenting this information to your application teams? I am creating Application Summary pages now and it's good to get others hard learned lessons in early.

Level 7

Cool Stuff

Level 12

To exacerbate to finger pointing even more, throw in separate companies.  I used to work for an outsourcing company as a server/virtualization engineer.  However, the company I was contracted to kept the networking and storage teams in-house.  So, any time a questionable issue came up, it was no longer finger pointing between teams, but finger pointing between companies.  As a former employee of the "parent' company, I always tried to treat us as one team and work together instead of finger pointing.  Ultimately, that is going to be the best way for both organizations to be successful in my opinion.  However, for this to work, all parties need to buy into this philosophy which was not always the case.

Level 9

Sorry bluefunelemental, I'm not currently on either of the sites I had in mind when I wrote my previous post and don't have anything useful to hand.

Most typically I'd build a custom view and use SAM component monitors to expose the status of critical Windows services, along with more typical stuff like server and network status. I might even visualize it somehow with Network Atlas. Just as critically, I'd create alerts based on the status of the components ans send those to application specific distribution lists, and even include instructions on how to fix the common issues in the alert emails.

If I'm back on either site anytime soon, I'll try to take some screenshots, etc and get back to you.

Level 13

Im glad this is finally a topic.  There are many end to end monitoring products out there none of which are from SolarWinds... yet I hope.  You really need to have more information than what can be gleaned from SNMP and WMI in order to piece together the entirety of the story.  I have run many systems in the past for various companies using various vendor software all of which were a big pain to keep up and running smoothly.  There was never a set it and forget it attitude towards those systems.

Also, being a network engineer, ITS NOT THE NETWORK!  

Level 9

Yeah... We're still too small to have separate Network Engineers and Virtualization Engineers. We're down to a department of just two people now -- me (Senior IT Technician) and my boss (IT Director) -- as our other two technicians quit a few weeks back. So we don't really have the problem of finger pointing because that would be me pointing the finger back at myself. However, it is still common that I switch back and forth between the two roles as if playing chess with myself just because I miss pointing fingers.

Level 11

We (the communications and security) work closely with the storage and systems administrators so there usually isn't a problem with us getting together and working a problem to a resolution.  Typically we run into the most problems when working with the developers who seem to have a great deal of knowledge about their particular process but exhibit very little understanding of the e2e workflow.  Around here the joke is that any problem is a always conveyed as a communications (network) problem, even though we are able to mostly locate the issue somewhere else.

MVP
MVP

It's funny to think that people (and yes, sometimes that means network engineers) spend their entire careers connecting heterogenous devices and networks to enable communication, but can't walk down the hall to talk to their counterparts.

MVP
MVP

Totally agree! This is where IT shops can mature from monitoring just things, to monitoring services. I've seen Exchange outages, for example, that knocked people offline. Meanwhile, all of the servers, storage, and network resources that support Exchange have nice green lights in the monitoring tools. If the lights are green, but the service is down, what are you really monitoring?

MVP
MVP

Without giving too much away, this is along the lines of next week's post: hybrid engineers. Can anyone survive in today's IT environment with stove-piped skillsets? Can you be an effective network engineer if you're completely unfamiliar with server, storage, and virtualization concepts and practices?

What's that saying about pointing at someone means three fingers pointing back at you?

MVP
MVP

Sounds to me like you've got a great relationship with your peers, and therefor a great work environment. But those developers, man. They're the worst.

Level 12

i concur ....michael stump

Level 17

Oh yes, lots of examples of where you find out there was something you missed monitoring. And then you add that piece to your monitoring toolbox, and before you know it you are monitoring everything possible (and still likely to miss something!) You can also get to a point where you find you are spending more time administering the monitoring application(s) than you are administering the systems themselves. Having insight into as many layers as possible certainly does speed up the time to resolution.

There is only one way to avoid the "blame game", and that is to build trust. When your customers trust that in you and your abilities than blame isn't a concern. Sure, things will break. And sure, those things may very well be your responsibility. But if your customers have trust in your abilities then they aren't likely to want to blame you. And if you have trust in them, then you aren't likely to blame others.

It sounds like Nirvana, I know, but when there is that much trust then everyone is likely to understand that with thousands of moving parts there is always a chance something doesn't function at 100%. These things happen, blame isn't a concern, but working together to find the root cause becomes the top priority, and not CYA by placing blame.

Level 9

And this is why I am working towards a software development job.