cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Application Troubleshooting: Making Molehills Out of Mountains

Level 12

When:           August 2005

Where:          Any random organization experimenting with e-commerce

Employee:         I can’t access the CRM! We might lose a deal today if I can’t send that quote.

Admin:               Okay, check the WAN, check the LAN, and the server crm1.primary.

Junior:                All fine.

Admin:               Great, Restart the application service. That should solve it.

Junior:                Yes, Boss! The app service was down. I restarted it and now CRM is back up.

When:             August 2015

Where:            Any random organization that depends on e-commerce

Sales Rep:           Hey! The CRM is down! I can’t see my data. Where are my leads! I’m losing deals!

Sales Director:     Quick! Raise a ticket, call the help desk, email them, and cc me!

Help desk:           The CRM is down again! Let me assign it to the application team.

App team:            Hmm. I’ll reassign the ticket to the server guys and see what they say.

SysAdmin:           Should we check the physical server? Or the VM instance? Maybe the database is down.

DB Admin:           Array, disc, and LUN are okay. There are no issues with queries. I think we might be fine.

Systems team:     Alright, time to blame the network!

Net Admin:           No! It’s not the network. It’s never the network. And it never will be the network!

Systems team:     Okay, where do we start? Server? VM? OS? Apache®? App?

See the difference?

App deployment today

Today’s networks have changed a lot. There are no established points of failure like there were when the networks were flat. Today’s enterprise networks are bigger, faster, and more complex than ever before. While current network capabilities provide more services to more users more efficiently, this also has led to an increase in the time it takes to resolve an issue, much less pinpoint the cause of failure.

For example, let’s say a user complains about failed transactions. Where would you begin troubleshooting? Keep in mind the fact that you’ll need to check the Web transaction failures, make sure the server is not misbehaving, and that the database is good. Don’t forget the hypervisors, VMs, OS, and the network. Also consider the fact that there’s switching between multiple monitoring tools, windows, and tabs, trying to correlate the information, finding what is dependent on what, collaborating with various teams, and more. All of this increases the mean time to repair (MTTR), which means increased service downtime and lost revenue for the enterprise.

Troubleshoot the stack

Applications are not standalone entities installed on a Windows® server anymore. Application deployment relies on a system of components that must perform in unison for the application to run optimally. A typical app deployment in most organizations looks like this:

app stack.png

When an application fails to function, any of these components could be blamed for the failure. When a hypervisor fails, you must troubleshoot multiple VMs and the multiple apps they host that may have also failed. Where would troubleshooting begin under these circumstances?

Ideally, the process would start with finding out which entity in the application stack failed or is in a critical state. Next, you determine the dependencies of that entity with other components in the application stack. For example, let’s say a Web-based application is slow. A savvy admin would begin troubleshooting by tracking Web performance, move to the database, and on to the hosting environment, which includes the VM, hypervisor, and the rest of the physical infrastructure.

To greatly reduce MTTR, it is suggested that you begin troubleshooting your application stack.  This will help move your organization closer to the magic three nines for availability. To make stack-based troubleshooting easier, admins can adopt monitoring tools that support correlation and mapping of dependencies, also known as the AppStack model of troubleshooting.

Learn more at Microsoft Convergence 2015.

If you would like to learn more, or see the AppStack demo, SolarWinds will be at booth 116 at Microsoft Convergence in Barcelona.


9 Comments
MVP
MVP

Between the documentation required today (ticketing)  and the complexity of the service (multiple servers to host the "application") troubleshooting is most certainly not the same.

Level 12

Along with all the separate silos within IT departments. Some have duplication of monitoring and logging systems. One area uses one tool while another uses another tool. Some could care less about tools that can monitor other areas of IT, even if it could take the blame away from their area.

This is why I am all for the App stack.

Level 12

True - More than single pane of glass, appstack is single tab for everything in your network.

Level 21

Yeah, I definitely 2nd what Jfrazier‌ says about the complexity today.  15 yeas ago when I started working with HP OpenView we had just a small handful of physical servers with only one application per server which was connected to a switch and then a few routers and that was about it.  Now I need a map to find any one specific physical device in our infrastructure and that doesn't even start with the complexity in the storage and virtual infrastructures.  Even our firewall and storage systems have virtual firewalls and virtual filers running inside them; it's essentially layers and layers of complexity.

Level 12

And I think there is still a lot more of network evolution to happen..

Level 10

Mole hills out of mountains indeed. Physical identification is sometimes all u need to as a form of troubleshooting. passed down information sometimes does not help especially when the reporter is a novice.

As a Net Admin, I agree with the troubleshooting flow.  It seems other departments may not be as well informed as mine about their products--because I use Orion and they don't?

But it's often the case the network gets blamed.  Later, when we've proven the network's good (and wasted too much time proving it), we may find out about a dead SAN drive, a hung service, database issues, human errors on any management solution, etc.

It's where having multiple Orion modules breaks the blame game and gets folks out of the habit of wasting cycles to see who can get to MTTI first.  (Mean Time To Innocence).

Level 8

This comments are quite interesting. As a Net admin, it is important we know where to start troubleshooting so we are not in a loop.

Level 20

Ricky is right about having enough modules to complete the appstack to start really getting a good picture.