Application Troubleshooting: Making Molehills Out of Mountains

donthomas over 8 years ago 3 minute read time

When: August 2005

Where: Any random organization experimenting with e-commerce

Employee: I can’t access the CRM! We might lose a deal today if I can’t send that quote.

Admin: Okay, check the WAN, check the LAN, and the server crm1.primary.

Junior: All fine.

Admin: Great, Restart the application service. That should solve it.

Junior: Yes, Boss! The app service was down. I restarted it and now CRM is back up.

When: August 2015

Where: Any random organization that depends on e-commerce

Sales Rep: Hey! The CRM is down! I can’t see my data. Where are my leads! I’m losing deals!

Sales Director: Quick! Raise a ticket, call the help desk, email them, and cc me!

Help desk: The CRM is down again! Let me assign it to the application team.

App team: Hmm. I’ll reassign the ticket to the server guys and see what they say.

SysAdmin: Should we check the physical server? Or the VM instance? Maybe the database is down.

DB Admin: Array, disc, and LUN are okay. There are no issues with queries. I think we might be fine.

Systems team: Alright, time to blame the network!

Net Admin: No! It’s not the network. It’s never the network. And it never will be the network!

Systems team: Okay, where do we start? Server? VM? OS? Apache? App?

See the difference?

App deployment today

Today’s networks have changed a lot. There are no established points of failure like there were when the networks were flat. Today’s enterprise networks are bigger, faster, and more complex than ever before. While current network capabilities provide more services to more users more efficiently, this also has led to an increase in the time it takes to resolve an issue, much less pinpoint the cause of failure.

For example, let’s say a user complains about failed transactions. Where would you begin troubleshooting? Keep in mind the fact that you’ll need to check the Web transaction failures, make sure the server is not misbehaving, and that the database is good. Don’t forget the hypervisors, VMs, OS, and the network. Also consider the fact that there’s switching between multiple monitoring tools, windows, and tabs, trying to correlate the information, finding what is dependent on what, collaborating with various teams, and more. All of this increases the mean time to repair (MTTR), which means increased service downtime and lost revenue for the enterprise.

Troubleshoot the stack

Applications are not standalone entities installed on a Windows server anymore. Application deployment relies on a system of components that must perform in unison for the application to run optimally. A typical app deployment in most organizations looks like this:

app stack.png

When an application fails to function, any of these components could be blamed for the failure. When a hypervisor fails, you must troubleshoot multiple VMs and the multiple apps they host that may have also failed. Where would troubleshooting begin under these circumstances?

Ideally, the process would start with finding out which entity in the application stack failed or is in a critical state. Next, you determine the dependencies of that entity with other components in the application stack. For example, let’s say a Web-based application is slow. A savvy admin would begin troubleshooting by tracking Web performance, move to the database, and on to the hosting environment, which includes the VM, hypervisor, and the rest of the physical infrastructure.

To greatly reduce MTTR, it is suggested that you begin troubleshooting your application stack. This will help move your organization closer to the magic three nines for availability. To make stack-based troubleshooting easier, admins can adopt monitoring tools that support correlation and mapping of dependencies, also known as the AppStack model of troubleshooting.

Learn more at Microsoft Convergence 2015.

If you would like to learn more, or see the AppStack demo, SolarWinds will be at booth 116 at Microsoft Convergence in Barcelona.

Top Comments

Jfrazier over 8 years ago +5

Between the documentation required today (ticketing) and the complexity of the service (multiple servers to host the "application") troubleshooting is most certainly not the same.
byrona over 8 years ago +3

Yeah, I definitely 2nd what Jfrazier ‌ says about the complexity today. 15 yeas ago when I started working with HP OpenView we had just a small handful of physical servers with only one application per…
oby over 8 years ago +2

Mole hills out of mountains indeed. Physical identification is sometimes all u need to as a form of troubleshooting. passed down information sometimes does not help especially when the reporter is a novice…

ecklerwr1 over 7 years ago

Ricky is right about having enough modules to complete the appstack to start really getting a good picture.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
eai over 7 years ago

This comments are quite interesting. As a Net admin, it is important we know where to start troubleshooting so we are not in a loop.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
rschroeder over 8 years ago

As a Net Admin, I agree with the troubleshooting flow. It seems other departments may not be as well informed as mine about their products--because I use Orion and they don't?
But it's often the case the network gets blamed. Later, when we've proven the network's good (and wasted too much time proving it), we may find out about a dead SAN drive, a hung service, database issues, human errors on any management solution, etc.
It's where having multiple Orion modules breaks the blame game and gets folks out of the habit of wasting cycles to see who can get to MTTI first. (Mean Time To Innocence).
- Cancel
- Vote Up +2 Vote Down
- More
- Cancel
oby over 8 years ago

Mole hills out of mountains indeed. Physical identification is sometimes all u need to as a form of troubleshooting. passed down information sometimes does not help especially when the reporter is a novice.
- Cancel
- Vote Up +2 Vote Down
- More
- Cancel
donthomas over 8 years ago

And I think there is still a lot more of network evolution to happen..
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel