I will say this...our organization is still in its' infancy regarding monitoring. So with that said it is understandable the current state we are in; where we kind of adhoc each application as it becomes important or a pain point. Maybe we monitor a couple of important services, or add an HTTP monitor for a service page, an IPSLA from a specific point, etc. We are even monitoring the collective set of memory for one particular 3rd party application because it has a memory leak and so when it hits a customizable threshold we kick off alerts to the appropriate personnel/departments. You can clearly see the memory climb as the leak progresses when pulling historical information on the application monitor itself.
So as I help my organization mature through monitoring I do have an end goal in mind:
- custom application web pages built specifically for individual applications
- custom page includes processes running .exe's, memory specific to the app, response times where possible, dependencies, etc.
- custom app pages available for not just all - but specific departments that are in that area of responsibility and/or support to that specific application
I can see us (foresight - vision) maturing to a point that there is an actual EMS (Enterprise Monitoring Systems) Team that service most if not all departments within the organization. Building specific areas (collection of web pages) for each area of responsibility so that they are able to participate and contribute to the support and help of support of those key areas. Working closely if not within the Infrastructure Team or Network Operations, NOC, but working very closely with Support Desks, Help Desks the forward facing department for all departments within the Enterprise. Enabling business at the speed of business!
Our Application monitoring falls into two camps: Customer facing and internal facing.
On the customer facing side, I feel fairly comfortable saying that we have a very mature Application monitoring system, but it has come from years of fine tuning and development work, and a lot of collaboration between the development team and the IT Team. The main advantage we have is that we are a software shop to begin with, and our business is writing both a client application and the supporting server infrastructure pieces. So very early on, our development team dedicated a person to writing a check piece for each server component we developed. We use Nagios for all of our critical application infrastructure monitoring, and it has served us well. We have weekly meetings to discuss false alerts, and our daily operations live and die based on the alerts we see from Nagios. Obviously this approach has significant costs -- basically a full time programmer who focuses on monitoring, which not everyone can do. But for our business needs, it was/is the right decision to ensure a quality experience for our customers.
On the internal facing side, we definitely don't devote the kind of resources we do for customer facing processes. We tend to do as generic as possible, unless we have a specific pain point that we need to keep a closer eye on. At that point we customize/optimize, etc until we've got a good handle on that particular issue.
So for your customer-facing monitoring, it's sold as part of the complete package your company sells? Then I understand the amount of effort you put in to this type of monitoring; it adds to the value and quality of the product you're selling.
Having a developer write a check piece along with the server component he's developed is a very smart move in my view; it enables 'effortless' monitoring; as the templates to monitor the application are very detailed and specific to begin with.
I however also like your 'we monitor for specific pain points' remark on the internal-facing side of things: it might be the best of both world. Use generic monitoring if there isn't any specific problem (or a low count of issues); dive into the problem with custom built monitoring as soon as the pain point becomes more obvious and important.
We are fairly immature in our process currently. Our focus is mainly on service uptime. I like the idea of weekly meetings to analyze the alerts of the week to hone them in to make alerts bullet proof.
Looking forward we are looking at really identifying our supply chain of information, and using same to track performance horizonatally as opposed to vertically (if that makes sense).
I am very curious to see what others are doing in this space.
Working in a Network Operation Center for a company that provided Video, VoIP, and Internet services to customers in several states, I found that the primary focus almost always went to hardware and link monitoring, but Application monitoring was a whole different story. It seemed in the later years that Application monitoring slowly began gaining respect, but it was a very slow process. As bsciencefiction.tv mentioned, the very basics were about as far as it went for us as well.
It is interesting to watch the AppPerfromance industry take off and then kind of slow down, then speed up again. I have read some articles talking about how expensive App Performance monitoring is and how difficult or complex it is to implement & maintain. There are a lot of products out there and convergence from applications, network, and servers come together under one pane of glass as it starts to mature, it has proven difficult to build dashboards that make sense to the end users - in my experience. That is where the complexity and costly topics came into play. Not to cheer-lead for SolarWinds, but I truly have found this product suite very nice to work with in contrast to a few other top systems that I have had the luxury of being exposed to. I don't feel like I need to be a major programming or scripting guru in order to pull this information together for my user base - however, should I choose to go the heavier development side with programming VB, trick HTML, and/or significant scripting or powershelling....I have all those options available to me as well.
Templates for Exchange, Websense, Cisco ACS, etc. There are so many common tools that IT Shops use and we have the ability to collaborate to get these templates with other like minded experts has proven to be a pleasure. These forums are very active with a lot of very smart individuals, so there is great value in that as well! I know going into 2013 I am looking forward with getting started with custom web pages to build simple, yet clear pages that merge not just the hardware health, and network health, but application health onto a powerful Application Details page with information that is applicable to that specific app. I am not even limited to the more common app templates mentioned, but I can build an app page for a totally in-house custom app that my company may have built. Put together a custom page and roll it out to say the development department that was responsible for developing that custom application - and I will have some very satisfied business partners. Build alerts applicable to that specific app and alerting to that team - I now have the ability to partner with my other departments to help them succeed at their own business objectives not just mine in my private silo.
Application resides on an AppServer (which is multiple servers within a VM Cluster) and load balanced:
App Page has the following:
- VM Host with health metrics
- VM Guest machines (each server in the cluster) and its applicable health metrics
- Network Health: metrics to where the host machine connects to the network - switch port or EtherChannel/LACP Link: metrics, switch health at first connection
- Load balancer hardware health as well as application pool metrics/status
- WMI or SNMP: .exe processes running and historicals regarding each applicable service for the app in order for it to run properly
- Custom Map Image: similar to a detailed Visio drawing showing how all the applicable pieces fit together for a completed transaction from the application or information flow (this shows live status as well) with all hardware/virtual devices along the path
- Historical graphs/charts for all measurable metrics being polled - to put history on the page showing baseline activity in order to understand anomalies and/or individual tabs
- SNMP Trap information or Syslogs as well as events written to ORION event log - as applicable to the app
This provides quite a powerful page with only 1 or 2 clicks away in order to find any health problem for the overall functioning health of an application.
The normally steep up-front cost of monitoring (purchasing a solution, customizing and optimizing it to your needs) and the mere indirect benefits it usually provides make it hard for management to invest (unless the value of monitoring has been proven by failing services and applications, obviously).
The sheer amount of application usually found in enterprise environments make any system that ties all those applications together in a dashboard a pain to manage; thus increasing complexity and cost. Every application has a different way it's monitored best, every application has different metrics to monitor and different threshold to trigger on.
It would almost seem that we need some sort of standard way of interacting with applications; a sort of standardized API to gather monitoring data. Oh, wait, I think we already have a protocol for that since 1980 called SNMP. So why is it still so hard? SNMP should be easy to implement, use and customize and nearly every networked device out there is SNMP-compatible.
We are a service provider and as part of several of our different services we offer monitoring. When we provide application monitoring we generally begin with a canned application (both ones included with the product and ones we have created on our own) and then we modify those as needed for the specific customers needs. We also create new templates for customers when necessary, it all depends on the needs of the customer.
For our internal stuff we also apply the same process, use a canned template and modify as necessary. Internally we also use the application monitoring functionality for some interesting things such as monitoring our facilities for HVAC status, UPS status, data-center temperature, data-center humidity and eventually per cabinet power usage as well.
Overall application monitoring is an ever evolving process. As we learn more about current applications or need to monitor new appellations we are constantly changing current templates and creating new templates.