In a previous article I discussed using an alert system that supports escalation as a way to ensure that a small team (1-6) efficiently handles multiple issues concurrently.
That model works well for each operations team. However, if an entire datacenter were impacted, each IT team uses their own triage process but must also communicate with other teams to coordinate work on resources in common. Sequencing work between teams becomes the critical task.
Your network operations center (NOC) plays a crucial role in triage when your deployment is large enough to require multiple IT teams, your systems must be highly available to support customer-facing services important to ongoing business, and power or another infrastructural resource goes down in your primary datacenter. The NOC staff tracks each alert that lights up their wall display of all systems in your site.
While each IT team might have its own tool-driven escalation process, the NOC watches all escalations, verifying that appropriate on-call staff respond within an expected interval, and leading a conference call on which to discuss progress with each task related to the current operations issues. Simultaneously, the NOC convenes a conference call with all IT team managers to discuss the unfolding crisis and decide at specific intervals whether or not to switch-over support for specific services to a secondary datacenter.
Any business critical production system ultimately reckons down-time in terms of dollars lost. To expedite triage and switchover decisions, your NOC needs an intelligent notification system that can automatically bring the right people into their conference calls. The sooner the NOC gets the IT teams talking to each other the sooner they can make informed decisions about what needs actions need to be taken and in what order. One team completing a certain task before another team is ready can actually set-back rather than advance resolution.
Important features of a scalable intelligent notification system are:
- Calendar-awareness: Looks up the on-call schedule for each IT team to find the engineer for the current shift.
- Escalation awareness: Looks up each team’s points of escalation to find the decision-making manager for switchover decisions.
- Automatic notification on multiple devices: Targets a specific contact on phone(s), email, SMS in a specific order.
- Automatic conferencing: Routes the engineer or manager into the relevant conference bridge in which NOC is already waiting.