Showing results for 
Search instead for 
Did you mean: 
Create Post

Application Apocalypse

Level 14

Capture.PNGApologies for the doomsday reference, but I think it’s important to draw attention to the fact that business-critical application failures are creating apocalyptic scenarios in many organizations today. As businesses become increasingly reliant on the IT infrastructure for hosting applications and Web services, the tolerance for downtime and degraded performance has become almost nil. Everyone wants 100% uptime with superior performance of their websites. Whether applications are hosted on-premises or in the cloud, whether they are managed by your internal IT teams or outsourced to managed service providers, maintaining high availability and application performance is a top priority for all organizations.

Amazon Web Services® (AWS) experienced a massive disruption of its services in Australia last week. Across the country, websites and platforms powered by AWS, including some major banks and streaming services, were affected. Even Apple® experienced a widespread outage in the United States last week, causing popular services, including iCloud®, iTunes® and iOS® App Store to go offline for several hours.

I’m not making a case against hosting applications in the cloud versus hosting on-premises. Regardless of where applications are running—on private, public, hybrid cloud, or co-location facilities—it is important to understand the impact and aftermath of downtime. Take a look at these statistics that give a glimpse of the dire impact due to downtime and poor application performance:

  • The average hourly cost of downtime is estimated to be $212,000.
  • 51% of customers say slow site performance is the main reason they would abandon a purchase online.
  • 79% of shoppers who are dissatisfied with website performance are less likely to buy from the same site again.
  • 78% of consumers worry about security if site performance is sluggish.
  • A one-second delay in page response can result in a 7% reduction of conversions.

*All statistics above sourced from Kissmetrics, Brand Perfect, Ponemon Institute, Aberdeen Group, Forrester Research, and IDG.


Understanding the Cost of Application Downtime

  • Financial losses: As seen in the stats above, customer-facing applications that perform unsatisfactorily affect online business and potential purchases, often resulting in customers taking their business to other competitors.
  • Productivity loss: Overall productivity will be impacted when applications are down and employees are not able to perform their job or provide customer service.
  • Cost of fixing problems and restoring services: IT departments spend hours and days identifying and resolving application issues, which involves labor costs, and time and effort spent on problem resolution.
  • Dent on brand reputation: When there is a significant application failure, customers will start having a negative perception about your organization and its services, and lose trust in your brand.
  • Penalty for non-compliance: MSPs with penalty clauses included in service level agreements will incur additional financial losses.

Identifying and Mitigating Application Problems

Applications are the backbone of most businesses. Having them run at peak performance is vital to the smooth execution of business transactions and service delivery. Every organization has to implement an IT policy and strategy to:

  1. Implement continuous monitoring to proactively identify performance problems and indicators.
  2. Identify the root cause of application problems, apply a fix, and restore services as soon as possible, while minimizing the magnitude of damage.

It is important to have visibility, and monitor application health (on-premises and in the cloud) and end-user experience on websites, but it is equally important to monitor the infrastructure that supports applications, such as servers, virtual machines, storage systems, etc. There are often instances where applications perform sluggishly due to server resource congestion or storage IOPS issues.

Share your experience of dealing with application failures and how you found and fixed them.

Check out this ebook "Making Performance Pay - How To Turn The Application Management Blame Game Into A New Revenue Stream" by NEXTGEN, an Australia-based IT facilitator, in collaboration with SolarWinds.



Thank you for posting.


One of my first experiences with VMAN while in POC.  Our storage team came to us saying they were getting alerted of high IO on one of the clusters, but they couldn't identify where.  I started out in VMAN at the cluster level and began drilling into the hosts.  Once I identified the host, I drilled into the VMs until I found the one that had the high IO.  They took this information back to our NOC and found that a VM migration was being performed.

It was a very good use case for the product since we were trying to convince management of the value of it.

I like the thoughts here, and will inject one of my own, which others find trite:  Those who ignore history are doomed to repeat it.

Putting our trust in WAN providers, ISP's, and especially in ASP's referred to as "the cloud" is an extremely risky proposition, given how vulnerable those services are to things completely out of our control.

We can defend against hackers, malware, and human error.  We can build in redundancy to account for natural disasters like earthquakes and hurricanes.  All at a huge price.

But we've done nothing to protect our users and data from space.  I know--"this guy's off his rocker--aliens?  Give me a break."

No, I'm referring to events like the Carrington Flare (Solar storm of 1859 - Wikipedia, the free encyclopedia) that burned telegraph operators and their transmission lines and utility poles.  (Wiki quote:  "Telegraph systems all over Europe and North America failed, in some cases giving telegraph operators electric shocks.[12] Telegraph pylons threw sparks.[13] Some telegraph operators could continue to send and receive messages despite having disconnected their power supplies.[14]).

In 2011, National Geographic published thoughts about the impact of such a solar storm happening today:

If you don't take the time to read the above linked article, here's a snip:


Lloyds of London was reported to have estimated the cost of a single Carrington Flare in 2014 would be $2.6 Trillion.  And it would take ten years to recover from it.

Happily, the United States government is working to coordinate the creation of a power grid that could recover more quickly than ten years.


The same article  goes on:


Even as we are concerned about dollars lost for minutes or a single hour of network downtime, we must also be planning for a serious Disaster Recovery--one that makes a Hurricane Katrina sized impact on electric resources and satellites and data transfer.

Are you prepared?  What would it cost to become prepared?

Would such a flare damage backed up files on a thumb drive in your desk?  Would copper lines generate electricity that damages NIC's and switch ports?

Do SATA and SSD media have risk of data loss or physical damage?  How about tape backups stored off-site--are they safe, or would they be erased/corrupted?

I'm not advocating putting on your aluminum foil helmet and heading for the 1950's fallout shelter buried in the back yard.  But being aware of actual risks can put you ahead of the game.  Even as our governments advocate having an earthquake kit if you live in that kind of area, or a survival kit if you live where tornadoes or hurricanes or ice storms can cause extended electrical outages, so too should we have a "data survival mode" that is a "down-time-procedure on steroids."

This happened to North American telegraphs in 1859.  It happened to Hydro-Quebec's power grid in 1989.  We're foolish to not learn from history, and we're ridiculous if we build an electronic IT infrastructure that remains vulnerable to something history has proven happens repeatedly.

Level 14

Thanks for your comment with the detailed research and references. Very interesting!


well said rschroeder​ !

Level 17

Very Nice! The Silo's and lack of Cross-Link and Connectivity Monitoring cause some of the issues in the enterprise environment. Everyone using their own monitoring tool keeps the collaboration and information sharing slightly at bay.

Folks have their data and without the correlating links and associated nodes, neighbors, servers and their services everyone starts finger pointing or asking for proof that all other environments and aspects of connectivity are the issue.

If the data resides in the same platform, and if you have a knack to reference the app stack there is insight that points to problem areas to start your TS Efforts.

Then consider how many different systems are used to provide amenities and modern living conditions to the inhabitants of a city or county. How long before electronic pumps come back to life, how much longer before ETend is available (i'm thinking most of you don't carry that kind of cash on hand to survive not having ATM access for a few days or more), how about the lack of contingency for the clerk in the store who hasn't used a calculator since High School, and due to the register showing amounts of change has a lack of active synopsis to expeditiously calculate your change. What about the receipt?

Municipalities that purchase Electricity and Water from other surrounding Governing entities will be out until their suppliers can bring systems back on line to verify working order and proper dispersion.

I am pretty sure most Cities or Counties aren't using Solarwinds to Monitor their device connectivity and working order....maybe it's something we should all start writing our City Council members to consider and budget for...

A consultant came in a looked at our old CIO and said..."What's your money making app"

Our old CIO told him.

The consultant looked at him and said "How many users are currently using the app at this moment and what do you lose a minute while its down?"

Our old CIO told him "I do not know"

Our old CIO became our Old CIO.

As a Business Continuity Professional a trend that I have seen for a long, long, long time... especially with ERP systems, is that mid-to-global businesses invest heavily into high availability and DR-capable infrastructure. These businesses will even do the due diligence of testing on a regular basis. But when it comes to a crisis many of these businesses will wait until the last possible minute to failover and operate in Production in DR. They don't have faith.

Cloud Computing is the exciting alternative but I am seeing control issues coming to light when making the decision to move the ERP to it.

Cloud Computing is exciting, but it's also a risk.  Proving the failover, the data security, the backups and data integrity . . .  One hopes the ASP's are as secure and reliable and robust as one assumes they are. 

Unfortunately, after a natural disaster or security event, the true vulnerabilities (or strengths) of ASP's are often revealed.


We are in Australia and use AWS. I still don't know if hosting in the cloud is actually cheaper than hosting it internally. Most of our data centres have shrunk significantly but has it all been for the better? And then when it fails, you have to sit and wait for someone who doesn't work for you to fix it. Whereas if it was your own employee, you can tell them to fix it now!

Level 14

Excellent write up.  How can we protect against this?  I .know we had extensive EMP protections aboard ship while I was in the Navy.  Would those protections be cost effective in the commercial arena?

Level 20

Also it's pretty hard to you anything cloud for some kinds of data... you just can't control the security... IK the dod has been pushing for some certified gov clouds but I can just see when the first huge leak happens.

Level 20

One of the reasons we have tempest facilities.  Also some of the newest systems are today being built to withstand gps contested environment and the new GPS constellation will soon be going up... Also we aren't only on the defensive side of this btw:  Secret California tests to black out GPS for six hours a day | Daily Mail Online

Interesting stuff!  GPS Interference Notam For Southwest - AVweb flash Article

I'm interested to see research into this--thank you for sharing.

My pessimistic side suspects the information learned from these tests will be used for "defensive" purposes in future conflicts, rather than solely seeing how we can survive a major solar flare event.

But I've been wrong before, and I hope I'm wrong here, too.

Level 14

Great thread....

ASP's are a fine way to extend your footprint without the start up costs and back end support. Unfortunately, even the most strict due diligence efforts cannot prevent EMP's, Earthquakes, floods or Mr. Murphy paying an unscheduled visit. Look around... most companies use outsourced payroll system providers, retirement (401K) providers, health care portals, even your company's own website. Any of these things are a tick away from being catastrophic to your business in the event of a disaster. The best we can do is trust but verify..... look at their SOC1 and SOC2 documents and ask the hard questions during selection, renewal and impromptu vendor health checks...

The 21st century is the age where our destiny is controlled by others and yet we are responsible as if we had total control....

A final thought.... all clouds are not fluffy and white.....

Best comment about cloud computing yet:  ".... all clouds are not fluffy and white....."


George_S, you've achieved an "Ehhh!" on your work.  4.0--excellent work!

Level 14

Thank you sir....


Any chance we can just rearrange things so we can leave the mission critical systems alone, and only lose the social media outlets (minus Thwack, of course) and cruddy TV lineups...?

I suppose, when it does happen, we will all have new jobs that day... lol

Level 11

$212,000 an hour.... that does not sound like a good time!

Level 21

This is a great post!

You mention the importance of continuous monitoring which I agree is key.  It's also important to view the monitoring practice itself as one of continuous improvement.  With each failure we can usually find ways to improve our monitoring to better identify the issue before it becomes an apocolypse.  You have item #2 being to identify root cause and fix; item #3 should be to take what you learned in item #2 and improve monitoring.

You're likely better off to recommend solar-flare-proof storage and great fire suppression systems, and quick-trip circuit breakers and insulated/isolated UPS's when that day comes.

At least if you recommend them, you'll be a hero for being right, even if management chooses to ignore your best advice.

And if the flare doesn't happen while you're there, you'll still have given the best advice you could.

I'd rather see a way for the electrical grid to receive and store, or even bleed off, all the excess juice that will be generated through the lines by a massive flare.  If it were only affordable and safe to do, who knows how long a city or state or nation or continent could run off the stored energy from a single massive flare?

About the Author
Vinod Mohan is a Senior Product Marketing Manager at DataCore Software. He has over a decade of experience in product, technology and solution marketing of IT software and services spanning application performance management, network, systems, virtualization, storage, IT security and IT service management (ITSM). In his current capacity at DataCore, Vinod focuses on communicating the value proposition of software-defined storage to IT teams helping them benefit from infrastructure cost savings, storage efficiency, performance acceleration, and ultimate flexibility for storing and managing data. Prior to DataCore, Vinod held product marketing positions at eG Innovations and SolarWinds, focusing on IT performance monitoring solutions. An avid technology enthusiast, he is a contributing author to many popular sites including APMdigest, VMblog, Cyber Defense Magazine, Citrix Blog, The Hacker News, NetworkDataPedia, IT Briefcase, IT Pro Portal, and more.