cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Continuity of Operations in Healthcare IT

Level 10

In my previous blog, I discussed the somewhat unique expectations of high availability as they exist within a healthcare IT environment. It was no surprise to hear the budget approval challenges that my peers in the industry are facing regarding technology solutions. It also came as no surprise to hear that I’m not alone in working with businesses that demand extreme levels of availability of services. I intentionally asked some loaded questions, and made some loaded statements to inspire some creative dialogue, and I’m thrilled with the results!

In this post, I’m going to talk about another area in healthcare IT that I think is going to hit home for a lot of people involved in this industry: continuity of operations. Call it what you want. Disaster recovery, backup and recovery, business continuity, it all revolves around the key concept of getting the business back up and running after something unexpected happens, and then sustaining it into the future. Hurricane Irma just ripped through Florida, and you can bet the folks supporting healthcare IT (and IT and business, in general) in those areas are implementing certain courses of action right now. Let’s hope they’ve planned and are ready to execute.

If your experiences with continuity of operations planning are anything like mine, they evolved in a sequence. In my organization (healthcare on the insurance side of the house), the first thing we thought about was disaster recovery. We made plans to rebuild from the ashes in the event of a catastrophic business impact. We mainly focused on getting back and running. We spent time looking at solutions like tape backup and offline file storage. We spent most of our time talking about factors such as recovery-point objective (to what point in time are you going to recover), and recovery-time objective (how quickly can you recover back to this pre-determined state). We wrote processes to rebuild business systems, and we drilled and practiced every couple of months to make sure we were prepared to execute the plan successfully. It worked. We learned a lot about our business systems in the process, and ultimately developed skills to bring them back online in a fairly short period of time. In the end, while this approach might work for some IT organizations, we came to realize pretty quickly that this approach isn’t going to cut it long term as the business continued to scale. So, we decided to pivot.

Next we started talking about the next evolution in our IT operational plan: business continuity. So, what’s the difference, you ask? Well, in short, everything. With business continuity planning, we’re not so much focused on how to get back to some point in time within a given window, but instead we’re focused on keeping the systems running at all costs, through any event. It’s going to cost a whole lot more to have a business continuity strategy, but it can be done. Rather than spending our time learning how to reinstall and reconfigure software applications, we spent our time analyzing single points of failure in our systems. Those included software applications, processes, and the infrastructure itself. As those single points of failure were identified, we started to design around them. We figured out how to travel a second path in the event the first path failed, to the extreme of even building a completely redundant secondary data center a few states away so that localized events would never impact both sites at once. We looked at leveraging telecommuting to put certain staff offsite, so that in the event a site became inhabitable, we had people who could keep the business running. To that end, we largely stopped having to do our drills because we were no longer restoring systems. We just kept the business survivable.

While some of what we did in that situation was somewhat specific to our environment, many of these concepts can be applied to the greater IT community. I’d love to hear what disaster recovery or business continuity conversations are taking place within your organization. Are you building systems when they fail, or are you building the business to survive (there is certainly a place for both, I think)?

What other approaches have you taken to address the topic of continuity of operations that I haven’t mentioned here? I can’t wait to see the commentary and dialogue in the forum!

12 Comments
MVP
MVP

Good points, So often people think disaster recovery and business continuity are the same things - until something happens.

Level 12

We are using both disaster recovery and business continuity combined in our environment.

We have the standard daily/weekly/monthly/yearly tape backups going on. We also have built in bare metal disaster recovery built into our backup routine. We are starting to make use of disk based backups as well for faster backups and recoveries.

While it may not be a legitimate business continuity aspect, I believe that vitalizing servers helps with this a lot. We have enough capacity in both our main data center and our off site disaster recovery center to run our entire virtual environment. Granted it would be cozy and not much room to breath but we would maintain full business functions, maybe just a little bit slower. Our remaining business critical servers are setup to be split between the 2 centers, and while each one on its own would not do a great job at running the entire load, it would stay up and running, just at a fairly reduced capacity. In essence, if our main data center suddenly turned into a fish tank (we have numerous water and steam pipes running through it so this is a valid concern), we would be at a reduced functioning capacity, but we will still be functional for the most part. Obviously not every system is redundant in this aspect, but those systems are not business/patient critical.

We were running a business continuity system on our business critical systems that would litterally allow us to roll a server back to any point in time in the last few days to a few weeks, but we stopped using it recently. We never had to use it, and never actually tested it. We found out recently that we were not using it properly, and if we did have to use it, it would likely have failed miserably and spectacularly.

MVP
MVP

Good article

We still do backups of data AND of systems.  Health care records have to be created and stored and made available for varying lengths of time, depending on the health service.  Children's records may need to be stored until they're 21 years old, some patients' procedures have to be available for seven years following the work performed, some records must be available for the entire life of the patient--and maybe even for some years beyond their death!

We have six data centers in a couple of different states--that keeps us geographically diverse enough that we aren't at risk of a single event (flood, fire, tornado, plane crash into a data center, etc.)

More and more storage is moving from tape and SATA to SSD, which surprises me.  But the archival property of SATA drives is poor, and tape can be damaged by heat, magnetism, pressure, and time.  So maybe those dollars-per-GB of SSD  space are a good investment.

Off-site storage and rotation of storage materials has always been in play.

We've moved towards the cloud for some services, although PHI isn't out there yet, and I'm not sure it ever will be (due to concerns about security).  More O365 at present, and it's not reliable or as fast as when we hosted it internally.  But it saves us $6M annually in Microsoft licensing, so we're trying it out.

Our first priority is to be able to immediately help someone who needs it, so DR takes a back seat to Continuity of Operations.  But DR still must happen.

In your second and third paragraphs I thought about what has been observed during and after Harvey and Irma (and Katrina and Andrew, etc.).  It's plain that DR and Operational Continuity are only on the minds of a few people, and that cutting taxes and lowering costs are on the minds of many others.  I'll mention the elephant in the room and say that when we elect officials based on their claimed intentions of cutting taxes, we often end up with lost services and lowered standards.  I'm thinking of Floridians who opted to not vote for people who could have brought new electrical codes and standards to nursing homes (because those new rules might increase taxes or the costs of building & operating those nursing homes), and who instead may have lowered taxes (or at least not increased them to achieve the better safety standards for nursing homes).  Well, hindsight is 20:20, but will we learn from the tragedy of our oldest relatives' lost lives due to a shell company not having any responsibility for their nursing home's lack of generators for keeping these folks from overheating?  It seems a profit-oriented endeavor, one which I hope won't be continued anywhere else at the expense of safety for the sake of dollars.

Remember:  WE are going to be in those nursing homes one day.  I sure hope they're safer instead more profitable by the time I get into one.

When you wrote "we drilled and practiced every couple of months" I was impressed.  That's great!  Not ever having to implement your emergency procedures doesn't mean they were wasted.  Even as schools practice fire drills every month, so should businesses do the same.  Fire drills, backups and restores (your backups are only as good as your last successful restore!), etc.  This is insurance, an investment against risk.

When you wrote "we largely stopped having to do our drills" I was disappointed. Practicing restores, practicing emergency procedures, practicing down time procedures--these are still valuable.  I try to design, build, and maintain great networks, but I still tell our care providers that they must regularly practice down time procedures.  Despite my teams' best efforts, the network WILL Go DOWN some time. Maybe it's scheduled, maybe it's a disaster, maybe someone just had a little boo-boo . . .  but data will be unavailable, and perhaps lost, if backups and down time procedures aren't understood and fresh in one's mind (and in the servers/tapes!).  Patient safety and care must NOT be impacted by not practicing down time procedures.  Staying familiar with the processes remains important.

It used to be a lot to worry about.  For eighteen months I was the ONLY Network Analyst dozens of hospitals and scores of clinics.  I couldn't sleep through the night lots of times.

But then came Network Configuration Manager.

Using NCM to ensure we have the ability to quickly restore/rebuild switches and routers and firewalls is one reason why I can sleep better at night.  I no longer worry about extended downtimes due to fire, storms, theft, or vandalism.  I know I can have a replacement piece of hardware configured in a minute or two after I have the hardware in hand.  Deploying one standard switch, router, and firewall hardware platform makes it easy and cheap to keep the minimum amount of spares on the shelf.  If one in production were totally lost, and if it isn't in a location where HA / Standby spares aren't already deployed for Business Continuity, it's a short path between discovered outage (thank you NPM!) and restored services (thank you NCM!).

MVP
MVP

I agree with your points rschroeder​ as the healthcare industry has some unique challenges.

I'd throw out there that emergency services (Fire, EMS, Police) has many similar and some more unique challenges and the continuity side with the ability to fail over to an alternate site on the fly is so very crucial. Having the ability to switch over to an alternate dispatch center with the same software/data and radio availability immediately available is not easy.  Think 911 call centers.  These can be for a small city, large city, large city with many suburban cities, and/or county level.  Many of these have to communicate with other entities or call centers to route calls...whether or not they have moved to a backup site needs to be transparent.  In quite a few cases, call transfers are still made via the telephone especially to adjoining municipalities. Case in point, 2 fire departments that runs mutual aid with us on many calls are dispatch by 2 other agencies so it may take an extra 5 minutes for them to be dispatched as they are individual phone calls to the other alarm centers. 

Coordination of local services on an increasing larger scale seems really hard.  Locals may not want to have "higher ups" interfering or requiring all locals get identical equipment, radios, procedures, etc. so that they can be coordinated by higher-ups in broader disasters.

Worse, local emergency services folks might know the right equipment to get, and not have budget.

Worse still is when local elected officials have to choose between raising taxes to keep emergency services adequate to the need, or being kicked out of office at election time.

That's what's happening here in Duluth right now. The mayor is great, but she's practical.  She's just shown a new budget to the public that has $2.1M in cuts--including cuts to fire and police departments Duluth Mayor's budget features $2.1 million in cuts - KBJR 6 Your Weather Authority: News, Weather &... .

The TV reporters have shown how fire department staff have remained at 1997 levels while calls on their services have increased dramatically over that time.  It's just about the same for the police department--many fewer staff available per incident are present today compared to 1997.

The fire chief's not happy, and I don't blame him.  Duluth fire chief responds to proposed budget cuts - KBJR 6 Your Weather Authority: News, Weather & ...

Level 21

One of the most important pieces with BC/DR is making sure you practice it... on a regular basis.  It is of course important to practice it after it's first setup but the environment will change over time, people will change over time and people will forget.  Make sure you practice the BC/DR process on a regular (at least annual) schedule to make sure it all works.  Also make sure that when making changes to the environment over time that you keep the BC/DR pieces in mind for what ever it is that you are changing so that it doesn't get left out.

MVP
MVP

Welcome to my other life...

Level 12

This is a wonderful read ciscovoicedude

Continuity of Business . . .    It seems that phrase MUST include environmental influences that have already been discovered and documented by best practices for data centers.  Those best practices certainly cover items like:

  • Earthquakes
  • Tsunamis
  • Hurricanes
  • Tornadoes
  • Terrorism
  • Construction / excavation (the highest source of WAN outage causes in my environment)
  • Power loss
  • ISP Diversity

If your customers (not to mention your employees) rely on the continuity of your services, surely all of the above items must be considered, and the cost of prevention/remediation/diversity to accomplish the desired up time is calculated and evaluated.

Compensating for easily identified risk factors, like those above, is a prerequisite to moving forward with business processes.

Once all of the most likely causes for outages/impact are identified, then teams can move on towards identifying broad regional service areas of impact and their causes, and then down to localized sources of problems.  Obviously that list is going to stay alive and fluid, and it needs to be updated regularly to add new clients and new sources, and to remove those that are no longer present.

Level 12

Very imformative

Level 12

i agree very informative

About the Author
I'm a Unified Communications engineer by trade, but I've got a background (and passion for) in systems management technologies of all kinds.