Recently, when hearing of the AWS outage due to weather in the Sydney data center, I began thinking about High Availability (HA), and the whole concept of “Build for Failure.” It made me wonder about the true meaning of HA. In the case of AWS, as Ben Kepes correctly stated on a recent Speaking in Tech podcast, a second data center in, for example, Melbourne would have had implemented a failover capacity which would have alleviated a high degree of consternation.
The following is a multi-level conversation about High Availability, so I thought that I’d break it up into some sections: Server level, storage level and cloud data center level.
Remember, Fault Tolerance (FT) is not HA. FT means that the application being hosted remains at 100% uptime, regardless of the types of faults experienced. HA means that the application can endure a fault on some levels with rapid recovery from downtime, including potentially little to no downtime. FT, particularly in networking and virtual environments, involves a mirrored device always sitting in standby mode, actively receiving simultaneous changes to the app, storage, etc., which will take over should the primary device encounter a fault of some sort.
Server level HA, which is certainly the oldest IT segment into which we’ve been struggling, has been addressed in a number of ways. Initially, when we realized that a single server was never going to resolve the requirement and typically this referred to a mission critical app or database), we decided clustering would be our first approach. By building systems where a pair (or a larger number), of servers built as tandem devices would enhance uptime, and grant a level of stability to the application being serviced, we’d addressed some of the initial issues on platform vulnerable to patching, and other kinds of downtime.
Issues in a basic cluster had to do with things like high availability in the storage, networking failover from a pair to a single host, etc. For example, what would happen in a scenario in which a pair of servers were tied together in a cluster, each with their own storage (internal) and one went down? If a single host went down unexpectedly, there would be the potential for issues with the storage becoming “Out of sync” and potential data-loss would ensue. This “Split Brain” is precisely what we’re hoping to avoid. If you lose consistency in your transactional database, often times, a rebuild can fix, but of course take precious resources away from day-to-day operations, or even worse, there could be unrecoverable data loss, which can only be repaired with a restore. Assuming that the restore is flawless, how many transactions, and/or how much time was lost during the restore and from what recovery point were the backups made? So many potential losses here. Microsoft introduced the “Quorum Drive” concept into their Clustering software, which offered up the ability to avoid “Split Brain” data, and ensured some cache coherency into an X86 SQL cluster, and that helped quite a bit, but still didn’t really resolve the issue.
To me, there’s no wonder that so many applications that could have easily been placed onto X86 platforms had so much time pass prior to that taking place. Mainframes, and robust Unix systems which do cost much to maintain, and stand up, had so much more viability in the enterprise, particularly on mission critical, and high transaction apps. Note that there are of course, other clustering tools, for example Veritas Cluster manager which made the job of clustering within any application cluster a more consistent, and actually quite a bit more robust process.
Along comes virtualization on the X86 level. Clustering happened in its own way, HA was achieved through tasks like Distributed Resource Scheduling, and as the data sat typically on shared disc, the consistency within the data could be ensured. We were also afforded a far more robust way in which to stand up much larger and more discrete applications, with tasks like more processor, adding disc, and memory requiring no more than a reboot of individual virtual machines within the cluster that made up the application.
This was by no means a panacea, but for the first time, we’d been given the ability to address inherent stability issues on X86. The flexibility of vMotion allowed for the backing infrastructure to handle the higher availability of the VM within the app cluster itself, literally removed the sheer reliance of the internal cluster on hardware in network, compute, and storage. Initially, the quorum drive which needed to be a raw device mapping in VMWare, disappeared, thus making pure Microsoft SQL clusters to be more difficult, but as versions of vSphere moved on, these Microsoft clusters became truly viable.
Again, VMWare has an ability to support a Fault Tolerant environment, for truly mission critical applications. There are specific requirements in FT, along the lines of doubling the storage onto a different storage volume, doubling the CPU/Memory and VM count on a different host, as these involve mirrored devices whereas HA doesn’t actually follow that paradigm.
In my next posting, I plan to address Storage as it relates to HA, storage methodologies, replication, etc.