Every company has certain applications that must be managed for high availability. Whether driven by business or regulatory mandates, these applications must be available 24/7 or as close to that as possible. Keeping these applications up and running while performing all of their necessary administration and maintenance is a real challenge. Products that meet recovery requirements for both data and apps are starting to appear; however, they're currently geared toward small- and medium-sized businesses (SMBs) and concentrate mainly on Microsoft Exchange.
In the last five years, data recovery technologies have been introduced that let users recover data to almost any desired previous point in time, shorten recovery times to minutes for even very large data sets and minimize the amount of storage capacity required for data protection tasks. But there's a class of apps for which data recovery by itself isn't sufficient. They require recovery of data and of the app itself in the most automated way possible.
The ultimate objective for any recovery operation is to keep the business running; in the case of an application outage, lessening its impact on revenue or customer service is critical. The key recovery metrics are recovery point objective (RPO) and recovery time objective (RTO). It's therefore important to understand what the recovery requirements are for each app before an appropriate solution can be chosen.
RTO/RPO requirements
Applications with high-availability requirements usually have very short RPO/RTO requirements. For those environments, some form of automated application recovery is often deployed. Two distinct architectures have been available in this space: fault-tolerant computing (FTC) and clustered computing for availability (CCA). FTC uses fully redundant servers with specialized operating systems that run a mirror copy of the application across both "halves" of the server, providing instant and fully transparent recovery in the event of a hardware failure. CCA uses separate server "nodes" connected across a network, as well as fault-management software that monitors for failures and can switch an application workload (including an application, its clients and data) to another node in the cluster. Both approaches do a good job of addressing planned and unplanned downtime, allowing companies to manage application availability to very high levels. FTC platforms usually provide higher overall application availability than CCA architectures.
CCA comes in two flavors: shared disk and shared nothing. In the shared-disk model, two or more nodes share a set of physical disks, limiting these to local configurations. In the shared-nothing model, some form of storage-centric replication is used to keep two physically separate data stores (the source and the target) in sync across a network.
While FTC and CCA solutions do a good job of managing high-availability requirements, there were several issues with these approaches. Early FTC designs from companies like Tandem Computers, Sequoia Systems and Stratus Technologies were proprietary and didn't support mainstream apps. More recently, FTC companies like Hewlett-Packard (HP) Co. and Stratus built systems using commodity hardware and software, but with a proprietary software layer to connect the mainstream operating system to the redundant hardware architecture. This results in higher costs and lengthier development cycles for new releases. CCA is generally more complex than FTC because it requires custom scripting, strict change-control requirements and sophisticated administrators; however, it uses off-the-shelf hardware and software, making it generally more applicable to mainstream applications. But the complexity of clusters makes them less attractive, especially to smaller shops.
New technologies for application recovery
Traditional replication products have been storage centric, replicating at either the block or volume level. They create a physical copy of the data and require the target server to be in standby mode. In addition, these products don't offer any way to monitor and detect logical and/or physical corruption in application objects.
Recent developments have bolstered a new high-availability computing model that comes closer to managing applications for continuity, rather than just very rapid recovery. |