When people talk about High Availability (HA) they are usually referring to the ability to use a computing service without noticeable interruption. HA generally provides a significant increase in the end-to-end availability of system functionality when compared to non-HA systems. However, they are often much more complicated to deploy and maintain. The question we need to answer is which systems need to be protected with HA techniques, and at what level.
The concept of highly available can be found in all sorts of things besides computing services. For example, we want our automobiles to be highly available and we take steps to protect some of the features that could fail, such as tires. This is why most vehicles carry a spare tire, but since there are costs to carrying these spares we generally only have one. High Availability always has costs in time and money.
It would be fabulous if computers never crashed or suffered outages from hard disk failures. But they do. Even if they didnít there would still be downtime from routine maintenance, human error, hackers, network connectivity, natural disasters, etc.
This is why considering which systems to protect, at what level, is important. We donít want the equivalent in our computing services of carrying six spare tires when one will do quite nicely.
High Availability protects resources so that in the event of a computer service outage, your business can continue with minimal interruption or decrease in throughput. By reducing the risk of downtime, we improve business continuity.
One way that HA reduces the risk of downtime is by eliminating single points of failure. If your entire PaperCut system runs on one server, then that server is a single point of failure. And, by the way, so are the components of that server like the hard drive, power supply and network interface card. Even software like driver updates and operating system patches can bring down the system if they contain serious defects.
HA is a set of methodologies to eliminate these single points of failure, and thereby reduce the risk of downtime, which increases business continuity. And thatís why itís a good thing.
Before we can achieve a Highly Available system we need to identify where the weak links are in the computing services. Where are the single points of failure? This is a complicated question because it depends on many factors, such as:
First, we have to know what parts of the business need to continue, and what their value is to the overall organization. If we are running a hospital and the snack vending machine canít process payments, itís probably not as vital as ICU patient monitoring. However, if youíre a university during finals week that same vending system could be mission critical.
Another impact to proper HA planning is user behavior. Are there peak usage times which stress certain components of the computing system? Do some applications create higher system load? Which users will need system access even in the event of a disaster?
Are there multiple types of devices that need to be protected? Have we identified all the pieces of the system? What if we lose power to the primary datacenter?
Textbooks are written on techniques to protect computing services. Whether itís the servers, operating systems, databases or power sources, we need to understand business goals and any single points of failure in the computing systems that will put them at risk.
Achieving HA is accomplished primarily through two methods: redundancy and recovery. Both give the computing system resilience (i.e. an ability to return to full operation).
Redundancy solves the problem of a potential failure by having a duplicate standing by. It uses technologies like RAID, virtual machine images, clusters and network load balancers (NLBs). If youíre using RAID and a hard disk crashes, no problem, the data has been redundantly written to other disks. If you are using virtual machines and the whole VM crashes, no problem just spin up the latest VM image on a new VM server and youíre back in business. Clusters and NLBs have multiple servers running and can divert computing requests away from a failed server to one that is still up. Theoretically, you can have extremely low risk of downtime by implementing redundancy.
HA can also be accomplished with a good disaster recovery plan. This is the most basic form of high availability and avoids most of the complexity of many HA techniques. Good disaster recovery plans will have procedures to minimize downtime for key systems. One such procedure could be taking a database backup every night, and writing daily transactions to an offsite server. This should give you the ability to have the full database back online within a short time even if the main database server crashes and burns.
Speaking of burning, we had another PaperCut customer with a robust installation that included clustered Application Servers, clustered print servers, clustered database servers and all of them pointed to a SAN. From a system point of view it was on the high end of HA. Until a fire tore through the datacenter, and then it was zero-available. Being forced into a redesign, when they reconstructed their PaperCut installation they opted for more modern virtual machine technologies to provide the same level of HA.
This leads us to a primary consideration. What is the right level of HA? There is a significant difference in TCO between providing 99% uptime and 99.99% uptime. Is the difference necessary and worth the cost? Even if HA is a good thing for your business, and the ways to achieve it are well understood the question still needs to be considered: what level of HA is necessary? You will have different answers to this question for various functions in your organization.Your mission-critical systems need more HA techniques for more parts of the infrastructure if the cost of an outage would be greater than the cost of providing HA.
For example, vital systems such as order placement, user authentication and database may need 99.99% uptime. This might be enabled with multiple HA techniques like virtual machine snapshots, clustering, synchronous replication, hot sites and off-site backups. However, the print system might be just fine with an hour of downtime.
Our focus when considering HA should be on two objectives
RPO is the length of time between taking snapshots of your data. Itís a measure of how far back in time you must go to get a recovery point. Itís also the amount of time where the business process can cope with a loss of data.
RTO (aka Mean Time To Recovery) is the maximum tolerable time from point of failure back to normal operation.
RPO and RTO need to be carefully considered, because together they determine the cost to recovery. The smaller the times for RPO and RTO, the larger the cost to recovery. If you want recovery points measured in seconds and recovery time in minutes, then expect a very high cost to recovery.
Now letís apply all this to a PaperCut deployment. We donít want to start with the assumption that printing systems need to be protected at the same level as other business functions. We should start with the business objective questions. What is the maximum amount of time that print jobs could be lost and need to be reprinted (RPO)? What is the maximum time allowable from failure of printing systems to full recovery (RTO)? And third, what is the expected total cost to recovery?
It is possible with PaperCut to improve overall system resilience at multiple points in the infrastructure. PaperCut recommends using HA technologies and methods that are most familiar to the customer, and where they have trained personnel to support them. This reduces overall system complexity by not introducing new tools and procedures to learn and implement in the event of an outage. This is a primary reason why PaperCut does not mandate HA methodologies or create HA products of our own.
Remember our clustered Linux HA customer mentioned earlier? It turns out that their single point of failure was the person who set up the complex environment using multiple HA technologies that no one else understood. PaperCut does not want to force customers to add yet another tool on the HA stack.
HA technologies have evolved to the point where very favorable RPO and RTO can be achieved without adding this cost and complexity into the PaperCut products. The customer should not have to learn our way of providing HA, they should be able to use what they already know. Even if we wanted to build all this in to PaperCut we couldnít fully protect against the most common sources of downtime: full hard disks, dead NICs and human error. The full printing system, including PaperCut, can be protected against downtime with off-the-shelf HA technologies.
Therefore, the discussion to add HA for PaperCut should start with, ďHow does the customer defend against downtime on other servers, and in particular the printing systems?Ē For example, it would do us little good to defend the PaperCut server and not other print servers as well. Anywhere that PaperCut is running, or where it utilizes a resource should be considered in the HA plan. Simply put, the main pieces that we want to evaluate protecting are the PaperCut Application Server, Site Server, Print Provider (i.e. print servers), database and the multi-function devices (MFDs). If a customer already employs HA methods on other servers and resources, they should be able to include PaperCut in the same manner with very little additional training or expense.
Whatever methods you employ for PaperCut HA, ďThe general principle is start light, and build over timeĒ (Chris Dance, PaperCut CEO).
There are several proven techniques to provide HA, and each has its differences that may impact your decision which one(s) to use. There are also technical constraints that may make some techniques unadvisable (e.g. using an NLB for SQL databases). The overarching principles are,
- Use the tools where you have depth of experience
- Provide the level of HA that meets business objectives
- Start light and build over time
The App Server can be protected with clustering, virtual machines or simple backup and restore.
The Site Server provides resilience to the printing system if the connection to the App Server becomes unavailable. The Site Server itself should also be protected with clustering or a virtual machine.
The PaperCut Print Provider runs on print servers and can employ all the methods mentioned above and Load Balancers. A Load Balancer distributes application traffic (print jobs in this case) across a number of servers. This not only load balances print jobs, but the load balancer will bypass a server that is not responding thus adding protection for a failed print server.
An external database can be protected with clustering and VMs, as well as some database-specific techniques such as synchronous replication and off-site transaction logs. Check with the customer and the database manufacturer for the best choice.
You can also check out our Ultimate guide to high availability methods for MS SQL Server over on our blog - definitely worth a read!
Keywords: High Availability, Resilience, Redundancy, Cluster, Virtual Machine, Failover