High Availability

What is High Availability (HA)?

Simply put, High Availability provides how a service is available to the user without any noticeable interruption and is achieved using the following principles:

  1. Redundancy
  2. Monitoring
  3. Failover

Implementing a highly available solution provides the benefits of reducing incurred downtime, whether due to loss of service, a site outage, human error, or scheduled maintenance, and improving application and service performance through scalability.

Another advantage of implementing high availability is that there are a plethora of technologies to support the needs of any organization. These can range from network load balancing, service clustering at the hypervisor or the database level to service backups and replication, all depending on budget and existing infrastructure resources.

The caveat is that in most if not all cases, someone needs to implement and support the solution and technology of choice. The systems administrator will require expertise in the technology or someone else within the organization, i.e., a network load balancer expert or a paid support contract from the technology vendor. HA always has costs in time and money, all determined by the solution and technology of choice.

Why HA is a Good Thing

HA protects resources so that in the event of a computer service outage, your business can continue with minimal interruption or a decrease in throughput. By reducing the risk of downtime, we improve business continuity.

One way that HA reduces the risk of downtime is by eliminating single points of failure. If your entire PaperCut system runs on one server, then that server is a single point of failure – and so are the components of that server like the hard drive, power supply, and network interface card, by the way. Even software such as driver updates and operating system patches can bring down the system if they contain severe defects. Therefore, it is essential to highlight the benefits of testing before going live in a production environment supported by standardized change control processes thereafter.

Making any system Highly Available is not just about how it is architected, but also the processes and procedures that are implemented, which help reduce the risk of a system failure. All in all, HA gives you methodologies to eliminate single points of failure, reducing the risk of downtime, and increased business continuity. That’s why it’s a good thing.

Business Continuity Planning

Having a business continuity plan is paramount when we look to implement high availability within an organization. It provides the business with defined guidelines of how operations will continue during an unanticipated disruption of service through the means of prevention and recovery systems while maintaining service levels for the end-users.

When creating a business continuity plan, it’s good practice to conduct a business impact analysis and recovery strategy initiative. This helps set expectations, define processes and guide resolutions.

Business Impact Analysis (BIA)

A (BIA) is used to identify the time-sensitive and business-critical functions that need to be maintained, the current processes, and the resources required to support them. This will allow the business to understand the operational ramifications further should the print service fail and indicate how long the company can operate without print?

Producing questionnaires, running workshops, and formulating interviews with the relevant stakeholders are some of the ways to ascertain the required information.

Recovery strategies

When generating a recovery strategy, we must - using the (BIA), identify and document the resource requirements of the business. We can recognize the gaps between recovery requirements and existing business capabilities by conducting a gap analysis exercise.

With this information, we can answer questions such as:

  1. What current technologies and processes exist within the organization?
  2. What in-house organization resources can we utilize to support the business requirements?
  3. What teams or individuals can we contact when a failure occurs?

Implementing a business continuity plan is hugely beneficial. It provides the organization with a roadmap and a set of processes that support making the right decisions if and when unexpected operational failures occur.

Organizational Factors

Before we can achieve a Highly Available system we need to identify where the weak links are in the computing services. Where are the single points of failure? This is a complicated question because it depends on many factors, such as:

Business value

First, we have to know what parts of the business need to continue, and what their value is to the overall organization. If we are running a hospital and the snack vending machine can’t process payments, it’s probably not as vital as ICU patient monitoring. However, if you’re a university during finals week that same vending system could be mission critical.

User behavior

Another impact to proper HA planning is user behavior. Are there peak usage times which stress certain components of the computing system? Do some applications create higher system load? Which users will need system access even in the event of a disaster?

System infrastructure

Are there multiple types of devices that need to be protected? Have we identified all the pieces of the system? What if we lose power to the primary datacenter?

Textbooks are written on techniques to protect computing services. Whether it’s the servers, operating systems, databases or power sources, we need to understand business goals and any single points of failure in the computing systems that will put them at risk.

Achieving HA

Achieving HA is accomplished primarily through two methods: redundancy and recovery. Both give the computing system resilience (i.e. an ability to return to full operation).

Redundancy

Redundancy solves the problem of a potential failure by having a duplicate standing by. It uses technologies like RAID, virtual machine images, clusters and network load balancers (NLBs). If you’re using RAID and a hard disk crashes, no problem, the data has been redundantly written to other disks. If you are using virtual machines and the whole VM crashes, no problem just spin up the latest VM image on a new VM server and you’re back in business. Clusters and NLBs have multiple servers running and can divert computing requests away from a failed server to one that is still up. Theoretically, you can have an extremely low risk of downtime by implementing redundancy.

However, the cost is usually at least double for a redundant system. And there is added time and complexity to build and maintain these systems, which also require highly trained personnel.

Serious consideration should be given to determine the type and level of redundancy. One PaperCut customer had a very sophisticated Linux HA system with clustered DNS servers and clustered database servers. It was bullet-proof. However, when their system administrator left the organization, no one knew how to maintain it. System upgrades were painful and time-intensive. Eventually, they scrapped the complex, over-engineered system and implemented VM servers that could be spun up on new hardware at a moment’s notice. It provided the same level of redundancy and much less complexity.

Recovery

HA can also be accomplished with a good disaster recovery plan. This is the most basic form of high availability and avoids most of the complexity of many HA techniques. Good disaster recovery plans will have procedures to minimize downtime for key systems. One such procedure could be taking a database backup every night, and writing daily transactions to an offsite server. This should give you the ability to have the full database back online within a short time even if the main database server crashes and burns.

Resilience

System resilience is the ability to ensure the continuous availability of operational services such as printing within an organization. Utilizing the PaperCut Site-server component can protect the print infrastructure from unexpected network outages or unreliable network links across multiple remote office locations.

In the event of a connectivity failure between the head office and a remote site, the PaperCut Site server will automatically take over the role of the PaperCut Primary Application server. The failover process is seamless and transparent, protecting print tracking, copy, authentication, and Find-Me printing from going offline. Once network connectivity to the Primary Application server at the head office has resumed, the PaperCut Site server hands back the responsibility to the Primary Application server.

For more information on what the PaperCut Site server protects, please see our Offline Operations KB.

What is the right level of High Availability?

Once upon a time, we had another PaperCut customer with a robust installation that included clustered application servers, clustered print servers, and clustered database servers – all of which pointed to a SAN. From a system point of view, it was on the high end of HA… Until a fire tore through the data center, and then it wasn’t on the high end of anything, let alone available. Forced into a redesign, they reconstructed their PaperCut installation and opted for more modern virtual machine technologies to provide the same level of HA.

This leads us to a primary consideration: what is the right level of HA? There is a significant difference in TCO between providing 99% uptime and 99.99% uptime. Is the difference necessary and worth the cost? Even if HA is a good thing for your business, and the ways to achieve it are well understood, the question still needs to be considered: what level of HA is necessary? You’ll have different answers to this question for various functions in your organization. Your mission-critical systems need more HA techniques for more parts of the infrastructure if the cost of an outage would be greater than the cost of providing HA.

For example, vital systems such as order placement, user authentication, and database may need 99.99% uptime. This might be enabled with multiple HA techniques like virtual machine snapshots, clustering, synchronous replication, hot sites, and off-site backups. However, the print system might be just fine with an hour of downtime.

Our focus when considering HA should be on two objectives

Recovery Point Objective (RPO)
RPO is the length of time between taking snapshots of your data. It’s a measure of how far back in time you must go to get a recovery point. It’s also the amount of time where the business process can cope with a loss of data.
Recovery Time Objective (RTO)
RTO (aka Mean Time To Recovery) is the maximum tolerable time from point of failure back to normal operation.


RPO and RTO need to be carefully considered, because together they determine the cost to recovery. The smaller the times for RPO and RTO, the larger the cost to recovery. If you want recovery points measured in seconds and recovery time in minutes, then expect a very high cost to recovery.

Uptime and how is it calculated?

Living in the current era of the “always-on, always available” service, there is the ever-increasing requirement to obtain 100% uptime of business-critical services. Whether it is your print management solution or the broader supporting infrastructure, in reality, things are bound to fail no matter how hard we prepare. Therefore we must align the expectation to something more achievable, 99% uptime, by utilizing the tools available to create a HA environment.

As such, uptime is determined by the availability of resources over a year, defined using the number of minutes or seconds. Calculations start at 99% and are measured in “nines,” representing the ratio of how many minutes out of the total minutes in a year service is up and running.

Depending on the organization’s requirements will determine what level of uptime is required versus the level of downtime that is acceptable.

Protecting PaperCut with HA

Now let’s apply all this to a PaperCut deployment. We don’t want to start with the assumption that printing systems need to be protected at the same level as other business functions. We should start with the business objective questions. What is the maximum amount of time that print jobs could be lost and need to be reprinted (RPO)? What is the maximum time allowable from failure of printing systems to full recovery (RTO)? And third, what is the expected total cost to recovery?

It is possible with PaperCut to improve overall system resilience at multiple points in the infrastructure. PaperCut recommends using HA technologies and methods that are most familiar to the customer, and where they have trained personnel to support them. This reduces overall system complexity by not introducing new tools and procedures to learn and implement in the event of an outage.

Remember our clustered Linux HA customer mentioned earlier? It turns out that their single point of failure was the person who set up the complex environment using multiple HA technologies that no one else understood. PaperCut does not want to force customers to add yet another tool on the HA stack unless they feel it necessary.

However, HA technologies have evolved to the point where very favorable RPO and RTO can be achieved. The full printing system, including PaperCut, can be protected against downtime with off-the-shelf HA technologies. The customer does not have to learn our way of providing HA, they should be able to use what they already know. With whatever option they do go for, you can’t fully protect against the most common sources of downtime: full hard disks, dead NICs and human error.

Therefore, the discussion to add HA for PaperCut should start with, “How does the customer defend against downtime on other servers, and in particular the printing systems?” For example, it would do us little good to defend the PaperCut server and not other print servers as well. Anywhere that PaperCut is running, or where it utilizes a resource should be considered in the HA plan. Simply put, the main pieces that we want to evaluate protecting are the PaperCut Application Server, Site Server, Print Provider (i.e. print servers), database and the multi-function devices (MFDs). If a customer already employs HA methods on other servers and resources, they should be able to include PaperCut in the same manner with very little additional training or expense.

Whatever methods you employ for PaperCut HA, “The general principle is start light, and build over time” (Chris Dance, PaperCut CEO).

PaperCut Examples

There are several proven techniques to provide HA, and each has its differences that may impact your decision which one(s) to use. There are also technical constraints that may make some techniques unadvisable (e.g. using an NLB for SQL databases). The overarching principles are,

  • Use the tools where you have depth of experience
  • Provide the level of HA that meets business objectives
  • Start light and build over time
PaperCut’s scalable architecture

PaperCut is designed to be a scalable system. This means that critical components (e.g., application server, database) can be located on separate servers to allow for growth and performance improvements. This also has the advantage of making these components more easily protected for HA. The diagram below shows a very common deployment with a PaperCut Application Server, a user directory, a database server, and multiple print servers. In this example, it’d be almost trivial to protect the PaperCut app server since the critical data is in the database.

A common PaperCut example deployment

PaperCut Application Server

The PaperCut Application Server (app server) can be protected with an NLB using an Active/Passive configuration, clustering, virtual machines, simple backup, and restore.

Active/Passive Load Balancing


Operating System-Level Clustering


Virtual Machine High Availability


Backup and Restore

PaperCut Site Server

The PaperCut Site Server (site server) provides resilience to the printing system if the connection to the app server becomes unavailable. The site server itself should also be protected with clustering or a virtual machine. We will discuss the PaperCut Site Server later in detail.

Print Providers and Print Servers

The PaperCut Print Provider runs on print servers and can employ all the aforementioned methods and Network Load Balancers (NLB). An NLB distributes application traffic (print jobs in this case) across several servers. On top of load-balancing print jobs, an NLB will bypass a server that isn’t responding, which adds protection for a failed print server.

External Database
An external database can be protected with clustering and VMs, as well as some database-specific techniques such as synchronous replication and off-site transaction logs. Check with the customer and the database manufacturer for the best choice.


You can also check out our “Ultimate guide to high availability methods for MS SQL Server” over on our blog - it’s definitely worth a read!


Related articles Application Server Failover Clustering Cluster Server VM Clusters Network Load Balancing Site Server High Availability Whitepaper


Categories: Reference Articles, Architecture


Keywords: High Availability, Resilience, Redundancy, Cluster, Virtual Machine, Failover

Comments