|Network Management Zone|
The Requirement for High Availability in an NMS
Performance, particularly the ability of an NMS to collect a high number of traps, was the subject of my previous blog. In this blog, I’ll discuss another aspect of performance: High Availability.
Many of the management systems deployed with our WebNMS framework, for example, are designed to achieve “five nines" reliability, which means the management system is designed to be available 99.999 percent of the time. If you do the math, that equates to no more than 5.26 minutes downtime per year. By contrast, a system with “three nines" is measured in 3-4 days interruption in a course of a year, yet there is a dramatic difference in cost.
True High Availability is a pretty stringent requirement, and requires quite a bit of architecting, development and testing is needed to guarantee that level of reliability. But it’s absolutely required in telecom. Some of our OEMs who deploy in very large carriers get plenty of pressure from customers to demonstrate high-availability capability. The customer wants to see proof and feel very comfortable.
High availability is all about redundancy – at both the hardware and software level. There can’t be a single point of failure – if something crashes, you need to quickly roll over from the primary to the failover resource.
When you think about it, the role of a management system in the first place is to ensure some level of availability and reliability to consumers and businesses. And from experience, we know that different corners of the world have different reliability expectations. For example, a few dropped calls in the U.S. market put pressure on customer loyalty. However, customers in emerging markets are more forgiving because they experience downtime more often. Even still, a premium service in any country requires a robust management system.
From the consumer point of view, you just redial your cell phone. But from a carrier point of view, they need a hardened fully redundant system because they can’t afford a high-scale outage.
High availability is accomplished at multiple layers – OS, hardware, database, and application layer. In the case of hardware, you have a primary and fail-over server. If it's a database, there should be a secondary database for data replication. WebNMS, for instance, is designed to work with database clusters such as you have in Oracle or MySQL. Getting database availability right is usually a matter of training customers on how to leverage the framework in a specific DB environment: Oracle, MySQL, Redhat, etc.
If there’s a requirement for database replication, then the skill set of a DBA may be necessary. Or if the system needs to support 100 concurrent user/operator seats, load balancing at the presentation layer will be required. Dozens of questions must be addressed to tune the system.
- How will the system know when to cut over?
- What are the failover listeners?
- What is the heartbeat interval?
- What is the performance time or lag time for the new system to come up.
- Which functional modules come up first in sequence?
- Should it be instantaneous or is there a grace period of a one minute or two? As you get closer and closer to instantaneous failure over, the cost to get there increases dramatically.
Now a point of clarification: WebNMS or any framework supplier doesn’t sell a five-nines system. We sell a management framework that’s capable of five-nines availability. In other words, it’s the NMS developer or EMS manufacturer's job to harden the system by testing it well enough with their equipment and harden the server, database, and OS environment. Much of the cost is hardware-related; however, testing costs are high as well.
High availability is one of the most challenging aspects of management system design. And unless your NMS framework is designed with the right availability hooks, you could go through a lot of effort and expense working around that deficiency.
Eric Wegner is a 20-year veteran of the industry and has 10 years of experience with ZOHO Corp . (formerly AdventNet) working on large and complex network management infrastructures for network equipment manufacturers, service providers and military contractors. Eric joined the company as the first sales person and is now business development manager leading the WebNMS division in North America.