Billing and OSS World
Search
Weekly E-mail Newsletter 

Fault Management Gets a Grip on Service Impact

Susana Schwartz
06/01/2002
For providers to offer superior customer service, fault management must go beyond identifying problem network elements to finding performance violations.

As providers move to broadband multimedia services, SLAs will grow in importance, and so will the need for assurance that operators deliver on QoS promises. That assurance will require the fault management software to do more than just sound alarms when network elements (NEs)-ports, cards or nodes-go down. It will need to identify anomalies and performance violations, as well as contact customer-facing applications to proactively issue credits or special services to placate customers-increasingly necessary as wireless and long-distance operators choke on customer churn.

"The industry is moving into a time when carriers are increasing their 'keeping a customer' versus just 'getting a customer' strategy," says Mark Nicholson, CTO and vice president of product management at Syndesis, which provides provisioning services, including order entry, service activation and resource managers.

Marrying Fulfillment and Service Assurance

The logical topology of how the services run over the physical equipment is not usually found in traditional fault management systems. In the past, operators didn't need it there. A fiber was cut, a truck would go out to fix it, and customers would call in to complain until it was fixed. But that is no longer acceptable as the lines blur between service fulfillment (comprising provisioning and customer care, including order entry and negotiation) and service assurance (comprising mainly fault management, including capacity/performance management and SLA management).

Many major ILECs, CLECs and ISVs have talked about integrating back-office fault management, provisioning, performance and network management systems with front-office customer-facing and customer care systems. Consequently, a great deal of integration among OSS, CRM, and billing and rating vendors is taking place.

Most agree that tight integration with CRM and service analytics will grow in importance with SLA management. "When you get a credible CRM solution to feed into fault management, you can do some predictive modeling of what customers will do next," says Eric Chen, senior director of communications industry strategy at PeopleSoft. As part of its push for multivendor solutions, the company has an ongoing partnership with Micromuse for automatic trouble ticket issuance.

Integration between trouble management and SLA management also is expected. For example, MetaSolv, which has developed trouble management that analyzes service problems and correlates them according to customer SLAs, is in the midst of integrating with the SLA management assets it acquired from Nortel. Its system is designed to govern SLA contract management processes, from the definition of SLAs to the capture of detail about customers and services. "Within the system, there is assessment analysis and reporting for obtaining data from networks, OSS and BSS that pertain to the SLA itself, which is put into compliance reports and violation reports," says Denise MacDonell, senior manager in product marketing at MetaSolv.

She says that MetaSolv is now developing modules to handle data attributes and rules to support IP VPNs and MPLS, in addition to the Ethernet, Frame Relay and ATM modules already in place. "We are working with vendors like Micromuse to allow us to obtain information when conducting service impact analysis on faults," she says. Micromuse's Netcool indicates faults, then Service Analytics-the software MetaSolv acquired from Nortel-obtains the service-affecting events from Netcool and publishes them into a customer-care application detailing which SLAs are affected. "It focuses resources on SLA-affecting problems, rather than generically fixing faults on the network," MacDonell says.

"If SLAs guarantee a certain level of QoS for availability and consistency, then operators have to be able to tie their service and fault management in with SLAs," says Beth Meeks, director of product management at MetaSolv. "A train commuter working on a laptop with the expectation to be able to access corporate networks will be very disenchanted if the service is unavailable, and revenue lost if the service is discontinued. Once connected, the customer expects consistency in the transmission of data-sans variations due to bit errors or disparity in download times."

Because it's unrealistic to promise 100 percent performance, operators must strike a balance between what they can provide and what customers readily accept . That will lead to revenue assurance to ensure that availability and performance are such that they benefit the bottom line.

Out With the Old

As part of that balance, service providers must move toward self-healing networks, which will automatically find the root cause of alarms, issue tickets and deploy engineering or technical teams to the problem, while notifying customer-facing systems that affected subscribers must be informed and reimbursed for any SLA violations. Manual correlation of alarms, antiquated inventory and element management systems, maps and personal know-how must be augmented or replaced by automated systems.

Traditionally, in point-to-point connections in Layer 2 networks (such as Frame Relay and ATM), fault management's fundamental purpose was to monitor alarms and outages, and proactively filter messages to resolve problems expeditiously in an effort to minimize down time. However, 3G services based on granular IP-based definitions and applications running on top of transmission infrastructure will require fault management to wear a much bigger hat.

Fault management used to be easier. Each NE was only capable of raising one alarm at a time, and each alarm was simple: it would just indicate whether there was trouble on that particular element. Because NEs didn't manage more than one network entity (circuit) in traditional types of transmission equipment, an operator could usually handle problems with a mix of automated and manual processes.

The challenge now is with second-level correlation of physical and logical topologies: the NEs and services, respectively. "When a facility is cut, carriers must somehow know if their Frame Relay or ATM networks are affected," says Syndesis' Nicholson, "but so, too, [they must know about] other Layer 2 or Layer 3 services traversing over the broken port or transport."

"While most service providers claim they can see the faults in their network, most cannot associate physical faults with logical typography in something like a layer 3 MPLS VPN service," explains Kurt Dahm, director of business and market development for OSS at Cisco Systems. "That requires a significant amount of data consolidation so that major carriers and billing companies don't have multiple representations of the same person. You can have someone that is in the database as a mobile wireless user, as a Cisco employee, and as a home operator."

Providers traditionally had to define very restrictive filters in interfaces between NE/EM (element manager) and the presentation layer. "For instance, we had to tell the system to only present alarms classified as 'critical' at their origin, invoke delays to reduce short-term alarms or group alarms based on the time of detection," says Johan Rolandsson, an OSS consultant to Telia on behalf of Barret Consulting, which has helped Telia upgrade its fault management products.

"Traditionally, when something happened in or around the surroundings of a network element, it signaled a status change to the fault management system, and predefined text populated a screen notifying a network engineer to generate a trouble ticket," he says.

"Network elements and element managers had to raise an alarm for each probable cause in each affected entity-whether ports, channels or cross-connection points," says Rolandsson. "While today's network elements and EMs do provide the necessary level of details, the drawback is that it's arduous to sort through the prodigious amounts of data necessary and the time necessary to conduct root cause analysis."

Often, he explains, critical alarms originate outside the scope of a provider's monitoring area-for instance, in the network of a customer or another operator. "Then operators would have to draw conclusions based on a subset of the information, not to mention that sometimes critical alarms could get lost, or that even with highly restrictive policies and tight filters, it is a time-consuming process for even the most skilled people," Rolandsson adds.

"The bottom line is that with convergent networks and services, network domains and technologies cannot be managed as individual entities any longer," he says. "We have an element management center identifying the faults, a network management center for rerouting, and a service management center now emerging to notify customers when there are outages or when QoS isn't delivered at expected levels." Telia has used the system to build a customer and services impact layer on top of the network layer.

"At the core of our strategy is the hope for automatic correlation among elements that are down, and the corresponding services and SLAs affected by that," Rolandsson says. But a balance must be struck between automation and maintaining control. "You have to go beyond making a nice GUI," he says, "but automation projects of this magnitude are often slowed by the political process."

In the meantime, Telia is eliminating the need to define correlation rules as it grows. "Because rules are applied in a generic way within the [TTI Telecom] Netrac system, we have a solution that is open enough to handle all types of equipment," Rolandsson asserts. As long as equipment is ITU-compliant, Telia can expand the use of Netrac, regardless of whether it involves Marconi, Alcatel or Cisco elements.

In With the New

In a modern transmission network such as SDH or SONET, each network element might involve several hundred circuits, so a single fiber cut could generate hundreds of alarms. With Layer 3 networks, a cut to a heavy optical cable with multiple DWDM systems can generate thousands of alarms. If there is a cut in a transmission cable carrying millions of conversations over optical fiber, several dozen network elements, multiplexes and cross-connects will eventually trigger a massive alarm storm.

Such alarm traffic can cause delays that affect service levels and customer perception. For that reason Telia implemented TTI Telecom's Netrac system to automate root cause analysis and to collect and interpret all information provided by the network, automatically drawing an image of how the network is configured. "We knew we needed something different than the expert systems that were prevalent at this project's inception three years ago," says Rolandsson. Back then, predefining if-then situations required engineers to think through every possible scenario. "With a system built on probability," he says, "we are not required to guess every theoretical result for an alarm."

Changing Alarm Prioritization

To avoid being overwhelmed, the alarm structure must be designed with not only the network in mind, but also its effect on service. To assess impact quickly, providers must be able to filter out innocuous alarms and home in on the key ones.

"You have to first define alarm structure," says Danny Itzigaohan, head of cellular wireless marketing at TTI. "It seems simple, but it is not." First he recommends a migration: "Fault management must become the basis for service management and impact assessment. That is where companies are going."

Whether a broken air conditioner, a burned-out cable card, fried power supply or flooding, assigning priority to alarms before deploying a technician or team of engineers will become increasingly important with QoS-based services. Though a problem could be relatively severe in terms of network operation, it might not be as urgent as, say, an alarm that directly affects provisioning.

To ensure that whatever has been provisioned is modeled in the fault management system, Dahm at Cisco says the correlation of fault, configuration, accounting, performance and security (FCAPS) will be key to staving off performance problems: "They all have to be linked if carriers want service assurance to talk to billing and rating when SLAs are violated." He cites Bell Canada as a company with excellent FCAPS coverage (see "Bell Canada Customers Drive Activation").

"Provisioning needs to be able to pull SLAs from the SLA management systems to present a unified view of customers. Based on that, providers can reroute services according to the list of customers associated with certain downed links," says Syndesis' Nicholson. "The thresholds for service on a video stream service utilized by a Wal-Mart would maybe be less stringent than those for Bank of America, which would have real-time needs."

"Even if the network isn't working perfectly, you can still ensure the most important customers' services are operating well before repairing the network," says Itzigaohan.

Consider, for example, an enterprise customer who wants to provision enough bandwidth for an important videoconference; but at the same time, an alarm for a faulty SMSC goes off, indicating a failure of goal-on-demand video service, by which video clips are sent to individuals whenever a favorite team has scored a goal. In such a case the operator will have to automatically know what is guaranteed in the SLAs for each service, and which group gets priority, to avoid revenue loss and churn. If one group is highly profitable, the decision might be to answer its alarms first.

If ATM cells and packets are going the same route, carriers must decipher which services can easily be rerouted during an outage and which cannot. For example, a fixed route such as that for an ATM PVC or SPVC could be assigned for automatic reroute to a detour.

The problem with automatic rerouting is guaranteeing the same level of service as was promised over the original network. Depending on the break or fault, hundreds of services could automatically be rerouted, saving engineers from having to handle each EM-regardless of its priority.

Fault Management for Soft Alarms

To guarantee a certain grade of service, operators have to define key performance indicators (KPIs) and performance alarms that interpret not only hard alarms, but "soft" alarms as well.

Thresholds for service level, fault and performance indicators can include mean time back to service, mean time between service failures, CPU utilization and the time required to download or present data to end users. Answer-seizure rates, drop rates and hand-over success rates can be set in SLAs for gold, platinum or silver customers, as can packet loss rate, utilization per interface, hit rate and handover success rate.

For example, in a broader sense, operators can implement a threshold-crossing alert, which would be depicted as a soft alarm warning of behavioral anomalies of the network. "If you have a server providing gaming or streaming video or music services, you can have performance alarms that go off when a CPU load goes above 70 percent. But if you have a four-CPU unit for gaming, you can set the rules to say an alarm should be issued only if three of the four CPUs go above 70 percent," says TTI's Itzigaohan.

Operators must be careful to consider more than just percentages. "If you determine drop rate is an important KPI for wireless calls, you could have a statistical flop on your hands if you set a threshold of 50 percent, and then alarms go off at 2 a.m. because one of two calls over a network was dropped," explains Itzigaohan. "Make sure the rules say to issue an alarm when the drop rate is more than 50 percent at times when there are more than 100 calls, for example."

Traditionally, KPIs were managed by hardware that detected drop rates in cell calls, but 3G networks will demand that service providers implement analytical software that detects behavioral anomalies of the network and compares thresholds. That will be the only way to guarantee that certain grades of service are being delivered.


Bell Canada’s VPN-e service offering for large enterprises is an MPLS VPN, which offers flow-through provisioning with SLA, customer care and automated order management.

Michael Cole, Bell Canada’s vice president of information systems and technology, has worked to build an OSS infrastructure that will enable on-the-fly provisioning of bandwidth and class-of-service offerings for multiple locations through its Web portal.
“To offer this service, we first ensured that in the core network, the MPLS systematically knows how to label and address packets so latency requests are respected. That way, if a customer provisions the VPN at a rate of 20 Mbps with latency set at an email tolerance, they can crank the bandwidth up to 30 Mbps and up the latency for a better image during an important videoconference,” says Cole. He concedes that allowing customers to drive activation and scheduling on the Web is complex, as it entails an understanding of the end-to-end process needed to closely monitor provisioning, activation, SLAs, performance and fault management.

Although RBOCs use traditional solutions and in-house development, Cole says Bell Canada prefers using third-party products to create what Cole calls a “comprehensive but evolving solution” from companies Bell Canada felt confident could scale up to Tier 1 business. “While a lot of solutions are scalable enough for CLEC and greenfield environments, we needed great scalability for the hundreds of enterprise customers we needed to support,” he says. “We also needed to get mileage out of investments we already made in inventory databases and secure portal infrastructure.”

Pulling the end-to-end process into a comprehensive system was challenging, but Cole believes the key to success lies in the integration of key components: a Web portal built with a BEA application, the framework built by IT integrator CGI, and a middleware bus EAI from Vitria, used for order management and workflow. Accenture provided integration work; and Syndesis, sophisticated activation—all within Cisco Layer 3 core and edge technology. Fault management/surveillance for Layer 2 and Layer 3 comes from HP OpenView. Smarts, a fault management vendor that combines historical reporting with real-time fault management for automated network analysis, is used for correlation. An in-house SLA and performance reporting piece has been built on an Oracle database, which is accessed through the portal infrastructure to enable customers to get online network status and SLA information securely.

For Cole, the next challenge will be integration with CRM and SLA management. “That will give us an element of speed in delivering the solution to the market,” he says.

    Share this article: Email, Slashdot, Digg, Del.icio.us, Yahoo!MyWeb, Windows Live Favorites, Furl
    RSS Add this article feed to: RSS, My Yahoo, Newsgator, Bloglines

    Read Comments [0]

    Post a Comment

    Email Email this article Comment Add a comment
    Print Printer version Reprints Order reprints
    RSS RSS Feed Bookmark Bookmark article







    Subscribe to Billing & OSS World Magazine
    First Name Last Name
    E-mail

    Sponsored LinksB/OSS Magazine Announcements