Root Cause Analysis in IP: Digging Deep for QoS Data

IP services are application-driven, which means QoS thresholds and degradation in service must be recognized and resolved proactively. The key exists in data already residing in operators' networks; the challenge is to find the mechanisms and procedures for extracting the necessary data.

Carriers will need a QoS-based approach to root cause analysis (RCA) to meet customer expectations for services. Although the industry has pretty much mastered the "bottom-up" approach to discovering and mitigating network-related alarms—following an analysis "tree" to determine at some point the affected customers—a "top-down" view is what is truly needed in IP. This entails a view of the business processes and application databases touching performance, fault and other management systems, so that customer segments experiencing a degradation of quality or breach of SLAs are noticed immediately for proactive user quality management.

Rather than continuing to waste many man-hours—as much as 60 percent of their time—looking for alarms that ultimately have no impact on customers, it will be important for ILECs, CLECs, ISPs, cable companies and wireless providers to have some sort of mechanism that will diagnose degradation in service quality and help communicate the root of the problem. With converged services, that is challenging. MPLS, voice and other services are all engineered differently, so root cause analysis of degradation is difficult.

"Post-mortem diagnosis will not be acceptable, which is why we need real-time traffic analysis, systemic trending or traffic analysis, and fraud detection. Data needs to be visible not only post-collection, but as it is streaming toward collectors," says Phil Jacobs, VP for engineering at LDMI Telecommunications, which offers local and long-distance phone service, data services, security, web hosting and network services to business customers in Ohio and Michigan. "You need something to smack you in the head and say, 'Check me, check me.'"

With real-time transport protocol (RTP) streams, it is possible to capture chunks of conversation to conduct test scoring between devices on the customer premises and voice gateways. Real-time coordination, however, is made difficult by the amount of SS7 signaling and multifarious components and devices (core routers, core switches, session border controllers, call agents, softswitch gateways), as well as data coordination across multiple protocols, such as MGCP and IMS (IP multimedia subsystem), H.323, RTSP, SAP or SIP.

Traditional performance and fault management systems allow companies to "peek" at certain facets of pertinent data, but carriers need the "gestalt," according to Jacobs, referring to a configuration so unified as a whole that its properties cannot be derived from its parts. "You look at bring-your-own-broadband type services like Vonage, and you see why knowing the anatomy of a call is going to be critical."

Traditional systems built for monitoring TDM and circuit-switched networks and services cannot deliver a comprehensive picture of how calls traverse not only physical hardware, but also servers, software and databases in IP. "Monitoring tools need to map the anatomy of each transaction, whether embedded or encapsulated in RTP streams. Data coordination across multiple systems must be a part of management systems," says Jacobs, who wants to be able to look in parallel at all events occurring on his network. He currently is employing solutions from Telecom Analysis Software for monitoring all OM data and AMA data on all his DMS500s. "Real-time trending will be necessary as things happen in network, so I can monitor when something breaks a threshold and affects quality," Jacobs says. He is watching closely as vendors and consortia try to resolve queuing, latency and traffic shaping issues.

"As we extend our IP network to handle quadruple play, the management of wireless devices will make it that much more difficult," admits Jacobs. The key will be the ability to somehow generate a "mean opinion score" (MOS) to gauge voice, download and video quality. The most effective way to make that possible would be to have software residing on devices to monitor quality of voice, video, ring tone or downloads.

That type of specialized software may enable carriers to extract, format and analyze raw data so that "soft alarms" (see below) can be created to warn engineering and operations personnel when total calls presented fall below an acceptable threshold, and to activate deployment of technicians and proactive calls to affected customers.

Micromuse, through its acquisition of performance management company Quallaby, is attempting to do that very thing. "We are developing software that will reside on cell phones and other devices to monitor voice, data and video quality through algorithms," says Scott Sobers, director of solutions marketing. He explains algorithms of an ideal voice, download or video session will be measured against the actual quality on a device to create an MOS.

He admits, however, that "gray area" still exists, because edge testers provide a limited view of service quality, as they are "fixed." In other words, SS7 signaling explains when a call is dropped, but tells nothing about call quality that the consumer is actually experiencing. "When something leaves the edge testers, looking at SS7 signaling can tell you when signaling is dropped, but tells you nothing about call quality," says Sobers. "It's like a chef who controls the quality of an entrée in his kitchen, but once it goes into the dining room, it could be dropped, get cold, not even reach the table—the chef does not know what the consumer's experience is unless the consumer complains. And this is what carriers want to avoid at all costs."

To help carriers notice quality degradation and home in on whether it's the tower, handset or application that is the cause, tools like Micromuse's Netcool suite should provide a visualization of not only the network, but the entire service itself. "IP services are not just made up of physical devices at layers 1 to 3, but also databases and application," says Sobers. He notes that application monitoring at layer 7 and the business services management at layer 8 are critical. "Migrating to a converged IP NGN … is an evolution," he says, "as carriers should be able to leverage their existing investments and manage across silos, domains and services."

Paramount will be analysis of capacity utilization, which involves trending through the capacity of networks. Because services reside in silos, alarms in one service don't really impact others. That will change with converged networks, as broadband services like VoD and VoIP and high-speed Internet will ride over one pipe, ultimately interacting and affecting one another. That means a spike in usage caused by IPTV or VoD could affect VoIP quality, even if physical alarms aren't invoked. "If a user experiences poor video quality during a viewing of 'Nemo,' that is viewed by the content provider as brand erosion," say Sobers.

That's why content providers will not offer their immense libraries unless QoS can be guaranteed—not just across the network, but down to the TV or appliance in customers' homes.

Soft Alarms
In IP, raw root-level data will take on a more important role, as different data types that once had little meaning will become the nucleus for soft alarms.

As they exist today, hundreds of alarms are evoked when something blows a fuse or when a trunk card goes bad. Knowing the outage or error, however, doesn't help a carrier to know there's suddenly less than 95 percent availability, in violation of an SLA with an important customer. Nor will it allow a carrier to know when average holding times change slightly, or early disconnects become more prevalent.

Soft alarms are necessary to monitor those service quality variables. Manufacturers have provided registers that track events, but the automatic data collection and soft alarm capability has been lacking. Root cause analysis as an SS7 monitoring technology for "hard" alarms for NEs and equipment will have to open up to embrace "soft" alarms to notice actual degradations in service, such as increases in average holding times or in disconnect instances, or slow data downloads, or jitter and latency in video.

"The level of service management necessary to make soft alarms a reality does not exist right now," observes CoManage CTO Andy Fraley. "Bad data still permeates most carriers' systems." Even carriers that are "more advanced" need to do more than correlate alarms to affected customers, SLAs and contact information on a weekly or nightly basis. "By next year," he says, "carriers will have to do such correlations automatically and in real time."

One of the big problems is faulty data and data records, which can generate "real-time bad data multipliers," Fraley says. "One bad record or data entry mess-up at a low level can cause multiple identities to spawn thousands of inaccurate correlations higher up."

Such multipliers can create tens of thousands of inaccurate correlations of events, which then lead to inaccurate routing and inventory results for possibly thousands of services.

"Some aggressive carriers are starting to run data integrity management between the network and inventory systems, which means every customer circuit, every cross-connect, every channel assignment is compared and managed between inventory systems and the network," Fraley explains.

Such practices can, in theory, facilitate higher-speed look-ups when ports go out, so that carriers can see what services are dependent on certain elements, after which point analysis fans out to show how many services and rider services are dependent on failed components for further correlation.

"Carriers will need a bird's-eye view, which means data cannot be maintained in silos, as the information then becomes very inconsistent across the organization," says Noga Confinao, managing director for Ace-Comm's European operations. Ace-Comm focuses on acquiring "best-of-breed information" from a network level, to foster business intelligence for understanding data sitting in silos across organizations.

"Whether efficiency improvement, cost reduction, revenue improvement—[all] are subsets of revenue assurance that tie into SLAs, as underbilling or lost assets ultimately affect a carrier's ability to optimize operations and deliver compelling services," says Confinao.

At the core of optimizing RCA will be data integration. "A single element like a single switch port can be represented in many ways, according to who is looking at it," explains Ace-Comm CTO Jean-Francois Jodouin. "A switch engineer will look at the switch differently than a billing person or a network planner." That can lead to problems, since the same switch port can be represented differently in applications across the network, which greatly complicates RCA.

To prevent the practice of recording the same data under different guises, raw data must be seized and summarized for comparative analysis among dissimilar devices. For example, how a "seizure" (a request for service) is handled depends on whether data comes off a SIP or TDM switch. In other words, the same action could have different names, such as switch records in peg counters or session initiations in SIP.

For that reason, data integration will be at the core of QoS-centric services. "That means a service provider company might have to deem a certain switch the master source, and then correlate all other data from billing, network inventory and various databases and switches to that master source," Confinao says. "If there is an access database with multiple switch ports, the carrier will have to make a decision about which is the master key reference, so that customers can begin to be represented in consistent ways."

Having a single repository of data is the goal, as it will enable operators to compare and contrast data sources and foster better data quality to mitigate revenue leakage, handle fraud issues and enhance marketing capabilities. "You need unified data schema so that you know how all bits and pieces fit," says Dave Walters, manager of product marketing for Smarts/EMC Corp., whose SMART Analysis engine is a correlation technology designed for highly automated, real-time analysis and extraction in triple play services.

"Extraction involves getting above every detail and abstracting layers that are useful to you, whether optical, MPLS, IP, server domains, data centers, application domains and so on," says Walters. All these domains have to be managed cohesively so there is a common data model. "Only then can carriers really know the connectivity and interrelationships of the routers, the chassis blades, ports, switch ports and everything up the food chain to servers, switch ports and applications running on top of that, and onto the business processes and customers behind those processes," he says.

For carriers to build preconfigured databases that enable discovery systems to do RCA of soft alarms, it's important to resolve data integrity issues so that assurance and fault management systems extracting data from inventory do not falter.

Because TDM switches produce prodigious amounts of data within their internal systems, it's the raw, root-level data in those switches that holds all the information needed for the underlying networks to know where connections were provided in and out of the network.

Root-level data is unformatted data within a switch or switch-like element, such as software-programmed control switches and media gateway controllers. To get to that data, every element in the network has to be known, and a method is needed for pooling root-level data from each of those elements.

Mechanisms are necessary for extracting unformatted, root-level data from TDM switches, as data integration will rely on extraction of unformatted data. "Raw, unformatted data—root-level data—is more accurate than the AMA data at the output of mediation or input of billing," says Larry Cornwell, managing partner and CEO of Telecom Analysis Software (TAS), which does "interrogation" of root-level data through a command infrastructure that goes into switch elements.

The root-level data gives a complete "chain of evidence," claims Philip Balevre, managing director and COO of TAS, noting the importance also that all the data meet reporting criteria for requirements like Sarbanes-Oxley. He believes in extracting signaling data at a lower level than has been traditionally the case with AMA data (CDRs) used for C-level reporting and analysis. "As root-level events are recorded," Balevre says, "data is extracted from the various registers and taken up a level to be used for analysis, reporting or trending."

Such data can be seized on the trunk usage registers, which contain operational and maintenance data from switches, as well as the billing data (CDRs, AMAs or newer CDR look-alikes) and signaling data (SS7, SIP, MPLS). "Data can be extracted directly from the network elements, although SIP and MPLS will probably require the use of a probe tool that plugs into elements on LANs, WANs and Ethernets," adds Cornwell.

When carriers stick media gateways or media gateway controllers into converging softswitch architectures, there still exists a need to link to TDM switches. "Those switches still want to view the incoming traffic as an incoming trunk group or as Ethernet. If monitoring that device at a TDM level, then you are tracking circuitry from a media gateway even if is a soft circuit, because TDM remains the heart of the network," explains Balevre. He notes that carriers have to have control of the core point of their network before worrying about distributing elements.

The point is that even in a virtual network, sooner or later a real circuit is hit. "It has to go to a telephone eventually," says Cornwell. "If a carrier doesn't have TDM switches, and they employ media gateway controllers and other softswitch technology, it still looks the same, because the controller is still a central point of collection; then you can expand data collection from the central point and out to be distributed," says Balevre. One hundred percent of data will go to the core of the network, "regardless of what you call it."

As carriers expand, they can add software modules or probes into signaling networks and compare distributed network elements to core network elements using core data from their management systems. If the elements don't match up, they can then analyze the reason.

Feeding Data to OSS
Until the Micromuses of the world expand into order entry, or the Siebels into fault management, or the Granites into billing, an accurate view of service management will rely on determining where, among the many components comprising IP networks, to insert mechanisms for obtaining necessary QoS data.

That's pushing fault management companies to hash out all of the correlation headaches, and inventory vendors to figure out how to procure all the detailed customer data.

The accuracy of the component configurations fed to fault management and assurance systems from inventory, network, order and CRM systems relies on the ability to find all fully aligned circuits. That means carriers must have end-to-end views of what the service components in inventory should be and compare them with the actual view of the network.

From the discovery side, fault management systems that do high-speed correlations need comprehensive databases containing all the service, circuit structure, associated customer names and contact information. "The discovery side then provides all detailed parameters and logical ports that do not exist in inventory, and a 'tree' of correlations and dependencies can be devised," says Fraley at CoManage. "Then, as events come up, equipment can be tied to links or configuration sets based on the tree of dependencies, so that service providers can apply correlated states up the tree, thus creating a cascade of event flows."

Once every element and every service that rides on top of network elements is known, carriers can determine which customers subscribe to what services. Only then can carriers decide whether alarms—soft or hard—are really impacting services, customers or ultimately revenues, and whether SLAs are being violated.

For IP-based services such as VoIP, streaming video or videoconferencing to deliver on QoS promises, SLAs can no longer be treated as a generic entity; they must differ according to whether they are based on TDM circuits, IP or VPN networks, or something else.

And to really utilize SLAs, carriers must tie degrading service to a database containing information about contracts. If there's a bandwidth spike that causes a millisecond of down time, root-level data must be looked at as a business process, so that logistics and process flows can zero in on bottlenecks causing degradations. "To build a service-level view of services, carriers must see not only routers, circuits and network boards involved in service delivery, but the servers, databases and applications sitting on top," says Sobers. He points out that in April, Micromuse launched a business services management solution that does modeling and monitoring of applications and databases. "You look at a company like BT, which needs 99.99 [percent] up time in their managed services, and that translates into six minutes of downtime they are allowed each year," he says. "After five seconds of downtime, it's critical they know what customers are down and how to notify them."

For the moment, data integrity and integration initiatives are picking up momentum, which means SLAs will continue to be neglected. However, as carriers clean up their data issues, and as vendors and consortia figure out the algorithms and mechanisms for extracting data relating to quality, then thresholds for availability, bandwidth, jitter and latency will be monitored and proactively acted upon according to SLAs.

Without that ability, content providers will continue to hold onto content, since a lack of QoS guarantees could hurt the way their content is perceived. That means major carriers and alternative broadband service providers must step up technology efforts to offer triple and quadruple play.
comments powered by Disqus