Usage Collection and Analysis in an IP OSS

Comments
Print
Internet service providers (ISPs) are faced with the challenge of evolving their network from one that is optimized for data, to a network offering advanced, value-added services. This evolution is driven by two key factors. The first and most compelling reason is the opportunity to offer high-margin network service for video, voice, secure VPN, unified messaging, multi-media rich gaming, multicast, circuit emulation and e-commerce applications on the same access pipe. The second, and equally important driver, is the increased competition for today's core service, Internet access. With dense wavelength division multiplexing technology and increase in fiber deployment, the amount of core bandwidth is expected to explode within the next two years (some experts predict a 200-fold increase). As such, bandwidth will increasingly become a commodity and erode ISP access revenue.

A converged IP network won't happen overnight. Instead, it will evolve over time with ISPs nipping away at services traditionally provided by dedicated networks. None-the-less, the writing is on the wall. Even today, some ISPs offer free dialup access and generate revenue by advertisement and content. Likewise, computer vendors are bundling free access with the sale of a system. High-speed service is following a similar path. DSL and cable modem providers offer broadband access at less than $60 per month - over $1,500 less than a comparable dedicated T1 line! The message is clear to ISPs: roll out a multiservice network, move to usage-based billing, and do it quickly.

Aside from raw bandwidth and quality of service (QoS) protocols, a critical component to a converged IP-based network is a robust operations support system (OSS). Traditionally, network OSSs assist in providing a range of support services such as provisioning, trouble ticketing, billing, customer care and network operations. In a multiservice IP network, as with any provider, a well-designed OSS is a competitive tool that allows providers to cut costs, scale, increase efficiency, and deploy new services quickly. On the other hand, a poorly designed OSS can paralyze an ISP, greatly reducing the speed at which it can turn increasing network bandwidth and advances in QoS protocols into revenue.

Central to the design of an IP OSS is the usage collection subsystem. This is because usage collection mechanisms determine what users are doing on the network, consolidate usage data and package it in a way that is useful for other operation and business support systems. To illustrate the role of the collection system, consider Figure 1. Here, the various pieces of an IP network are classified in a three-layer architecture. The lowest layer includes the network and service elements. Typically, the network elements consist of devices used to move IP packets from a source to destination system (such as routers, ATM and frame switches, access servers, firewalls and gateways). IP service elements consist of end system devices that provide e-mail, Web, messaging, file and other application services. The highest layer includes the business support systems that manage the network, bill customers, configure the network and service elements, and provision services based on customer requirements.

The middle layer, or mediation layer, is the glue between the network elements and business support systems. It consists of the usage collection component, and a provisioning component. (Refer to the October 1998 issue of Billing World for an in-depth explanation of IP mediation.) Although provisioning is a key issue within the mediation layer, solutions exist for this purpose today (e.g., scripts can easily add or remove an e-mail and Web account). In the future, directory-enabled network (DEN) efforts from Cisco, Sun, Microsoft and Novell along with access standards such as LDAP will better solve this problem.

The challenge today is with the usage collection subsystem. In many ways, that component is the core of the IP OSS since it feeds the business support systems the data required to do their job. For example, it provides data for usage-based billing, network monitoring and fraud detection, as well as for understanding how applications and users consume network resources.

Requirements of a usage collection subsystem

The goal of the data collection system is to provide a layer of abstraction that hides the complexity and heterogeneity of the underlying network elements, and presents a clean interface applications programming interface (API) to the business support systems. Therefore, the amount of work the collection system does is the difference between the requirements of the business support systems and the availability of and ease of access to usage information at the element layer. In terms of usage data requirements, the four typical BSS systems include billing and customer care, network operations, marketing, and IT departments. Let's consider each of these systems in turn, and then discuss how to collect the usage data.

Billing and Customer Care: In order for ISPs to bill for service other than basic subscription, they must be able to determine usage per account. An account might be a single residential user, it might be a small company, or possibly an entire department within a large enterprise. In any event, at a minimum, the following data must be collected in order to rate each network transaction: number of bytes, duration, service-level (i.e., specific QoS provisioned) and time of day. Other useful information includes application type (e.g., Web, e-mail or voice), distance (e.g., local, national, or international), and billable content within the IP payload (e.g., copyrighted material requiring a royalty, on-demand use of an application, or an audio clip).

Network Operations: Network engineers are not concerned with billing. Instead, they focus on maintaining the health of the network, capacity planning, server configuration and maintenance (e.g., Web, e-mail and file services), fraud control, trouble ticketing and service monitoring. In this capacity, network support systems look to usage monitoring systems to help identify usage trends, abusive users and service disconnects or performance problems. For example, consider dial-up service. In order to maintain a low blocking rate, network operations centers (NOCs) require accurate dial-up usage patterns and service denial statistics. Likewise, network engineers are interested in identifying the users who are continuously logged-on, or those who use their accounts for bulk e-mail distribution. Identifying such users is critical because, by some estimates, they incur over 50 percent of the costs to a provider yet generate less than two percent of the revenue. Many providers may not wish to bill per-use. Instead, they might choose to block abusive subscribers when resources are low, or enforce some acceptable use limitations to protect themselves (e.g., no more than 100 hours of use per month, or no more that 1000 emails). This problem is further exacerbated with broadband access, since abusive customers can put more than ten times the amount of pressure on backbone and server resources.

Marketing: Whereas network operations folks are concerned with identifying abusive customers, marketing and sales look for high-volume customers for a possible line or service upgrade. In this capacity, the usage collection system can be used to identify customers who are likely to upgrade their service, figure out what services customers are using and customize a usage plan. Alternatively, customer usage data can be used to develop pricing plans and package scenarios that maximize profitability (e.g., plan X which includes 100 minutes of video, 200 minutes of voice, etc.).

IT Departments: Within large corporations, IT departments need usage data to control network expenses, particularly on wide-area links and subscription services. Like a billing system, they often look to charge these costs back to individual departments or possibly the application support team. For example, a large company might spend $50K per month for full T3 access to the Internet, or $10K per month on a frame relay or IP VPN service. CFOs would like to identify which departments are using the links, and appropriately allocate that cost back to the individual departments. Similarly, these costs might be allocated back to the application development team. For example, an enterprise resource planning (ERP) application might consume 50 percent of the network resources, or possibly place heavy demand at a month's end closing. Subscription and content services (e.g., Lexis-Nexis and Reuters) can also incur substantial costs. Here, IT departments need to identify employees who subscribe to the services but do not actively use them.

Design of a usage collection system

If high-resolution usage information is the goal, the question is how do you collect it? Within a circuit-switched network this is straightforward, because Bellcore, TMN, and ANSI standards give clear specifications on each element's functionality and interface (e.g., CMIP, SNMP, Q3 and CORBA). Unfortunately, such standards are not defined for IP networks. Nor can they be ported from circuit switched standards, because the networks have fundamentally different operational and technical characteristics. It is also worth noting that the key standards body (the Internet Engineering Task Force) is not aggressively addressing the problem (as it is too close to industry).

The largest obstacle is that the IP protocols were not originally designed to provide feedback concerning network or application usage, or service-level performance. Instead, the intelligence and management operations were handled by end systems (e.g., TCP). The result is that usage information is collected by the various network and service elements and stored in "log" files. One mediation design approach is to "mine" log files for particular usage data, and then forward it to the business support systems. For example, when a user dials into an IP network, the access server collects a user ID and password. The log-in information is then authenticated at a RADIUS server, which logs the event. At month's end, the amount of time the user spends on the network can be determined by consulting the log files. The same approach can be used to account for e-mail or Web hits.

What's wrong with mining logs?

The log-file approach essentially leaves usage collection (i.e., metering) up to the various network and service elements. This approach is works well in circuit switched networks because the ingress switch is a natural place to collect usage information. In IP networks, relying on the many network and service elements to capture usage information has the following problems:

Completeness: The biggest limitation is that network and service elements are not capable of capturing all transaction information. For example, two PCs can easily start sending video without using an H.323 gatekeeper to set up the session and record usage information (note that this is also an option within Microsoft NetMeeting). In such cases, there is no device to capture usage information. So how can transactions be captured the transaction and subsequently bill for it? Some router vendors, such as Cisco, can detect long-running packet flows, such as video, and generate a corresponding detail record (a feature called NetFlow in Cisco gear). To the router, however, the video flow appears like any other data flow (e.g., a file transfer). Thus, the application type is not easily identified without a stateful inspection of the application level H.245 call setup protocol - something well beyond the capabilities of a router. Further, routers are not the ideal instrument to evaluate the application content (i.e., session information) of an IP packet. For example, a router could not gather usage statistics on subscription services, as that information is contained within the payload of the IP and TCP protocols. Thus, the log-file approach combined with NetFlow does not give complete usage information.

Performance: Enabling usage-capturing features within the network elements can place substantial loads on the device. Since the network elements are generally not designed or optimized for this purpose, such features can degrade packet forwarding or server performance and add to transmission delays. For example, by some estimates, enabling NetFlow on a Cisco router can drain CPU usage as much as 30 percent. Further, accessing the data via SNMP or log files adds to the CPU and I/O load, and can add substantial overhead on network links.

Real-time: Accessing usage data in real time is often critical for many BSS mechanisms such as fraud detection and service authorization. By their nature, log files are not real-time. Thus, this functionality has to be added (which is expensive and proprietary), or the real-time service must be sacrificed.

Central access: Since each network element captures only information relevant to the specific service it provides, generating a complete event requires that the data be aggregated and correlated into a single event. Collecting usage data in this environment is a daunting task because there are dozens of sources, each with a proprietary access and data format, and because the data can be short-lived. Further, the individual data is often meaningless by itself; the log-file approach requires sophisticated refinement and correlation techniques.

Maintenance: Maintaining a log-based solution is a considerable challenge. Disks of log files fill up, network elements sometimes fail to log all transactions, new software releases may require an interface modification and new network and upgrades/new installs require integration. Additionally, the log approach may not scale as network and server loads increase with the subscriber and broadband access deployment.

A different approach: Semantic Traffic Analysis (STA)

A different approach to collecting usage data holds promise. Semantic traffic analysis is a technique in which a special device, called a probe, is strategically placed in the network. The probe is configured to passively capture every IP packet transmitted on the wire, and analyze the header and specific data fields. In real-time, the probe extracts relevant usage and accounting information, and forwards it to the appropriate BSS. In some ways, the probe is analogous to a meter measuring electricity or water usage. However, the semantic traffic analyzer is far more flexible and powerful because it is capable of generating both low-resolution usage information (e.g., volume of data sent or received by an end system or computer) as well as high-resolution application-level (i.e., session) data. For example, the STA can trace of every packet sent by a particular user, extracting such information as Web sites visited, pages viewed, minutes of video sent, e-mails sent and more.

Compared to the log-file approach, semantic traffic analysis has the following key advantages. First, the probe does not rely on network elements to record information. Instead, it analyzes control packets exchanged between the server and end systems. This allows STA to precisely figure out what users are doing, without disturbing the operation of the network elements, servers or end-systems. Thus, the performance and maintenance issues found in log-schemes go away (since probes are a non-intrusive, passive device). Secondly, since the probe captures everything on the wire, it is a complete solution in that it does not require usage information to be correlated or synchronized. Finally, the usage information is collected in real-time, which is critical for fraud, pre-paid billing and network operations.

In an enterprise network a probe is located at the network exchange point between the carrier and enterprise network (typically, outside the firewall, because firewalls can shape user traffic). The probe captures all inbound and outbound traffic with the public IP carrier. As such, the traffic analysis software can be configured to collect detailed network usage for each department, employee or application within the organization. Such data can be used in the charge-back example given above. Additional probes can be added to account for network usage within the enterprise. For example, branch offices and telecommuters usually access the corporate headquarters via public frame relay service. By locating a probe on the frame interface at the headquarters, the traffic analyzer can capture intra-company usage as well. This approach becomes increasingly attractive as corporations begin to move voice traffic over their frame service as a way to reduce costs. In this case, voice gatekeepers are unnecessary, so probes are the only way to capture usage information for auditing purposes. Finally, STAs can be used by IT departments to audit usage-based content and subscription services. Thus, IT managers can determine whether or not an employee is using subscription or information service, or if an employee is abusing a service (e.g., auditing a telecommuting service).

Within an ISP, probes can effectively provide detailed usage information for billing as well as network operations. As mentioned earlier, it is critical for an ISP to identify customers who online 24x7, reselling service, spamming or otherwise abusing service. As ISPs, cable providers and CLECs roll out broadband service, accounting for usage becomes critical. That is, an ISP can price DSL service at $60 per month for as long as the customer sends a comparable amount of data as with dial-in. However, if the customer takes DSL access and constantly sends a full megabit of data, that customer is behaving more like a dedicated T1 customer today. Thus, this service should be priced closer to T1 service. In the access scenario, a probe can be located at the ISP's point of presence (POP) or headend, thereby providing a single usage port for that customer base. In addition to volume of data and application type, probing allows ISPs to bill for content such as copyrighted information (contained within an HTTP packet), items that must be retrieved via an expensive transcontinental link (e.g., pages that are not in a local cache), and gives providers the ability to screen content (as is required by some European governments).

Figure 2 shows a typical ISP point of presence (POP) environment as well as the location of a STA probe. Here, a probe is located at the network exchange point between the backbone and the POP. The probe captures all inbound and outbound traffic sent from all the dial-up and dedicated access (both T1/T3 as well as xDSL users) to the Internet. Depending on data collection requirements, STA probes can be configured to collect information about specific users (by IP address, dial-up username or IP group), collect information for all users, or collect information about all or some specific protocols and applications.

Although Semantic Traffic Analysis is very powerful approach to usage collection, there is no free lunch. Probes are an additional network element that must be installed and maintained. In most environments (OC-3 or below), probing equipment can be as simple as an Intel based box. However, a high-speed environment (OC-12 or OC-48), currently requires a more powerful box to handle the massive volumes of data generated (e.g., a Sun Solaris box) since the data has to be processed in real-time, with 100 percent accuracy for any carrier-grade deployment. The second key problem relates to probe location. A large-scale carrier might have hundreds of POPs. This requires the deployment of a network of probes that must be synchronized to remove duplicate events as well as provide redundancy and fault tolerance. This is a tough problem requiring a well thought-out solution, one that minimizes network overhead without sacrificing robust usage collection capabilities.

Conclusion

From a business development perspective, the key question is how do ISPs evolve their service offerings and bill for them? This is a complex issue. The trouble is that the Internet has been built from ground-up as a flat-rate service. Thus, there is some concern that consumers aren't ready to be billed using fine-granularity metrics such as number of bytes, distance, latency, jitter or other QoS metrics. Likewise, applications are not quite ready since they typically are inefficient and do not provide adequate feedback regarding usage. Marketing doesn't know how to promote, discount or bundle services. Finally, billing systems are lagging behind because most don't support the real-time authentication, authorization and accounting (see "Batch Systems for Internet Billing? Think Real-Time!" in Billing World Magazine, Jan 1999).

None-the-less, there is tremendous momentum in the industry to move to high speed, packet-based networks. T1 (1.5Mbs) DSL home service in California already cost only $49/month and cable modem access at these (and higher) speeds is even less expensive. For providers to survive and develop long term sustained and profitable business models, detailed usage information of their customers is needed.

How providers collect usage information, what they collect, and how they bill for it is a critical design decision that greatly impacts their success. Fortunately, the technology is improving daily and many industrial-strength metering solutions, such as the Semantic Traffic Analysis approach, are starting to emerge.

Matthew Lucas is an IP billing and multicast system consultant. He holds a Ph.D. in computer science from University of Virginia and a B.S. in math/computer science from Carnegie-Mellon University. Matthew can be reached at matt@TeleStrategies.com.

Ori Cohen founded NARUS in November, 1997. Narus is the leading provider of carrier-grade Semantic Traffic Analysis solutions. Prior to NARUS, Ori was VP of Business and Technology Development for VDOnet, a leading provider of Internet video conferencing and video on demand software solutions. Prior to VDOnet, he served as CEO for IntelliCom Ltd. Ori holds a Ph.D. in Physics from Imperial College (London, UK).

Comments