Learning from Past LTE Outages
As today’s network operations groups ramp up rollouts for voice over LTE (VoLTE), they’re haunted by the nationwide LTE data outages of a few years ago. Back then, LTE was a different kind of network upgrade. More than just the next generation of circuit-switched technology, it was a large-scale IP infrastructure that had to coexist and work with older technologies. The complexity and scale of these massive, multivendor systems stretched conventional monitoring approaches and tools beyond their limits.
The resulting outages struck at the core values of mobile operators, and did so in a highly public way. They undermined the key principles of reliability and quality of the end-to-end customer experience. Every time there was an outage, it eroded subscribers’ trust, weakened their connection with the operator and opened the door to customer churn.
Throughout all of this, one thing was certain: there was a real need for a new architecture for infrastructure monitoring.
Rethinking VoLTE Monitoring
Fortunately, today’s network operations groups have a new plan of action. First, they’re investing heavily to create a muscular infrastructure. That means plenty of capacity margin in VoLTE servers in the IMS core, but also in a host of other hardware and software elements, from multiple vendors, stretching from the core to the base station where smartphone users actually connect.
Secondly, they’re phasing in the availability of VoLTE services, by gradually expanding it to new markets and new phones. VoLTE traffic is expected to soar in the latter half of 2015, as more areas get it and more subscribers are added.
Yet, even with all of these positive changes, success actually hinges on a third change as well: rethinking and retooling infrastructure monitoring. Past LTE outages highlighted what was needed: access to metrics and statistical data on a much broader range of third-party equipment; near real-time data collection; a processing infrastructure to handle huge amounts of data; the ability to identify normal behavior for each network element and variations from that (baselining); and speed.
As a result, today’s network operations groups have a new and demanding set of requirements. Their equipment suppliers and monitoring tools vendors must be able to:
- Get real-time data from all the elements - hardware and software - in the call path
- Determine baseline performance behavior for each element
- Collect and process a vastly larger amount of performance data
- Get fast, accurate, reliable alerts when behavior changes
- Achieve clear, fast visibility into those changes and their causes
- Create closer, more effective and more nimble relationships with suppliers
The good news is that tools are now emerging to satisfy these demands, and performance monitoring vendors are scrambling to support them.
These days, dashboards display a range of key performance indicators showing, at a glance, the health of the network service. Analytic programs are faster, and able to work with a wider array of data formats, including log data. On a deeper level, this new generation of monitoring tools exploits distributed computing and parallel processing, making it possible to handle the vastly larger amounts of raw data and statistics that end up in reports and dashboard visuals.
Tracking the Life of a Packet
To reach their quality assurance goals, today’s network operations groups have to be able to manage VoLTE as a seamless service. That means tracking the “life of a packet” and its health from one end to another, from the initial call request by the smartphone, to cell tower, through backhaul links, switches, routers and servers, to a staggering array of network services embedded in the IMS core. All of this hardware and software comes from a dozen to two dozen vendors. And all of these elements have to work in microseconds to sustain latency-sensitive call sessions and voice quality.
SNMP is only one source of the data needed to effectively monitor complex services like VoLTE. Monitoring platforms need interfaces or adapters - or the ability to quickly build them - to grab proprietary data formats and raw log data and pull out the key performance indicators of multiple systems, including network probes, policy servers, packet gateways and Ethernet backhaul switches, among others.
Because the systems and technologies that comprise VoLTE are so new, both operators and their suppliers struggle with a general lack of standards and best practices for monitoring and managing it. For example, CPU utilization might be measured six different ways by six different vendors. Data from real-time, deep packet inspection tools about what’s on the wire needs to be married with performance data from servers, switches and other gear linked by that wire -- with information about “what the equipment reports” -- to get an end-to-end view how the service is really behaving.
What operations teams need now is a status map of a given service like VoLTE – so they can see each element that comprises the service, it’s relationship with the others, the interactions among them, and their health as revealed in dynamic dashboards via metrics and baselines. The result is a visual summary of the real performance of a nationwide service.
The approach to this kind of data access, collection and integration is also changing. The long-standing vendor practice of leaving the creation and certification of custom SNMP MIBs to the carrier’s operations team is no longer workable. These days, suppliers must be nimble. This involves fast turnaround to enable custom MIBS or even creating new metrics to integrate with other third-party sources, and to bring up the results quickly in dashboard-style displays.
Figuring Out What "Normal" Looks Like
Data collection plays a critical role in identifying normal behavior ranges, or baselines, for each element in the VoLTE call path. These baselines, factoring in time of day and eventually even seasonal shifts, become the bedrock for alerts and alarms that can distinguish between acceptable variations and true key performance indicator anomalies that signal emerging problems.
VoLTE is so new and so limited that operations teams are still figuring out what normal call path behavior is. As a result, monitoring platforms must be sophisticated enough to detect differences in activity. Then, that information becomes the foundation for accurate alerts that minimize false positives and false negatives.
Monitoring application layer VoLTE transactions is just one example of how operations groups are rethinking these issues. As the number of VoLTE users grow, so does the volume of transactions, driven by the surge in call flow messaging among the various SIP servers in the IMS Call Center Control Function (CSCF). To monitor effectively, common transaction metrics that can run across an array of different vendors’ products and platforms are needed.
These kinds of synthetic indicators can then be applied in a capacity versus demand analysis against each component in the call path. This shows how many subscribers are using each CSCF server, and how much traffic they’re generating. That information, in turn, lets network operations confirm that those servers are not overloading. It also enables users to measure the actual performance of a system against the vendor’s promised performance. These metrics can then be used to forecast future demand based on subscriber growth, and plan for infrastructure changes to support those higher CSCF transaction rates.
Finally, the monitoring platform has to be fast. In the aftermath of the previous LTE data outages, operators discovered that some outages could have been prevented if metrics already on the servers or network equipment had been collected faster. A common complaint from operations teams is that existing monitoring tools are painfully, maddeningly slow, bogging down under the mass of data needed to manage today’s VoLTE networks, or simply unable to monitor the number and breadth of devices that form those networks.
Moving Forward Successfully with the Right Monitoring Platform
Modern VoLTE roll-outs reflect the hard lessons, learned so painfully, from the LTE outages just a few years ago. Today’s network operations groups are demanding more from their equipment suppliers and monitoring vendors. And, the tools are emerging to satisfy those demands. They know that the right monitoring platform can provide new operational capabilities for tracking the real-time health of VoLTE services, identifying under-performing components, revealing congestion, validating vendor claims, enforcing service level agreements and optimizing VoLTE as demand for it grows.
And, perhaps most importantly of all, it can help maintain subscribers’ faith in the service.