Today’s service providers are under constant pressure to increase capacity of their networks and other infrastructures to keep up with user demand. At the same time, they’re expected to keep costs down so they can effectively compete in the market.
Given these opposing requirements, they need to make investments in areas that truly warrant upgrades now as opposed to later. In order to consistently make those decisions correctly, they must be adept at a crucial discipline: capacity planning.
Capacity planning involves determining when the demand for a given resource – bandwidth, CPU, disk space, memory, etc. – will outweigh the capacity to deliver it. Done correctly, capacity planning makes it possible to determine which infrastructure upgrades will deliver a return in increased customer business and reduce churn. Make a mistake and service providers may find themselves over-provisioned or not delivering capacity upgrades when they need them most.
Currently, most service providers conduct capacity planning by collecting data from various systems and pouring it into spreadsheets on a weekly or monthly basis. This labor-intensive method is no longer fast enough to keep up with the rapid changes that occur as more and more customers turn to service providers for everything from wireless phone service and cable TV to cloud-based IT services.
However, service providers do have another tool at their disposal. If they have a good performance monitoring platform, it’s already collecting detailed data on all crucial components of the infrastructure. In this paper, we’ll outline four ways to take advantage of that data to dramatically improve capacity planning capabilities.
Current State of Capacity Planning
First, let’s look at how most service providers currently tackle the job. A typical scenario has teams importing metrics from all the important systems and devices. These metrics may show peak usage on WAN and LAN links, percentage of used capacity on a storage array and other relevant statistics depending on the system in question.
Next, the teams import the metrics into spreadsheets and perform calculations to try to determine which ones merit attention. That’s a difficult task because the metrics typically aren’t correlated with one another. For example, they can’t immediately tell if a spike in demand on one WAN link had something to do with a failure on another.
It’s also a labor-intensive process given there are often thousands of systems, devices and variables to monitor, including bandwidth, CPU utilization, storage, memory, etc. What’s more, it’s hard for providers to determine whether they’re using the most appropriate metrics and calculations. And should they decide they want to re-run a given calculation, it may be difficult because the raw data is often gone once it’s extracted from the source device or system.
It can also be tough to extract data from certain devices that don’t support standard management protocols such as SNMP. And it’s hard to ensure each device and system is accounted for, given the rapid pace of change in most service provider environments.
In the end, service providers can’t be sure they’re making the most appropriate capacity planning decisions. That puts into question whether they’re investing in the areas that need it most, and will deliver the best return.
1. Go Beyond TopN Reports
Now let’s discuss the ways performance monitoring platforms can help service providers deliver more accurate capacity planning reports.
Some platforms are not being used to their full potential and can dig far deeper into performance metrics. For example, it’s common for providers to rely on TopN reports that show which resources are used the most, whether it’s servers, WAN trunks or applications.
Such reports are useful to a point, but they do come with some big caveats. For instance, they represent an average of use over a period of time, which isn’t the most useful approach for capacity management. A TopN report may show one WAN link as being most utilized, but it’s not necessarily the one that’s suffering repeated usage peaks over time.
What’s more, TopN reports only tell a part of the story. What’s not included is information that’s crucial for capacity planning, like:
- When do peaks occur?
- How often do the peaks occur?
- How long did each peak last?
A TopN report may show a WAN link averaged 90% of its capacity. But a closer look with the performance monitoring platform can highlight when the link burst over capacity and for how long. If the peak lasted only a second, the provider may feel it’s not an immediate concern. But if it went on for many minutes or longer, that’s a different story. And time of day matters too: If the spike lasted a long time during the normal peak period, then that’s an area that’s ripe for an upgrade.
With the right performance monitoring platform, providers can get to that level of detail using things like percentage reports. For example, the provider might establish a threshold for when capacity is considered too high, maybe 80% or 90%. They can then run reports that show how often a given system reached these thresholds.
Service providers can also run reports that show the percentage of time during which utilization reached their predefined thresholds. That enables them to identify when they’re consistently exceeding a threshold and for how long, so they can quickly home in on problem areas.
Consider the difference such detailed data can make to a cable company. While a TopN report would show areas that experienced heavy use, what they’re really looking for is the ones that had heavy use for extended periods during prime viewing hours. Those are the areas that would demand immediate attention.
2. Report on Groups of Resources
In some instances, it’s not a single WAN trunk or server that a service provider is concerned with, but how a whole group of resources is performing as a whole. Examples may include groups of interfaces to a given application, a server farm or an entire customer site.
Maybe a data center hosting provider has 5 or 6 lines coming in to its facility from a large customer, all attached to a load balancer. The provider will likely be more concerned with the performance of the lines as a whole than with any individual line. With a performance management platform, the provider can create a group object that treats the lines as one, apply a key performance indicator (KPI) that defines what proper performance should look like, and continually monitor to ensure the load balancer is distributing traffic effectively.
In large service provider networks, it would be unwieldy to create such groups manually. So it’s important that the performance management platform be able to put devices and systems into groups using application programming interfaces (APIs) or service management systems. This makes it possible to automate the grouping process.
A wireless service provider, for example, may use this capability to keep track of the capacity at various points in its network. Perhaps they group the base stations and backhaul lines that aggregate traffic according to region. They may want to place highly stringent KPIs on the core or backbone of the network, because it affects so many customers, while allowing somewhat higher traffic concentrations on the downstream links.
In short, a performance monitoring platform would allow the provider to put a premium on those backhaul connections that serve the most customers, so they never run out of capacity. What’s more, they could get alerts in real time should capacity be threatened — like if there was a power outage in an area where everyone was using their cellular devices.
Similarly, an Internet Service Provider may want to group the lines that pertain to its various classes of service — one for real-time traffic like voice and video and other groups for less crucial and best-effort traffic. A performance monitoring platform would enable them to look at the sum of each type of traffic and ensure each is within its defined performance parameters.
Both examples illustrate the type of capacity planning that’s beyond the reach of even a weekly spreadsheet report.
3. Incorporate User Activity and Other Data
For capacity planning purposes, it’s good to know not just how much capacity is being used, but what it’s being used for. And it’s also helpful to be able to pull in data from all sorts of devices, including those that don’t support traditional management protocols such as SNMP.
The right performance management platform can help on both fronts. To determine what capacity is being used for, it will use both log data and NetFlow. NetFlow provides deep insight into IP traffic, making it possible to determine where packets are going to and coming from, and enabling the tracking of application usage.
Service providers can use this data to better plan capacity by type of traffic. For example, a wireless carrier may want to track usage of 4G/LTE traffic on its network to determine when it makes sense to add more 4G capacity and repurpose older spectrum.
Such information can also factor into how companies charge for different services. When a cable company sees rapid growth of a particular feature, they may want to charge more for it. Similarly, if other services aren’t catching on, they could run a promotion to spark interest.
In some instances, companies may also want to pull in data from portions of their infrastructure that don’t support typical IT protocols such as SNMP. A good performance management platform should be able to correlate log data with performance metrics from just about any device. It should also be able to normalize the data so it can be compared to other data to find trends that help in capacity planning.
A cable company, for example, may have to incorporate data from a number of devices to determine the capacity of its network in any given area. That includes cable modems and/or routers in customer homes, cable modem termination systems (CMTS) in the access layer and CATV headend solutions in hub sites.
With a performance management platform that can import data from all of these devices, the company can accurately determine how many additional customers it can add, taking into account their service mix, before capacity runs out. At the same time, they can ensure that as customers are added, service performance doesn’t degrade – causing disgruntled customers, cancellations and lost revenue.
4. Run "Days Until Threshold" Reports
Finally, some performance management platforms can run predictive reports. Such reports show how many days until a given resource runs out of capacity or reaches a predefined limit.
That is a powerful capability that greatly simplifies capacity planning. If a provider knows it will take, say, two months to upgrade capacity of a given resource, they may want to know three months prior to when that resource will be at maximum capacity. For example, a wireless provider can’t wait until they hit 90 percent capacity before they make the call to add more routers to the core or construct extra cell sites.
Such predictive reporting brings a just-in-time infrastructure planning capability. As a result, providers aren’t scrambling to increase capacity after they’ve run out. At the same time, they’re not so far ahead of demand that they’re wasting capacity and capital.
The days of trying to do capacity planning by manually importing data to spreadsheets are coming to an end. This type of process can’t keep up with the rapid changes needed to keep up with customer demands and market forces. What’s more, a spreadsheet-based system makes it too difficult to correlate relevant variables to get the best, most actionable information.
A good performance monitoring platform offers a number of benefits that can bring a new level of sophistication to a service provider’s capacity planning efforts.
A platform with open APIs makes it possible to gather data from a wide range of external devices, not just those that support traditional IT management protocols such as SNMP. It also enables the automation of gathering and reporting on data, which means it’s no longer a gating factor. Results can be produced in real time and reports can be created anytime — not just on a weekly or monthly schedule.
This type of performance monitoring platform will deliver more accurate capacity planning information, enabling service providers to determine exactly where their most pressing problem areas are – and which ones can wait until later.
It also enables providers to retain raw data for extended periods of time – often a year or more. That’s critical for making accurate capacity forecasts, as it’s not a good idea to base projections on inaccurate usage data that’s averaged over time (i.e. hourly usage data is rolled-up to daily usage views after 30 days). It’s also an important consideration if the need arises to re-run certain reports, or to run them using different variables.
What’s more, the right performance monitoring platform provides better institutional knowledge of seasonal events. For example, in planning for big annual events, providers can look at data from the same or similar events in the prior year. With this data in hand, a wireless service provider or cable company can plan for major events such as the World Cup, Super Bowl or Olympics, or even the annual influx of college students to certain cities and towns. With sound data capacity planning data in hand, they can be confident customers will get the performance they expect.
Finally, when evaluating the latest performance monitoring platforms, it’s important to make sure customers won’t encounter bottlenecks that give them a less than satisfactory service experience – and that money isn’t being spent on unnecessary infrastructure upgrades.