Bell Mobility Goes Lean: Using Performance Monitoring to Cut Waste
by Zlatko Zahirovic, Manager of Wireless Network Connectivity Engineering at Bell Mobility.
Reprinted from an article that appeared in Wireless Week on March 25, 2014
Many network teams think of performance monitoring as an insurance policy. At Bell Mobility, we see things differently. In our world, performance monitoring is a way to cut waste from our CDMA, LTE, and HSPA+ based wireless network.
For decades, businesses of all types have adopted the principles of Lean Manufacturing. Bell Mobility, a division of Canada’s largest communications company (BCE), is no different. We embrace lean theory to cut the muda from a wireless network that serves 7.8 million subscribers.
Our performance monitoring platform is a key component of this strategy. Like any company, we use performance monitoring to troubleshoot user experience issues. But we also look for ways to drive waste out of the business and improve our bottom line. Performance monitoring gives us that insight.
There are seven deadly wastes in lean practice. Here are some examples of how performance monitoring has helped us eliminate them:
Following the launch of the iPhone back in 2007, Bell Mobility saw other wireless providers scrambling for capacity. So our gut instinct was to over-provision our entire backhaul network. We provisioned the microwave spectrum for a maximum of 131 Mbps -- almost four times higher than the theoretical maximum of what the technology at the time needed. Just now we’re reaching 120 Mbps bursts on those same links, so we had over-invested in bandwidth for some time. While over-provisioning ensures cell sites never bust capacity, this form of waste increases costs. In our case, it has been in the neighborhood of $26 million over the last 7 years ($8.1 million for microwave licenses and $18 million for over-provisioned leased circuits).
The greatest benefit performance monitoring delivers is cost avoidance. We can now view actual and projected usage trends. We know exactly what we’re going to need and when we’re going to need it. We can end wasteful production costs. Investment in extra bandwidth or unnecessary cell site construction equates to overproduction in lean terms.
Prior to implementing our current performance monitoring solution, our field services technicians performed “turn-up testing” on any and all backhaul circuit/microwave links. Each of our 120 technicians was required to carry an EXFO test set at a cost of approximately $10,000 each. Up until a year ago, we used to test every single microwave link and fiber circuit when it was put in service, but prior to us putting live traffic on it.
With enhanced performance monitoring capabilities, we no longer need this excess inventory. Our network is at 5 9s (99.999 percent) availability. We eliminated the need for $1.2 million in hardware test sets alone, not to mention another $320,000 in related tech time and labor costs over the years.
Unnecessary movement between processes creates waste. For service providers, this most relates to truck rolls. Every time we dispatch a truck to troubleshoot a problem at a cell site – like packet loss – we lose money.
Delays caused by unnecessary transportation time impacts customer churn. We’re in a hyper-competitive market. Sometimes, by the time we've resolved a remote issue and check back with the customer, they’ve switched to another provider.
Performance monitoring gives us the visibility to avoid truck rolls and decrease churn. We can detect packet loss and other cell site issues from network operations center. In the past year, we’ve decreased truck rolls by 44 percent for our backhaul requirements.
How do we view motion as a form of waste? A lot has to do with prioritization of man hours in fixing problems. If a particular cell site goes down, do we need to address it immediately? In many instances, a cell site neighbor can absorb traffic from a down site. You have to understand the customer impact. If a customer goes from five bars to four bars, is their quality of service still the same? Perhaps other issues need more immediate action. You have to address the issues that will keep the most customers happy.
Canada is susceptible to power outages caused by inclement weather. Our performance monitoring platform gives us a clear picture of what cell sites are down and the resulting spikes in use at other sites. Now we know if there is sufficient overlapping capacity to support the outage. We can better rank troubleshooting efforts – and better serve the customer.
The act of maintaining a performance monitoring platform also creates waste. I am a huge proponent of appliance-based deployments. We don't worry about procuring our own hardware, server and security licenses, or a database administrator. Moving away from a software-only solution cuts a tremendous amount of time, effort, and expense. We’re talking full time employees who now work on revenue-generating projects, not maintenance.
Inaccurate capacity projections have the most adverse effects on wait times for our business. We can’t wait until we hit 90 percent capacity before we make the call to add more routers to the core or construct extra cell sites. At that point, we’d be sitting on our hands while our customers suffer from our lack of just-in-time planning.
We recently received an alert in our SevOne performance management system that a 120 Mbps microwave link between two towers had trended above a 70 percent capacity threshold. We needed to start a four month process of constructing a higher 502 Mbps link. Without performance visibility, we would have missed our window.
With trending baselines, we know exactly where capacity needs to be six months from now. For example, we know the sites serving the universities will blow up with traffic from the influx of new students in September. We need to understand those trends and prepare for them.
With LTE traffic, we can see our investment paying off. We can tell what percentage of our traffic is LTE versus HSPA/EVDO. Every week for us is a record-setting week for LTE. Once we get to the point where 90 percent of our traffic is LTE, we can decide if it’s time to get leaner by eliminating 2G/3G services and antennae from our towers.
Bell Mobility network utilization over time. The largest spike occurred during the 2014 Sochi Olympics when Canada won Gold medals in both the men’s and women’s hockey tournaments.
Most organizations find it easier to uncover an outage than to reveal deterioration of user experience. But both of these equates to a defect for the wireless network. If you have a cluster of sites experiencing 25 percent packet loss, the quality of the call is awful. It doesn't matter that we can buffer up to 35 percent. The call will sound like a person is in an inner tube with a hairdryer blowing in the background. Instances such as these are difficult to uncover, and without performance monitoring - these defects are impossible to predict.
One interesting cause for call quality deterioration in the past was routers melting or freezing. Canada's weather conditions can be harsh. We had no way of monitoring cell site temperatures in real-time and alerting when conditions worsened. We had cell sites sometimes above 60 degrees Celsius – way higher than the manufacturer specifications. Routers were burning up and we had no way of knowing until we implemented SevOne.
Bell Mobility’s current practices include real-time monitoring of temperatures at the cell sites. Alerts trigger for the network operations center when temperatures dip above or below manufacturer recommendations. Out of the gate, we uncovered many sites that had been experiencing problems for years. AC units were either not on, misconfigured, or non-existent. Talk about a defect! We took action and cut another form of waste.
Many businesses have poor processes in place because someone claims, “We’ve always done it that way.” This mindset leads to waste. I challenge all of my team members to review their processes and uncover waste every day.
For example, our previous monitoring tool required a lengthy manual process to add new devices for monitoring. When we migrated to our current platform, we were able to automate this process. This allowed us to direct valuable headcount to more important work.
We can often bypass poor processes by taking a more proactive approach to performance monitoring. I don’t want someone to ask me why something on the network is not working, or hear a customer complain about service quality. I want to uncover and fix issues before they’re recognized by others. You can’t do this without performance monitoring.
For other businesses looking to use performance monitoring to attack waste on their network, I’d offer up the following advice:
- You must trust the data 100 percent. With our current network performance management solution, we never question the integrity of the data. That was not always the case. With our previous vendor, we only trusted 60 percent of the data at best, and we never knew what 60% to trust. For instance, a chart of daily cell site activity should look like a sine wave. The low point occurs around 4:00 a.m. and the high point 12 hours later. When I looked at this report from our previous vendor, it was flat for half the day, then it went to some triangular form, then back to flat. When you see something that completely contradicts common sense, you know there’s something wrong. You lose all faith in the data, so you can’t make informed business decisions.
- You need broad access to multi-vendor performance data. It was not long ago that Bell Mobility had zero visibility of about 3,000 devices on our network. Why? Because our legacy performance monitoring vendor was not able to work with certain Juniper MIBs. Juniper makes up about 85 percent of our infrastructure. This was a huge visibility gap that we have since closed. Be sure your performance monitoring solution can collect the data you need.
- You need to deploy fast! Network issues don’t wait for you to deploy a monitoring solution. "Time to value" is another way of saying "time to cut waste." Our previous performance monitoring vendor took more than three years to deploy and still left us with visibility gaps. Migrating to an appliance-based platform allowed us to monitor, report, and alert on our entire network in less than three months.
Without performance data, you have no information. Without information, you have no insight on how to leverage lean principles to better your business. We have 15 teams and more than 400 users accessing our performance monitoring platform and reports.
Performance monitoring solutions deliver much more than just an insurance policy.