Like many companies around the globe, SevOne has implemented a mandatory work from home (WFH) policy for at least the next two weeks. Our intention with this policy is two-fold: 1.) protecting the health and safety of our employees and their families, and 2.) maintaining business continuity to help our customers and ensure that their network monitoring needs continue to be met.
As the leaders in modern network monitoring and analytics, we have a unique opportunity to share what we’ve learned so far about WFH and the impact on networks. We hope that this blog and the unique, operational insights we get from working with our customers and our own network monitoring expertise, will help organizations and their IT teams in these very challenging times, especially with their efforts to support newly critical WFH-focused network services.
In the coming days and weeks, we will look to continue this blog series with updates on our experiences and those of our business partners.
Lesson #1: Focus on VPN Access - Who, Why, How, When
Let’s start with “The Who”. Who you need to focus on first?
At SevOne, here’s what we did. We created two general types of users:
- Group A) those who mainly access SevOne internal systems, and
- Group B) those who mainly access cloud-based resources.
We put our initial focus on Group A. Why? For employees who mainly access internal systems, WFH means they will not be able to access their systems like they normally do. In WFH mode. they will need to initiate a VPN connection to the SevOne corporate network. These folks aren’t accustomed to establishing VPN access. In addition, a rapid shift -- and increase -- in VPN activity has the potential to significantly raise the traffic load through our edge routers.
From the perspective of many of our internal users, the VPN app is nothing more than that logo on the screen that they never bother with. To help remind users of the VPN app that is loaded on each of their laptops, we provided these individuals with a straightforward set of instructions on how to initiate and terminate their VPN sessions.
However, estimating the impact of the VPN traffic surge was a much different task.
For a variety of business and security reasons, SevOne does not split-tunnel our VPN traffic. What does that mean? It means that if a WFH user is streaming CNN or Netflix to their laptop while on VPN, ALL the traffic from those streams comes to and through SevOne edge routers. That can dramatically increase the overall traffic load and potentially increase response time.
To help reduce this potential, unnecessary burden on our network, we have guided all of users to only initiate a VPN connection as needed, rather than having it always on. For the users who do need continuous VPN connections, we suggested that they put their systems in “sleep” mode when they take breaks and end their day. That simple task disengages their VPN, helping to reduce overall traffic.
Lesson #2 Focus on Network Monitoring and Operations
To best prepare for increased VPN traffic, we took five main steps. We:
1. Verified our VPN redundancy,
2. Set up high-frequency data collection on our edge routers,
3. Set up slope alerting,
4. Verified alert notifications across our IT team, and
5. Created VPN-specific dashboards for team use.
Let’s look a bit closer to each of these steps.
Step 1 - Verifying VPN Redundancy
This process was pretty straightforward. We initiated a test outside of normal business hours where we failed over the primary VPN system and verified that the redundant system was operating properly.
Step 2 - Set up High Frequency Polling
For many years, the industry standard for polling network infrastructure has been every five minutes. In the present situation, we’re looking for more granular data to better understand VPN traffic at least every minute. In SevOne Data Platform, we set up high frequency polling at one minute intervals to increase VPN traffic visibility.
Step 3 - Set up Slope Alerting
With the anticipated surge in VPN-based traffic, we wanted to know more when the growth rate of VPN usage and well as total internet bandwidth of traffic is dramatically different than expected.
Step 4 - Verifying Alert Notifications Across our IT Team
This was a simple step. Since we already had alert notifications tied into our IT paging system, we initiated a test alert in SevOne Data Platform to verify the appropriate SevOne team members were notified of potential VPN issues.
Step 5 - Creating VPN Specific Dashboards for Team Use
We set up dashboards to help us visually track
- VPN Utilization
- Total Concurrent Users
- Bandwidth Hogs
- Edge Router Throughput
VPN Utilization Dashboard
We created a dashboard showing VPN CPU & memory utilization, along with concurrent users and interface statistics from our primary, backup and tertiary VPN servers. The goal of this dashboard is to visually see if there is any correlation between the growth of users and the ability for the VPN system to keep up with the increased demand. What we are seeing so far is that our primary VPN server is performing well with the increased demand.
Total Concurrent Users
In this dashboard we are able to visualize total concurrent users (dark blue), along with the normal baseline (dotted blue) and three standard deviations from the baseline (blue shading).
You’ll see the initial rise in concurrent users in the “trial” work from home day on March 13th and the continued growth on Monday March 16th as well as when we started March 17th. As we collect more data over the next week, we expect to see the baselines begin to adjust to this hopefully temporary “new normal”. This growth of users is generally within three standard deviations from the baseline.
In this dashboard we are utilizing NetFlow records to analyze what applications are taking up the most bandwidth - or “Bandwidth Hogs”. If we were in a live troubleshooting scenario, we could drill down on these to see what users were contributing to the bandwidth usage.
Edge Router Throughput
In this dashboard we are looking at our edge router throughput, with both in and out traffic, including baseline and standard deviation. We typically set alerts to notify our team when traffic exceeds one standard deviation from baseline, but in this WFH era we will be shifting to three standard deviations as baselines adjust.
Lesson #3 - Remain Agile & Don’t Get Caught Saying “I Can’t Explain”
This is less of a lesson, but more of a reality. Over the coming days and weeks, we recognize the need to remain agile in our approach to maintaining the health and safety of our employees,and business continuity for our customers.
And now with the steps we’ve taken to monitor and alert on VPN activity, we are looking to avoid using the phrase “I Can’t Explain”, when it comes to understanding VPN activity.