Why Performance Alerts Require More Intelligence
When it comes to performance monitoring tools, we hear the same complaint over and over from operations teams: “We get too many false positive alerts.”
The primary culprit? A process that generates alerts based off static, user-defined threshold violations. Don’t get me wrong – there are times when static thresholds serve a purpose. For example, wanting to know when a CPU exceeds 95% utilization for a period of 15 minutes or more, or when the voltage on a UPS falls outside of a specified range. But generally speaking, static, user-defined thresholds suffer many issues:
- They are often “best guesses” at what acceptable performance should be
- They do not understand the context of your unique infrastructure
- They lack the intelligence to know if a violation is significant or not
When it comes to alerts, what you really want to understand is, “What’s happening in my environment right now that is unique that I need to know about?” Static thresholds fail at providing such insight.
For example, how do you determine acceptable upper- and lower-limit threshold values for the following?
- The number of connections to your firewall for any time of day
- Number of failed user logins for a specific application over a 15 minute period
- Number of Apache processes spawned by an application
Maybe you simply want to understand when bandwidth utilization is much higher (users streaming YouTube and Netflix) or much lower (network backup failed) than normal. Again, static thresholds come up short.
Too Much Noise
Let’s consider the example below. A monitored metric has exceeded its pre-determined, static threshold value seven times over a 30 minute period. This results in seven redundant alerts, contributing to the noise that you have to wade through.
Too Wrong for Too Long
Now let’s consider a more intelligent approach to this scenario. Let’s assume your monitoring platform automatically baselines the performance of every metric it collects. This provides a reference point for any given time of day and day of the week.
If we compare real-time performance to historical norms for this moment, we see that the first four spikes are within an acceptable range. Perhaps one spike exceeded a standard deviation from normal performance, but not for a prolonged period. No need to alert just yet.
But as we move further along the timeline, we notice something relevant. The monitored metric exceeds an allowable deviation from baseline performance four times within a sliding 15 minute window. Now the monitoring system knows there is an anomaly for sure and it alerts once.
This method provides a more reliable predictor of service-impacting events. There’s no noise to sort through. No false positive alerts to wake you up. Instead, you get insight about real-time changes within your infrastructure performance.
Looking for more common sense tips for monitoring the performance of your infrastructure? Download our free whitepaper on 6 Steps to an Effective Performance Monitoring Strategy.