Using Cisco IP SLA to Monitor Network and Application Latency and Meet Service Levels

Using Cisco IP SLA to Monitor Network and Application Latency and Meet Service Levels

Another year of Cisco Live London is over. The SevOne team had a terrific show, we were overwhelmed by the number of engineers from across EMEA who kindly showed so much interest in the latest release of our network performance management solution - v5. Our CTO, the inimitable Vess Bakalov and I were fortunate to be asked to deliver a session about the power of high definition IP SLA polling via SNMP.

IP SLA, as a technology, is not particularly new or shiny. It's been around in one form or another for more than a decade, though it started out with a different name – RTTMON, it then went by the SAA tag, ending up being re-branded as IP SLA in 2005. References to both RTTMON and SAA can still, of course, be found in the configuration commands, SNMP MIBs and documentation.

I find the idea of technological archaeology fascinating. The evolutionary origins of our technologies are so often visible in their modern structures. There is such a close parallel with the biological world; we humans suffer back problems because we only recently, in evolutionary terms, learned to walk on our hind legs. Technology limitations which were not seriously considered to be so, IPv4 address space, ASCII in a global market, 640KB of memory under MS-DOS, to name a few, become limiting all too soon.

Before Professor Richard Dawkins became the polarising arch-sceptic activist we know and love today (or not), he described cultural evolution as an extension of genetic evolution – coining the term meme to describe a cultural idea for example, a song, phrase, image, idiom, or invention which replicates and transmits through the populace in a way which is analogous to genes in biology.

Clear and extreme examples of these memes are the kinds of fads and crazes which spread lightning-fast through school playgrounds. The shifting rules of marbles, girl germs (or ‘cooties’ in the US) and the supposed ill effects of stepping on cracks in the pavement are all examples of childhood memes. Children, with their spongy curious inventive minds, seem a particularly fertile ground for enterprising memes. As genetic (particularly human) evolution has slowed, the rate of memetic evolution has seemingly accelerated – it would be difficult to argue ideas are not communicated, combined and synthesized faster now than ever before, and we wonder what drives the ever-increasing thirst for bandwidth. I personally wish there were fewer ‘piano-playing cat’ memes and more ‘new approach to fighting global-poverty’ memes floating about, but that’s just me being old and grumpy.

I think I wandered a little off track there. So, bringing it all back to SNMP and IP SLA – we at SevOne like to consider all available network monitoring approaches - not just the newest, freshest ones - to see if our uniquely speedy-scaly architecture opens up new applications. SevOne has a few strengths which really help to get the best out of these (in internet terms) venerable Cisco integrated technologies. SevOne’s strengths are:

  1. Support for sustained high-speed polling which is retained for 12 months – our multi-threaded engine can support polling down to every 1 second, but it’s more practical because of round-trip latency for example, to use perhaps a 5 second period.
  2. Automatic baselining for all collected indicators, with support for 5 minute baseline intervals. Using 5 minute baselines can equate to 100s of billions of individual baseline points being calculated every week in some of our larger customers. Yes, billions with a B: ~30 indicators x 5M objects x 2016 5-minute-slots-per-week = 302.4B.
  3. An alerting engine which uses the raw data at 5 second collection granularity - many performance systems only really support alerting using 5 minute data, which can’t be considered very real-time in these days of sub-10ms algorithmic trades.
  4. The ability to support compound static and dynamic baseline thresholds.

All of these points are pretty straight-forward, save for number 4. SevOne compound thresholds relate to our ability to create threshold policies which combine threshold conditions using Boolean and/or logic. Why is that interesting? Well, consider this situation:

CPE Router -> Aggregation Router -> Internet -> Datacenter Gateway Router -> Distribution Switch -> Application Server

By defining a set of parallel IP SLA probes on the CPE Router, which terminate on each of the known hops of the path back to the application server, and building the ‘normal’ baseline picture of their round-trip latencies over a typical week, we can make inferences about which remote hops are responsible for an end-to-end degradation. This is very similar to what you or I would do when we run a trace route from our local machine to look for slow hops, except it has the advantage of being able to use all the different probe types like UDP, Jitter, VoIP, HTTP, DNS, L2ping which IP SLA gives us.

Combining these probes with a SevOne compound threshold, we can automatically generate actionable notifications and send them to your network engineering staff. Instead of saying, "Your application is 120% slower than normal for this time of day and it’s due to the congestion between your providers’ Datacenter Router and the Internet" – without having management access to either! Neat, huh?