Why Most Performance Monitoring Architectures Don’t Scale


SevOne's Nic Reid, Sr. Director of Product Design & Management, discusses the history and evolution of performance management. He further explains the detriments of having an outdated platform.



Hi everyone. I'm Nic Reid. I am the head of product management for SevOne, and today I'd like to talk a little bit about the history of network management. Particularly performance management, and the evolution, the steps that the various products have taken in order to get where we are today. Then also to talk a little bit about the risks to your business if you're still invested in some of the old architectures. Let's go back 20 years. 1995 or there abouts, it was a different time. Networks were much smaller, we were talking local area networks, generally. Ethernet was really the state of the art, there was early deployments of TCP IP in businesses. Generally, network management was a much simpler business, because internet was generally dial-up, the networks were disconnected, and there just weren't that many elements that needed a lot of management.

The systems that were created at that time, or that were in use in that time period that had been developed in the late 80's through to the mid 90's were generally stand alone systems. They were designed to run on a single box, so one box here. That box might have been a Windows box, an early Windows server, or it might have been some sort of Unix box, maybe a sun box or an HP box of some description. There would have been a database running on that box, probably something like Oracle, or Sybase, or possibly MS SQL. There would have been a little bit of basic software running on there which knew how to do some kind of collection statistics, and something on here that knew how to probably generate some reports. All of these systems that were deployed around about that time may not have even been web based, or the web was an after thought.

That was not part of the core offering, maybe there was a little bit of a web reporting front end that was provided alongside the rest of the platform. It might have been client server, there might have been some Windows software that you connected into this platform to get some visibility into what was going on. The point is that there was generally one platform, and we did everything that we needed to do with that one platform. The sorts of devices that we were talking to, what might have been some fairly basic routers, some of the early Cisco routers, the IGS's and AGS's and these kinds of things. There might have been some hubs, or some early switches. There might have been some little work group servers or printers, but not very many. Tens of devices that we were monitoring, and you could do everything you needed to do with some basic SNMP, some SNMPv11, perhaps. For our tens of devices we could keep up with, it might have been 10 minutes or 5 minute polling, possibly.

Around about 5 minutes, so let's say 5 minute polling, and we would have been pulling the statistics out of these boxes and sticking them in the database, and then there would have been maybe 1 or 2 network managers whose job it was to connect into this platform and have a look at what was happening. With a view to maybe doing some capacity planning on WAN links, there might have been some basic WAN links out of those boxes, but we're not talking about lots of different types of statistics. A small number of metrics and also not updating very frequently. We might call this stand alone platform the first version, or architecture 1. Let's call it architecture 1.0 of performance management. Now, fast forward a few years, and networks started getting a little more complex. Maybe there were some more routers in the network. There were some more servers, and maybe it was a bit more of a data center, with maybe the main frame was introduced into this environment, or there were some AS400 mid range boxes.

Incrementally, more and more of an organization's infrastructure became TCP IP, or IP enabled, and then that led to the need for greater and greater visibility. We ended up saying, "Well okay, how do we get beyond a single box here?" That leads to the next evolution of the architecture, of performance management as an architecture. We might call that 2.0. How do we recognize version 2.0? Well, that's where we start distributing components or functions out of this single box, into other collection boxes. They might be distributed geographically, but generally speaking, what that meant was, "Okay, I will deploy another box over here, who's job it is just to do the polling of this part of the network." It'll do its polling, and it will send all the statistics back to the same central database. We'll get rid of that there, and we might deploy another box over here, whose job it was to manage this part of the network, or collect statistics from this part of the network.

In other words, we were distributing the collection. Still, because of the history of where this had come from, we were still talking to, generally speaking, a single central database. Some sort of single platform which had a bunch of disks in it, and that's where all the statistics were ending up. Of course, the same sort of evolution occurred as the different verticals that we were addressing, so this might have been everything from WAN devices over this side to data center devices over here. There might be all sorts of branch equipment in between, so as the breath of types of devices that network "managers" were being asked to manage increased, then not only does the need for distributed collection increase, but also the need for more people to be able to access the platforms started coming in here. We started seeing more pressure from the top as well. Now the server administrator wants access to this performance management platform. Well, that drives another kind of distribution.

Now our little reporting component that was sitting on here, this box is being worked far too hard, because it's responsible for storing all the statistics, but it's also responsible for doing all of the web work to give the reports that these guys, the increasing constituents, audience if you like, is demanding. What we then do is we start distributing the reporting. Now we have these report servers over here, and depending on the exact products that we're talking about at the time, they had different names for these modules, but we might move them out there so these guys are no longer talking to our poor, overworked database server, and instead they're talking to reporting servers, which do most of the reporting work. Now, that gets us so far, so that now we're looking at this architecture 2.0. Architecture 2.0 really is the idea of distributed collection, and perhaps distributed reporting, but still a central database. I think we can see where this is headed. Now this database is and always has been the number 1 bottleneck.

This is the issue with most of the performance management, and even network management platforms out there at the moment. There's too much of this single central database. I don't care if you're talking about an on premises deployed solution, or whether it's cloud based. There's still a lot of legacy platforms which use this model out here where you have a central database. Now of course sometimes there are ways of making this a bit more resilient. You might have a 1:1 replica, or you might use something like Oracle Rack to have some kind of database resilience, but really we still have this main problem where everything that's collected from the network ends up in this database, and all of the users that want to get statistics out of that database have to go to the one place always. It's a problem as we scale. That takes us to the next evolution, or the next step in our road to today.

That brings us onto architecture 3.0. How do we re-engineer our collection of network performance information so that we can scale out, and we can handle all these additional users, and all of these additional network devices that we're trying to collect statistics from. Well, the concept is this. We take a box, and we make it have a database, some collection, and some reporting, and we build another one, and we keep on doing this. Always collection, always reporting, and always storage of data. We do that as many times as we need to. Then we create some clever middleware, or a query layer, if you prefer, which is able to take a request from a user, and ask other platforms for the information that's required in order to be able to respond to that request. This is a classical scale out architecture. It's used in lots of web scale companies. Every time you do a Google search, it's using this kind of thinking.

Really what architecture 3.0 is all about is taking that web scale thinking and applying it to network management, and it scales in a really linear, nice way. As more elements are added down here, there are more devices, and there are more servers, and there's more applications, and lots of new infrastructure that we're introducing all the time. Then we can just add in more of these appliances as and when we need them. Really this is where we're at in terms of the state of the art. This kind of model is what makes everyone great. It's this idea that we can have lots of the same building block, and add them scale amount, insert them at any time without interrupting any of the existing monitoring and collection that's going on.

It's a very powerful next step forward in terms of the architecture. Now of course, if you don't have this kind of dynamic flexible, scalable architecture, if you're still stuck back here in architecture 2.0, then all of the reasons that you have for doing network performance management, all of the alerting, all of the trending, capacity planning, the oversight of the operation of the network can all be very negatively impacted when you try to get beyond the scale of your single central database. The impact really can't be underestimated. It's a massive impact. If you don't have this style of architecture or something that looks elastic like this, then as your organization grows and your network grows alongside it, then you'll hit a point where you really just can't operate any longer. You can't get the insight that you need, you can't run reports in seconds, the alerts are not generated in real time, so everything slows down. That's why we champion and have built a product around this version 3 of the network performance architecture.

By using this third generation architecture, SevOne has been able to create a product which can scale to monitor millions of objects. We have many production deployments with more than 5 million monitored objects where an object is an interface, or a CPU, or a disk, or an application of some kind that you want to collect time series metrics from and do analytics around. We even have some customers who are monitoring more than 25 billion metrics per day, so it's some really very significant numbers, and this is because we monitor the largest networks in the world. None of that would be possible if we didn't use this kind of scale out approach. It simply, we couldn't make it work. Indeed, when we look at the industry, Gartner actually predicts that something like 20 to 30% of all IT organizations will be ripping and replacing their legacy monitoring platforms in the next couple of years. Really the motivation for that will be because of all of the network growth they're expecting.

Things like software defined networks, or software defined data centers, and things like hybrid cloud infrastructures, there are so many more metrics that are important to the running of those environments that you need a scale out architecture in order to be able to accurately collect them and analyze them in real time, and generate alerts, meet reports, create reports for thousands of simultaneous users, and do so without any kind of slow down. You also need to be able to support one second granularity. Day in, day out, and that's really not possible unless you have something that goes along this architectural line. Thanks very much for your time, I hope this has been a useful introduction to network performance management architectures and the evolution of how we got to where we are today. We have tons of great content on our website, videos and white papers, so come and check it out.