Resources

Data Center Monitoring

Video
 

Join Dave Hegenbarth, SE Director of Global Strategic Alliances for SevOne, as he demonstrates the capabilities and general usage of SevOne's technology for data centers. Learn which performance characteristics of the data center should be tracked and closely monitored so that business efficiency can be maximized to its full potential.


Transcription:

Dave:
Today's topic is on data center monitoring, and when we talk about data center monitoring, we really talk about complete and immediate visibility to all the components that are running in the data center. Just a little bit about SevOne, and what makes us different in the marketplace. SevOne employs a peer-to-peer clustered technology. Now, what does that mean? Basically, that means that we can deliver near real time reports very, very, quickly, across a large number of devices. It also means that we have a scalable solution, we can scale down to as many as several devices, and up to as many as 50, to 75, to 100,000 devices. The great thing is, you get all the same features of SevOne, you get all the same performance, with reporting performance of SevOne, whether you have 10 devices, 10,000 devices. It really is this clustered peering technology that allows us to deliver value in the marketplace.

A little bit about the architecture, if you haven't seen this before. What are we collecting? Well, this is very relevant to today's topic around data center monitoring, is there are lots of parts and pieces in the data center. We collect that information in many different ways, so we can go from traditional SNMP, ICMP, just pinging things to understand if they're up or down. Polling them with SNMP to understand some performance metrics, but there's a lot more than that in the data center, so we have also the ability to attach and measure the performance of our virtual servers, or physical servers, or even devices such as computer room air conditioning, UPS's, stand switches, this all plays a part in that total data center monitoring ability. Some of the ways we do that, when not using SNMP, we have plugins that we call deferred data, or xStats. These are methods of getting non-pollable data into SevOne.

Once we've collected all that data, what do we do with it? Well, we have a very open API that allows us to either take in more data or dispense data, configuration management, so we can be tied into any number of different configuration management solutions, such that if a device is provisioned in the data center, it can automatically be provisioned, also in SevOne, for performance monitoring. We have the ability to take fault and event management events in, or we have the ability to have performance events relayed via the API to other fault management systems. Portals and service management, KPI's, key performance indicator metrics, are very key to showing the performance of the data center.

Of course we generate reports, so we have a web based front end, it's completely web based, so you can hit it from your browser on your phone, on your laptop, on your iPad. We also have released a application, both for Android and iPhone, which allows us to have our event notification in a native app, and be able to take that, obviously, wherever we go. A lot of different ways to get information in and out, an open API that makes it easy to get information in or out, and the ability to present this in a web interface, or in an app for your phone.

The marketing slide, and the real point of this slide is, not only to introduce you to some of our customers, but it's also to show you that no matter the vertical, they all have a lot in common, and that is they need to see a broad number of KPI's across many, many, devices, in and out of their data centers, and remote locations, and be able to retrieve alerting and reporting in near real time.

With that, we're going to move over to the demo, demo with Dave. What I have here, and I'll go back, I'll actually start at the homepage for you, so we can see this build. This is the homepage, or the welcome screen, by default, in SevOne, if you don't configure anything else. On the left here I have a list of reports that I've created, these are my favorite reports that I might want to go to. Today, we're talking about the SevOne Data Center overview, so this is actually the SevOne Data Center, here in Wilmington, Delaware. I'm tracking many different metrics about the performance characteristics of the data center. The first is, I have 2 links to the internet, I have a primary and a backup. A couple things to notice about this, my backup is running along- my backup, here in the center, is running at about .21%, so I know that I'm over on my primary, and we can see primary traffic, from Comcast, coming through in percentage of my internet pipe as a whole.

Some other things to note, we can see here that I'm actually polling my outside firewall every 10 seconds, because I really want a granular understanding of the performance of my internet traffic. If you can see here, I'm also measuring the 95th percentile mark over the time period, which I think is past 4 hours, as we see there. I want to understand what my, plus or minus, my 5% of my peak is, in terms of traffic. I'm also running a baseline, so if I zoom in to any part of this graph, and you can see that all of the graphs in a dashboard for SevOne are interactive, if I zoom in we'll see a little better. Right down here, it's hard to see, but you can see that I have this dash blue line running through here, and that's my baseline, that's my understanding of normal for the traffic, for this period in the day. Friday at 11, some odd minutes, after the hour here, or actually in this case I zoomed into 9:30. Friday on 9:30, 9:15 to 9:30, 9:30 to 9:45, we do 15 minute granularity on a Friday. We can see they're running just a little bit above normal, and that's a look at my firewall, my outside interface, in terms of inbound and outbound traffic. Same thing happens for my backup.

On the right, here, I've actually a response time graph, so I'm using Cisco IP SLA latest from the core of my network, to my first hop into the internet. That's really the only link I control, my first hop to Comcast, is Comcast my default gateway in Comcast. That's a link that I could call up and complain about. Not pinging Google or something else further into the internet, because I really have no control over the internet itself, but I do have that first hop. If we had some issue, I'd want to know about it. I'm very interested in response time along side of bandwidth. I can have 80% usage of my bandwidth, as long as my response time is good and quick, I really don't care. I paid for that pipe, I want to use it.

It's only when my response time goes up, that I'm really interested in my bandwidth, short of my bandwidth being over 100%, or full, the pipe being full. What I have in this graph is you'll see a solid line, and a faint line, and this is what we call time over time. Now, I can actually understand my response time, my first hop into the internet, as it is today, compared to yesterday. I can know that yesterday was a really slow day, and everybody was complaining, and whatever, or I know yesterday was a really great day, and looking at it as it is today. It gives today's data. Instead of a long term rolling baseline, this is more of a real time measurement, how am I today, day over day. We can see here that I'm right about the same thing, you can see the peaks are offset just a little bit, yesterday and today. We can see that my response time averages 2.4 milliseconds, which is a good round trip time to that first hop in the internet. My peak only was 13.6 millisec, so we can see it's an average, and we're doing pretty well there.

The next couple graphs in my data center, one is NetFlow, so SevOne gives us the ability to combine polled data, such as SNMP, or statistics, from VMWare, what have you, with flow data, in this case this is NetFlow coming off my core switch link to the firewall. This is version 9 NetFlow coming in to us. It gives me a way of understanding, for the traffic I was looking at above, what is the composite? Who's talking to who? What's their next hop? If I have multiple hops out of the network to the internet, I can see which one's being used. Then, I can see a volume of traffic over time. We can see that over the past 4 hours here, this one particular host has been talking out to this particular IP address in the internet. We can see that the last 4 hours they've used 17.05 gigabits worth of traffic, peaked out at almost 40 megabits of traffic right about the time we started at 11 o'clock, Eastern Standard Time here.

It gives me an understanding of the internet traffic, up here, I can combine in the very same dashboard, the ability to see who is actually consuming that bandwidth. Here, I have a pie chart, this is the up links to some of my switches in the data center, I'm just seeing, from my core switches, who's consuming the most bandwidth. We can see 78% right now, is all going to my 13750 closet switch, which is a large number of the users in my network. Next, we have a table which is what we call a TopN, or we rank some metric within SevOne, in this case I've chosen my core switch, and I've asked to rank the most utilized ports in that core switch.

This first port, at 40.83%, is my spam port, out to another device, followed by, it looks like a host and a couple of VLAN interfaces, so I'm just saying, for my core switch who's consuming, which ports are consuming the most bandwidth? We can use this type graph, or chart, for any metric we have, whether it's volts, or bytes, or whatever, it's not limited just to ports on a switch. The next 3 graphs are kind of interesting. This particular graph, the total internet, is an example of how we can actually create what we call synthetic objects, but the ability to add up different KPIs, and make a single KPI. In this case, I've added all the invites of my internet basing firewall, all the outbytes for my primary, all the inbytes and outbytes for my secondary, I've lumped all of those together in a bandwidth chart. The neat thing is, one we have this, what we call a synthetic object, it acts just like a regular object in SevOne, so I can baseline it, I can learn on it, I can trend it, I can do all the good things we do there.

In the center here, another example of some of the flexibility of SevOne, is the ability to grab temperature. So on the bottom, here, I'm actually measuring temperature off the inlet module two in my core switch. I'm also measuring the outside temperature here, in Wilmington, Delaware, with a zip code of 19808. The way we do that, is we have a very small script that goes out to the weather site, it says for 19808 at this time, what is the temperature? We bring that number back and we put it into SevOne. Again, it's synthetic object. After that, I can trend on it, I can baseline, I can do lots of different things.

Also measuring UPS voltage, so I have an APC, a UPS running in my data center. And I have the ability to understand line voltage in and line voltage out, in a number of different metrics, I just did these. And then, I have the ability to baseline that, as well. So we can see that our average is 207, our baseline is 20698 over this period of time. So, I have the ability to grab stats on voltage. And, again, alert off those, so if I have a low voltage condition, or I'm running on battery, I would want an alert on that. And, we'll talk about alerts, here, in a minute or two.

Some other stats, I'm looking at my core switch CPU against this baseline, you can see we're running pretty hot today, over what the baseline usually is on Friday. I'm looking at firewall statistics, so a lot of times we get questions around security-what do you guys do around security monitoring. We're not a security monitoring focus, looking at my connection set up rate, if my firewall is getting hammered, I'm gonna know that because I'm gonna be significantly greater than my normal baseline for a Friday at this time. What we can see here is the shaded area is my baseline, and my darker area is where I am above baseline. But I'm really not that much above baseline and not too concerned. I also included on there active IP sect tunnels, so I have the ability to kind of see how many tunnels have been built today, to my remote sites. I want to know that my average is around four, because there's about four or five or six sites that connect, depending on what is needed and what the day is.

Also monitoring a wireless infrastructure. So, a lot of us have wireless controllers in the date center, they go to access points via IP across the infrastructure, and we want to understand how many users are using each of those. We also might want to understand the CPU and other things of the wireless controller. But, what we're looking at right here is, again, the number of stations or clients associated with an access point against the baseline. So, the shade is the baseline and the dark is the actuals. So you'll see we have a few more people than normal on as Friday, that have shown up at SevOne today.

It is amazing to watch this and users bring their own device, as the number of attached interfaces or clients has gone up dramatically, even here at SevOne. Just because everyone has a phone, or two phones, along with a laptop and other wireless iPads and wireless devices. So, we're very interested to watch the number of users to provide the correct services.

Lastly, I just embedded another net flow graph, it tells me a little more about who's using the traffic over the internet. And, again, we can see this 5539, which we saw up top. We'll see we have an internet of NAT users, this is our pool going out, and you can see the volume and the usage of traffic over time. This was provided by NetFlow, coming from the edge router to us.

And then below that, as I mentioned, we'll talk a little bit about alerts. We have a graph here, which we call event summary. And the idea behind this, is to understand when a performance alert may have happened for a particular group of devices, in a timed fashion. And what I mean by timed fashion is, I can draw a vertical line at any point and I can see that something was or was not good. So, I can see right in here, Friday at 22:17-whatever that was, that was almost a week ago. I was green for these guys, I had a notice of some sort here, and in my VMWare I was green. So you can look for any time slice of the week, to understand how things are performing. And, you get a summary at the bottom of this. So, I had 54 performance violations in the last week. The affected 6 of the 20 devices in my Linux group.

In the meantime, to repair, the time it took to clear, for one reason or another was 5 hours and 25 minutes. So, I can see that, in a perfect world, there's a lot of alerts in our environment for the sake of alerts. If this was all green, nobody would ever believe that it was actually working. So my violations, my thresholds are set very close, and I do get a lot of alerts, but this allows me to see, visually, how I ran last week in a data center. Hopefully, this will be all green. Or, if there was a particular outage that affected the Linux servers, there would be a small red bar. The neat thing is to see that where I had this particular outage, I probably didn't have one here or here.

That's a lot about different parts and pieces of our data center, wrapped up into one dashboard. And inside of SevOne, you always have the ability to PDF these. So, we can come in here, if my boss asks me for how we ran yesterday, or today, or whatever, I can actually click on the PDF button and up top is the PDF of the data center. It has a number of different pages, but you can see two of my graphs there. So, I can send this to him and he doesn't even need to log in. We also can schedule these to be reported. So, we can have this emailed as a PDF every day, every week, every month, to understand how things are moving in the data center.

As I mentioned the graphs in any given dashboard that we're viewing are live, or up to the last minute, so I refreshed this at 9 something, I can come right back in here and say, you know what, I want to see this graph; actually this guy is behind, I can say, you know what I want to see the past two hours for this guy, that was the graph that I had zoomed in on. And, we can see now, 11:24, I'm as current as my last poll.

We also have the ability to work with these, in terms of drilling down. I can highlight a particular data range, we can see the SNMP traffic, and I can do what we call chaining. So, I can actually take the output of this graph, and make it the input of another graph. So, if I quick chain to what we call a FlowFalcon Report, which is our NetFlow report. And, he's gonna pop toward the bottom of our screen. If I scroll all the way down here to the bottom, what we're gonna see is, now I have a NetFlow report of the traffic during that time period, and who was talking to who, from the perspective of the firewall. So, any of these graphs can be manipulated within the dashboard.

You can also set the dashboards to refresh, so I can have this up on a plasma in the knock, or in someplace where people are interested in the performance of the data center. And I can set it to refresh every number of seconds. So I get a lot of information about my data center, in a single screen, from both polled metrics, coming from things like IP SLA or SNMP. I have flow, which is coming from the routers and flowing into me. And I have even the ability to pull the in-scripted demo, like the outside weather temperatures, or other things that can not be polled from SNMP, to get those metrics in. And then those metrics are normalized, I can change the time frame in the dashboard from two hours to, I want to see the past four weeks of everything. So, very quickly I can draw four week graphs of all my data. So I can answer the question, how did that happen last week, or is this week the same as last week, or how did we do whenever. We have the ability to answer those questions very, very quickly.

And with that, I'm gonna open it up to questions. It's 11:25, I talked for quite a while here, but I wanted to open it up to everybody to see if they had questions for me. Data center monitoring, SevOne question monitoring, etc. Hopefully, someone has a question today.

Is everybody muted? Hang on, guys, you may be muted. I think now folks are unmuted. All right, questions.

Speaker 2:
Can you hear me?

Dave:
I can.

Speaker 2:
All right, great. I've got a quick question for you. These reports are all very impressive. Are they all a standard set of reports or are they all customized and how difficult are they to produce?

Dave:
That's a great question. They are customized. Each of these objects is kind of unique to here. But, they are very easy to produce. We have a very nice little report wizard kind of guide that goes through and you can build these graphs at your pleasure. I can do things like, if I go here and say I just want a TopN report, and actually open this up. There's a top end of all my ports. I can do things like chained into a graph of that TopN. So, now I get kind of the data that goes associated with these guys. And from there, if there's NetFlow associated, I can do NetFlow graphs. So, there's a lot of different ways to get graphs in there. But it's really easy to click and point and drag. You can make these bigger and smaller and things like that.

Speaker 2:
Great. Okay.

Dave:
Thanks. Other questions?

Speaker 3:
What versions do you need to be on to utilize this dashboard?

Dave:
We have been doing dashboards like this for a really long time. So, four dot something started it. The fun little wizard came in 5.0. So, the wizard part was a 5.0 addition. We are currently shiping a 5.2.2, is the latest version.

Speaker 3:
All right, thanks.

Dave:
Sure.

Speaker 4:
Can you do any 95th percentile calculations?

Dave:
Sure, so I can even add that as a line into this particular graph. So, I can edit this graph and I can say in my settings, somewhere in here, says I want to see my percentiles. And I can say 95th or 98th or 99th and we can add that and then we'll actually show you that for the data set on the graph.

Speaker 4:
Okay, can you export in CSV format?

Dave:
Sure can. Yep. So, I can drop this guy down and say get my CSV for this particular data.

Speaker 4:
Do you have the ability to do scheduled reports?

Dave:
Yeah, these dashboards can all be mailed to you. They end up as a PDF in your inbox, and it can be every hour, every day, every week.

Speaker 4:
Can you do an analysis, so one scheduled report that encompasses multiple interfaces at the same time?

Dave:
Absolutely, absolutely. Well, define multiple interfaces, but I can have this router and that router, and this server and this temperature, I can put all that in a graph or a set of graphs, yes.

Speaker 4:
Can you create kind of like a static group that contains the devices or interfaces of interests, so that if you have to run the same report over and over again, you can just say hey, just include this group of interfaces?

Dave:
I can, yes. So we can do, what we call device groups, we can do object groups, which are sets ofinterfaces. There's, yes, a number of different ways to do that.

Speaker 4:
And can you describe how you scale? So I don't know what the hardware components are for the solution. If I need to understand what happens as you scale to very large numbers of devices and/or polling instances, I don't know how your system is- what criteria you use.

Dave:
That's a great question. What I will do is I'll kind of go back to the slide here, because it sort of outlines it in a, this is more of a marketing picture than an architectural picture, but if we look at this, SevOne is sold as an appliance, whether virtual or hardware. So, what that means is that everything unique comes in the box, so to speak. All of our monitoring technologies are included, the software, the database, the hardware, the O.S., it's all there together for you, whether virtually or in a hardware appliance. And, really, our differentiator in marketplace is that, every one of those appliance, hardware or virtual, has a particular capacity, of which you can monitor. And, when you reach that capacity, you simply add another appliance, or another virtual appliance. They are peered together, so they talk to each other, and each understand the others workload. And that's how we keep our speed of reporting at scale. We have, basically, a linear scale, as we add more devices, monitor devices, we'll add more SevOne appliances to cover that workload. That's how we scale and, quite frankly, what makes us different in the marketplace.

Speaker 4:
So, what happens if you have multiple appliances and you're trying to generate a report, do you need to know which one monitors a particular device, or how does it work?

Dave:
So that's the magic of SevOne, if you will. The peering that takes place, they all talk to each other, so they all understand who is monitoring what. And this global picture, no matter which one of these boxes I would web to, I would have the ability to see all of the devices that I have the roll based access to see. So, if I'm an administrator and I can see all, I will be presented with a graph or a selection of devices, however you want to look at it, of all the devices in the world, and I will be able to report on them just like they were on the appliance that I was webbed into.

Speaker 4:
Okay.

Dave:
So, it's a peer-to-peer architecture, everybody knows about everything. So, I don't have to know who is monitoring what, the system takes care of it for me.

Speaker 4:
Okay. What database do you use?

Dave:
Well, that's a great thing, it's all a part of the appliance. So, it comes with the appliance, there's no database administration you need to do, or anything like that. But, under the covers, we do use MySQL.

Speaker 4:
Okay.

Dave:
Any other questions from anybody?

Speaker 4:
What's your data retention period?

Dave:
365 days. So, we keep all of the as polled data. We sized these appliances with enough disc space to keep all the as polled data for 365 days.

Speaker 4:
And how many devices can you support per appliance?

Dave:
Our smallest appliance, we break them down in engineering terms by objects. And an object is most easily described as, if I had a 24 port switch and I want to monitor all 24 ports, it would cost me 24 objects. We break our sizing down into objects. Our smallest is 5,000 objects. Our largest is 200,000 objects in a 2RU platform. And, I would say that your average device, in terms if we have to relate devices to objects, is about 50 objects, plus or minus. There are a lot larger and there's a lot smaller. But they average somewhere around 50. You take that into 200,000 and end up with about 4,000 devices. On a single 2RU box.

Speaker 4:
So, your licensing is done based on appliance?

Dave:
Our license is based on an object and an appliance will have a max number of objects. Smallest, being 5,000, largest being 200,000, and then you add the next appliance when you go over that limit.

Speaker 4:
Okay, thank you.

Dave:
Sure. Anyone else, on a Friday?

Speaker 5:
If we're doing a up down monitoring, is there a way to like acknowledge that something is down, that somebody is working on it, for example?

Dave:
Sure, so in our alerts console we can do a couple different things. We can acknowledge, we can do an ignore. So, an acknowledge says okay we know about this, we want to clean out the board. The ignore says, we're not sure we fixed this or hey, we're waiting for Dell to bring us a hard drive so for the next four hours, there's nothing we can do. So, ignore will take it off the board, at the end of the the four hour timer expires, we'll retest and bring it back. If you acknowledge and it's bad, we'll retest and bring it right back. Ignore kinda gets you around that and says hey look, we know this is going for a period of time and then we're gonna recheck to see if Dell fixed whatever they needed to fix. If they haven't, we'll bring it back on the board. We also certainly give you the ability to auto clear. So, we have thresholds, if CPU goes above 90%, you want an alert. But, if CPU goes back below 90%, it'll take the alert back off the board automatically.

Speaker 6:
In relation to that, alerting for CPU, can you schedule individual monitors to not respond during a certain period. So, for instance, if you're expecting your server CPU to go to 100% every night while it's in maintenance, and you don't want to know about that, but you do want to know if the server stops responding, can you put a time schedule to monitor.

Dave:
I could, but there's even a better way to handle that. In that particular scenario, remember the baselines I showed in my demo, you can actually base your alerting off of your baseline. And if that CPU is always at a 100% or that bandwidth is always at 100%, because we're doing this nightly backup, your baseline is going to reflect that value. So, I can say, I want an alert, not when the CPU's not 100%, but I want to know when my CPU is much greater than normal, or baseline. So, I can measure if it's always 100% at this point in time, in the middle of the night, there will be no trigger. But, on Sunday afternoon at 4:00 p.m., or whatever, it's 100%, I'll get an error message because, hey, it's not usually that at Sunday.

So, I could schedule an alert that says don't alert me on this particular thing, at this particular time. Or, I can set up my alerting, really more around normal. So, I have this rolling 10 week baseline of what is normal, and I want to know, and I can do that in terms of a delta, 4 tics off of it and a percent, 150% greater than baseline. Or, I can do it in terms of standard deviation. So, I'm 3 standard deviations greater than normal right now, I really want to understand why. So, two ways to do it, you can schedule it, or you could allow our understanding of normal, to kind of take care of that.

Speaker 6:
Okay, thanks.

Dave:
Sure. Anybody else on a Friday? All right, thank you everyone, for joining Demo with Dave. As you can see on that screen there, Friday, March 29th, we're gonna go through the SevOne capabilities and the ins and outs of monitoring virtual machines, the relation between guests and hosts and something like that. Thanks a lot for joining, everyone have a great Friday.