Identifying Network Technology & Server Candidates For Virtualization


Vess Bakalov, Founder and CTO of SevOne, discusses pinpointing likely candidates for server virtualization, understanding the impact it has on host CPU, RAM, and IOPS, and the importance of monitoring the performance of both the virtual and underlying physical infrastructure.


Hello. My name is Vess Bakalov, and I am the CTO for SevOne. Today we're going to talk about virtualization and specifically, how to identify good candidates for virtualization. I expect that basically the entire audience right now has had some experience with virtualization. You all have dipped your finger in the lake, the deep lake that it is and have drank some of the cool water. That being said, let's talk a little bit about the technicalities. What should guide our decisions as to why we virtualize and what happens?

First of all, what drives virtualization? The fact is, cost savings. That's why we're willing to shell out tens of thousands, hundreds of thousands, millions of dollars to our virtualization vendors in order for them to provide equivalent savings for us. We have these racks of servers and there is tons of servers in here. They all do some specific task. In enterprise data centers, we get resigned to more or less have a dedicated server for each application; provides a good, some say, firewalling. If the application goes down, it doesn't bring down any other co-dependent applications. It allows us to scale different parts of the infrastructure horizontally without impacting others.

Given that general philosophy that we've adopted over the last couple of decades, the fact is that many of the applications running in each one of these servers could actually be consuming very little of their resources. We may be running something like DNS. The fact is, you, unless you are one of the major service providers in the world, you really can't buy a server slow enough to do DNS. Even worse, something like DHCP; the list can go on. There is many very important services that we run, but the fact is they really don't tax our boxes a whole lot.

What's furthermore is, we now have this big boxes with the architecture of CPUs evolving toward heavy multi-core design that literally could have 2 sockets with 12 cores each that are running hyperthreading. You're really quickly looking at 48 core or even 96 core boxes. RAM is always, also been evolving quite quick, so it is not uncommon today to see a servers with 512 gigabytes. Furthermore, this thing is now attached to a big SAM array, which consolidates our storage, provides really nice things like backup, etc. There is a ton of storage behind it to the point where it's a commodity and we don't have to worry about it too much. We'll talk about which parts of it we need to worry, but volume is certainly not one of our main concerns.

This entire cabinet may end up using as much resources as this server could provide. What else is there? How consolidation is a part of it. The other thing that really drives this whole movement is easier management. Having this entire rack of servers may mean that we're running different generations of servers, even if standardized from a single vendor, there my be difference between generation 5, generation 6, etc., between these servers which would necessitate different drivers and operating systems and other things.

With virtualization on top of it, we get essentially one of our main management benefits, commoditization. Essentially, we no longer care about what the underlying server is built as; exactly what the rate is, exactly what the network is, all these things are abstracted for us. The next thing that it also gives us is things like VMotion. VMotion is important and the equivalent technologies, of course, provided by other vendors. That's really important because now when we have to do maintenance, we can actually move all of our servers that will be affected by a particular host to another host and we can do it during off hours. They can be of course, dynamic VMotion, but that hasn't been accepted by enterprise quite as much. Nonetheless, just simply in the maintenance aspect of things, that's a huge benefit.

We also get a fairly good HA capability and not to be undersold, snap-shotting. Actually, we can get a backup of a server, we can get a template of a server very, very quickly. Also we have essentially for free, we get console access. That's another really, really good benefit that emerges from the use of virtualization.

This combined with a drive towards applications which provide horizontal scale essentially allows us to begin creating the backbone of today's data center. What's even more important about the horizontal scale is it's on demand. As we add additional databases, as we add additional web servers, we can put them behind our load balancers and do this on demand.

That's all great benefits. Now the question is, what are the things we need to watch out for? As always, usually when I speak, it's about performance. It's about, "How can we understand the performance impact to our enterprise? How do we guarantee the same level of service delivery between the nicely dedicated hosts which provide us this nice level of service and the brave new world of having everything inside these more consolidated environments?"

One of the main things people immediately think about is CPU. CPU is honestly, I claim, the easiest bit to manage. CPU can be interleafed between different applications, so if we have an application that has a pattern. This is a base from 0:00 to 24:00. Let's say our pattern looks like this. We may have another application which does off hour batch processing that may start right at midnight and look like this. That's obviously a very easy choice for us to interleaf these two applications.

What doesn't interleaf so well, and what we need to really watch out for, is memory. In all likelihood, unless we're doing some of this horizontal scale on demand stuff, servers are going to stay powered on. It's very few environments we're seeing how people are actually powering down their servers. Essentially even though the CPU will look like this, and we're going to get better utilization of work near 24 hours, memory is going to look like this. In all likelihood, it will look like a flat line because most server applications, once they load up your resident memory, stay in resident memory. There isn't that much in terms of fluctuation. Again, exceptions to the rule abound, but in 95% of the cases this is essentially the graph you will likely see when it comes to RAM.

There is another graph you are going to see when it comes to storage. As I mentioned earlier, there are several aspects of storage. There is pure volume, which I claim today is fairly commoditized. We do worry about it. It is something we need to check we don't go over, but it's not necessarily, given that we have connectivity to commoditize SAM or even NAS, or even on-board disk, these are things that are fairly easily managed over time. Storage doesn't surprisingly change.

What does surprisingly change are 2 things: throughput and IOPS. In this day and age, IOPS are probably the more important of these 2 factors. Oftentimes, CPU and storage are fairly well aligned. As we look over time, this very same system is probably going to look storage-wise a lot like this as well in terms of both IOPS and throughput. The fact is, though, the scale may be different. Certain applications are extremely transaction-bound. Certain applications are very much throughput, just volume of sending data to disk or reading data from disk bound.

This needs to be well understood when we begin to design the system. We need to get a really good baseline before we ever do anything. We need to look at the historical data of how our servers have performed over time in their physical boxes. Then we need to take them and model them on top of what we are projecting to use as our virtualized environment. Once we do this, once we've made the jump, things change. We continue to evolve our infrastructure. One of the most insidious things that happens in many environments is that the owners of services continue to manage their VM hosts, their virtualized hosts, as if they are real. That is true, and may of the things that they do will continue to be, to look very much the same. The fact is, the underlying infrastructure is just as important.

It's very important to look at both of these pieces at the same time. For instance, if I have a VM that is reporting that it's using 80% of RAM and it's suddenly performing very slow. Here's my graph of the VM, and he is running at 80% RAM and however, all of the sudden, my transactions begin to grow. It would be extremely useful to also be able to immediately correlate this with, say, the host where we are noticing a surprising trend of using 100% of our RAM and actually swapping. What is happening now is that the VM, even though he thinks he's using 80% and not swapping, is being affected by the swapping of the host.

That is not quite as bad with CPU because with CPU it's actually, as the number of megahertz goes down, what as been allocated to the VM is going to step down and we will actually see 100% CPU utilization in the VM. Memory in combination with swap can often become that bottleneck that really turns off your applications.

The other thing that we need to look out for is IOPS and IOPS latency. With a big SAM that's massively shared between many users, it is very difficult to figure out exactly what is your limit in terms of IOPS. There is a limit, but it is more of an art than a science to necessarily determine this in real time. It is very important to baseline, so we have CPU here. We do have to understand RAM. Luckily, RAM stacks up nicely, so it's actually fairly easy to project. The next thing is disk IOPS. This thing, I would argue the best way to measure it is to look at the average latency of disk reads and disk writes and make sure that that remains constant. Essentially, the number of IOPS may suddenly decrease as you put your application on the virtualized host. What it really means is that we just can't get them through.

Volume is also a consideration. I will call this "throughput." There we go. Throughput is also a consideration, but the main thing that I would, again, point out to is really going to be the latency. At this point of time, there are things to do. You can get tier storage. We can look into how to build these things.

I guess to go all the way back to what would be the best candidates for virtualization? Number 1, very unutilized servers that are very RAM-bound. As long as your applications is more or less residing in RAM and you have a well-understood CPU utilization gradient, you can easily move this application with very little consideration into the virtualized environment.

The next thing is horizontal scaling on demand. Since the application can be easily spread across many servers, as long as you're not bound to a single storage device in terms of a bottleneck, you essentially can check off CPU, RAM, as well as IOPS, off your list. Of course, managing, making sure continuously that you are managing the underlying infrastructure as well as the virtual.

The final thing is the traditional heavy databases that are heavily utilizing a host. You may end up with only being able to virtualize one or two servers that are heavily utilized in their bare metal form onto a single, even large, virtualized host, virtual host. The benefit in that case that you have to weigh for yourself is, "Am I willing to potentially sacrifice some predictability for the benefits of snap-shotting, access, VMotion, essentially easier management?" Once all these things are put into proper perspective, once you understand to where your performance bottlenecks, I think that virtualization can become one of the most important tools in your arsenal. Thank you very much for watching. My name is Vess Bakalov and I hope to see you soon.