Okay, so now that we’ve tested different OS installations, now it’s time to test the real purpose we acquired these blades for: Virtualization
A little info on the hardware: Cisco N20-B6620-1, dual Xeon E5540s, 24GB of RAM, and two 73gb drives
We’re using VMware ESXi 4.0u1 for our testing, and booting from the SAN. Yes, I know, it’s only still experimental with vSphere, I don’t like it, but that’s the path I was lead down by my superiors.
We’ve got vCenter running, 8 hosts (2 chassis with 4 blades in each), and everything seems fine, so I lay down Windows 2008 x64 in a guest with 2 vCPUs and 4GB of RAM, and it loaded in the expected amount of time, nothing spectacular here.
After I had this lone guest running, I decided to try some load testing from within the guest. I simply googled for powershell load generation and stumbled upon this: Measure-Command {$result = 1; foreach ($number in 1..2147483647) {$result = $result * $number}} taken from Here
Since it isn’t multithreaded, I needed something more to work my second vCPU. Using similar logic, I created this: $calc = 1; foreach ($number in 1..2147483647) {$calc = invoke-item c:\windows\system32\calc.exe}
This line basically instantiates 2,147,483,647 instances of calc.exe, which I later discovered didn’t work. After testing many times, the most I could get of any process was 852 (whether it was notepad, paint, or calc).
Okay, so back on topic, I opened two powershell windows and ran each of the above in their own window. This quickly pegged CPU and ate up all memory. I decided to leave this overnight to see how it did, but upon returning to work this next morning I discovered something a bit odd: two UCS blades were listed as Not Responding in vCenter.
One host was where my lone VM lived, the other seemed random. After some investigation, I discovered that right around 8 hours, BOTH hosts dropped offline. A ping to the ESXi host resulted in strange latency numbers and dropped pings. Now that was an odd coincidence, so I powecycled both blades, made sure they were online again, brought up my lone guest and started my powershell load gens again.
Much to my surprise, 8 hours later I had the two hosts offline again, the EXACT same two as before. I brought them back online, moved my lone guest to a different blade and tried again with the same results.
Now, why would a single lone guest impact the hardware in this way? Cisco couldn’t tell me either. I worked with Cisco back & forth on this, and they really didn’t have an answer right away. I mean, srsly, I wasn’t maxing out resources at the host level, only my puny 2vCPU/4GB guest. There was no I/O, neither network nor storage, just simple CPU cycles and RAM utilization.
Now, you’ve got to ask yourself one question: Do I f…wait, wrong question, here we go: Is this something I really want running in my production datacenter? My answer: HELL NO!!!!!
This is a completely meaningless test without you also running the same exact test on other vendor hardware. Did you by chance do that?
Yes, as a matter of fact I did, thanks for asking.
I compared it with what I had available, which was both the Dell M1000E chassis with an M610 blade and an R710 2u rack mount server. I was able to replicate my issues on the Cisco UCS hardware over and over, while the Dell M610 ran without any issues and little to no impact to the hypervisor for weeks.
I talked about that in Part 3 of this topic.