Well, Cisco finally came with an answer to why I was able to break the stuff like clock work before, and that answer was firmware. A new firmware has been release for the chassis, blades, & FEX (and I’m sure I’ve either got that in the wrong order or hardware), but I can’t say I’m excited about it.
We set more time aside to have Cisco come in and upgrade the bits, as if we haven’t wasted enough time already. This time, they sent the big guns to work on it, or gun, rather, as they sent one of the engineers named Troy. He was a good guy, very knowledgeable, but he can’t help it that he works for Cisco, we’ve all gotta eat, right?
After laying down some new firmware in a very specific order, and don’t ask me to venture down that road now, it’d be like flying blind, the UCS system is alive and running well. I fire up my lone guest and have him chug away just like before. Twent four hours later and he’s still running, no servers are offline, REJOICE! So, it’s very apparent the new firmware fixed something, but what exactly? Your guess is as good as mine, if not better.
I’m always wanting to up the ante and try harder, so I cloned my guest three times, for a total of four guests, but I also gave each 4 vCPUs and two 8GB of RAM and two with 4. The reason for this is I tested our four internally supported OSs: 2003 & 2008, x86 and x64 (this is why two had 4gb, the addressing limitation). Simple math says we’ll be using 16 vCPUs and 24GB of RAM. These Xeon CPUs are quad-core with hyperthreading, for a total of 8 cores per blade, 16 total threads per blade, so it should be able to handle it.
Now, since we have 4 vCPUs within the guest, I needed something more than before. I took my regular two scripts, factoring and the mass instantiation of calc.exe, then added two more similar to the calc that launched notepad.exe and mspaint.exe. So now it’s launching 852 calcs, 852 notepads, and 852 paints, on top of factoring, so yeah, it’s pretty slammed.
My results were different, however, my host still went into the Not Responding state, but that was the only one affected, and I was able to kill it after 6 hours like clock work. It’s obvious the firmware fixed something, but it’s not ‘golden’, imo. True, my guests were asking for all RAM the system had available, but don’t all true virtualization environments over provision? I watched and it wasn’t fully maxing out the RAM at the host level, but it sure did come close.
To be fair, I decided to try this exact same test on a Dell M610. I couldn’t do it exactly the same, as I don’t have CNAs in the Dell, and the storage back-end in my lab is iSCSI, but it’s the closest thing I’ve got. I decided to also throw more at the Dell and created one more 2008 x64 guest with 4 vCPU and 8GB of RAM.
I started the exact same 4 powershell command lines on December 17th, went on vacation for two weeks, came back, and forgot about it for another week. So after about a month, I check on my five guests and they’re all still running like a champ! The host is still extremely responsive, and to top it off, one of my guests finished it’s factoring.
UCS ready for prime time? You be the judge!
One more thought to digest, while my 8 UCS blades were sitting idle with ZERO running guests, one host dropped off and became “Not Responding” again. I powercycled it, kind of ignoring it, since I already pretty much loath UCS, and not even a week later a different host is now “Not Responding”. WTF???? C’mon Cisco, fix yer shit!