Hello! Long time, no scripting! I’ve been blowing through VCF, deploying, redeploying, and built some scripts to help me with this. Sharing is caring, read on to see what I’ve done…
Hello! Long time, no scripting! I’ve been blowing through VCF, deploying, redeploying, and built some scripts to help me with this. Sharing is caring, read on to see what I’ve done…
At a high level, I need to install five (5) PCIe NVMe SSDs into a homelab server. In this post I cover how CPU & motherboard all play a role in how & where these PCIe cards can and should be connected. I learned that simply having slots on the motherboard doesn’t mean they’re all capable of the same things. My research was eye-opening and really helped me understand the underlying architecture of the CPU, chipset, and manufacturer-specific motherboard connectivity. It’s a lot to digest at first, but I hope this provides some insight for others to learn from. Before I forget, the info below applies to server motherboards, too, and plays a key role in dual socket boards when only a single CPU is used.
I told one of my nodes to enter maintenance mode and it sat for overnight like this:
That screenshot was taken almost exactly 26 hours later. There were no running VMs on the host, nothing on the local datastore, no resyncing or rebuilding objects in vSAN, and lastly nearly zero IO on the network adapters.
I tried canceling the task, it would not cancel.
I rebooted the host, it came back into the cluster with that task still running.
I rebooted my vCenter, and that finally killed the task.
I’ve been intending to deploy NSX-T 2.4 since it’s release a few months ago to check out what’s new.
With that, I learned a little about a repeatable workflow to deploy it in a relatively easy way.
This assumes you already have your vCenter deployed with a vSphere cluster and port groups set up. For NSX-T 2.4 (-T hereafter), you don’t have separate controllers from your manager, you can deploy a single manager and then add additional managers to make it a cluster. You’ll want 1 or 3 NSX Managers, depending if this is a lab, testing, or production; and if it’s a cluster, you’ll likely want an additional IP to serve as the cluster VIP. If you’re keeping count, that’s four (4) IPs, which is how I’m going to deploy it.
VMware has exploded into Software Defined Networking (SDN) with NSX, it’s no secret why it’s their fastest growing product, either. Through the use of all the components within NSX, you can be well on your way to a fully Software Defined Datacenter (SDDC) accomplishing things like automated deployments of networks, edge devices, NAT rules, firewall rules, and the list goes on.
Over the last year, we’ve been doing a lot of testing with VMware Cloud on AWS (VMC) and it’s pretty slick. In the past, we’ve used our physical parameter device (Cisco ASA) to handle the VPN traffic, but yesterday I wanted to set up a VPN to the management gateway, and I wanted it done now. Since I don’t have direct access to the ASA, I have to submit a ticket to our NetSec team to have them do it, and they have their own work going on, so naturally I decided to use an NSX Edge for this.
I pulled up the two interfaces side by side so I could fill out both at the same time, but I noticed the VMC side was missing a few things that I had on the NSX side: Local ID & Peer ID. But the VMC side also had an option for IKE & SHA versions, which I didn’t have on the NSX side. Keep those in mind as you step through this, let’s get started…
I was reading VCDX56’s post Nutanix AHV VM Reporting Via REST API authored by Magnus Andersson @magander3 where, as the title suggests, he discusses a script he wrote to gather information about VMs running on a Nutanix AHV cluster using the REST APIs. At the end of the post, he mentioned that he would like to change the script from Bash to Python.
I have recently been doing quite a bit REST API scripting with Python, so I took a crack at it last night.
Everyone hears about VMware’s Virtual SAN and how awesome it is. It’s a very compelling offering and is only overshadowed by their software defined networking solution NSX.
The biggest hurdle: how to get started.
The truth is it’s extremely simple to enable and start using, but that’s not the “getting started” I’m talking about. I wanted to cover off some things to think about when you’ve decided you’re going down the VSAN path.
How do you know how many IOPS to expect, or how much storage you will have or need, should you go hybrid or all flash, and what resiliency or protection options you have, and the impact of those.
UPDATE
VMware has posted a KB about this, which I did not realize at the time of writing the blog. https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2146267
We’ve been testing out VSAN here at work and noticed that one of the clusters we rolled out had serious latency issues. We initially blamed the application running on the hosted VMs, but when it continued to get worse we finally opened a case with VMware. Here’s a chart of the kind of stats we were seeing (courtesy of SexiGraf):
Read latency in particular was very high on the datastore level, IOPS weren’t great, and Read Cache Hit Rate was low. We also saw that read and write latency was high on the VM level. After we opened a ticket with VMware, they discovered an undocumented bug in VSAN 6.2 where deduplication scanning is running even though deduplication is turned off (and actually unsupported in hybrid mode VSAN altogether). They provided the following solution:
For each host in the VSAN cluster:
1. Enter maintenance mode
2. SSH to the host and run: "esxcfg-advcfg -s 0 /LSOM/lsomComponentDedupScanType"
3. Reboot the host
After we applied the fix, the cluster rebalanced for a little while and came back looking much, much better. In the below graph, you can see right when the fix was applied and see read latency drop, IOPS increase, and read cache hit rate jump to the high 90-percents:
And for good measure, this is how it’s looked since:
So to summarize, if you are running hybrid VSAN 6.2, you should definitely check your latency and read cache hit rate. If you’re experiencing high latency and poor read cache hit rate, go through and change /LSOM/lsomComponentDedupScanType on all your hosts to 0. I can’t take credit for actually discovering this, so thank you to my coworker @per_thorn for tracking it down. And thank you @thephuck for letting me write it up on this blog!