OK - from one day of insider EMC threads on VMware topics - three posts.
2) "What causes VMFS volumes to get resignatured?"
Ah, VMFS volume resignaturing - one of those things that are nice about NFS datastores - you don't worry about this :-) But then again, there are pros and cons both ways - thank goodness I can embrace them both (and don't trust people who don't share the downsides of both!)
Ok, what are we talking about?
VMFS-3 volumes have a unique indentifier (a UUID) in the Logical Volume Manager (LVM) that "signs" the volume. It's a long 16 character hexadecimal value, This is similar to the idea of how Microsoft signs NTFS volumes. And, this concept of signatures is very important in any case where LUNs are shared, like WFCS (Windows Failover Cluster service - aka the new MSCS)
I remoted into my ESX server to show you what I mean. You'll see that in the /vmfs/volumes directory, each VMFS volume has a long string and a the nice readable names (the ones you configure in VI) are as links to the obscure UUID.
Answer to the “what is the long string”:
It’s a VMware generated number – the LVM signature aka the UUID (it’s a long hexadecimal number designed to be unique). The signature itself has little to with anything presented by the storage subsystem (Host LUN ID, SCSI device type), but a change in either will cause a VMFS volume to get resigned (the ESX server says “hey I used to have a LUN with this signature, but it’s parameters were different, so I better resign this”).
Long and short, the things that cause this to change (and this is universal for all vendors):
- LUN ID changes (the modern equivalent of the SCSI device number)
- SCSI device type for example, changing from SCSI-2 to a SCSI-3 device (upgrading from DART 5.5 to DART 5.6 does this, as well as changing FA settings on a DMX)
- Changing to be an SPC-2 compliant device
- This also commonly occurs with snapshots/clones/BCVs of a LUN (since they show up with a different LUN ID).
This is covered in section 3.4 of the Symmetrix Solutions Guide for VMware (https://www.emc.com/collateral/hardware/solution-overview/h2529-vmware-esx-svr-w-symmetrix-wp-ldv.pdf ). The author Bala Ganeshan is our Symmetrix Virtualization CTO/tzar, and it's a must read for customer (and where I grabbed some of these screenshots - thanks Bala!). Note that there are similar solutions guides for the other platforms which discuss this as well. AND - you can get these in nice coffee table hardcover if you want. I always think its good for both the server team and storage teams at customers - and in some cases that's one person! - to have this. Often, disconnect or "non cross functional domain expertise" is a problem. Order the books here: https://store.vervante.com/c/v/category.html?base_cat=EMC%3A%20EMC%20TechBooks&pard=emc
Ok - so what happens? With a snapshot - the right behavior (google LVM.EnableResignature and LVM.DisallowSnapshotLun, or just read the EMC guide) , is that you enable resignaturing, and rescan, and poof, the LUN appears. With a DR cluster that hasn't seen that LUN before, you Disallow snapshot LUNs, rescan).
Here's the flow, as an example so you can see what happens at every step.
But on a production cluster, the old VMFS volumes goes away. To get it to reappear, you need to change EnableResignaturing to 1, which will resign the VMFS volume.
As soon as a VMFS volume gets resignatured, all the VMs need to be re-registered in Virtual Center.
But, you don't want all those VMs registered, since they already are registered, right? But what if it's not the snapshot, but instead the original LUN? Well, all of a sudden, it's like the old datastore disappeared, and a new datastore arrives, with a whole bunch of un-registered VMs, and the old VMs are greyed out. oh oh.
Options to avoid an outage:
- First and foremost, most cases like this I’ve seen are because for one reason or another, the initial best practices were not followed. Then, someone realizes it (sometimes because they want to leverage something like SRM, or migrate to a new platform – which requires these to be set properly). Seriously, if you're serious about your VMware environment - and you should be, regardless of whether you're on the server side, or the storage side - do yourself a favor and read the best practices guides. Learn from your trusted vendors, so you don't have to learn it yourself. Server vendors have 'em, storage vendors have 'em, Backup vendors have one, and while networking vendors in general aren't there yet, Cisco has a great one.
- You can disable all resignaturing (DisallowSnapshotLUN=1), but this isn’t a good idea as a long term solution IMHO (so you delay the inevitable).
- Present a NEW LUN with the proper parameters, and then use storage vmotion (can be timeconsuming - my rule of thumb is that 1GB per minute using Storage Vmotion, with a some additional IOPs work on the containing and target volumes - remember that svmotion right now has no governance - it goes full tilt).
- It CAN be possible using some storage virtualization (like EMC Invista) to hide the back-end change from the ESX host
- Alternatively, you can plan for the outage and automate the VM registration using a script (EMC Professional Services does this if you want us to).
Moral of that story: Read and try to follow vendor best practices material. Don't take changes to your infrastructure - including storage - that supports VMware lightly. Test before you change in mission-critical cases, just like you do for your other mission-critical IT elements.