Recently saw a little uptick (still a small number) in customers running into a specific issue – and I wanted to share the symptom and resolution. Common behavior:
- They want to remove a LUN from a vSphere 4 cluster
- They move or Storage vMotion the VMs off the datastore who is being removed (otherwise, the VMs would hard crash if you just yank out the datastore)
- After removing the LUN, VMs on OTHER datastores would become unavailable (not crashing, but becoming periodically unavailable on the network)
- the ESX logs would show a series of errors starting with “NMP”
Examples of the error messages include:
“NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa._______________" - failed to issue command due to Not found (APD)”
“NMP: nmp_DeviceUpdatePathStates: Activated path "NULL" for NMP device "naa.__________________".
What a weird one… I also found that this was affecting multiple storage vendors (suggesting an ESX-side issue). You can see the VMTN thread on this here.
So, I did some digging, following up on the VMware and EMC case numbers myself.
Here’s what’s happening, and the workaround options:
When a LUN supporting a datastore becomes unavailable, the NMP stack in vSphere 4 attempts failover paths, and if no paths are available, an APD (All Paths Dead) state is assumed for that device (starts a different path state detection routine). If after that you do a rescan, periodically VMs on that ESX host will lose network connectivity and become non-responsive.
This is a bug, and a known bug.
What was commonly happening in these cases was that the customer was changing LUN masking or zoning in the array or in the fabric, removing it from all the ESX hosts before removing the datastore and the LUN in the VI client. It is notable that this could also be triggered by anything making the LUN inaccessible to the ESX host – intentional, outage, or accidental.
Workaround 1 (the better workaround IMO)
This workaround falls under “operational excellence”. The sequence of operations here is important – the issue only occurs if the LUN is removed while the datastore and disk device are expected by the ESX host. The correct sequence for removing a LUN backing a datastore.
- In the vSphere client, vacate the VMs from the datastore being removed (migrate or Storage vMotion)
- In the vSphere client, remove the Datastore
- In the vSphere client, remove the storage device
- Only then, in your array management tool remove the LUN from the host.
- In the vSphere client, rescan the bus.
Workaround 2 (only available in ESX/ESXi 4 u1)
This workaround is available only in update 1, and changes what the vmkernel does when it detects this APD state for a storage device, basically just immediately failing to open a datastore volume if the device’s state is APD. Since it’s an advanced parameter change – I wouldn’t make this change unless instructed by VMware support.
esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD
Some QnA:
Q: Does this happen if you’re using PowerPath/VE?
A: I’m not sure – but I don’t THINK that this bug would occur for devices owned by PowerPath/VE (since it replaces the bulk of the NMP stack in those cases) – but I need to validate that. This highlights to me at least how important these little things (in this case path state detection) are in entire storage stack.
In any case, thought people would find it useful to know about this, and it is a bug being tracked for resolution. Hope it helps one customer!
Thank you to a couple customers for letting me poke at their case, and to VMware Escalation Engineering!
Happened here today. VMware storage tech on the phone picked up on it. However, rescanning did not solve completely. I had to put the ESX hosts into maintenance mode and reboot the hosts.
Posted by: Phillip Reynolds | December 01, 2009 at 02:06 PM
You wrote
# In the vSphere client, remove the Datastore
# In the vSphere client, remove the storage device
How do you remove the datastore and the storage device?
Posted by: Michael | December 01, 2009 at 04:55 PM
This is a huge bug for ESX 4.0
There are use cases for keeping the datastore and powered off VMs intact, unmapping the LUN and presenting a replica perhaps
Any ETA on when this will be resolved, or is "esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD" the resolution anyway
Posted by: Barrie | December 01, 2009 at 09:44 PM
I see HP are posting this as an official resolution on their EVA support site:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=3664583&prodTypeId=12169&objectID=c01908147
Posted by: Barrie | December 01, 2009 at 10:07 PM
I've run into this issue as well. In my case, the LUN and Datastore were shared between two different clusters. I had to remove the LUN from one cluster, but still leave it in place for the other cluster.
Long story short, my work around in that case was to evacuate the host of all VMs, shut it down, remove the LUN from that one host, and bring it back up.
Posted by: Brett Impens | December 02, 2009 at 12:49 AM
It got us here today.. talk about ouch..
Posted by: Ben | December 04, 2009 at 01:07 PM
I still am trying out how to follow your work around #1. What is the proper way to remove the Datastore? Vmware support sure has been slow on this issue.
Posted by: michael | December 04, 2009 at 07:46 PM
I have had this problem REPEATEDLY and vmware support has been utterly and totally worthless... After thursday night (last occurrence), I was almost to the point of:
"Every host is going to get it's own CHAP user+pw, and every LUN removal/delete operation is going to be done with a full cluster-wide maint mode + reboot"...
Posted by: Nathan Neulinger | December 06, 2009 at 10:33 AM
I would second the question about what is the proper way to remove the datastore WITHOUT destroying it - i.e. to just remove it from one host out of several. We had a long ticket with vmware about this and their stance was basically "just shut off access to the lun, it'll work fine", which repeatedly trigger the whole "host drops off the face of the planet" networking problem.
Last thursday night change was an example of this... took an hour or so of production servers dropping off before I was able to finally clear out the problem from just THREE host systems. A few other hosts had lingering issues throughout the next day.
And a nasty side effect of all of this is that it caused repeated vcenter restarts/crashes due to all the hanging disk accesses/etc.
Posted by: Nathan Neulinger | December 06, 2009 at 10:40 AM
Chad,
We hit this same issue, had a case opened with VMware support and worked through it with them. I would be curious is one would still hit this bug if you had PPVE in place.
Support has since closed out my case and said the fix is now included in U1, but it's more of a work around that an actual fix.
Good stuff.
Scott
Posted by: Scott Sauer | December 07, 2009 at 10:04 AM
So this has been my response when I asked about using the FailVoluemOpenIfAPD.
"Thank you for your email.
The option FailVoluemOpenIfAPD is available in ESX 4.0 U1 but is disabled by default. I would still recommend following kb articles 1009449 and 1015084 to mask out the LUN before un-presenting from the storage array."
Joy, what a dangerous bug!
Posted by: Michael | December 07, 2009 at 04:11 PM
Same, curious as to the meaning of "remove the datastore". Is this a reference masking it on each host? I've had LUNs removed after deleting the VMFS partition and had no network issues in vSphere yet, so is it intermittent, or is the VMFS partition removal mostly all thats required?
Posted by: Dan C | December 11, 2009 at 09:29 AM
Hi Chad,
One of our customers had a problem last week whereby an NS20 with a CLARiiON CX3-10 backend experienced a failed drive in the vault set. The write cache was disabled and the rebuilt priority was set to ASAP.
The net result of this was that it killed the performance of the VMware environment giving symptoms like the APD bug. We checked the ESX logs and we did see the NMP/ADP error messages listed above.
We changed the rebuild priority to High, but this had very little impact on the poor performance.
What was very strange was that once the rebuild had finished the system performance returned to normal.
We are working with EMC and VMware support on this, but whatis currently not clear is was the problem caused by the Celerra/CLARiiON or is this the vSphere APD bug.
I agree with the other comments here that this is a very serious issue.
Any thoughts would be appreciated.
Posted by: Mark Burgess | December 14, 2009 at 06:33 AM
Thanks Chad (always) for all the awesome info. In your opinion: If we're a shop running ESX4u1 and only seeing "H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0" (SG_ERR_DID_BUS_BUSY [0x02] BUS stayed busy through time out period) should we look at FailVolumeOpenIfAPD or contacting VMware support directly?
Posted by: Andrea | December 15, 2009 at 10:18 PM
Hi
I have seen this issue two or three times, there is no way round it other than to remove LUNS form VSphere, rescan then remove LUN direct form Storage.
There is also a 30 minute scan from VC that suspends the machine hence you cannot ping
According to support a fix is in QA and was to late to be in U1 but will be released soon as a single bug fix...
Posted by: Alfredo | December 28, 2009 at 07:44 PM
Anyone hear if the bug has been fixed yet? Same issue here. Alfredo, do you have a support case I could follow up with?
Posted by: A | January 14, 2010 at 10:00 PM
The information from the following KB can be used to avoid this
issue before removing a LUN.
http://kb.vmware.com/kb/1015084
Posted by: Doc Carrots | January 20, 2010 at 12:19 PM
I can validate ( the hard way ) that having PowerPath VE installed won't avoid this problem. I'll chime in asking what do you mean by removing the storage device?
Posted by: Dominic Rivera | March 30, 2010 at 06:51 PM
One of the symptoms I saw on both nights I worked on this issue is the other storage luns are not browse able through VC.
So lets say you have cluster0x-lun-04 is being removed. See through VC if you can browse files after your particular lun is removed. So you browse through cluster0x-lun-05 and you see no files at all.
The issue is once the disk paths fail for certain things it starts to fail them on all of the disks whether they are being removed or not.
One suggestion VMWare gave the 1st time I saw this issue was an un-handled exception with the SCSI firmware on HP servers while this might solve a bulk of the issues it is possible the bug was fixed and then reintroduced recently. I know in the environment I see and work on the issue did disappear for awhile.
With the HP bug or firmware issue, I noticed that the array had pop-up message after the server was rebooted after the issue back in August through October of 2009.
-Travis
Posted by: Travis | April 08, 2010 at 03:06 PM
This week storage accidently provisioned to the wrong cluster, then reassigned to the correct one caused this issue for us. A big thank you as I remembered this post.
Posted by: Justin | April 15, 2010 at 01:44 PM
FYI... attempted Workaround 2 on a recently patch ESXi 4 host connected to an IBM DS4700 array and did not resolve. VMware support says they are still working with certain storage vendors on the issue. Waiting impatiently for a fix. This definitely affects SRM test failovers when completing a test and waiting for cleanup to finish as the SRA removes snapped replicated test luns from the hosts.
Posted by: iridium130m | May 18, 2010 at 12:19 PM
We are experiencing a strange problem where the rescan triggered by formatting a new FC LUN causes our hosts (ESXi 4 build 219382) to disconnect from vCenter, vSphere Client KVM sessions to virtual machines to drop and RDP session to virtual machines to drop. Is this symptom related to this storage bug?
Posted by: aenagy | May 28, 2010 at 03:59 PM
I would agree. My company has had 3 production outages from this incident, and only recently realized this was related to this root cause from VMWare support. I have to say it angers me that VMWare has known about this for 6 months without releasing a patch. Can anyone say HyperV? I also ask how to remove a storage device from the vSphere client. I can guarantee that only removing the VMFS formatting is not good enough, as the cluster will still hang when it is removed from the SAN.
Posted by: Lee | June 10, 2010 at 07:30 PM
We went through some storage reconfiguration last week and the process I followed was:
* Relocate / delete any VMDK's from the volumes.
* Remove the VMFS filesystems from them and rescan the hosts.
* Mask the luns off using claim rules (add the mask, load the list, run the list, and then using the reclaim -d command to unclaim the volumes so they're fully masked).
* Then unpresent from the servers at the SAN end, reconfigured them, represented them...
* Reversed the above process.
Worked fine and no hiccups I'm pleased to say.
Posted by: Jim Taylor | June 14, 2010 at 08:22 AM
What's the best way to do this, if you have that LUN presented to multiple vSphere hosts? Do we delete the LUN from each host before unpresenting it on the array side?
Posted by: MK | August 20, 2010 at 10:41 AM
We are seeing very strange issue on esx4.0 , we have at least 20 node esx cluster sharing same storage,which is vmax, also we have powerpath /ve . And we have noticed following ,we are not trying to remove any lun deliberately.We are using brocade virtual connect on hp blade center , with virtual hbas
If we push zoning change , all the esx box will loose 1,2 or 3 path out of for at random, also this is not a consistent behavior.Only way to correct is to reboot the esx server.
If we create new devices on the storage array , all the esx box will loose 1,2 or 3 path out of 4 at random, also this is a consistent behavior.Only way to correct or rescan four paths is to reboot the esx server.
This sometime causes hard crash on couple of ESX box.
EMC and vmware are of no help , it's been 4 weeks. Yes , ours is not exact problem as we are not trying to remove data stores .. but this causes vms to fail or production performance suffers. Only ,article I found closest to our issue is this. Any help would be appriciated.
Posted by: Rahul Mehta | September 09, 2010 at 01:27 AM
Has this issue been fixed in ESX 4.0 update 2?
Posted by: Darryl Wilson | September 20, 2010 at 09:20 AM
@Darryl - YES.
Posted by: Chad Sakac | September 20, 2010 at 11:23 PM
Problem with ESX 3.5 and IBM DS3400, the Vcenter just stop seeing the datastores, but the virtual machines still working.
Any comments ?
Posted by: Gerardo Rangel | December 11, 2010 at 10:46 PM
@Rahul,
We are having the same type of issue. Did you find the cause?
Posted by: Brian Shepherd | February 16, 2011 at 07:49 PM
I used ESX4.0.0.0.261974 u2 wiht storage EMC-cx4-120
It still that error
Has this issue been fixed in ESX 4.0 update 3?
Posted by: chokopo | May 31, 2011 at 05:29 AM
Sadly this issue still persists on ESXi 5 update 1! Still cannot find a workaround.
Posted by: Nick | October 11, 2012 at 05:55 PM
Concur. Issue is still present in ESXi 5.0u1. Hurt us bad yesterday.
We have three clusters against the same SAN. The largest cluster completely tanked, and the two smaller clusters partially tanked. Over 500 VMs affected. Situation only started to recover after a hard SAN reboot on both heads. Most hosts recovered by themselves, but for three hosts multiple reboots were required before they recovered. The hosts would hang during boot while scanning lun paths.
Posted by: Elmars | December 19, 2012 at 09:29 AM