« vStorage APIs for Data Protection and Avamar 5 | Main | vSphere 4 NMP RR IOoperationsLimit bug and workaround »

December 01, 2009

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Phillip Reynolds

Happened here today. VMware storage tech on the phone picked up on it. However, rescanning did not solve completely. I had to put the ESX hosts into maintenance mode and reboot the hosts.

Michael

You wrote

# In the vSphere client, remove the Datastore
# In the vSphere client, remove the storage device

How do you remove the datastore and the storage device?

Barrie

This is a huge bug for ESX 4.0

There are use cases for keeping the datastore and powered off VMs intact, unmapping the LUN and presenting a replica perhaps

Any ETA on when this will be resolved, or is "esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD" the resolution anyway

Barrie

I see HP are posting this as an official resolution on their EVA support site:

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=3664583&prodTypeId=12169&objectID=c01908147

Brett Impens

I've run into this issue as well. In my case, the LUN and Datastore were shared between two different clusters. I had to remove the LUN from one cluster, but still leave it in place for the other cluster.

Long story short, my work around in that case was to evacuate the host of all VMs, shut it down, remove the LUN from that one host, and bring it back up.

Ben

It got us here today.. talk about ouch..

michael

I still am trying out how to follow your work around #1. What is the proper way to remove the Datastore? Vmware support sure has been slow on this issue.

Nathan Neulinger

I have had this problem REPEATEDLY and vmware support has been utterly and totally worthless... After thursday night (last occurrence), I was almost to the point of:

"Every host is going to get it's own CHAP user+pw, and every LUN removal/delete operation is going to be done with a full cluster-wide maint mode + reboot"...

Nathan Neulinger

I would second the question about what is the proper way to remove the datastore WITHOUT destroying it - i.e. to just remove it from one host out of several. We had a long ticket with vmware about this and their stance was basically "just shut off access to the lun, it'll work fine", which repeatedly trigger the whole "host drops off the face of the planet" networking problem.

Last thursday night change was an example of this... took an hour or so of production servers dropping off before I was able to finally clear out the problem from just THREE host systems. A few other hosts had lingering issues throughout the next day.

And a nasty side effect of all of this is that it caused repeated vcenter restarts/crashes due to all the hanging disk accesses/etc.

Scott Sauer

Chad,
We hit this same issue, had a case opened with VMware support and worked through it with them. I would be curious is one would still hit this bug if you had PPVE in place.

Support has since closed out my case and said the fix is now included in U1, but it's more of a work around that an actual fix.

Good stuff.
Scott

Michael

So this has been my response when I asked about using the FailVoluemOpenIfAPD.

"Thank you for your email.

The option FailVoluemOpenIfAPD is available in ESX 4.0 U1 but is disabled by default. I would still recommend following kb articles 1009449 and 1015084 to mask out the LUN before un-presenting from the storage array."

Joy, what a dangerous bug!

Dan C

Same, curious as to the meaning of "remove the datastore". Is this a reference masking it on each host? I've had LUNs removed after deleting the VMFS partition and had no network issues in vSphere yet, so is it intermittent, or is the VMFS partition removal mostly all thats required?

Mark Burgess

Hi Chad,

One of our customers had a problem last week whereby an NS20 with a CLARiiON CX3-10 backend experienced a failed drive in the vault set. The write cache was disabled and the rebuilt priority was set to ASAP.

The net result of this was that it killed the performance of the VMware environment giving symptoms like the APD bug. We checked the ESX logs and we did see the NMP/ADP error messages listed above.

We changed the rebuild priority to High, but this had very little impact on the poor performance.

What was very strange was that once the rebuild had finished the system performance returned to normal.

We are working with EMC and VMware support on this, but whatis currently not clear is was the problem caused by the Celerra/CLARiiON or is this the vSphere APD bug.

I agree with the other comments here that this is a very serious issue.

Any thoughts would be appreciated.

Andrea

Thanks Chad (always) for all the awesome info. In your opinion: If we're a shop running ESX4u1 and only seeing "H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0" (SG_ERR_DID_BUS_BUSY [0x02] BUS stayed busy through time out period) should we look at FailVolumeOpenIfAPD or contacting VMware support directly?

Alfredo

Hi

I have seen this issue two or three times, there is no way round it other than to remove LUNS form VSphere, rescan then remove LUN direct form Storage.

There is also a 30 minute scan from VC that suspends the machine hence you cannot ping

According to support a fix is in QA and was to late to be in U1 but will be released soon as a single bug fix...

A

Anyone hear if the bug has been fixed yet? Same issue here. Alfredo, do you have a support case I could follow up with?

Doc Carrots

The information from the following KB can be used to avoid this
issue before removing a LUN.

http://kb.vmware.com/kb/1015084

Dominic Rivera

I can validate ( the hard way ) that having PowerPath VE installed won't avoid this problem. I'll chime in asking what do you mean by removing the storage device?

Travis

One of the symptoms I saw on both nights I worked on this issue is the other storage luns are not browse able through VC.

So lets say you have cluster0x-lun-04 is being removed. See through VC if you can browse files after your particular lun is removed. So you browse through cluster0x-lun-05 and you see no files at all.

The issue is once the disk paths fail for certain things it starts to fail them on all of the disks whether they are being removed or not.

One suggestion VMWare gave the 1st time I saw this issue was an un-handled exception with the SCSI firmware on HP servers while this might solve a bulk of the issues it is possible the bug was fixed and then reintroduced recently. I know in the environment I see and work on the issue did disappear for awhile.

With the HP bug or firmware issue, I noticed that the array had pop-up message after the server was rebooted after the issue back in August through October of 2009.

-Travis

Justin

This week storage accidently provisioned to the wrong cluster, then reassigned to the correct one caused this issue for us. A big thank you as I remembered this post.

iridium130m

FYI... attempted Workaround 2 on a recently patch ESXi 4 host connected to an IBM DS4700 array and did not resolve. VMware support says they are still working with certain storage vendors on the issue. Waiting impatiently for a fix. This definitely affects SRM test failovers when completing a test and waiting for cleanup to finish as the SRA removes snapped replicated test luns from the hosts.

aenagy

We are experiencing a strange problem where the rescan triggered by formatting a new FC LUN causes our hosts (ESXi 4 build 219382) to disconnect from vCenter, vSphere Client KVM sessions to virtual machines to drop and RDP session to virtual machines to drop. Is this symptom related to this storage bug?

Lee

I would agree. My company has had 3 production outages from this incident, and only recently realized this was related to this root cause from VMWare support. I have to say it angers me that VMWare has known about this for 6 months without releasing a patch. Can anyone say HyperV? I also ask how to remove a storage device from the vSphere client. I can guarantee that only removing the VMFS formatting is not good enough, as the cluster will still hang when it is removed from the SAN.

Jim Taylor

We went through some storage reconfiguration last week and the process I followed was:

* Relocate / delete any VMDK's from the volumes.
* Remove the VMFS filesystems from them and rescan the hosts.
* Mask the luns off using claim rules (add the mask, load the list, run the list, and then using the reclaim -d command to unclaim the volumes so they're fully masked).
* Then unpresent from the servers at the SAN end, reconfigured them, represented them...
* Reversed the above process.

Worked fine and no hiccups I'm pleased to say.

MK

What's the best way to do this, if you have that LUN presented to multiple vSphere hosts? Do we delete the LUN from each host before unpresenting it on the array side?

Rahul Mehta

We are seeing very strange issue on esx4.0 , we have at least 20 node esx cluster sharing same storage,which is vmax, also we have powerpath /ve . And we have noticed following ,we are not trying to remove any lun deliberately.We are using brocade virtual connect on hp blade center , with virtual hbas

If we push zoning change , all the esx box will loose 1,2 or 3 path out of for at random, also this is not a consistent behavior.Only way to correct is to reboot the esx server.

If we create new devices on the storage array , all the esx box will loose 1,2 or 3 path out of 4 at random, also this is a consistent behavior.Only way to correct or rescan four paths is to reboot the esx server.

This sometime causes hard crash on couple of ESX box.

EMC and vmware are of no help , it's been 4 weeks. Yes , ours is not exact problem as we are not trying to remove data stores .. but this causes vms to fail or production performance suffers. Only ,article I found closest to our issue is this. Any help would be appriciated.

Darryl Wilson

Has this issue been fixed in ESX 4.0 update 2?

Chad Sakac

@Darryl - YES.

Gerardo Rangel

Problem with ESX 3.5 and IBM DS3400, the Vcenter just stop seeing the datastores, but the virtual machines still working.

Any comments ?

Brian Shepherd

@Rahul,

We are having the same type of issue. Did you find the cause?

chokopo

I used ESX4.0.0.0.261974 u2 wiht storage EMC-cx4-120
It still that error
Has this issue been fixed in ESX 4.0 update 3?


Nick

Sadly this issue still persists on ESXi 5 update 1! Still cannot find a workaround.

Elmars

Concur. Issue is still present in ESXi 5.0u1. Hurt us bad yesterday.

We have three clusters against the same SAN. The largest cluster completely tanked, and the two smaller clusters partially tanked. Over 500 VMs affected. Situation only started to recover after a hard SAN reboot on both heads. Most hosts recovered by themselves, but for three hosts multiple reboots were required before they recovered. The hosts would hang during boot while scanning lun paths.

The comments to this entry are closed.

  • BlogWithIntegrity.com

Disclaimer

  • The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by Dell Technologies and does not necessarily reflect the views and opinions of Dell Technologies or any part of Dell Technologies. This is my blog, it is not an Dell Technologies blog.