[updated Oct 1st, 10:32am] – Question/answer and also script to automate, noted Jason Boche’s article
[updated Oct 1st, 5:37pm] – long answer to a posed question, digging even deeper into the “why”
[NOTE – added Oct 2nd, 7:28am] – while there’s a reasonable amount of urgency in this thread, it’s worth pointing out the core resilience in svmotion here. While the recommendations below are material, svmotions never caused any guest outage through all the tests, when they would fail, they would fail atomically when the stage of file deletion would take too long – ergo no data unavailability, and there is no risk of data corruption due to this issue]
So… Scott Lowe (EMC vSpecialist CTO, and all round good dude) made a weird out of the blue blog post two weeks ago called “Hidden VAAI command” – check it out here.
So… what would make him make a weird post like that?
Well – for the last couple months VMware and EMC (note this applies to larger storage vendor community) have been grappling with a tough issue, and we’ve come to an important conclusion (thinking these things through, measuring positive/negative impacts is not trivial).
On Friday, the call was made and communicated broadly (and Jason B picked it up almost immediately here)
There is a pretty big issue with the current implementation of VAAI TP reclaim. This shouldn’t be viewed as a “VMware issue” or a “Storage vendor issue”, but a “both” issue. This affects the broad vendor community, so while I’m an EMCer, and will reference EMC storage arrays, I strongly suggest talking to whatever vendor you use.
There’s a long version (which includes the why/how) and the short version…
- Short version – no storage vendors are getting certified with the TP reclaim offload, and you should disable this VAAI command.
- Long version – the core issue is that prior to the VAAI TP reclaim, deleting a file in VMFS was near instantaneous (VMFS filesystem inodes being updated). Conversely, with VAAI TP reclaim – even if the reclaims are superfast (milliseconds), they add up (as many thousands can occur on a large file). A large VMDK (or any VMFS file for that matter – though the VMDKs will generally be the big ones on a datastore) getting deleted can take seconds. In some cases, this means svmotions can fail (the file delete takes longer than the timeout value of one of the last steps in the svmotion, which is to delete the file)
Read on for details, history, KB articles, what is going on, and workarounds…
The VAAI API spec and test harness was missing any “maximum time for an unmap” (and how do you write that, since it’s really “maximum time to delete the maximum file size”), and so everyone was marching forward. BTW – this is changing the way VMware will work with the community on stuff like this – we’ve discussed how to make the specs focus more on the usecases, not just the specific interaction.
So… We started to see svmotions periodically fail as we were in the betas of vSphere 5, and in the betas of EMC VNX code codenamed Franklin which added the UNMAP support. We actually accelerated UNMAP in the GA code of Franklin (GA name is VNX OE Block R31) because of some of these things. On the EMC VMAX, they decided to ensure that this never occured by simply not processing the UNMAP commands in the current code (even though there is UNMAP support).
In the meantime, VMware looked at ways of adding timing this to the test cert (for all storage vendors), but you can’t say “x% faster than 0” – so it’s non-trivial.
The long term solutions we evaluated:
- Storage vendors will be to implement the TP reclaim extremely asynchronously (ergo even faster than we already do) – but this is harder than it looks at first glance (there are dangerous conditions where a block is being reclaimed and overwritten at the same time, and is a material code change). This will need firm, robust testing and hardening under load (we remember lessons learnt from VAAI XCOPY at scale).
- VMware could change svmotion to not wait for the file delete to complete before completing the task.
So – what happens now for EMC customers (readers using other vendors – please contact your vendor)
- Remember, if you are using vSphere 5, you don’t “enable” TP UNMAP, it just happens – so you could run into this problem.
- Remember, this applies to VMFS use cases, not NFS datastore use cases (NFS doesn’t have a ESX-triggered reclaim)
- The VMAX code handles this by simply not processing the UNMAP requests which is good in a sense (the svmotions aren’t failing). This is because right now, the VMAX processes UNMAP synchronously – so svmotions would fail more often – as there would be more latency. The VNX R31 code does process UNMAP, and processes it asynchronously, but this means that default settings and behavior could cause svmotions to fail if files were large (we can’t think of other scenarios were a slow file delete would cause a timeout, but there may be others). On the EMC side, we’re working to make UNMAP extremely asynchronous and fast across all the use cases.
- VMware will be issuing a patch that disables UNMAP in the near term
- Longer term, VMware will decouple the dependency in svmotion on file delete (but make sure it gets deleted successfully)
- In the meantime, you can use the EMC approaches for space reclaim on the device-wide level on demand (this is also the way you can reclaim on NFS datastores).
VMware on Friday published a KB on this topic (thanks Cormac!) here: http://kb.vmware.com/kb/2007427
So… now back to Scott’s post. Now you know – why we started saying “hey, here’s how you can disable TP reclaim…”
- To enable:
- esxcfg-advcfg -s 1 /VMFS3/EnableBlockDelete
- esxcfg-advcfg -g /VMFS3/EnableBlockDelete # To check the status
- To disable:
- esxcfg-advcfg -s 0 /VMFS3/EnableBlockDelete
- esxcfg-advcfg -g /VMFS3/EnableBlockDelete # To check status
[Update]: I’ve gotten questions about whether this affects VMFS3/5, or just VMFS5, or just VMFS3. The answer is the the UNMAP command is used on both VMFS3 and VMFS5 filesystems (used more with VMFS5, but that’s not material). While the command to rectify uses “VMFS3” in the syntax, it applies across VMFS3 and VMFS5.
[Update]: The always awesome William Lam (if you’re not following him, you should @lamw) has written a post that automates disabling this parameter across a set of hosts here: http://www.virtuallyghetto.com/2011/09/how-to-automate-disabling-of-vaai-unmap.html
[Update]: A very, very smart colleague of mine was scratching his head on this a bit, and I bet his question is one the mind of many. The question was: “If the task is offloaded even when async - why doesn't the storage acknowledge it and let the hypervisor continue?”.
This answer is going to be a long-winded verbose “Chad” answer, but it’s a surprisingly deep topic.
Imagine for a moment the old VMFS code (this is just meta code for the sake of explanation, and yes, I didn’t put in any error checks, but you get the point…)
- Delete File
- look up inode
- remove inode pointer
- Done – report the file as deleted
Note that this is all a filesystem operation. No direct or indirect blocks are touched. If the filesystem doesn’t use and inode table, it has something which has pointers to blocks.
To understand this better, read this Wikipedia entry: http://en.wikipedia.org/wiki/Inode_pointer_structure
Now imagine the new vSphere VMFS code:
- Delete File
- look up inode
- get direct/indirect block list (logical block address – LBA – which starts a A and goes to Z for this example)
- SCSI UNMAP list
- SCSI UNMAP block A
- Check for ack
- repeat, going to the new block
- remove inode pointer
- Done – report the file as deleted
To put this in perspective:
- This is why on shows like CSI, they can recover data that has been deleted. When you delete a file in a filesystem, the blocks aren’t actually touched, only marked as free to be used for other things. You can troll though the blocks, one at a time, and if you know the filesystem structure, you can reconstruct the file. BTW – even if you do a low-level format, there can be a magnetic “shadow” on magnetic media – which is why “writing zeros multiple times” (or total destruction of the device usually via shredding) is the only way to be sure stuff is deleted.
- This is why a “quick format” is fast (which clears the inode table completely, and creates a clean set of filesystem constructs), but a normal format (which issues a SCSI command to each and every block on the device) takes time.
So – now imagine a 100GB VMDK. That original code would be near instantaneous on the delete. Now, let’s say the storage target is REALLY good at replying fast – let’s say 10ns to ack. Let’s say the block allocation size is 512 bytes – that’s 195,312,500 blocks. You’re going to go through that SCSI UNMAP loop 195,312,500 times, at 10ns each. So – that’s 1,953,125 microseconds, or 1,953 milliseconds, or basically 2 seconds. But, a 10ns ack is REALLY short. You could make the filesystem itself have an asynchronous process to do the UNMAP, but that’s clearly dangerous – as what happens if the process doesn’t properly clear up these unmapped blocks?
So – why would it take long for the storage subsystem to ack?
The answer to this is the core of storage subsystems – data fidelity and persistence.
The storage subsystem has to maintain a very, VERY correct list of “what blocks are part of the free pool”. After all, if it handed over a block as “free” that wasn’t – oops, we have a very bad day (data just got corrupted, as the block was in use). That list of “what blocks are part of the free pool” itself needs to be persistent (ergo, what happens if midway through releasing a block, there is a partial or catastrophic failure – or heck, a software bug).
So – even when its Async (in other words, the block – or more accurately the meta object – it’s usually something like a page or other “group of blocks” construct - doesn’t actually need to be returned to the pool), it needs to get far enough into the storage stack that the “correctness” of the block being available is persistent.
So – this is why while the vendor community will work on being ever faster, ever more async – the best way is to recognize that a file being deleted taking a fraction of a second or seconds is usually perfectly OK, and not have a serialized code dependency on checking for the file delete. Again, not pointing the finger at anyone – I call this a “both” problem.
First BTW – I’m sure I will get a few vendors piling on here saying “but the way we do it is different/better!” – for that I welcome them to comment – IN DETAIL. There are differences about how to respond, where in your stack this persistence can be settled, etc – but as you can see, it’s not trivial.
Second BTW – this is why whenever someone tells me they are going to start/join a storage startup, I have to sit them down, and have a good talk (having done one). It’s not that storage startups are a bad idea – innovation is awesome, and startups are fun… It’s just that the storage market is a VERY hard market to penetrate at any reasonable scale (beyond your first 10-100 wins) – for two reasons:
a) persistence of data (which is the definition of storage) means it’s sticky – hard to move, hard to displace at a customer – you better have something more than a marginally better mousetrap – it better rock the world;
b) there’s a certain amount of “non-compressible time” to “harden” block, NAS, and object storage stacks – during which time, you’re prone to having “very bad days” - ergo data that isn’t written when it’s written, or is written wrong.
As EMC we still occasionally have bugs that fall into this worst possible category – ones that cause data fidelity and persistence issues – and heck, we’re on Rev 31 on VNX, and I’ve lost track on Symm. OneFS in Isilon land is on it’s sixth major generation after almost a decade on the market. When those bugs happen, it’s all hands on deck (and customer ETAs get issued). That’s why when we do major things (like repurpose the mature VNX block and file code into a merged kernel like on VNXe), we do it slow and steady (and hence why we’re not rushing to do it in VNX land, where it’s not a primary customer driver).
Some people ask me the reason it’s “non compressible” to build a storage stack. My answer is that the only way to really test is to have lots of customers, with all sorts of workloads, at all sorts of scales. On that second one point – I’ve heard timeframes from 3-5 years for a mature block stack and 5-7 years for a mature NAS stack. Of course, you can take “off the shelf” open-source stuff and start from there, but that also means your differentiation will really be rooted in the UI more than anything – and pursuing ideas with low barriers to entry are a bad idea for a startup. Again – that doesn’t mean those are bad, storage is just a difficult startup idea relative to, let’s say, creating a whole new SaaS thing (where you can be up and running with something new in months).
Hope this helps our customers!