May seem strange to some of you for me to post this RIGHT after I do a post on the VMware View Reference architecture :-)
It's not, however. There is no IT panacea. There is only technology to be applied to problems. And, generally, I think it's about know-how, not technology itself. Don't trust people who say otherwise :-)
Ok - what are we talking about? Ok, got a question from a very advanced VMware shop today looking at this and examining the blogosphere for info. OK, background reading...
First - a good post on this here from Stu at vinternals.
Second - a good rejoinder/ongoing commentary here from Rodney.
Take a read - then come on back and read on...
The core thing they are pointing out is that View Composer doesn't solve all the use cases - and they are right it doesn't.
1) that with a given rate of change against a boot VMDK, eventually the snapshot off the base image will be the same size (worst case) as the base
2) that fear of VMFS locking will change the core scaling dynamics
3) fear of I/O response time and workload behavior
So, quick comments:
On point 1. In cases where there is a high degree of change against the boot image, yes, the linked snapshot will grow and space consumption. I've said it before, I'll say it again - every design decision is about trade-offs. All snapshot techniques use little storage to start, and grow as they diverge. ESX snapshot design has the following design tradeoff: free to take (very low I/O and performance cost), free to revert, and expensive (from an I/O and performance standpoint) to delete (and reclaim the storage space). BTW - when I was out at VMware this week - there was a spirited discussion about how we can integrate our capabilities further. For example, the image management tools of VMware View Composer and View Manager are better than our PowerVDI array-powered snaps, but on the other hand, the performance, scaling, and speed of the array based snaps is about an order of magnitude better - wouldn't it be cool if we could merge them? Originally VMware View Manager had storage API integration points, but got cut as the GA approached - but there's nothing to say we can't continue to work here, and we are.
of course, with VMware View - there is the storage overcommit factor, and automated desktop refresh (customer use case example here) as Rod points out.
I think in practice along Stu's lines - statelessness is a goal. The most efficient storage savings will be in cases where it is possible to deconstruct the client into as fixed a boot image as possible, a user data disk/CIFS share, and use thinapp as much as possible. Will there be some in hybrids and persistent VMs? Yes, but not as much. It's difficult to create, even artificially where you make a snapshot get as big as the source. Rod is right - planning for the worst case isn't needed anymore. Heck if we always did worst case planning - there would never be a use case for thin-provisioning technologies in general. Furthermore, production storage de-duplication technologies have a place here also.
On point 2. Our testing shows that Rodos is right. The VMFS SCSI locking based best practices for VMs per datastore have been historically VERY conservative (I talked about this on the recent VMTN community podcast here ) - many customers are happily using 30 VMs per VMFS datastore or more. Yes, NFS datastores don't have this locking mechanism, but of course, like everything else we're discussing - there are pros and cons there too. But - that's all a bit of a red herring. In the reference architecture testing we had 64 VMs per datastores worked well without a hitch. This resulted in 7 datastores per cluster for ~500 users per cluster. So, yes, is the VM/VMFS a scaling metric? Yes, but so is the number of VMs/core, the number of VMs/server, the number of servers per cluster, and so on and so forth. Trust me that VMFS locking is the way down the list of the design metrics. I'd add a great comment mid testing stream from Warren at VMware who was doing a lot of the testing:
"The best part is end to end things are SMOKING fast!!!!!"
The storage array wasn't even huffing and puffing here also at the 1000 client mark - and it's literally one of our smallest (except the iomega devices - BTW - anyone notice that those itty-bitty Iomega StorCenter desktop boxes have started to appear on the VMware HCL?)
On point 3. There's more to test here - with the workloads we were using - there was no massive I/O centralization, and in the same way that we saw Memory pshare ESX stats that are much higher than what we see generally for server workloads, the cache effect is high. Conversely, when we were doing the 10K client workloads and earlier reference architectures on the Celerra with VDM 2.1 and the EMC PowerVDI tool - we didn't have any stagger of login, but literally booted them as fast as we could. There, there was absolutely a "first action" IO storm. Again - there's more work to do there (which we are committed to doing), but we know that there is a possible answer which are small amounts of EFD (enterprise flash).
This is an exciting area, because the technology is coming together and at the same time there's more to come.
I will say this - the only IT panacea to me is the panacea not of any technology per se, but rather what happens as we share our learning, best practices, and as your trusted vendors, try to accelerate your success with end-to-end reference architecture work. It's about the know how more than anything else.
What do you think?