The inimitable MikeD (http://www.mikedipetrillo.com/mikedvirtualization/) pinged me and some of my colleagues this week with this question - he was starting to hear it, thought it sounded like FUD. The question is:
"Does the Distributed Power Management power-on/off cycle cause the boot LUNs of the ESX servers to fail prematurely?"
BTW - if you're not a DPM expert, this video highlights it's incredible promise (it's currently an experimental feature in VI 3.5).
The answer in a word? NO. DPM does not shorten ESX server's lifespan due to drive failure.
Ok - two comments. The first is that FUD is always funny (sometimes even our own - and yes, I know sometimes EMC is the FUDer not just the FUDee - though I try as hard as I can to not do that). The second is that there is more than just the one word answer of course, otherwise it wouldn't be another long-winded Virtualgeek post, would it? :-)
If you're interested.... Read on!
First of all - read this with a grain of salt - our experience is with spin-down in enterprise arrays with enterprise HDDs - but I strongly suspect the same would hold true of the SAS drives most ESX servers use, even with their basic RAID controllers.
So the EMC Disk Libraries (EDL), including the DL3D (which do dedupe) platforms use highly available CLARiiON backend storage, and the B2D case is the most widely deployed CURRENT use of spin-down that I'm aware of and is a nice fit of function and form. B2D works well with large SATA drives, and the B2D task tends to be very periodic (spin up for backup, spin down for the rest of the day , except for the periodic restore task where the drives automatically spin up).
BTW - we have already stated on bringing this feature into the mainstream FLARE featureset.
When we started to build spindown into the EDLs, we asked the same question that is being asked of VMware - because there is logic to it.
So, we started a two-part project: i) test HDDs against the manufacturer's MTBF claim during spin down/up cycles; ii) figure out what we could do to make them even MORE robust (in case we found stuff that was bad).
- The EMC team that does all hardware testing conducted extensive start/stop testing operations specifically designed to answer spin up/down reliability concerns. Essentially the approach was to "trust but verify" the failure metrics the disk vendors provide. In the end, we found the average MTBF was unaffected. I still think this is an example of the enterprise-level process focus which is one of the reasons many customers choose EMC.
- But, why stop there when there's fun engineering to be had? We added all the following increased availability features just cause (and BTW, in the B2D case, they help)
- Disks that are spun down and not accessed for 24 hours (because no application needed the data) will be spun up and subject to the normal LUN Sniff Verify operation. What am I talking about? On the EMC platforms, they are always looking for "soft errors" - a non-fatal read error where the returned data is wrong (it's checked against an in-block CRC). In a standard environment when disks are always spinning, the Sniff Verify is consistently running in the background. Now, with spin down, naturally the Sniff Verify doesn't run in the same manner. Instead it runs when the disks are spun up for a "burst" of 30 minutes. One can think of it in terms of a real-time vs. a batch activity. Over time, algorithmically it is the same as far as running from beginning to end.
- In this use case RAID 6 is used for dual parity by default. Works well with large RAID sets, with this type of workload. The EMC answer on RAID type is in general "it depends" - since it depends on the workload, but for B2D, RAID 6 is a very good choice.
- We changed the flare logging added specific and new error codes if problems occurs while re-spinning a disk up.
BTW - those are all features we have globally. the LUN Sniff is why proactive hotsparing is a big deal. Some don't see the difference between that and a plain hotspare, but it's a big difference. While we discovered no big deal on spin-down/up, we do know that RAID rebuilds are the most stressful part of a disk drive's life. When a threshold of soft errors are detected (which generally presage a hard error) the drive starts to migrate itself out in ADVANCE of the failure - which means that all the usual heavy lifting re parity and impact on all other spindles (which is very random) is eliminated. This shortens the period of exposure, which is the name of the game. Of course we do that and have double parity if you need it (very large RAID groups).
Now, of course, you can't apply all the stuff we did on our EMC Arrays with the JBOD or basic RAID you have in a given ESX server, but simply start with the first finding - spin down/up didn't materially affect the MTBF.
In the end, the other thing that strikes me is that the whole thing is kinda moot. Look, if an ESX server fails, while it's not good, but it's not a disaster either - VM HA response will recover so you can replace the ESX host with some leisure and I'm sure your servers are covered under a warranty, right?. Personally, for my datacenters, I know the power savings of DPM would be hard to ignore - even IF it shortened the server life by 10% on average - which it doesn't!