The inimitable MikeD (http://www.mikedipetrillo.com/mikedvirtualization/) pinged me and some of my colleagues this week with this question - he was starting to hear it, thought it sounded like FUD. The question is:
"Does the Distributed Power Management power-on/off cycle cause the boot LUNs of the ESX servers to fail prematurely?"
BTW - if you're not a DPM expert, this video highlights it's incredible promise (it's currently an experimental feature in VI 3.5).
The answer in a word? NO. DPM does not shorten ESX server's lifespan due to drive failure.
Ok - two comments. The first is that FUD is always funny (sometimes even our own - and yes, I know sometimes EMC is the FUDer not just the FUDee - though I try as hard as I can to not do that). The second is that there is more than just the one word answer of course, otherwise it wouldn't be another long-winded Virtualgeek post, would it? :-)
If you're interested.... Read on!
First of all - read this with a grain of salt - our experience is with spin-down in enterprise arrays with enterprise HDDs - but I strongly suspect the same would hold true of the SAS drives most ESX servers use, even with their basic RAID controllers.
So the EMC Disk Libraries (EDL), including the DL3D (which do dedupe) platforms use highly available CLARiiON backend storage, and the B2D case is the most widely deployed CURRENT use of spin-down that I'm aware of and is a nice fit of function and form. B2D works well with large SATA drives, and the B2D task tends to be very periodic (spin up for backup, spin down for the rest of the day , except for the periodic restore task where the drives automatically spin up).
BTW - we have already stated on bringing this feature into the mainstream FLARE featureset.
When we started to build spindown into the EDLs, we asked the same question that is being asked of VMware - because there is logic to it.
So, we started a two-part project: i) test HDDs against the manufacturer's MTBF claim during spin down/up cycles; ii) figure out what we could do to make them even MORE robust (in case we found stuff that was bad).
So....
- The EMC team that does all hardware testing conducted extensive start/stop testing operations specifically designed to answer spin up/down reliability concerns. Essentially the approach was to "trust but verify" the failure metrics the disk vendors provide. In the end, we found the average MTBF was unaffected. I still think this is an example of the enterprise-level process focus which is one of the reasons many customers choose EMC.
- But, why stop there when there's fun engineering to be had? We added all the following increased availability features just cause (and BTW, in the B2D case, they help)
- Disks that are spun down and not accessed for 24 hours (because no application needed the data) will be spun up and subject to the normal LUN Sniff Verify operation. What am I talking about? On the EMC platforms, they are always looking for "soft errors" - a non-fatal read error where the returned data is wrong (it's checked against an in-block CRC). In a standard environment when disks are always spinning, the Sniff Verify is consistently running in the background. Now, with spin down, naturally the Sniff Verify doesn't run in the same manner. Instead it runs when the disks are spun up for a "burst" of 30 minutes. One can think of it in terms of a real-time vs. a batch activity. Over time, algorithmically it is the same as far as running from beginning to end.
- In this use case RAID 6 is used for dual parity by default. Works well with large RAID sets, with this type of workload. The EMC answer on RAID type is in general "it depends" - since it depends on the workload, but for B2D, RAID 6 is a very good choice.
- We changed the flare logging added specific and new error codes if problems occurs while re-spinning a disk up.
BTW - those are all features we have globally. the LUN Sniff is why proactive hotsparing is a big deal. Some don't see the difference between that and a plain hotspare, but it's a big difference. While we discovered no big deal on spin-down/up, we do know that RAID rebuilds are the most stressful part of a disk drive's life. When a threshold of soft errors are detected (which generally presage a hard error) the drive starts to migrate itself out in ADVANCE of the failure - which means that all the usual heavy lifting re parity and impact on all other spindles (which is very random) is eliminated. This shortens the period of exposure, which is the name of the game. Of course we do that and have double parity if you need it (very large RAID groups).
Now, of course, you can't apply all the stuff we did on our EMC Arrays with the JBOD or basic RAID you have in a given ESX server, but simply start with the first finding - spin down/up didn't materially affect the MTBF.
In the end, the other thing that strikes me is that the whole thing is kinda moot. Look, if an ESX server fails, while it's not good, but it's not a disaster either - VM HA response will recover so you can replace the ESX host with some leisure and I'm sure your servers are covered under a warranty, right?. Personally, for my datacenters, I know the power savings of DPM would be hard to ignore - even IF it shortened the server life by 10% on average - which it doesn't!
Sure would like some clarification.
Seagate does not use MTBF anymore, but has opted for a clearer AFR.
http://forums.seagate.com/stx/board/message?board.id=BeforeYouBuyBoard&message.id=3
Why not test to clear standards the manufacturor uses? MBTF is almost meaningless.
How did EMC find the time to test this to get a statistically accurate idea of failure rate. It would take a very long time to pull this off.
Also, how did you come to the conclusion that firing up your disk 1 time per day is statistically relevant? I don't know what the answer would be, but since I have a degree in Comp Sci, I think the answer might be "It depends". ;-)
I don't positively *know* the spin down affect drive life, but I do know that heating and cooling cycles are negative. I do know that spinning up is a stressing condition on any mechanical system.
Frankly, I hope you are correct. But time will be the ingredient needed to prove it all out.
I'd also rather hear this kind of ting coming from Seagate themselves. Agree?
Always enjoy stopping in Chad! Hope the EMC diaspora find new homes soon too. :-(
Posted by: Mike Shea | December 04, 2008 at 06:20 PM
Chad,
Interesting post, this is something I raised on VMTN a year ago http://communities.vmware.com/message/818774#818774 with a few people chipping in that the disks now days are sensible enough with preemptive commands to spin down safely...
Anyone thats performed migration of physical servers from one Datacenter to another knows that you will have some hit of kit which fails either immediately or over a period of time.
There is a low possibility of both drives in a raid 1 set failing at once at the end of the day and to be honest with the smart opportunity and cost savings from DPM on offer in my opinion its worth the risk!!!!
Posted by: Daniel Eason | December 05, 2008 at 05:00 PM
Mike - thanks for the comment.
Unfortuantely, I was in Paris (EMC forum, meeting with the VMware Sales, Partner and SE teams for South EMEA, and customers of course) this week, so didn't have the amount of time I would normally like to make in any post.
So - any error is mine alone. I asked the CX product/engineering team their testing results, and they didn't use "MTBF" they used the word "reliability" - they may be using a different metric now, I'll double check. How did they do it? Well, we HAVE been testing this for a long time (more than a year before we introduced the feature mid-year), and of course you know the facility in Franklin where they do all the mechanical testing... It's BIG.
I also considered putting a qualifer in the heading (where I said "NO"), but decided to put the qualifiers in the body (where I used words like "strongly suspect"). In the end, I agree with Mr. Eason's comment - even if there was a higher chance of ESX server failure, I think the upside outweighs it because a good VI design is in essence stateless on the server (this gets better in the next version with the Distributed vSwitch or Nexus 1000v/VN-link)
Re: the statistical significance of periodic vs constant soft-error checking, I'm a EE with a CS minor, so I've taken my fair share of stats :-) The point here is that over a slightly longer period of time the entire dataset is checked. During normal IO operation (drive spun up for backup, or restore - which will happen once a day), all the normal checks occur, they added the periodic check (drives will get spun up, a quick set of checks, then spun down) ensures that over time, every block of the drive is exercised and checked (because otherwise "stale" parts of the drive - particularly with the B2D use case - wouldn't get checked)
I do agree that it would be great to hear the drive manufacturers themselves pipe in on the thread - any Seagate/WD/STEC folks out there? Will be interesting to see how Enterprise Flash drives change the dynamic here over the next few years.
Re: diaspora - I hope that ever person finds a place where they enjoy working. Since you are a former EMCer at NetApp - I sincerely hope you're happy there. Life is too short to not enjoy your work, I certainly love mine.
Quoting one of the most former recent NetApp folks now at EMC (who sent this email on 12/2): "To be honest, I felt little uncomfortable adjusting to EMC environment during the first 30 days. I always used to have a nagging question in the back of my mind about my decision to move. The good news is that I don't have it anymore and I really I feel very happy to have made a decision to move to EMC. I would like to take this opportunity to thank you for bringing me here." Moral of the story, IMHO - EMC and NetApp are both great companies, fiercely competitive with each other, and I think it consistently forces us to be the best we can be for the customer (though occassionally brings out the worst in us towards each other).
As a frequent commenter on the EMC blogs, what are your thoughts about prefacing posts on competitor blogs with "Disclosure - I'm a ____ employee"? (anyone interested, Mike's other comments are here: http://virtualgeek.typepad.com/virtual_geek/2008/09/my-likely-last.html#comments and here: http://virtualgeek.typepad.com/virtual_geek/2008/08/welcome---my-fr.html)? Lately I've been trying to do that when I post on a NetApp blog - you know, just so the people reading the comment can apply their own judgement knowing the dynamic?
Posted by: Chad Sakac | December 06, 2008 at 11:10 PM
The drive vendors all claim 50K spin-ups reliability for their enterprise quality drives. At 10 spin-ups per day, this is 13 years of use before failure - and 10 per day is probably high. So, even given some exaggeration on the reliability ratings this is longer than the useful life of most servers.
Your comprehensive response is more than low quality FUD like this deserves.
Posted by: Charlie Dellacona | December 09, 2008 at 08:44 AM