Ugh - yesterday was a tough day. My team is a global team, and the phone started ringing and emails started pouring in as soon as clock ticked over to Aug 12th in Australia. The only reason I haven't posted earlier was I was up to my eyeballs.
The issue is in this KB here While it was rough on me as a VMware partner and as EMC as a VMware customer (you can see from my other posts we have thousands of ESX servers, and while production goes through a change control process and is on 3.0.2 and 3.5u1, our solutions validation and demonstration labs update as fast as possible) - people at VMware are in agony.
I don't want to write a big post on this - great ones have been written at some of my favorite places - here, here (Alessandro at his hyperbolic best), and here.
Before you go any further, get the express patches here, and you can see Paul Maritz's comments here.
I think a lot of people in the blogosphere are extremophiles, defined this way:
""An extremophile is an organism that thrives in and may even require physically or geochemically extreme conditions that are detrimental to the majority of life on Earth."
Many of the comments on VMTN just break your spirit a bit until you harden yourself. Anger/frustration are reasonable, but anonymous posters and fan-boys, I have little patience or respect for you (though I doubt you care) Strong feelings are good - passion is good, but don't hide if you feel strongly about your position, and you think it's reasonable.
I'm getting hardened to it a bit - along with Intel, Cisco, Microsoft and other big players, EMC get's a lot of flak from these kind of folks - people want the small guy to beat the big guy, and revel in it, and like slinging mud.
VMware is a weird exception - they are the 800 pound gorilla of their space, but it happened so fast they maintain the "startup" halo around them. This inevitably changes as growth happens, they solidify themselves as a key in the new IT infrastructure space, and new startups work the david/goliath story humans love.
Here's one person's opinion - no more, no less.
- This sort of happens to all companies - great/weak, small/large. I'm not blasé about it, it's a terrible thing. The real test is how the organization deals with it - are they open, transparent, fast to fix, and afterwards, do they improve their process?
- If it happens with ANY frequency, regardless of the company, there are serious problems - we are all only as good as our products.
- The way to reduce the risk to your IT infrastructure from this if it really is inevitable is to implement change control. A classic example is that Enginuity is currently at 5773, but DMXes ship with 5772 code for now. The same holds true with all our platforms - there's a window we called a PPR - "Phased Product Release". Even long established processes, we've STILL been hit (there was a similar date-based bug in the Kerberos code used by our CIFS server a while ago). Customers implement their own version control on top of what we do. The only reason customers hasn't been behaving that way with VMware has been that their software quality has been SO good.
Let's look at this case:
- VMware has clearly been open, transparent, and clearly aggressive with their fix, which was made available late in the same day.
- This is a first for VMware. A bad first no doubt, but still a first. The question is whether the first of more to come, or an isolated incident. While I may be biased, I think they deserve the benefit of the doubt based on their history of quality.
- Every customer needs to apply good IT best practices on change control of all parts of their mission-critical infrastructure - VI is no different, and arguably one of the most important ones to do this with - because, like core network switches and storage infrastructure - it's hyper-consolidated.
What was your experience, and what are your thoughts? Extremophiles welcome, but rationalists desired :-)

Change control is, of course, essential for any production environment. But I personally haven't encountered a lab test script that includes rolling forward the clock on lab systems to catch something like this.
In this case the bug manifested itself around 2.5 weeks after release of the patch? I don't know the technical details, but could it just as easily been 2 months?
So I guess what is the "right" amount of time to wait before installing an update.
Posted by: Ryan B | August 13, 2008 at 02:15 PM
This is similar to the issue we had with the 59330 code build of ESX 3.5 which was time-bombed back in February. VMware's response to that event was disheartening, but I'm much happier with the stance they have taken with this issue.
In the words of Alfred Pennyworth: "Why do we fall, sir? So that we might better learn to pick ourselves up."
And so they have.
Posted by: Aaron | August 13, 2008 at 03:19 PM
This was a ridiculously stupid thing to let out and I'm on the fence as to if someone should get the chop as a result of it.
(Not my call, I'm not an employee of VMWare)
But are they shipping a crappy product? No. They just did something incredibly dumb.
Load both barrels. Point at feet. Fire.
People either see that it was an example of epic fail and take them at their word it won't happen again or they move off to another vendor.
As an Admin I never had time for the complainers who'll do neither.
Posted by: Storagezilla | August 13, 2008 at 04:19 PM
I suspect that if all ESX users were polled, a very small percentage of companies actually run the very latest version. We tend to be right at the leading edge with ESX and even WE don't have the very latest in production. We do have it in the lab, and it caused very little trouble for us.
However, when you have come to expect the best, as we have with the quality of VMware software, any glitch is glaring and surprising (even if not catastrophic) - especially if you are fortunate to have someone in your group who drinks only Microsoft Kool-Aid.
I believe, as others have stated, that we need to judge VMware on how they react to the mishap before we get too concerned. It's the first chance to see how the new leadership responds to such quality challenges. And, Goodness knows, he had a lot of opportunity to react to shipped software bugs at one of his former employers...
Posted by: Neo Writer | August 13, 2008 at 06:21 PM
If you're going to have time-expiring licences, always offer a grace period before they expire.
i.e. Start displaying prominent warnings that the licence is about to expire at least a moth before the expiry date. That way, people notice before functionality is lost.
Phil
Posted by: Phil | August 14, 2008 at 05:35 AM
This one caught about half my sockets in production. We had been testing out update manager and using DRS to demonstrate zero downtime upgrades. It was a great demo to management but an epic fail a week later when things went wrong.
To add insult to injury Update Manager decided it didn't want to talk to its database and stopped working half way through applying the emergency patch. Thankfully I started with these things before update manager came around so I just dropped the patch on the box and fired it off.
Not a good infrastructure week.
Posted by: John | August 15, 2008 at 08:46 AM
- 'Zilla, you and Barry remind me with every post/comment that there's not a "no extremophiles allowed" sign on EMC buildings :-) It's OK, I love you anyway
- Neo, I'm with you, but on the other hand, when you have hundreds of thounsands of customers, something that affects 10% is still ten thousand customers. This gets hard as you get big, and VMware is now BIG.
- Phil, I hear you, and I agree, but to be clear - this wasn't a time-expired license, it was time-expiration built into a pre-release code that slipped into the GA build. Maintaining build trees is critical as well as regression testing, but the only license timeouts in GA VMware products are on evals, not the real deal (at least as far as I know).
- John - ouch. All I can do is express my sympathy. I'm glad you were able to apply the express patch quickly.
Posted by: Chad Sakac | August 15, 2008 at 10:25 AM
This is a tough issue, in my view it was dealt with by VMware promptly, if anything my only complaint was the lack of update, however I trusted them (as usual) and they delivered a resolution.
Maybe this will change how beta/patches get released in future.
ALso i sincerly hope it dosnt change the way VMware approach providing constant improvements and additions via update patches as this is what makes them and the developer team just so dam good!
Posted by: Daniel Eason | August 15, 2008 at 03:25 PM
As a fan of VMware's products, this was disheartening for me also.
As a VMware customer, I wasn't happy about it, but I also hadn't deployed it into production yet. I tend to take a more conservative approach to deploying patches. Heck, let everyone else find the bugs and problems it creates - I'd trade "stable" for "new" any day.
I wouldn't let others get you down. One of the things that help me maintain the proper context here is remembering that "it's easy to pick on the big guy". This includes all big players in their markets: VMware, EMC, Microsoft, etc. They have millions of customers so their mistakes will become world famous at the speed of the Internet. It's the nature of the beast. Just remember not to lose sleep over the nay-sayers that are just using it as an opportunity to nay-say.
Posted by: Virtual_JTW | August 22, 2008 at 08:42 AM