Some days, my head feels like it's going to spin right off :-) Thank goodness I love my job! I'm personally working on some of the last-minute demos - there's some really cool stuff that's under embargo until Wednesday - but suffice it it will be cool!
My colleague Nick Triantos from NetApp did a good post on what Netapp will be doing, here: http://blogs.netapp.com/storage_nuts_n_bolts/2008/09/netapp-vmworld.html?cid=130026414#comment-130026414
I have to confess, it got my blood boiling a little bit - he accuses EMC of being "All Marketing" and that they will be the largest vendor there. Please, look at my record, I've been kind, even handed, even complimentary to NetApp, but come on Nick, is that condescending tone necessary? Tar ME with the marketing brush?!?
Did you know that while we're mud slinging (well, at least I'm getting mud slung at me) that the Large Hadron Collider fired up today? Man, it's inspiring, and a reminder that this competitive stuff is SO petty. I'm also pretty inspired by the SVVP "other shoe dropping" with formal ESX 3.5u2 qual getting done.
Ok, so we have a marketing gimmick (the car!!!), sure but heck, it's Vegas :-) I like to think of us as "All technical, and some fun too"
You'd think that with all this back and forth - it's all we do (well for me - it's 8:51pm, and I've been working all day - so this is a diversion :-)
I will tell you this, it's going to be technical, we're demoing like mad, and there are some MIND BLOWING things there.
We're in booth 500, and I look forward to seeing you there. What you can expect to see:
- Demonstrating Oracle 11g and SAP with instant/backup recovery and Site Recovery Manager all in action together (BTW - thank you all who have been so interested in the Oracle piece that it got oversubscribed and we've been asked by VMware to schedule a second session on EA1961)
- Demonstrating Exchange 2007 Joint VMware/EMC reference architecture including VSS backup/restore, CCR and LCR use cases
- How all our management tools (Storage, Network, Application Dependency and VMware Best Practice compliance analysis) integrate with the VI 2.5 SDK and ESX CIM APIs - TODAY. Our view is that customer need to start virtualizing tier 1 apps - and we want to help - with the joint best practices and with tools to ensure you get applicaiton-level SLAs.
- Replication Manager for VMware - instant VM or datastore backup/restore - integrated with the VI and ESX APIs for nice simple point in time consistency. As a company that's not as good at marketing as others, we quietly actually shipped this BEFORE ANYONE ELSE. It wasn't pre-launched, it shipped in June. Here's a quick look if you want to see what it does. BTW - this is a single tool that does this for application integration Exchange 2003 and 2007/SQL Server/Oracle and now VMware, and across all EMC's platforms. Here's a demo....
- VDI - great to hear that NetApp hit 5K clients. We just broke 10K :-) UPDATE: I regret phrasing this sentence this way - it comes across as snide and as condescending. I let frustration get the better of me. NetApp has no intrinsic scaling limit that I know of, I only know what we can do. BTW - that's what this standing lab in Santa Clara is there for - for EMC and VMware to jointly prove these out at scales (target of 40K clients). BTW, we're also doing this at 500, and 1000 client cases. It's not a chest-beating exercise - customers come in all shapes and sizes, and we're trying to help prove out how this works for them all, it was triggered by the World Bank which asked for us to show how this would work for 20K clients. We have a similar (but different - each with advantages and disadvantages) approach as NetApp - mass scale with consumption of one, both how we do this today, and also how we do it with VMware tomorrow. We will be sharing all our performance scaling data, and experiences. THIS IS BLEEDING EDGE. Anyone who tries to tell you that this is all figured out is not giving you the whole truth. Anyone who just shows you the array side of the equation is flat-out dangerous. When you push this far, and all boot simultaneously, sometimes VC gets into a weird state on some of the VMs. Its fun, and continues to be fun. BTW - here's a cool screenshot (count up the VMs - there's actually 10,994)
UPDATE: Folks, the NetApp side has made some comments that in their view we're not being clear here: http://blogs.netapp.com/virtualstorageguy/ Vaughn is a good guy, so I'm going to take the personal slag (ouch - used car salesman) in stride. We are replicating the LUN objects in this case, and they are the unit of replication. If you will have more than roughly 200 VDI VMs per cluster, you need to have several VMs per LUN. The PowerVDI tool automates this. These ratios mean it's more accuate to say "you will invariably achieve 100:1 space savings, and may achieve more depending on the scaling goal". (the ratio of source VMs per source LUN needs to be higher at higher scales, and therefore 100:1 is the worst case model). There is NOTHING intrinsically wrong with NetApp's approach, there is NOTHING intrinsically wrong with EMC's approach. Each have advantages and disadvantages, and are a function of each vendor trying to solve the customer problem. The ratio of LUNs to VMs is not 1:1. We don't do this because we love block - while we can take 96 filesystem snapshots, only 16 of those are writeable, so this approach is better for EMC customers. Just setting the technical record straight.
- Here are some datapoints from the VDI testing to date:
- there are limits to how fast ESX and VC can register vms (hostd dies) and how many you can spawn simultaneously - discover the what we've done jointly with VMware on this!
- XP Booting causes 4000 IOs spread over 20 seconds. That is 200 IOPS per Client Sustained during the boot…. Multiply that times 100 for a server…… 20,000 IOPS. Or multiply it out to a moderate environment of 1000 desktops. 200,000 IOPS all directed at the Array. How much hits the backend vs cache effect? What is the client experience? Come and find out! (BTW this will be a big topic in session PO3824 - "Storage Solutions for Enterprise Consolidation with VMware")
- BTW - Congratulations Dan for coming in 3rd in VMware's PowerShell contest for the PowerVDI tool we created for this. The quote was great: "Here's what Lee Holmes, author of the Windows PowerShell Cookbook had to say about it: "Automates a very complex scale-out task, and combines a lot of technology (VMWare, Putty, AD, etc) into a single PowerShell task. Offers interactive use, unattended scripting. For high-scale clusters, this would be an enormous time-saver. This script really demonstrates the super-glue nature of PowerShell." Dan, you'll laugh man, but a large NetApp customer asked me for the script... wait until we unveil what's coming next! For those of you wanting this script, note that we've posted it openly for all.
- Deduplication, and simple file-level recovery for Windows/Linux VMs
- NFS and iSCSI deployment best practices
- Joint EMC/VMware Remote/Branch Office solution
- Several of our Partners (Agilsys, Fusionstorm, ICI and others) will be present in the booth to show their integrated solutions.
- We will be demonstrating several things I can't even allude to as they are under an embargo until the event.
We will have 70 people at the event, including myself, but also our engineering teams directly - including Bala Ganeshan and Sheetal Kochavara who are the engineering folks who maintain our DMX and CX best practices, along with our Field VMware Specialists (all VCPs from around the world.
I don't know what any other vendors are sending equipment wise, but I've sent 13 CX3s, 4 CX4s, 3 NS20s and 2 DMX4-950s. Those will be in lots of places where you can use them, including the hands-on labs, and in various booths.
I can wait to talk to YOU, and of course with our competitive brothers and sisters!
Oh, and our keynote (where all are welcome!) is KN EMC on Wednesday at 11:00-12:00. It is a joint VMware/Cisco/EMC session discussing best practices for designing the next generation datacenter with Scott Davis from VMware and Ed Bugnion from Cisco. It's three nerds on stage - a recipe for trouble! I asked to do a show release to show what we're doing at the event, but our PR folks told me that unless a senior exec is doing the Keynote (like Dave Hitz is for NetApp - and they did a press release), we don't do those. I'm just a technical guy, so I don't qualify :-)
Just to show the ridiculousness of it all, NetApp's press event got scheduled at the same time as our keynote. We intentionally moved ours to avoid conflicting with theirs because we know some people will want to go to both. I'm going to keep pushing the kind agenda, because I don't think pettiness helps any of us.
Nick says:
"For NetApp this will be the largest event of the year and like everybody else we've been preparing for it for some time now. This year, by far, we'll have the largest presence in the show and we'll be ready to showcase various technologies that solve some very real and hard to solve business problems."
UPDATE: Nick has made a fair clarification of his intent of the statement, and I trust him in his clarification.
Nick - looking forward to meeting you there and getting to know you. This is our biggest event of the year also, a bit bigger than EMC World which had 10K customers. I don't know if we'll have the largest presence at the show, but I can tell you that you are not in a position to make that statement that you will be until after the show ends, right? Statements that are not true? I call that Marketing.
.
.
.
OK, I can't help being a bit petty.
Remember the whole brouhaha about the IDC data back here?
So, on Sept 8th, Goldman Sachs released their annual survey analyzing IT spending - the "Independent Insight" here's how THEY describe it.
"This is the 42nd issue in our IT Spending Survey series. Our survey panel is made up of 100 managers with strategic decision-making authority at multinational Fortune 1000 companies."
Ok - so note (I don't want to misrepresent!) note that this is FT1000 (we're doing excellent in all segments - but I don't want to imply anything about other segments, take it what it is, no more, no less)
First, for my VMware brothers and sisters - congrats:
"VMware’s incumbent position at the top of the rankings demonstrates that server virtualization momentum remains alive and well." (Microsoft - watch out for the wrath of Nick, showing Live Migration when Hyper-V R2 is scheduled for a 2011 release is called Marketing)
Nice to see EMC Software (which excludes VMware) also in the gainer column - our management tools are flying off the shelves, in some part due to their integration with the VI SDK and ESX CIM APIs.
I liked this, of course:
Does that look like it agrees with the IDC data from earlier to you? Look also at who went up and who went down.
Oh, and then here's this hot off the presses too: http://www.idc.com/getdoc.jsp?containerId=prUS21411908 The NAS numbers and growth numbers are stunning.
LET ME BE CLEAR: I THINK NO CUSTOMER SHOULD PICK A SOLUTION PURELY BECAUSE OTHERS DO. Every customer should pick solutions that work for them. What I am saying is you can't produce those results and outgrowing those that are smaller, and do it across categories if you're not doing something right for your customers. To believe otherwise is to shrug off reality, and to become one of those "extremeophiles".
Look, I want us to focus on helping the customer. Nick - I'm good with that if you are - sincerely......
I will be in some NetApp sessions (and in VMware labs, and in other sessions I'm interested in). I will not heckle - though of course you guys are welcome to in my session (leave my colleagues alone). I'm respectful and want to learn from you and with you. But I won't take mudslinging at EMC lying down - in the same way that you guys shouldn't (and don't) when we do it.
Speaking of mudslinging and a disingenuous show of NetApp "affection" maybe I ought to turn your attention to your following statement.
"WAFL's tradeoff is free snapshots in exchange for non-linear performance under normal workloads and normal utilization. It's your superpower, and your kyptonite. Kickass write perfomance when there are loads of free spaces to write to..."
Now, where have I seen comments like these coming from before and I more importantly where have I seen comments like these crumble like a house of cards under the longest running SPC1 benchmark published in the industry. Mind you this was published in January 2008 yet 8mos later you're attempting to beat the same drum.
Now that's mudslinging.
Cheers
Posted by: Nick Triantos | September 10, 2008 at 09:44 PM
Nick, watch this - let's play a game, one I sincerely hope is a useful one.
I'm going to play the devils advocate, and then you do the same.
If you don't I'll play both sides.
I just don't compute how any technology (any) has only good design tradeoffs and not negative ones.
Ok, me pretending to work for NetApp (but I'm being sincere here, and look forward to you responding in kind):
1) WAFL is good, because it eliminate write penalties unde most circumstances
2) WAFL is good because the NVRAM/write model can deliver excellent peformance, and even better with RAM-based accelerator cards
3) WAFL is good, because it enables every write as NVRAM destages during a journal event to in effect be a snapshot - there is not "snapshots per se", because nearly everything is a snapshot. This enables WAFL-based systems to have excellent snapshot scaling and performance - a core feature on which many other advanced features are based
4) WAFL and the fundamental design premise of using a filesystem as the underlying storage "container" means that common underlying mechanisms that apply at or below the filesystem can be applied to higher-level functions (i.e. a vehicle for delivering iSCSI and FC while maintaining common management models and functional capabilities), which has simplicity benefits.
All off the above are ABSOLUTELY true, and excellent design points for the early innovation of Netapp's founders.
Ok - now be intellectually honest - argue the other side - engineer to engineer.
I've got to do some more work tonight, but tomorrow, if you haven't, I'll do it. Others are welcome of course (I don't edit comments), but EMC'ers and other competitors, don't pile on - I don't think it would help.
Netapp folks, NetApp customers, you're welcome to add things that you think are design advantages, particularly if you think that they have no negative downside.
BTW - my intent here isn't to refute the advantages and somehow say that they don't have benefits, but instead refute the position that any given design decision or implementation isn't inherently a tradeoff.
Posted by: Chad Sakac | September 10, 2008 at 09:59 PM
Chad, we don't need to play intelectual games in order to reach a conclusion to a long running saga regarding perceived and unproven WAFL deficiencies or inefficiencies, if you will.
The verdict has been in for some time now and so has the supporting the data at SPC1.org and can be viewed by any and all interested parties, including yourself since you don't seem to have studied it at all. In fact, as a Netapp user that you are, you ought to take a look. It may change your perspective and your assumptions.
Posted by: Nick Triantos | September 10, 2008 at 10:21 PM
Nick, I guess NetApp invented NFS as well? Why don't you guys build a solution that can scale beyond two heads, build a native fibre channel solution (maybe your numbers will go up), stop using these silly benchmarks which has been discredited for years and to me is more of a marketing exercise then anything else. and settle your suit against Sun. Which by the way invented NFS;).
Stop NetApp your killing me. (do a google on that).
I can't speak for Chad, but when my filer was in production we did see good performance but then it hit dirt in a few months. Who has time for weekend defrags.
Posted by: Terry | September 10, 2008 at 11:20 PM
Hi Terry,
A solution that can scale beyond two heads?
http://findarticles.com/p/articles/mi_m0EIN/is_2006_June_12/ai_n16463837
What Does Native Fibre Channel mean? There's only one way to implement the Fibre Channel protocol Terry.
The "silly" benchmarks have been discredited mainly by those who dont have courage to run them...
Posted by: Nick Triantos | September 11, 2008 at 01:50 AM
OK, first, three apologies:
Apology 1: Nick, you're right that my comment (the "Superpower/Kyponite" comment) was hyperbolic and inflamatory. I'm not being disingenuous I just stepped over a line I try not to. Stated differently, in a less inflammatory way - my point was that for everything (technologies, companies, people) - our strengths are intrinsically our weaknesses. I always get rankled by the "there is no downside to this engineering/technology decision" position.
Apology 2: for this subsequent list itself. I have to complete it - it's a matter of following through on a commit, and I try to predictably do deliver against commits. I also hate that I can't figure out how to be more concise :-)
Apology 3: Terry, Nick can't see your email the same way I can, and I'm sorry. Terry isn't an EMC person Nick - I think he is a customer. How about "I'm sorry you had a negative experience, I want to know more to learn how we can do better" (I say this often, but far less often than I say "thank you for being an EMC and VMware cusotmer").
I've said it before, trust vendors that tell you where the solution stops working (saying this generally, not about either Nick's company or mine). Trust benchmarks that show where the solution broke. Trust partners who tell you what NOT to do with their own gear. Distrust those who tell you who focus on what is bad about others, rather than what they can do to help you.
Cringe... I'm breaking my self-commandment above, but to follow through with my commit, here we go.
Know that I will try always to openly state what we find as limits of our own stuff. If there are factual errors in my commentary, folks, corrections are welcome. **I DO NOT CLAIM TO BE A NETAPP EXPERT**. Can I ask to try to keep it on the level (i.e. technical corrections welcome, ravings not so welcome), I'm trying to seal the wound, not reopen it. My point was purely to point out that everything has an up and down,
1) "WAFL is good, because it eliminate write penalties unde most circumstances"
The downside of this design is twofold. First that the strength (reforming random writes into contiguous writes buffered by NVRAM) makes the link between NVRAM and scale of features and performance intrinsic. The second is that sequential reads after random writes as the WAFL layer lacks contiguous blocks can cause non-linearity in cases that are expected to be linear (for example some, databases expect locality of reference, and use this as an optimization technique - so in some cases WAFL helps, in others it hinders - the essence of a tradeoff). That is not to say that I'm saying that Netapp filers catch fire and explode as WAFL has less contiguous space, or this is a bad design choice, rather - they are TRADEOFFs. There is a reason why WAFL Iron and other utilities exist. Again - not bad, just a tradeoff.
2) "WAFL is good because the NVRAM/write model can deliver excellent peformance, and even better with RAM-based accelerator cards"
the role of NVRAM is analagous in some ways, but very different in others, than a write cache. The core function is not as a buffer (though it absolutely does buffer writes), but as an instrinc mechanism to ensure filesystem consistency in the journaling action. The downside of this is that the NVRAM cache size is a core limiting factor for many NetApp features and envelopes. System memory is also an important factor (like it is on Celerra and CLARiiON and all arrays, and why we're all rearchitecting for massively multicore 64-bit procs and large addressable RAM) for other features - like some of the dialog if you follow the search the Terry's previous comment suggested. That is not to say that I'm saying that Netapp filers catch fire and explode, or this is a bad design choice, rather - they are TRADEOFFs.
3) "WAFL is good, because it enables every write as NVRAM destages during a journal event to in effect be a snapshot - there is not "snapshots per se", because nearly everything is a snapshot. This enables WAFL-based systems to have excellent snapshot scaling and performance - a core feature on which many other advanced features are based"
The downside here is the requirement for background reclamation, which NetApp has done great work to make more and more transparent, but must be done. In a prior company before EMC acquired us (Allocity) we used a WAFL-like block layout mechanism with a B+ tree pointer table, and it was REALLY hard to reallocate the blocks and restructure the B+ tree. Netapp is clearly better than Allocity at this :-) But, we had plenty of smart folks, and it was HARD. This is one of those things that is intrinsic - you pay the piper before or after (move the blocks at some point or another). The other core issue is that while this core architecture is excellent at snapshots, NetApp's clone capabilities are generally viewed as inferior to other vendors. The response (SyncMirror - which is also they way they do a RAID 1-type thing that is better characterized as a mirror of a RAID-DP/4 container) when a workload or customer - right or wrong - demands it, is missing some features that most customers expect in that use case (consistency groups across objects spanning containers), because those are very hard to do if the files are in different filesystems. Though consistency is instrinsic if they are in the same filesystem - this can conflict with other best practices. The competitive response ("Clones/BCVs are always bad, snapshots are always right") would be the same if EMC said "Snapshots always are bad" (and sadly, sometimes we do) is a bad response either way. So, once again - strength is intrinsically a weakness - for both approaches. Still, in geniune outreach - kudos for NetApp for driving the use cases of snapshots into the mass mainstream. Replicas of data are good for lots of reasons - period. The EMC view is sometimes you want snapshots, sometimes you want clones. Calling a writeable snapshot a "FlexClone" (emphasis on Clone - which was widely used as a word describing a ) was genius marketing. I'm actually a marketing idiot. If it's not apparent, I can't say anything in a short, clear way :-) I would have likely called it a FlexWriteableSnapshot or something horrific like that.
4) WAFL and the fundamental design premise of using a filesystem as the underlying storage "container" means that common underlying mechanisms that apply at or below the filesystem can be applied to higher-level functions (i.e. a vehicle for delivering iSCSI and FC while maintaining common management models and functional capabilities), which has simplicity benefits.
The downside here is twofold.
The first is that the block object, even if you comply EXACLTY with the FC and iSCSI protocol standards, which NetApp of course does - they have many smart engineers there - inherit the behavior of the underlying filesystem. Many of the filer limits are a function of how filer failover behavior will occur as Flexvol count, and capacity increase. EMC's Celerra is similar (datamover failover speed is a function of size and number of filesystems). It is a very hard engineering problem. BTW, this is why we enforce the Celerra "usuable capacity limits" as lower than what the backend can actually support (which of course is used in a compettive context). Likewise, the stated raw capacity behind a NetApp filer is not a reflection of the usuable capacity per se, but rather a upper maximum. To ensure sets of parameters (like the A-SIS notes if you do the google search the previous comment suggests) are the functional limits. And remember - LIMITS ARE NOT BAD - SO LONG AS THE VENDOR DISCLOSES THEM TO YOU - and Netapp states them in their docs, so make sure you see them, just like you should see EMC's). The challenge of accelerating filer failover and increasing the predictability is a long term project, and one where NetApp has made great strides and I would fully expect them to continue to do so. NetApp has also made great strides in working with Application and OS partners to build into applications the ability to extend the IO timeouts to sustain failover with a pause but not a hard I/o Failure. Customers should listen to the application vendors and Netapp and follow their best practices. Likewise, you will find these as common best practices with the Celerra. BUT conversely, not using a filesystem as the core container, with iSCSI and FC LUNs as files in the filesystem means that "pure" block devices have failover characteristics are measured in milliseconds. For example, this is a really, really hard engineering problem on non-open systems. Those folks dictate the requiremetns and expect the stoage subsystem to comply, and laugh if you talk about extending timeouts, and is one of the reasons why getting into the ultra high-end, not because of peformance, or drive counts, is hard for NetApp. Let me restate: our Celerra iSCSI implementation has the same design trade off (i.e. it's an iSCSI target that is a file in an underlying filesystem container) - but when we made our Celerra have FC, we (EMC) decided to trade off a single user interface and local/remote replication model in exchange for the "Native" FC characteristcs (i.e. the choice of FC isn't one purely of performance, but of other characteristics too. That's what Terry (I think) was referring to. It wasnt't that we couldn't. And it wasn't that Netapp made the wrong choice. We simply made a different one. For example, here's the downside of our choice - we now need to invest in a higher-level management construct to make our management model more integrated though the underlying implementations are different (for the reasons stated). Each customer needs to look at the benefits and decide what's right for them.
The second IS that scaling up is harder. Yes, NetApp acquired Spinnaker, and yes, they are shipping ONTAP GX. The duration of the integration effort shows just how hard the engineering problem is. Delivering all the features and benefits of ONTAP Classic and WAFL (i.e. the 7G family) while merging with the Spinnaker model is very hard. I don't underestimate Netapp's ingenuity, and I'm sure that they are doing it, and eventually will do it. I can only imagine the difficult of the decision to merge or maintain seperate the different codepaths and philopsphies. Whereas EMC has been that way since day one (for better AND worse), it must be anathema to some NetApp folks What customers need that sort of scale? It's a narrow use case, but an important one. NetApp, EMC and others have scale-up designs, designed for different use cases (ONTAP GX clearly focused at the very high end NFS single namespace case, EMC's DMX clearly focused at the linear performance, even in degraded hardware cases and ultra-high system availablity and scale). The market is ultimately the judge of validity of our choices.
Closing thought - this has taken a lot of energy, me to write, and you to read. I think that the stuff above text is a TOTAL NET ZERO addition to the value of the knowledge on the internet. It's useful for two engineers sitting down in a bar mulling this stuff over a pint. I don't think customers care, except insofar as the tradeoffs we all make express themselves to them in their use cases.
One thing useful for me.... this dialog has made **me** learn something.
When the SPC-1 stuff first happened, I was one of the voices crying out internally to respond in kind. And yes, I have read it, and looked in detail. I was outvoted and we didn't respond via the SPC. It wasn't a matter of fear, or a lack of courage (hyperbolic/inflammatory?) or a being a "marketing company, not a technology company" (hyperbolic/inflammatory?). It restarted a broad dialog about benchmarking inside the company. We continue to participate, and will continue to participate in public benchmarks where we see an even hand. Microsoft's ESRP (where we both have postings), and SpecNFS (where we both have postings) are examples. I don't know if these choice are right or wrong, but man - if this dialog is anything, there's something to be said for the agrument that was lobbed against me by EMCers in the SPC debate: "Chad, you don't know how much work it will be just to fight the competitive benchmarks, and it's work that does little for anyone, and just spirals into Mutually Assured Destruction logic". Did we do some stuff that was bad in response? Sure - competitive teams everywhere have that as a job (man, I would hate that). That was them, not me.
Re "showing performance data even where it fails", I've posted examples from hundreds of examples availble to EMC, EMC partners, and customers in earlier posts. I'll be showing more at VMworld. I'm going to try to stay above the fray, and post as much useful knowledge as I can.
All this back and forth has made me learn: I crossed a line I don't want to cross. I want to spend my energies and this blog focused on what we're doing to help customers, not go back and forth. I'm sure I'll occassionally fall off the wagon, but there it is. I'm going to try.
Posted by: Chad Sakac | September 11, 2008 at 11:32 PM
Glad you hit being able to create 10K clients.
Means nothing. Absolutely nothing. Nada. Zilch. Zero. Zip.
Anyone can do it - as long as you are willing to dedicate the number of hosts required to do it in your *lab*.
It is a smoke screen - nothing more. Living with what you have decided to buy - that is the big thing, and NetApp does it simpler, with far fewer storage objects to manage. Virtualization is in part about moving from many to few. Anything else is silly, expensive and ultimately, not satisfying.
That is why EMC's biggest customers are moving to NetApp architectures. Ask us.
Period.
Posted by: Mike Shea | September 13, 2008 at 09:49 PM
Hey Mike and Nick,
I would love to see less emotion from you ntap folks and more solid discourse here. As you guys may know, I have a unique perspective having worked for both NetApp and EMC (now at EMC).
Chad has extended what I believe to be fair, honest and very well substantiated points around the general topic of engineering and design trade offs.. If you took the time to read it, you should have seen that he included challenges inherent to file system based architectures and mentioned BOTH the EMC Celerra and NetApp NAS Filer. Can you guys honestly say Netapp has no design trade offs? Why does it seem that all the posts that come across from you guys can easily be summed up as:
"EMC is really, really BAD and NETAPP is really GOOD. Just look at the SPC-1 results we ran for EMC".
A little perspective:
At EMC, I work with roughly 100 customers a quarter and shockingly, not one of them runs SPC-1 as a line-of-business app. Bottom line: customers simply do not care about marketing papers, all they care about is how the frame will work in their environment, with their own applications.
They also wonder why Netapp is constantly comparing itself to EMC . It reminds me of the Kia and Hyundai commercials where they say they have better features than Accord or Toyota Camary for a lower price. If EMC has such poor storage systems, why is Netapp always comparing themselves to us?
A final word on performance:
In attempt to help customers meet their needs in the field, I have headed up 5 large scale, real world benchmarks with real customer data and applications at EMC against Netapp since I've been here. The results: EMC 5, NTAP 0. If EMC performance is so bad, how is this possible? Here is a press release of one customer who was so happy with EMC after benchmarking their application with both emc and netapp, that they wanted to jointly announce it to the Street: http://www.emc.com/about/news/press/2008/20080529.htm
-Jonas Irwin
Posted by: Jonas Irwin | September 18, 2008 at 03:04 AM
Hi,
Thanks for your great blog. It is fun to read although it takes sometime :)
I have a spefic focus on your blog especially with EMC datamover failover part. You say it is hard to guess a failover time by function of time and filesystem count.
I have a NS40 which has 8 filesystem and 3TB total size with %90 usage. DART is the final one of 5.6, fresh updated.
150 vmware machines are running over iSCSI and performance is quite good.
However, datamover failover to standy takes 600 seconds. When we follow with getreason standby waits for active datamover rebooting and then activates immediately. I think 600 seconds is not the failover time but the time of active reboot process.
Any comment would be helpful. We are think to move Netapp indeed.
Thanks in advance.
Posted by: goktugy | November 10, 2008 at 05:08 PM
Thanks goktugy, and thank you for being an EMC customer. Obviously want to make sure you stay that way, and stay happy :-)
There was an issue with the earlier (5.6.3x) DART builds where failover was significantly longer than expected under some relatively rare conditions, but it looks like it may be affecting you.
Note that a lot of the best practices guides note to extend timeouts to 600 seconds (10 minutes) for various OSes - but that is absolutely NOT the target failover time, rather a worst case. In general, with the configuration you describe, failover should be occuring in less than 2 minutes, and in possibly as low as 30 seconds (again, as I noted, with all filesystem-based devices, failover is very difficult to bound).
Can you tell me which DART version you are using ("server_version server_2" at the CLI)
Posted by: Chad Sakac | November 11, 2008 at 02:11 PM
Thanks for joining me in. I 'd like to be happy as an EMC customer infact, but I wonder how to achieve that with in the specified failover times :)
I know that this is not a customer blog nor a support/complaining place. However I believe a customer eye might be useful as a relevance. So let me put my mind in please :)
I have 150 VMware machines and going to be more. If failover takes 2 minutes most of my virtual machines are dead which means a disaster for me :(
I checked VMware timeouts and its soft iSCSI initiator (Cisco) timeout is something like forever. Should I check also check virtualized Windows and Redhat timeouts as well?
EMC and Netapp has great features/performance, but if both has a potential of a disaster occurance none of the customers I know wouldn't invest on it. I wouldn't if I knew...
Back to the subject here is my DART version ;)
[nasadmin@emcnaccs42 ~]$ server_version server_2
server_2 : Product: EMC Celerra File Server Version: T5.6.40.3
Also I need to correct myself, it is not 600 seconds but 300 seconds (5 minutes) as my collegues informed.
Posted by: goktugy | November 12, 2008 at 03:10 AM
Hi again,
EMC solved the case by setting EnableAptpl=1. This setting reduced the failover/failback time 300secs to 60 secs. 150 iSCSI virtual machines pauses I/O and continues normally.
Thanks for your support.
Posted by: goktugy | November 18, 2008 at 01:11 AM
goktugy - our pleasure. FYI for anyone else reading this thread - the Datamover parameter Goktugy mentioned here affected DART revisions NAS 5.6.39, 5.6.40, or 5.6.41 (all of which are behind the current rev now), during which the datamover failover for datamovers with active iSCSI targets regressed.
An alternate to applying the workaround is to do a DART upgrade to 5.6.42 or later, but the datamover parameter is a relatively easy workaround.
It's only downside is the application of the paramater requires a datamover failover.
Again - thank you for being an EMC and VMware customer!
Posted by: Chad Sakac | November 24, 2008 at 04:45 PM
A blog update from the future (2009)?! Please tell us if the economy is going to recover by September...
Posted by: Anonymous | January 10, 2009 at 11:17 PM
LOL - anon!
Yes, I've got good news for you. the economy comes to a fantastic rebound in mid-year, well ahead of September. By the end of the year, economists are happily confused (i.e. the recovery is so good), that they go back and do detailed analysis. The turnaround turns out to be based on huge productivity improvements, power savings, CapEx and Opex reduction starting mid-year with huge innovative products from VMware, Cisco and EMC!
In all seriousness - putting aside the plug - I really, really hope it gets better. Personally, I think we're at the bottom, and it will improve - but I tend to be an optimist.
(On another note - I've got a wierd date issue in my head - I struggle with my basic daily agenda - sometimes if it weren't for my blackberry, I wouldn't know what DAY it is :-)
Posted by: Chad Sakac | January 11, 2009 at 01:55 PM