Ok, so what was announced?
The current generation EMC Unified platforms got some wicked cool next-generation efficiency technologies. These are:
- FAST Cache = we’re added the ability to have up to a 2TB read/write cache using cheap hardware. Think 80-90% performance improvement in some cases, and saving a ton of dough at the same time. Crushingly good results in View use cases (boot storm, antivirus, recompose, refresh data all shown below).
- Sub-LUN FAST (automated tiering) = we’re adding the ability for the array to tier at sub-LUN levels of granularity.
- Compression = we’re adding the ability for the array to compress production storage
- vStorage API for Array Integration (VAAI) support – this hardware assists vSphere 4.ahem (that’s my way of saying “the next rev”) during storage tasks in a way that is analagous to what Intel VT and AMD-V did with vSphere 4. This last one I’ll dedicate a separate post to later….
The other thing that is really cool ties in with the other thing we’ve done – massive simplification. Each of these features, while ridiculously advanced under the covers, is ridiculously easy to use.
If you stop reading here, please take away one takeaway: if you’ve used EMC stuff in the past, and think (or someone is telling you), “archaic RAID group provisioning model”, or “it’s a legacy array, we’re a virtualized array”, please push back a bit – ask your partner/EMC sales team to give you stick time before you listen to a competitor bash another. I need to deal with that stuff all the time – and it’s a load of hoo-hah.
Want to understand what I mean? Ok, hope I’ve hooked you for a long ride – trust me, it will be worth it :-)
We found that as we introduced advanced feature after advanced feature, people didn’t use them unless it was “stupidly simple”. I’m not saying people using the arrays are stupid, rather that they are busy, and life is to hectic to use something unless it not only saves money, but is also easy.
Ok, the best way to “grok” what we’re talking about is to watch this, and see them in action.
(big thanks to Susan Sharpe for building the demo) Download the high-rez versions here: WMV and MOV
If you’re interested in learning more about how these work under the covers, and the real-world data, read on…
Remember these two fundamental axioms around storage efficiency (this is the way I see things at least) as we examine these new capabilities.
- “Efficiency” is multi-dimensional. It’s not about any ONE feature, but applying efficiency techniques across the board. Sometimes efficiency is measured in $/watts/space per GB. Sometimes it’s measured in $/watts/space per IOps or MBps. Sometimes it’s measured in $/watts/space per “unplanned change that has a high impact”
- Every workload needs/means different things when they say “efficiency”. It’s some mix of capacity efficiency, performance efficiency, and unplanned change efficiency. More importantly during the life of information or an app – that definition of “what is efficiency?” tends to change – down those main dimensions.
I’ve been on this soapbox for a while. Consider the diagram below as a reflection of 1 and 2.
Each of the technologies down the red axis ($/GB), the green axis ($/IOps or MBps), or the blue one ($ incurred by lack of ability to change on the fly) is an general technology. I intentionally try to use non-EMC centric product/features for each of these, just to help people realize efficiency is about ALL OF THESE THINGS.
Ok, let’s cover each of these new gems in turn. Remember, these apply on any current generation CX4 (SAN – iSCSI/FC/FCoE) or Celerra (SAN + NAS) platform.
FAST Cache:
FAST cache is primarily a performance efficiency (green axis) technology. It’s leveraging the fact that Flash is on 1000x slower than DRAM when it comes to latency. Woah. 1000x times slower? That doesn’t sound good, does it? :-) Remember what we’re comparing it to – the fastest spinning disks are 10^6 slower. That’s 1,000,000x slower. Since Flash is only about 10x more expensive that disk (and falling fast), that’s a 100x latency/$ improvement.
Pause and think about that: “how often in technology land does something come along that’s 100x better?”
Now you can see why I go on and on about Flash being transformative, and how it will be applied EVERYWHERE. You’ll see it in servers, you’ll see it in array caches, you’ll see it in non-volatile storage tiers, you’ll see it in laptops, desktops, mobile devices – EVERYWHERE. Just like “protocol passionistas”, I would tend to not listen to people who say “one way always”. It’s just as likely that it’s simply what suits them.
Before going further, it’s important to note that NetApp was first to the market with use of DRAM-based, then later Flash-based PCI cards that act as an extension of system memory and cache. I’ve also blogged on this topic (competition in general) before – each of us go back and forth, each doing something first, and others (if it is good/possible) following. Note – this highlights that there are areas (and there are many) where we agree! There are areas where we disagree (amongst other things, and apropos today, I certainly don’t believe that the use of large flash-based caches obviate the need for automated tiering which we’ll cover later). I don’t claim to be an expert on their implementation, so I won’t go on any further about it – though NetApp folks and customers are of course welcome to comment.
OK – so FAST cache allows you to simply add solid-state storage to your array and it can be assigned to be an extension of the system cache. This approach means we can leverage the ever dropping prices of flash, mass-market commoditization of the form factor, and it’s cake for our customers to add to their existing arrays.
A few common questions….
Q: How flash can much can be assigned?
A: Up to 2TB. I’ve confirmed that this is not an architectural limit per se, but there is system memory used for metadata and in the smaller EMC platforms, it’s system memory bound. This means that in the future, as we develop next-generation storage engines, it’s likely that this upper limit could increase.
Q: Read only, or read/write?
A: It can act as an extension to read and write cache.
Q: Does it need “re-warming” if you lose an SP?
A: No. Just like the CX4 and current generation Celerra - if the SPS (battery used during power loss so write cache can be destaged to disk) state is good, an SP failure doesn’t disable cache if you have enabled HA vault.
Ok – on to the fun part. What’s does the data show?
Well – one of the use cases where this topic is important (though it applies nearly everywhere) is in VMware View use cases.
The gang in RTP have done a BOATLOAD of testing with View 4 in the heavy-IO periods: Bootstorms, AV, View Recompose, View Refresh, and guest patching.
This was done on a CX4 array with small number of disks hosting 150 VMs. 133GB of FAST Cache was configured. Further scale-up testing is being done now, but there’s no reason not to expect linear scaling models in View use cases.
Here’s what we found….
First up to bat…
Linked Clone based boot and logon storms
On the left, the bar chart shows what the host was generating (the blue bar) vs. what the backend disk needed to service (the red bar). On the line charts on the right the blue represents the LUN and guest disk response time (governing user experience) with FAST cache (red line) and without (blue line).
They dug in a bit more to understand the effect in more detail.
In this chart – the green represents the amount of FAST write cache hits/s, and the red shows the effect of the FAST read cache. The nearly invisible blue line shows how much was actually going to the backend (virtually nil). You can see that even in this use case which you might assume was all about reads, it’s not. It makes sense when you think about it – the guest are going to update their swap even if you fix the size and do other optimizations. Before you ask, YES we did all the guest-level optimizations VMware and EMC recommend.
Conclusion re effect of FAST Cache on View Bootstorm use case:
- Time to boot all desktops to usable state and log in all users decreased from 20 to 9 minutes from completely cold cache.
- Time to boot all desktops reduced to 3 minutes from 20 minutes with warmed cache.
- FAST Cache absorbs 99% of I/O – Disks never more than 45% utilized
- During peak load FAST Cache is absorbing over 2,800 write IOPS and 10,000 read IOP
- During peak load FAST Cache sinks enough write IOPS to saturate an entire shelf of 15K FC drives
- During peak load FAST Cache sinks enough read IOPS to saturate 60 15K FC Drives
- Estimated number of drives required to match FAST Cache performance: 66
- Peak response time decreased from 225ms to less than 50 ms
Good huh?
Ok, next up to bat – View Recompose
On the left, the bar chart shows what the host was generating (the blue bar) vs. what the backend disk needed to service (the red and green bars). On the line charts on the right the blue represents the LUN and guest disk response time (governing user experience) with FAST cache (red line) and without (blue line).
They dug in a bit more to understand the effect in more detail.
In this chart – the green represents the amount of FAST write cache hits/s, and the red shows the effect of the FAST read cache. The nearly invisible blue line shows how much was actually going to the backend (virtually nil). There’s a healthy impact on both reads and write caching models.
Conclusion re effect of FAST Cache on View Composer Recompose use case:
- Replica copy happens almost instantaneously with FAST Cache enabled
- Disk average IOPS decreased by 70% and peak IOPS by 67%
- FAST Cache would enable pool recompose operations during production which could be critical in failover, disaster recovery scenarios and change control windows
- FAST Cache absorbs 98% of I/O – disks are never more than 28% utilized
- Host response time never exceeds 2 ms during recompose
- Estimated number of drives required to match FAST Cache performance: 13
Next… View Refresh.
On the left, the bar chart shows what the host was generating (the blue bar) vs. what the backend disk needed to service (the red and green bars). On the line charts on the right the blue represents the LUN and guest disk response time (governing user experience) with FAST cache (red line) and without (blue line).
They dug in a bit more to understand the effect in more detail.
In this chart – the green represents the amount of FAST write cache hits/s, and the red shows the effect of the FAST read cache. The nearly invisible blue line shows how much was actually going to the backend (virtually nil). just like in the other use cases, read cache and write cache had a considerable impact.
Conclusion re effect of FAST Cache on View Refresh use case:
- Refresh lasted same duration with and without FASTcache due to vCenter throttling
- Average disk IOPS decreased by 73%
- Peak disk IOPS decreased by 75%
- Substantially lower average and peak disk usage with FAST Cache enables remediation of virtual desktops while still delivering high performance
- On demand refresh of non-persistent desktop pool members can be scheduled in real time during production hours which is an enabling technology for truly non-persistent pools of stateless desktops
- Scale testing should show FAST Cache will enable significantly high user per spindle density due to decreased disk workload
- FAST Cache absorbs 99% of I/O – disks are never more than 24% utilized
- Host response time never exceeds 5 ms during refresh
- Estimated number of drives required to match FAST Cache performance: 18
Ok – how about Anti-Virus?
On the left, the bar chart shows what the how long it too to do the AV scan without (blue) and with FAST cache (red). On the line charts on the right the blue represents the LUN and guest disk response time (governing user experience) with FAST cache (red line) and without (blue line).
They dug in a bit more to understand the effect in more detail.
In this chart – the green represents the amount of FAST write cache hits/s, and the red shows the effect of the FAST read cache. The nearly invisible blue line shows how much was actually going to the backend (virtually nil). As you can see – in the A/V scan use case, it’s almost completely dominated by the read cache.
Conclusion re effect of FAST Cache on View Antivirus scan use case:
- Time to scan a desktops decreased from 67 to 15 minutes with FAST Cache
- 87.2% decrease in Peak Lun Response time
- Peak Host Response time decreased from 382 ms to 11.86 ms
- Average time to scan a desktop decreased from 67 to 15 minutes with FAST Cache
- Disk utilization never above 10% - FAST Cache effectively handles 100% of I/O once warmed
- During peak I/O FAST Cache handles over 1,100 write IOPS and 14,300 read IOPS
- Estimated number of drives required to match FAST Cache performance: 83
Ok – how about Patching?
On the left, the bar chart shows what the how long it too to do the full patch cycle without (blue) and with FAST cache (red). On the line charts on the right the blue represents the LUN and guest disk response time (governing user experience) with FAST cache (red line) and without (blue line).
They dug in a bit more to understand the effect in more detail.
In this chart – the green represents the amount of FAST write cache hits/s, and the red shows the effect of the FAST read cache. The nearly invisible blue line shows how much was actually going to the backend (virtually nil). You can see that even in this use case which you might assume was all about reads, it’s not. It makes sense when you think about it – the guest are going to update their swap even if you fix the size and do other optimizations. Before you ask, YES we did all the guest-level optimizations VMware and EMC recommend.
Conclusion re effect of FAST Cache on View guest patch use case:
- Heaviest write workload of all tests: 45:55 read:write ratio
- During peak workload FAST Cache is absorbing 2200 write IOPS and 9000 read IOPS
- Estimated number of drives to equal FAST Cache performance: 60
What about the effect during periods of plain-jane “steady state” View client use?
For this test, we ran the RAWC workload which simulates a lot of things the client would do (common tasks and applications).
It’s clear that in this case, the configuration wasn’t storage bound (like the other lifecycle states). But, even here, the lower latency of the FAST Cache had a positive effect on the end user experience in many of the use cases.
BTW – I’ve uploaded a PDF of the slides (has all the high-resolution graphs) here.
So – what do you think?…
BTW – a big thank you to the EMC solutions team in RTP (Dave Boone, Aaron Patten, Greg Smith,Venkat Rao, Venkateswarara and all the others – you guys rock!)
OK – on to the next two efficiency features :-)
Sub-LUN FAST
Note that in my “Chad’ 3 dimensions of efficiency decoder ring” above, the idea of auto-tiering is BOTH a capacity efficiency and a performance efficiency feature. How’s that?
The reason is very basic and comes down to 5 things:
- Economics of IOps. Today SSD is be orders of magnitude cheaper than magnetic media when it comes to $/IOps (it’s already 100x or so and accelerating) and watts/IOps
- Economics of GB. For a while longer spinning rust (magnetic media) will be an 10-2x cheaper measured in $/GB. It’s interesting (and inevitable) that likely in the VERY long term, solid-state or other next-generation media will also cross over in $/GB (though in that timeframe the metric $/TB) – some models put that out in 2015/2016 – that’s not that far!!!
- Long tail effect. Looking at any customer across all the stuff on their arrays, you see a “long tail” phenomenon.
- There’s a small (consuming few GB) amount of data HAMMERING the array (consuming many IOps).
- There’s a long tail representing a huge volume of data (consuming many GB) that is not driving the array very hard (consuming few IOps).
- Change over time is guaranteed. “What data is doing what” sometimes is constant, but in the vast majority of cases, it tends to vary over time. You do batch jobs against databases. Old projects need accessing again. Tables/rows in databases become really important, then are never touched again. Also, sometimes the busy data isn’t in a LUN or a file, but a PART of a LUN or file (consider the case of a swapfile in a guest OS).
- You can do things to make the pool more efficient, but in the end, the pool is the pool. The disk resources in the array can be through of as a “sink” (like a heatsink). It can “handle” a given amount of maximum IOps and storage a maximum amount of GB. Efficiency techniques all vendors do apply on TOP of the fundamental characteristics of the pool. To understand this, consider deduplication and compression. They take a fixed asset (pool of GB you have) and squeeze more efficiency out of the pool of GB. Likewise, consider that while you can “accelerate” a pool of storage by using a larger cache in front of it (this improves the $/IOps metric), in the end, the “pool” behind the brains of arrays has a total amount of GBs it can store, and the number of IOps it can deal with.
Do the math, and come to the obvious conclusion 1+2+3+4+5 =
“it’s great to improve the capacity/efficiency of the pool but it’s even better if you can do that and change the whole underlying economics of the pool in the first place.”
That’s what Virtual Provisioning, large pools with sub-LUN level auto-tiering do. It isn’t about performance itself, like FAST Cache – but rather changing the underlying economics of storage at the bottom of other advanced features (which improves $/unit of performance and $/unit of capacity)
Let’s make this obvious so there is NO DEBATE.
- A 15K RPM disk can do ~200 IOPs, and current state of the art is 450GB raw per drive, and let’s call it $2K per enterprise drive.
- A large SATA disk can do about ~80 IOps, and current state of the art is 2TB raw per drive, and let’s call it $1K per enterprise drives.
- A Flash-based disk can do about 2500 IOps – but that’s us being crazy conservative – 5000 is normal, and current state of the art is 400GB raw per drive, though 200GB drives are is more of the sweet spot, let’s call it $10K per drive.
Ok – so let’s say you create a pool of 100 disks from different mixes of drive types, and then consider the total “sink capacity” of that pool both in terms of IOps it can sink and GB it can store.
Drive Mix in the pool | Total IOps of the pool | Total GB of the pool (raw) | Total acquisition cost of the pool |
All slow 5.4K RPM | 8000 IOps | 200,000GB = 200TB | $100,000 |
All fast 15K RPM | 20,000 IOps | 45,000GB = 45TB | $200,000 |
All SSD | 500,000 IOps | 20,000GB = 20TB | $1,000,000 |
Mix of 10% SSD, 20% 15K RPM, 70% 5.4KRPM | 59,600 IOps | 151,000GB = 151TB | $210,000 |
Mix of 20% SSD and 80% 5.4K RPM | 106,400 IOps | 164,000GB = 161TB | $280,000 |
So – dummying the table down:
- the 80/20 mix delivers 80% of the capacity and 1325% the performance of the all SATA config at 280% of the price.
- the 70/20/10 mix delivers 335% of the capacity and 295% the performance of the all 15K RPM config at a 10% higher price. Put another way, you could shave a little off the top, and use 10% less total footprint and in the same price get about 300% more capacity and 250% more performance.
Of course, these days, customers use 5.4K, 7.2K, 10K and 15K rpm disks (note that this is independent of the connectivity type) in the same array. Moral of the story? Solid state is a huge savings technology applied to pools of non-volatile storage….
…but only if you have a very specific dataset which is localized to the SSD OR the array can auto-tier at the sub-LUN level.
Remember, things like enhanced caching, deduplication and compression, thin provisioning, these are all data services that layer ON TOP of the underlying characteristics of the pool. Consider them an “and” not an “or”.
Of course, it’s not easy from a technology standpoint. Will be harder in some architectural models than others. There’s also the question of increased metadata management (costs storage processor CPU and memory), performance impact to move elements of storage that drives a balance (and this will vary on many factors from platform to platform, and from vendor to vendor) in the implementation (primarily about the degree of granularity and the speed/impact of movement).
As an interesting note, over time, we the expose APIs for application vendors to see this metadata and further integrate with the underlying policy.
There’s also one other big thing – dramatic simplification of storage management.
Not only is “RAID group” oriented layout (“tell me your exact requirements in advance and I’ll put that LUN/filesystem right HERE”) a thing of the past – the whole provisioning model of “pools” is even changed by this. Adding IOps or GB to the pool is as simple as adding more SSD or SATA drives to the pool.
Ok, so, when it’s all put this way – it seems to me you have to be an idiot (at least on a technical level) to refute that automated sub-LUN tiering is going to be something that EVERY storage vendor is gonna have to do.
So – faithful VMware readers – how does this apply to you?
The short version? The answer is that EMC just announced that we are delivering on the idea of “Storage DRS” with sub-LUN tiering, and will save you a lot of money, and make provisioning a LOT easier.
It does mean that a datastore is “non-linear” in performance – literally parts of a VM that need to be faster will be in one place on one type of media, and parts of a VM are fine being slower will be another in the pool. How’s that for virtualized storage :-)
Now, unlike VMware DRS’s effect on CPU and memory, it’s impossible for the storage to react in real time to changing demands (it takes some analysis over time, and it takes some time for the data to move).
VMware has publicly discussed an upcoming feature in future vSphere releases called Storage IO Control (SIOC) that we think complements sub-LUN FAST. You can read about it here. The benefit of SIOC is that it kicks in immediately when the IO latency threshold is hit in the guest. The downside is it doesn’t resolve the underlying problem until the administrator does something.
Put them together, and in theory, it’s like peanut butter and jelly – SIOC deals with the crisis, and the sub-LUN FAST deals with the problem. I note “in theory” only because the testing is now just starting in the integrated use cases.
Cool huh?
Ok – last stop on the new efficiency features tour…
Compression
Now that the Celerra and CLARiiON engineering teams have merged, and over the last few years Rich Napolitano – who heads up the EMC Unified Storage Division – has pushed hard on what he calls the “provider/supplier” model (where the various sub-teams collaborate and share functional code, there is awesome amount of integration benefit we are seeing.
Most apparent is of course unification of management with Unisphere, but there are deep under the cover benefits – and one of those is around the virtual-provisioning layer of the codebase – which also happens to provide many efficiency technologies.
Currently available on the market for a while on the EMC Celerra NAS use cases – EMC has been delivering file-level deduplication (“these two files are the same, eliminate one, and use pointers”) and compression for some time.
A nice thing about the implementation is literally “you just turn it on” and save a whackload of NAS storage (it operates as a background task). There is no effect on filesystem configuration or other array features. This is the operational model we’re aiming for with what we call “data services”.
What to expect ranges based on the dataset, but in most general NAS (CIFS/NFS) use cases customer tend to find about a 50% savings on top of thin provisioning. In NFS with VMware use cases, the compression capability tends to give about a 30-50% savings on top of thin provisioning.
VMware folks know that the EMC vCenter plugin for NFS use cases enables you to compress individual VMs.
So now this feature of compression is available on EMC’s Unified platforms block use cases too – and you just “turn it on”.
One thing to understand about compression….
EMC primary storage compression is NOTHING LIKE “winzip” :-)
Remember that when you are listening to music on your iPod, the iPod is doing a real-time decompress – and it’s totally transparent. Ditto, when you take a picture with your camera/phone – it is doing a real-time compress – taking the raw input from the CCD and storing it natively in a compressed format (JPEG). Those are analagous to what we are doing behind the scenes. A “zip” it ain’t :-)
You can have compressed devices in a storage pool as shown in the demo – so in effect it is a tier.
OK – wrap up….
- Is this stuff cool or what!
- EMC is second to NONE when it comes to efficiency. Like everything there will be some features where we are ahead, some where we are behind, but put them together and we are second to none. BTW – I didn’t even talk about the other efficiency stuff we do. Remember – every customer has many use cases on “Chad’s Storage Efficiency Decoder ring” – it’s not about any one thing, rather about how they all fit together and your use cases.
- Remember – it will be part of the next FLARE revision for customers using CX4 and current generation Celerras.
- We’re not stopping here. We’re actually now in “full on acceleration” of the upside of hard choices and hard investments made in the last two years in the Unified Storage Division. In my years at EMC, I have never seen a roadmap this rich. You can expect even more in the near term.
Thanks, and would love your feedback! What do you think? Are we on the right track/wrong track? What are we missing?
Wow! A lot of information to consume. Thanks Chad - As always your post are amazingly detailed, and very clear.
Any technology that we can use that will make our storage and hereby our infrastructure perform better faster and for less $$ - is always welcome.
Keep them coming!!
Posted by: Maish | May 11, 2010 at 02:31 PM
Solved the December problem with VDI impact on storage. What's next? :-)
I would have an idea for "compression". Why you don't use DataDomain technology to deduplicate instead of "compress"? Of course you'd need to do it in a more random rw style then DataDomains sequential need. Also it should be enabled on LUN basis, knowing the CPU/delay/etc impact. You can also you FAST cache for the containers. Crazy idea, but... who knows.
All the best,
CR
Posted by: Cristi Romano | May 11, 2010 at 07:17 PM
I can think of at least one more use Case: vCenter SRM Failover TEST.
if i run some Recovery Plans at the same time, the same boot storm that occur in VMware VIEW will happen here..
Itzik
Posted by: Itzik Reich | May 12, 2010 at 09:50 AM
Itzik:
Or MS patching with SCCM / WSUS. Patch windows are often limited, so you wind up boot storming those too :-)
(Unfortunatly, servers usually aren't linked cloned - they won't fit as well in the cache)
Posted by: Dave B | May 12, 2010 at 04:37 PM
Thanks for the great post Chad, Interesting use of SSDs as cache.
Since I'm with NetApp, naturally I have some questions regarding the new caching scheme.
I keep reading in the various EMC SSD cache posts "we cache writes!"
Caching the writes is necessary with EMC's architecture, NetApp uses a different way of writing to the disk, but anyway, that's a different discussion.
My questions:
1. At what part of the write path is the SSD cache? More like a second level cache?
2. What's the page size? Same as sub-LUN FAST (768KB?) or something smaller?
3. Is it tunable by LUN or some other way?
4. What's the latency? NetApp developed the custom cache boards because they fit right in the PCI-E slots of the controllers, for maximum throughput and lowest latency.
Thanks!
D
Posted by: Dikrek | May 13, 2010 at 12:49 PM
When you get sub-lun FAST, doesn't that negate much of the need for FAST cache in most enviroments? The flash cache is still disks in your disk enclosures, not on internal expansion boards like NetApp. So the "cache" I/Os still flow through the same back-end data channels. This seems to me more of an 'interim' feature until sub-LUN optimization is possible.
So I think the characterization that you are reducing disk I/O is a bit misleading. You are reducing rotating disk I/Os, but you could just have a whole LUN on flash and reduce spinning disk I/Os to zero. Or wait for sub-LUN FAST and then frequently used blocks are on SSD.
Posted by: Derek | May 15, 2010 at 06:48 PM
@Dikrek:
Every array is different architecturally in many, many ways. [warning - I do not claim to be a NetApp expert]. NetApp's approach of NVRAM and journaling means that write caching isn't needed in the same way that EMC's are in a 1:1 analogy.
That doesn't mean it's not material, as large spike write workloads that could be buffered might not be - forcing NVRAM flushes to be faster than the timed flush (and the backend to work harder and less optimally). Hence the ongoing growth in NVRAM over the years.
But - it is incorrect for anyone to make a 1:1 correlation and jump to an erroneous conclusion.
Of course - NetApp is not the only EMC storage competitor :-) I try to avoid writing a post with any competitor in mind, so any comment I make isn't directed at anyone in particular.
Answering the questions:
1) the cache pages are small - the default is 8KB (can be customized). FYI as a correction (if we're going to compete, it might be on a correct basis) the sub-LUN FAST granularity on midrange is 1GB (as of the July release), and will be 768KB initially on the Symmetrix. Smaller sub-LUN granularity needs more "oomph" to pull off at scale. Expect, however, this granularity to continue to increase as hardware continues to accelerate. We have designed towards the massive multicore generation we are now in.
2) although to the system it looks like an extension of system cache, it is hierarchically below main system cache (metadata operations strive to leverage system read/write cache before hitting FAST Cache).
3) Yes, it's totally tunable on a LUN/Filesystem basis. Of course, on top of cache handling, we can apply QoS policy to the other shared system attributes.
4) Latency is fantastic - micro-seconds - thousands of times better than rotating magnetic media. We looked at length at the various ways to implement this. Latency was a few nanoseconds better when the flash was directly on the PCIe card itself. But, there were a few really big downsides to that approach:
a) We couldn't make the cache shared across controllers - so a customer would need to get twice as much, and ALUA behavior couldn't expect cached content when devices moved. Also, on SP failure/reboot - the cache needed to be rewarmed - not what people expect. Did NetApp crack this, or are the PAM II/Flash Cache cards needed on each storage processor and act independently? That seems to be a big downside if it were the case... I think I know the answer, but don't want to presume - since I'm not an expert on NetApp. Would love to hear it from someone who is....
b) it was harder to make it very simple, very easy to add more cache via the PCIe card way vs. the "just add SSD" way. This is material as the cost per unit of flash is changing VERY fast. With the EMC Unified FAST Cache design, it's a customer-installable option. They can start with as little as two 200GB SSDs for a few thousand dollars, and then add more in the future as their needs grow, at which point SSDs will likely be 2-10x cheaper. This design made it easy for all current generation EMC midrange customers to get started efficiently and easily.
Again, I'm not an expert on NetApp, so perhaps you can tell me - what's the minimum incremental cost of a small PAM II/Flash Cache module? This component - how quickly will NetApp be able to keep up with the plummeting commodity cost of SSD with their specific part - which will sell a fraction of the total SSD market?
And of course how easy is it to add PAM II/Flash Cache to an existing environment - how disruptive is that operation? Surely removing the entire array storage processor, disassembling it and adding a PCIe card - doesn't sound simple or easy to me. Or, is the proposal the cost-efficient and simple "just buy a new box, and migrate everything over"?
Thanks for the questions, and would love to hear your thoughts on my questions!
Posted by: Chad Sakac | May 15, 2010 at 11:05 PM
@Derek:
If you read the answer to Dikrek's question - you understand that sub-LUN FAST and FAST Cache are complementary capabilities.
the characterization of reducing disk IO (and improving latency) is NOT misleading (and borne out/supported by the oodles of data I put in the post).
Of course, you're right in the sense that if you just construct the whole config out of SSDs, FAST Cache isn't needed - but of course, this is economically impossible today with the current (current!!) $/GB of SSD.
No - until SSDs have roughly the same $GB as slow magnetic media (which WILL happen - but not for many years), we will have to have shared storage subsystems with both SSD and slow magnetic media.
Now, before going on - I want to restate what I mentioned above:
FAST Cache has latencies to disk for cached IO's that are measured in microseconds, the fact that they have loops and SFPs in the middle only add nanoseconds. With both NetApp's approach and the EMC one, the microsecond-class latencies come from the Flash controller and Flash itself - the only way to get it down even further into nanoseconds is to use SRAM (aka regular cache), though that drives cost up a lot.
It is notable that EMC does offer TB-class global shared SRAM caches on the enterprise arrays. In those cases, SRAM connected via SFP/backend loops would be a bad idea (since SRAM latency is nanosecond fast, so the incremental nanoseconds are material) - and of course, that's not how EMC does it for the global shared SRAM cache on VMAX (which use a very low-latency serial interface)
So, if the comparison latencies for a PCIe flash based cache and a PCIe-connected via SFPs/loops flash based cache is the same, let's talk about how FAST Cache and sub-LUN FAST are like... well.. an "AND not an OR" (sorry NetApp folks - I can't help it - that marketing slogan you're using makes me laugh :-)
sub-LUN FAST **doesn't** move a page from one tier to another immediately (that would be a lot of unnecessary and not "free" IOs). There's a set of metadata that tracks usage, and based on internal profiling, moves it over time. This means that when there is a momentary IO requirement, at any given moment, it is served by **where it is**. If it's on SSD, it will be served in microseconds (roughly same speed as if it was in FAST Cache), but if it's not, it will be served up in milliseconds (how many milliseconds will depend on 15K, 10K, 7.2K or 5.4K rotational speed more than anything else).
FAST Cache means that writes will immediately be completed (buffering up more IO, "deduplicating" reads, and enabling the array to spend less backend IOs by doing more coallescing and other things) and increase the likelyhood that a read will be served from cache (in microsesconds, thousands of times faster than the milliseconds of the fastest rotating magnetic media).
The main effect of sub-LUN FAST is: Enabling you to build a configuration with a given amount of IOps and a given amount of GBs for cheaper, by mixing drive types into a big pool that sorts itself out over time - ergo cost/GB efficiency. Just remember, an SSD is 170x cheaper than magnetic media RIGHT NOW when it comes to $/IO. Huge slow SATA is 20x cheaper than SSD when it comes to $/GB RIGHT NOW. Of course, workloads requirements change over their life, and maximizing the cost per IO and GB over that lifetime - that's what sub-LUN FAST makes more efficient (and since it's transparent - it makes it easier to configure a big pool).
The main effect of FAST Cache is: improving the response time and overall efficiency of the system for every IO, regardless of where it finds itself at any given moment - ergo IO latency/$ efficiency.
How they work together: heavy write and read IO workloads, if consistently heavy, will be on SSDs (sub-LUN FAST will move it). periodic bursts of read/write IO workloads will get SSD response from FAST Cache, which will drain to slower magnetic media (since sub-LUN FAST will have moved the not heavily used workloads to slow media).
The sub-LUN FAST and FAST Cache combination is the best of both. Both are in the same EMC Unified Storage software release (in Beta now, GA in July).
As I showed in the demos above - configuring these is so simple, so default that we expect it to become the default model (big pools that figure themselves out).
I suppose my question to NetApp folks out there (since these questions are both very NetApp centric) - do you feel **SO** confident in WAFL under all workload conditions (remembering of course that it was invented almost 20 years ago now) that you buy the "we don't need to auto-tier" argument?
I'd have to imagine that with the underlying NetApp plex/aggregate/flexvol structure, auto-tiering would be architecturally hard - perhaps that's the root of the - current - "we don't need it" worldview that seems to be going on there?
I think that the fact that many (Compellent of course, but 3PAR and others as well) are delivering sub-LUN automated tiering (to leverage the trend to SSD and large SATA economics), and some (we certainly are) are also doing large low cost cache (eg FAST Cache/PAM II) models would start to beg the question:
**what has NetApp indicated about auto-tiering plans, and what is the REAL plan - surely people aren't really telling Georgens that it's "just not needed"???**
Posted by: Chad Sakac | May 15, 2010 at 11:08 PM
Chad - ntap is the new baby shampoo. NO TEARS!!! I'm waiting for them to put their money where their mouth is and only ship a single tier: SATA . Until they do that their customers still need to manually figure out what data goes on SATA and what data goes on FC.. This is complex and adding native SSD to their systems (if they ever do as Jay Kidd said they would) will make thing even more complex.
I agree with your comment about ntap not being EMC's only competitor. If I remember correctly, they are somewhere around #5 (or is it 6?) in the overall storage market. Maybe I don't get it but wouldn't it make more sense for them to try and knock off their nearest competitors first before throwing rocks and broken glass at EMC?
I'd like to point out a few things about ntap's architecture that may help you understand things a bit better. If someone from ntap or anywhere else can enlighten me with new data or features that I’m not privy to, I’d love to learn more.
WAFL does a decent job with purely small block random workloads only when they have boatloads of very contiguous free space. Take an empty ntap and an empty emc box. Carefully fill both systems with say 10-20% or so data.. Then run a heavy random load with lots of writes, but be very careful to avoid aging the system beyond even 12 or so hours. Surprise!, ntap will win the race every time! It's no accident that all the ntap funded "independent 3rd party tests" comparing emc and ntap arrays are intentionally setup this way. Take an average of all their "independent 3rd party” papers use of space, and you will find that they use no more than 10% usable to raw disk allocated to their test kits. Ntap’s biggest architectural challenge, that will likely keep them at #5 in the market, is that customers simply don't run their arrays like this in the data center.
Once you start filling a ntap system, taking and deleting snapshots and most importantly aging their file system, even over a moderate period of time (say 100 hours), the ntap performance profile changes very dramatically for the worse.. Why? Metadata becomes a challenge (not only in the form of space overhead) as there are many levels of indirection that a filer must query to find the real physical block that is being requested for a simple read for example. One simple read I/O can easily end up turning into five I/Os (or more) to traverse the metadata structures and locate the actual physical block.. This phenomenon can end up creating massive read latency many times over and the poor customer is left trying to figure out what happened to their initial zippy performance. This issue is also why ntap has almost no footprint in the data warehouse space...sequential reads almost always are turned into random reads. Their PAM cards can help mask this behavior (for random loads) by bringing their metadata read overhead onto SSD which can help, but those I/Os still need to be performed somewhere. PAM cards are a very expensive "customer paid fix" for ntap’s intrinsic scaling problem IMHO. I’m not sure, but in addition to needing PAM cards in each filer head (2X the cost to use in both sides of the cluster and needing to be rewarmed over many hours when a routine fail over occurs), and chewing up slots on the PCI bus, wouldn't that limit the overall scale or front end or back-end ports you could add to their filers?
On one hand ntap tells customers to build the "big magical pool of storage" to share all available I/Os to all apps, but on the other hand, they limit the size of the aggr to 16tb which seems really small with growing size of drives these days. Maybe when they were shipping 144gb drives (who makes these? where did that extra 2gb go?;-)) it might have worked okay.. I think they can do better with ontap 8 but I’ve never met anybody who actually runs Ontap 8 yet..
I only mentioned the impact on reads above but what happens to what they tout as their "advantage": Writes? Writes are also a huge challenge when the system is aged out and moderately full, because free space is now unfortunately trapped in smaller holes and new writes must seek many times over to find the open free space.. Again, this results in unexpected latency and unpredictable behavior. The Reallocate utility attempts to come to the rescue but it needs a ton of CPU resources and lots of time and will drag the system down.. I suppose this why many ntap arrays rarely get more than 1/3 full with the disks their spec sheets say they can scale to in the real world.. I can’t tell you how many shops end up with 30 or more of these things (only 1/4 to 1/3 full)..scattered all around their data centers.. It’s as if ntap has created a new industry storage category called "Networked DAS".
Posted by: Jonas Irwin | May 16, 2010 at 06:11 PM
@ Jonas:
Throwing FUD is not conducive to respectful selling, those same points have been the mantra of anti-NetApp competitive sales for the last 10 years, but the real-life success stories, the company’s earnings and amazing growth tell the real story.
I have large customers with 10,000+ replicated snaps on their arrays, seem to be running just fine... (full, lots of I/O, data warehouses, complex DBs, tons of VMware etc – all without PAM). Funny that, the snapshot comment coming from EMC, a company that only allows 8 snaps per LUN (and with a well-publicized huge 50% performance hit…)
Indeed, even though you work for EMC, you will probably use our storage at least a few times today, since we provide the back-end disk for most of the online providers.
Maybe you need to read http://bit.ly/aNMwon and http://bit.ly/cnO2
Back to actually discussing technology.
This is turning into a post about NetApp instead of answering Chad’s legitimate questions. Let's put it this way:
NetApp provided thought leadership with shipping the PAM cache years before EMC even announced something similar (let's not forget FLARE 30 or sub-LUN FAST with the gigantic 1GB chunk are not even here yet and won't get initial wide adoption until matured). It's silly to think we're not working on new stuff for others to have to catch up on (again) :)
Regarding thought leadership in auto-tiering: Compellent was first with their auto-tiering and has a 512K minimum chunk. How do they do it?
Regarding thought leadership in (true) Unified Storage: NetApp, obviously. The (true) unified EMC system is coming what, (maybe) 2011? Almost 10 years later than NetApp?
Regarding thought leadership in true block-level deduplication of all primary storage protocols: NetApp again. Nobody else is there yet.
What about deduplication-aware cache? Which, in turn, deduplicates the cache itself. Since nobody else deduplicates all primary storage protocols at the block level, nobody else has this cache deduplication technology.
Enough with the trash talk. BTW, I like the V-Max. I hope Enginuity is getting the SSD cache.
Auto-tiering is a great concept but everyone doing it seems to suffer from potential performance issues due to the fact the data movement algorithm won't react fast enough to rapidly changing workloads. It can work well if the workload is predictable and stable over time – enabling you to just dump your data into an array and have it figure out (eventually) where the different hot/cold areas should reside.
The addition of huge chunks of cache goes a great way towards alleviating this, but it's only part of the answer. Otherwise, it's a solution waiting for a problem. Good for some workloads, but not all. Great to have if it gets out of the way when needed.
To answer Chad's question: Each cache card is separate and only seen by each controller - this is, fundamentally, an architectural difference, and it seems to work well in the real world. Upon controller failure the other cache card has to get warmed up with the workload from the failed controller. The cards are fast enough that this happens very rapidly (each board is much faster than several STEC SSDs, the benefits of a custom design - and no, the warm-up doesn’t take "many hours").
But, of course, I will not just go ahead and divulge the NetApp roadmap just because Chad is asking :) (just as Chad wouldn't divulge EMC's roadmap if I were asking, no matter how nicely).
I’ll give you my thoughts on the no-tiering message (may or may not agree with the NetApp CEO, it’s my own opinion):
In many situations, a decently designed box (NetApp with PAM, XIV, possibly CX with FLARE 30 and SSD cache) can get a lot of performance out of just SATA (NetApp has public SPC-1 and SPEC benchmarks for both OLTP and file workloads where PAM+SATA performed just as well as FC drives without PAM).
However, I don’t believe a single SATA tier covers all possible performance scenarios just yet (which is why I don’t agree with the SATA-only XIV approach – once the cache runs out, it has severe scaling problems and you can’t put any other kind of drive in it).
When I build systems, there are typically either 1 or 2 tiers + PAM. Never more than 2 tiers of disks, and very frequently, 1 tier (either all the largest 15K SAS drives, or all SATA if the sizing allows it). I see it this way:
It’s fairly easy to put data that should be on SATA there in the first place – most people know what that is. If you make a mistake, the large cache helps with that. It’s also fairly easy to put the rest of the data in a better-performing layer. Is it ideal? Not really. Should tiering be automated? Sure. But, until someone figures out how to do it without causing problems, the technology is not ready.
I will leave you with a final question: For everyone doing sub-LUN auto-tiering at the moment, how do you deal with LUNs that have hot spots that are spatially spread out on the LUN? (this is not an edge case). For instance, let’s take a 2TB LUN (say, for VMware). Imagine this LUN is like a sheet of finely squared paper. Now, imagine the hot spots are spread out in the little squares.
Depending on the size of your chunk, each “hot” chunk will encompass many of them surrounding little squares (pity I can’t attach an image to this reply), whether they’re “hot” or not.
With sub-lun auto-tiering, the larger the chunk, the more inefficient this becomes. Suddenly, due to the large chunk size, you may find half your LUN is now on SSD, where maybe only 1% of it needs to be there. Cache helps better in that case since it’s a small block size (4K on NetApp, 8K on EMC). It’s an efficiency thing.
It’s not that easy for a cool concept to become useful technology.
D
Posted by: Dikrek | May 17, 2010 at 04:54 PM
@Dikrek -
It wasn't my intention to turn chad's post into a debate with ntap. I could easily respond with lots of questions for you as well as easily refute all the stuff (fud?) you said about emc but instead will save that for another time and place ;-). Perhaps over a beer or something.. To respond with a little honey instead of the predictable piss and vinegar - we agree that you guys were the first with PAM..nice work! If memory serves..we had EFD and that still has merit for truely cache hostile workloads.. that's probably why you guys still partner with texas mem with vseries..
I'll try to answer your statements and questions about 1GB slices being "huge". Really? 1GB is only .005% of a single 2TB SATA drive. Storage pools will easily have a hundred drives of mixed types but will very often predominately consist of 2tb drives. For the CX, I'd argue 1GB is relatively small. Is it perfect? Nope. 512 sounds great but the tradeoff is all the meta data baggage that needs to be stored to tracked about each slice. Each autotiering implementation has pros and cons I suppose. We've seen some phenomenal results with the 1GB slice but it's by no means an answer to all workloads and all types of data. You could probably play with IOMeter and create an artifical workload that makes it look really bad :-). For cold data, mega caches compliment auto-tiered data quite nicely with the added benefit of accelerating not only reads but writes as well. The long term benefit of FAST is that all the stale stuff, which is most of what sits on the array, ends up trickling down to low cost sata.. Like it or not, it makes for a great tco at every dimension.
To your question about a highly scattered random workload being bad for autotiering with a 1gb slice. We can probably agree that there's a temporal nature to most workloads..not all, but most - What begins to emerge for our customers is a sort of a datacenter wide storage "working set". We've seen data from literally thousands of vmware environments and have been able to generate thermographic charts that show very favorable patterns for autoteiring. Net/Net is the "pros" of mobility of a 1GB slice outweigh the cons for the vast majority of use cases. Ultimately, as long as customers use a variety of drive speeds and efd, providing the ability to throw all drive speeds into a few single pools and let the array figure it out is something most customers find very appealing :=)
Peace.
Jonas
Posted by: Jonas Irwin | May 18, 2010 at 12:40 AM
@dikrek @jonas:
Dikrek - I try to stay away from claims of "FIRST!" and "THOUGHT LEADERSHIP!" - in the end, the first is transient, and the second is in the eye of the beholder.
Jonas is a good guy - I think you made him snap a bit. As the "800lb gorilla" (I don't mean that in the arrogant way, rather as a simple "we are the biggest") you can't imagine how often we hear all the statements of "EMC sucks at X, we rock" - and eventually it gets frustrating, and one wants to punch back.
While perhaps inevitable (everyone compares themselves against the largest player in every category), it gets hard after a while.
For example - the "well documented" snapshot thing you describe - that was the test NetApp commissioned and ran.
Also, EMC doesn't do 8 snapshots. That's a CLARiiON snapshot. We can do 1000 file-level snapshots, each writeable. We can do 96 filesystem snapshots, 16 writeable at a time. We can do continuous data protection (effectively "infinite snapshots") with Recoverpoint. We can do 128 writeable snapshots on Symmetrix. I know that might make us hard to follow, but it also means almost anytime anyone says something about us, they are wrong, which makes competing easier :-)
Likewise - our approach on Unified Storage has been focused on this:
- yes, customers want single storage solutions to support multiple protocols.
- the implementation of HOW that gets done is less important, what is important is that it's easy, and that they get the functionality they want/need.
- that they get it at the right price, via the channel they want, and with the support they need - in presales and in post-sales.
In our mind, we're able to do that well (and doing well in the market - which is of course the ultimate judge, as you point out).
Our approach (encapsulation of key functions) has enabled us to innovate down several axes at once, without getting into the effect of merging complex kernel codestreams that become co-dependent.
We have merged big chunks (underlying storage allocation logic, iSCSI target code and more) across kernels and will continue where it makes sense. It has enabled us to do things that are obscenely hard the other way.
The unification of our management models (unifying block, NAS, and also CDP) helps our customers. That was the only real beef people had (as opposed to FUD about underlying implementation). Check. Done. Don't start with "multiple ways of doing one thing" with me - I'll bring a world of examples where you need a ton of interfaces, kernels, management models to do a set of things :-)
That isn't to imply that the NetApp choices are by definition wrong, just DIFFERENT.
Want an example? Your question on autotiering and granularity is architecturally just totally different for each of us. As an example, if you have a VMware guest, and it has a guest swap - it's likely to be contiguous, or mostly contiguous with a virtual pool model that uses a traditional underlying block layout scheme. Likewise, same holds true for database structures.
This means that the data bears out that auto-tiering at 1GB level granularity has a huge beneficial effect. I certainly invite people to say "uh - you don't want it, you won't save anything". We can demonstrate materially that they will - at which point the person saying "they sux, we rox" will look flat out silly.
Conversely, if you use a "reallocate on every write" journaled model - if you didn't auto-tier at or close to the allocation size (in WAFLs case, 4K), the benefit would be much much lower. Your comment of "Imagine this LUN is like a sheet of finely squared paper. Now, imagine the hot spots are spread out in the little squares." is very WAFL-oriented. The "little squares" have more locality in alternate approaches. It's a superpower and a kryptonite at the same time - like ANY of the core design decisions we all have to make when designing a platform.
Again - not right/wrong - just different.
That difference meant that it was easier for NetApp than others to implement a sub-file level deduplication approach. All caches refer to a block referred to several times via higher level inode/pointer structures once. This means this has a second order benefit that fell out of that.
Again - not right/wrong - just different. In the same way that we are working at capacity efficiency from different angles starting what we can do to help customers most, and do the most quickly. Hence the focus on file level dedupe on primary storage coupled with compression now, and sub-file block-level dedupe on backup targets. I would argue (and do) we can compete with ANYONE when it comes to $/GB and $/IO in every dimension. It's not about any given feature, but rather the solution efficiency.
Of course are working on sub-file level dedupe, in the same way I'm SURE there's someone working on auto-tiering at NetApp.
Re "Thought Leadership", I don't know about how Compellent does their auto-tiering, but regardless, they were certainly the first player of any size to lead the way there. Each of us has innovated over, and over again - no one has an exclusive license to innovation. EMC has created new categories of storage - not once, but several times, as well as point innovation. Just look at your Bycast acquisition - in essence validating that customers have need for Atmos-like storage models.
Personally, I'm glad that my employer that we try to embrace innovation from different sources (R&D, M&A, and watching what competitors do).
We've intro'ed the dense storage configs (like Copan), the auto-tiering (like Compellent), unified NAS/SAN management (like NetApp) - all the while introducing new things (like primary storage compress, global cache coherency) the list goes on and on...
Posted by: Chad Sakac | May 18, 2010 at 06:32 AM
Chad et al,
when it comes to thought leadership on automatic sub lun tiering, I think we should all aknowledge that HP Labs can probably claim the moral high ground here thanks to the excellent work done by John Wilkes, Richard Golding, Carl Staelin, and
Tim Sullivan in creating a RAID Array with automatic tiering back in 1996 ! cf http://www.hpl.hp.com/research/ssp/papers/AutoRAID.TOCS.pdf
Unfortunately like much of the other work done in HP's storage labs, this never got the support it needed from the rest of HP to turn into a succesful product, and when they bought Compaq the somewhat misnamed "Enterprise Virtual Array" (which IMHO is neither Enterprise class, or particuarly "Virtual") killed off what could have been a much better product.
The good thing about EMC and NetApp (I work for NetApp btw, and have never worked for HP), is that our focus means that our good ideas become good products and we invest heavily in making sure our customers can take advantage of the technology we imagine and create. Ultimately it doesnt matter who thought of it first, what matters is who is able to solve problems the in the most efficient manner.
Vision and thought leadership is cool, but to paraphrase the CIO of Morgan Stanley , the true measure of differentiation is execution.
Personally, I'm more impressed by shipping products and happy customers than lab results, white papers and products that havent been released yet. Having said that the engineer in me is looking forward to seeing how well EMC will execute on their vision.
Regards
John Martin
Posted by: Storagewithoutborders | May 19, 2010 at 09:33 PM
Hi Chad,
I have a small tech question.
Can I enable Fast Cache on all of the five Vault Pack SSD drives?
Is it advisable?
Thanks,
CR
Posted by: Cristi Romano | May 21, 2010 at 05:33 PM
@Jonas
Sometimes ex-NetApp employees make for the most passionate evangelists for the competition. That's great passion! All the best to you in your career at EMC - just not too much when competing against us over here at NetApp :-) if you don't mind.
Unfortunately, your WAFL analysis is dated (ie. measured in years) in some areas and your proof points simply wrong. I would caution against using some of those "aged NetApp system" points in the field and your data warehouse example. Those are just softballs over the middle of the plate for most of the NetApp field nowadays. I'm not telling EMC how to train their sales folks - in fact, the NetApp in me says keep heading down this path - but we really don't need to get down into the weeds on how WAFL works. At the very least, competitors start with the name WAFL and let their imaginations run wild from there. In sales campaigns once a competitor pulls out the FUD paper, it's almost like witnessing a fender-bender. You know it's going to end badly for them but you just can't take your eyes off it. You can use it these rants if you want but I don't think it works out all that well for you. I do think to one of Chad's points, most customers don't care *how* the solution works. They want to know whether or not it solves their problem and *what* benefits they will see.
Much of what has been talked about here - unified storage, snapshots, primary storage dedupe, flash as cache - aren't important because NetApp pioneered in these areas. From a NetApp point of view, these were relatively easy to do because they were already part of the WAFL DNA. Whether by luck or design, WAFL lends itself very well to market shifts, particularly the shift towards efficiency and Cloud architectures. It's not about big beating small anymore. It's about fast beating slow. Nimble 800# gorilla is an oxymoron of sorts, isn't it? :-)
Anyway, Jonas is a good guy. I wish him success and I'm sure he will do right by his customers. Based on his post, though, I'm pretty sure WAFL isn't his strong suit but that's O.K. He works for EMC. Have him tell you why you should buy from EMC rather than why you shouldn't buy from NetApp.
@Chad,
O.K. - I had to chuckle a little at this statement: "I know that might make us hard to follow, but it also means almost anytime anyone says something about us, they are wrong, which makes competing easier :-)" I'm not sure the "Where's Waldo" strategy turns out all that well. I'm thinking having a bunch of incongruous approaches to answer the same basic problem wouldn't be a strength, at least not in a customer's eye. The implicit challenge is to find Waldo and performance is Waldo for EMC. It's a challenge to be dealt with. That's simply not a variable for NetApp nor does NetApp have to amortize development across a wide variety of platforms and features. It just means that comparatively NetApp can be more nimble with their solutions; adapt quickly to changes in customer demands. I'm not saying EMC can't - not wrong; just...different
Posted by: Mike Riley | May 24, 2010 at 12:26 PM
@Mike - it's not a variety of approaches to solve one problem, it's a variety of approaches to solve different problems.
- Sometimes customers consolidate everything, including a broad set of open systems (not Windows/Linux) and mainframes.
- sometimes the midrange "lose significant percentage brains/ports/cache when you upgrade/have a storage brain fail" is a deal breaker for a customer.
- sometimes customers need heterogenous replication.
- sometimes customers requirements demand inline dedupe.
- sometimes customers need RPO of zero - not sometimes, but always.
- sometimes customer need to support tens of thousands of devices.
There is a long, long list.
Sure, in many cases simple, easy & efficient is good enough, and we're happy to fight that out with any respected competitor.
But the "one way all the time" thing - I think perhaps even you folks don't really buy that.
Does it cause any cognitive dissonance that:
1) ONTAP 8 has two distinctly different modes, with different featuresets and capabilities (yes, yes, to be merged at some future date).
2) If one way to backup was always the right way (snap and replicate - which we can do of course also) and inline dedupe is bad why the bid for Data Domain?
3) Bycast (good buy, IMO) - does that run ONTAP? Hmmm no. And I guess it DOES highlight that at very high internet-class scale, object models with rich metadata (ala EMC Atmos and Amazon S3) that have no intrinsic dependency on a given filesystem make sense after all, huh?
4) One way all the time, right? So why so many SnapManager products? Why not unify them (we have - one Replication Manager).
Look - this isn't to claim we're perfect, and you'll note that in my post I didn't make ANY comparison to anyone (NetApp included). EMC has lots to improve, and I wake up every day to try to make it a little bit better (after kissing my wife and kids - I've got my priorities straight :-) An example where we can improve is more contigousness, common look/feel/function across our capabilities, and if you look at my simplicity post from EMC world, you can see our massive progress in that area, something we're pretty excited about.
My comment about our different approaches to solutions was humor, nothing more, nothing less. Yes, we think sometimes different technology answers to different technology questions is the way to go. The question of whether we're right, and whether they are suffiently differentiated, and sufficiently integrated - well, that's up to the customer.
I will restate my comment though - I've found that what NetApp things they know about our products is usually at LEAST as wrong as what we think we know about theirs.
Posted by: Chad Sakac | June 03, 2010 at 12:13 AM
Hi, Chad.
Since this seems to have taken a decided NetApp-centric turn, I posted my response on this NetApp blog:
When is it FUD? When is it Ignorance?
http://blogs.netapp.com/efficiency/2010/06/when-is-it-fud-when-is-it-ignorance.html
Have a great week.
Mike
Posted by: Mike Riley | June 06, 2010 at 06:57 PM
@ Mike - sad how it turned into a NetApp/EMC bash fest. That happens far to often. Note that in the post, I referred to NetApp only once, and that was to acknowledge innovation around Flash used as an extension of system cache.
The record is clear (you can see it all above in the post and the comment thread), the pile-on started with a series of questions posed by pro-NetApp folks, and then it all went downhill from there.
We each see the world through our own eyes, and through the lens of our own experiences.
I always try to make the posts not refer to anyone else (except where the only right thing to do is to make an acknowledgement). I don't (haven't and will continue to try not to) filter comments - heck, they aren't even moderated. That means there's no stopping anyone from making any comment.
I'm going to try a new technique - and we'll see how it goes.
As soon as the dialog starts going sideways, rather than participating (fueling?) I'm simply going to say:
"Tomato".
As in "Tomato"/"TomAHto"
We're both doing fine in the marketplace, both gaining share, both have many innovations, both have fans and detractors.
The more I get sucked into the back n forth, the more I think it just hurts us both when the blogosphere/twitpiss contests happen.
Don't get me wrong, I still vehemently disagree with you on many things, but that's OK.
Thanks for the comment.
Posted by: Chad Sakac | June 06, 2010 at 09:26 PM
Chad - surely you mean TomAYto ? heehee not even safe on that one!
In the end there are always two parts - the blog and your comments which are controllable - and the outside comments which can be wild and free.
Sadly Jonas Irwin fell into the trap of believing he is an authority on a company once he has left - never works, just begs to ask personal questions doesn't it? His job is "Director, Competitive Strategy, EMC" after all?
Posted by: Adriaan | June 08, 2010 at 08:08 AM
I have a Celerra NS-120 will these features be supported on it? How would I go about adding 1 Flash Drive, would this be a new shelf?
Posted by: PxPx | August 25, 2010 at 11:18 AM
@PxPx - thank you for being an EMC customer!
Yup, those features all GA now, and you can get them as you upgrade to FLARE 30 and DART 6.0. You get a LOT with that (Unipshere, VAAI support, and a lot more), and it's free (though the FAST Suite is licensed), and non-disruptive. I would highly encourage it.
EMC's implementation of SSD support is very flexible. You can stick any SSD in any enclosure - basically anywhere you would put a disk.
When doing your update, please download and use the EMC Procedure Generator. It will generate a personalized process JUST FOR YOU.
Thanks again!
Posted by: Chad Sakac | August 28, 2010 at 12:57 AM
Do I need to download the Clariion Procedure generator, or the Celerra Procedure generator, or should I get both?
Don't I need to add 2 flash drives minimum in a raid-1 for fast cache?
Posted by: Shane | September 08, 2010 at 02:47 PM
@Shane - you need to have the CLARiiON and Celerra Procedure Generators (yes, they are actively being merged, analagously to how we are doing with the products). Note that as I'm posting this comment, there is a momentary pause on F30 upgrades, should be lifted shortly.
You can have an "unprotected" FAST Cache, but then it operates in Read-Only mode. If you want to use it in Read/Write (and you do), the minimum is a R1 config. We're seeing most deployments go out the door in parity configs, but R1 is fine.
Thanks for being a customer!
Posted by: Chad Sakac | September 11, 2010 at 11:01 AM
I have Unisphere 6.0.36-4 installed with Flare 30 on my Celerra NS120. However, my Unisphere looks nothing like the version you have in your video demonstration. How do I do the Auto-Tiering and Compression on the Celerra version of Unisphere those same options don't show up.
Posted by: PxPx | October 18, 2010 at 12:31 PM
Here is what I got back from EMC support Supposedly these features dont work yet on Celerra LUNS.
Certain new CLARiiON FLARE Release 30 features are not supported by Celerra and the initial 6.0 NAS code release. CLARiiON LUNs containing the following features will not be diskmarked by the Celerra, resulting in diskmark failures similar to those described in the previous Symptom statements:
• CLARiiON FAST Auto-Tiering is not supported on LUNs used by Celerra.
• CLARiiON Fully Provisioned Pool-based LUNs (DLUs) are not supported on LUNs used by Celerra.
• CLARiiON LUN Compression is not supported on LUNs used by Celerra.
Posted by: PxPx | October 20, 2010 at 03:13 PM
@PxPx:
- FAST Cache - is fully supported on for NAS.
- FAST is supported for direct block storage provided by your Celerra today, and will be supported for NAS volumes VERY shortly. Today, fully automated storage tiering is done for NAS at the FILE level, not at the block level (but as insinuated, will be available shortly).
- for your NAS volumes, you're already up in that race :-) You get file-level dedupe and compression as a native part of what the Celerra offers.
- Thin provisioning is also provided for NAS.
@Derek, @Direk - this whole comment thread only makes my point now that NetApp has introduced native SSDs as a non-volatile tier in addition to their Flash Cache (but not yet automated tiering).
Moral of the story - each vendor tends to "pooh pooh" the other's approach, while furiously evaluating it if it's a good idea (EMC certainly does).
Hence my efforts to not say bad stuff about the other guy.
I think it might warrant a blog post :-)
Posted by: Chad Sakac | November 08, 2010 at 11:29 PM
i havea question about EMC Recoverpoint licensing. i have one RP-HW-1U-GN4B in main site and another one on remote site. i need 100TB CRR license for remote replication. please tell me which for product number and how many QTY i must choose for recoverpoint licensing
Posted by: ALI | January 08, 2011 at 02:07 AM