So here, we showed how vSphere 4.1 SIOC and EMC FAST can work together to make DRS for storage possible today.
And here, we showed how EMC FAST Cache can dramatically improve cost efficiency (along with VAAI) for VMware View 4.5 use cases.
There are industry debates (good healthy ones) about the place for cache models and automated storage tiering behavior – usually driven by folks who have a focus exclusively on one or the other. This is like the “compression vs. dedupe” or “SSD – in the server or in the array” arguments… IMO – it’s a “both”, not “either”, and over time become pervasive.
Loads of performance data with various workloads, and the reason why – all below – read on!!!
Where do these new “Mega Cache” approaches (think EMC FAST Cache, NetApp Flash Cache, Sun stuff) fit into the equation? To me, the answer is: where caches always have – they enable you to squeeze a more out of existing resources, if the data is in the cache for reads,and if it’s also a write cache, to buffer host response to some degree from the backend storage itself (the degree this is important varies based on many things). Whether data is in the read cache is a function of the cache size relative to the working set, and whether the read cache is warm. Ultra high-scale arrays have always had massive caches, but mega caches (think “hundreds of GB or even TB”) are a relatively new thing in the mid-range arrays (which have historically had caches in the “hundreds of MB or low GB” range).
Here’s some data that came from the EMC Unified Storage Division performance engineering teams (Radha thank you!) as we’ve been getting ready for the launch of FAST Cache. FAST Cache, recall, enables huge low cost caches (up to 2TB, and using Flash, which is a fraction of the cost of a similar amount of NVRAM or destage-able DRAM ) to be dynamically and non-disruptively added to the EMC Unified platforms. Simple to add non-disruptively and improves performance significantly.
The first set of data below shows TPC-C workloads (against a 240GB working set) and various sizes of FAST cache against a SATA based target. The core question is “can these caching models make slow 7.2K SATA perform like 15K FC disks”. The answer is “kinda”.
You can see that when the amount of Fast Cache is proportionally smaller (25% of the working set) the effect is smaller, but hey still good! As the amount of FAST cache approaches the working set (50%) the effect becomes pronounced (90% better TPM).
This is the root of the testing that supports some of the FAST Cache marketing metrics (“Increase performance two-fold in Oracle, Microsoft SQL Server environments”)
Another example (this one in the VMware View 4.5 use case in testing done to support the joint EMC/VMware View 4.5 reference architecture work – thank you Aaron Patten and team – you are rock stars!) is the effect of FAST Cache during boot storm, anti-virus scan, recompose, refresh, and patching. I did a longer post on that here. Client virtualization is a place where these caching approaches are material, as the working set is relatively small (dominated by the base replica). In the first graph in the upper left, the red is the “no FAST Cache” baseline, and the blue is with FAST cache – and the measure is host response time. huge difference. The upper right graph is the actual IOs hitting the backend disks (which is why the host latency is so different), and the bottom right graph shows the IOs per second being absorbed by write cache (red) and read cache (blue).
Awesomesauce :-) It’s obvious (IMO) why the idea of large caches are very important and will need to become universal (either in the “global cache” approach of high-end arrays like EMC VMAX, or the "Flash as an extension of system cache” approach in the mid-range like EMC FAST Cache)
Q: So – are these approaches a panacea? Do they eliminate the need for automated tiering?
A: an uncharacteristically short answer - NO.
Why? The answer is basic. Look carefully at the data presented above, and see if you can spot the marketing trick :-).
- Not all workloads are cache-friendly. Watch out for marketing (anyone – EMC included) that carefully selects the workloads and then says “ergo, you don’t need solid state as a non-volatile storage tier”. They show a 240GB TPC-C workload, and a View 4.5 workload with a base replica that fits in the cache. Sure, there was some benefit of FAST cache when it was a small portion of the working set, but not close to the results when the whole workload fit in cache (at which point it’s basically equivalent to all the data being on solid-state to begin with). All caches leverage the fact that pointer-based approaches mean that a single block can be referenced many times (due to snapshots or block-level dedupe as an example), meaning that the “effective size” of the cache is larger, but the long and short of it is that if the data ain’t in the cache, there’s no benefit.
- Note the shape of the curve in the first example – that’s the effect of the “cache warming period” – as the data is first read and loaded into the read cache. In the View use case, this is pretty fast (as the first base replica is read), but still exists. If you don’t have enough back-end IOs, the performance during this warm up period is the performance of the backend (as if the cache didn’t exist), and how long this warm-up period lasts can be material.
Here’s a bit more “behind the curtain” analysis to hammer home that second point.
Here – we’re measuring the TPM of a 240GB working set as the cache warms – and varying the back-end disk config. Eventually they all reach the same point (~30,000 TPM), but it takes 1 hour for the 40 x 15K RPM disk config, and 5 hours for the 40 x 7.2K RPM disk config to reach that point. More importantly perhaps, it takes 2 hours for the 40 x 7.2K RPM disk config to reach the place where the 40 x 15K RPM config STARTS, during which the performance is about 2.5x lower (starting at ~5000 TPM vs. ~12,500 TPM). On a positive note – note that the effect of FAST Cache over time is such that even the 40 x 15K RPM disk config ends up doing 2x more TPM than it would without FAST Cache.
Now, is warm up behavior OK? Well, in some use cases, absolutely. In other use cases no.
BTW – this why we think it’s material that the design of EMC FAST Cache doesn’t require re-warming if a storage processor dies (or more commonly you do a software update), and that the FAST Cache is shared across both storage processors (as well as all the goodness of being able to dynamically add on the fly and leverage the plummeting cost of solid state storage – note the new 100GB and 200GB lower cost solid state disks we’ve got to add to configs)
It’s also a perhaps good point on which to summarize why you can count me in the side favoring “AND” not “OR” on the topic of mega caches and autotiering:
- the amount of “hot” data on a storage subsystem is very small compared with the “cold data”.
- Which parts are “hot” vs. “cold” tends to not change much for a long time, but with an decreasing chance over time that something cold will become hot. When that something cold needs to be accessed though, the change is can be dramatic.
- If you can move the “hot” data to a solid-state non-volatile tier – it’s always fast. If you can move the “cold data” to a huge slow disk, that’s efficient. Both are more efficient than building the whole config out of disks that are neither fast nor huge (15k RPM spindles) – ergo auto-tiering is very important.
- BUT, when there is a change, physically moving the data is NOT something that can be instantaneous for two reasons: if you move something from one non-volatile tier to another the moment it’s accessed, you’re incurring an IO cost, and will CONSTANTLY be moving stuff. Conversely, loading it into cache is easier, and can absorb some of the badness while it’s on the wrong tier – ergo large scale caches are very important.
- Mega caches help even things out – but are going to be smaller in size generally than non-volatile pools of solid state – why? Well, the “mega cache” is an extension of system memory, so generally there’s some metadata kept in direct system memory to keep track of what data should be cached. If this map is super-big, well, you lose the effectiveness of the DRAM in the system. That’s why we support 2TB worth of EMC FAST Cache in the larger Unified systems (they have more DRAM and therefore can track more metadata for larger FAST cache models), but you can have a lot more than 2TB of solid state in a system for non-volatile use cases. You can expect the size of megacaches to grow as primary system memories in storage processors grow, but you can expect the amount of non-volatile solid state storage to grow even faster as prices continue to plummet – ergo it’s best if you can do both.
Hope this helps clear things up – remember, when you hear someone vehemently argue that “one way is always the right way!” – look with a critical eye. They might have drunk some koolaid (hey, it happens to all of us).
Thoughts? Comments?
David Flynn from Fusion-io spoke about a similar concept just now on theCube/live@VMware which I commented earlier.
Posted by: StefanJagger | September 01, 2010 at 04:18 PM
Chad, thanks for writing about this very interesting issue about how best to use flash within a storage system.
As I understand it, you are making these points:
1. Using flash as a tier with auto-tiering is helpful in many cases. However, it is relatively slow to respond to changing workloads. Therefore, it is helpful to have a complementary cache.
2. Using flash as mega cache is helpful in many cases. However, it has the following two limitations, so it is helpful to have complementary tiering.
(A) The cache takes time to warm up to hold the working set.
(B) The cache is smaller than a tier could be, limited by amount of memory needed to index it.
I agree with point 1. Besides being relatively slow to respond, tiering often works at a coarser granularity than caching. E.g., the unit of tiering might be 1MB, while the unit of caching might be smaller than 100KB. This might be because tiering is heavier weight than caching.
I am not sure about the limitations in point 2.
Regarding limitation (A): If the data could be in a fast tier to begin with, because of either manual or automated assignment, the data could as well be in the cache, because it could be tagged as being highly cache-worthy either manually or automatically. In particular, it could always stay in cache---starting from the very first writes to each block, so that there are no cache misses.
Perhaps you are suggesting that, if the cache has to hold a large fraction of the dataset to provide a significant payback, it is better to just put the dataset in a faster tier and be done with it. Let's consider the pros and cons of having all of the data in a flash tier vs. having all of the data in a flash cache.
- The advantage of putting the data in a flash tier would be that it would not use any disk space or disk write IO. (Neither approach will use any disk read IO because each would serve data from flash.)
- The advantage of putting it in a cache would be that the cache can tolerate data loss and therefore does not require expensive flash with high endurance, parity or mirroring, or a whole bunch of other mechanisms needed to ensure durability.
In most cases, the second advantage outweighs the first. This is because the dollar savings in disk space are small because disk is much cheaper than flash. Furthermore, flash as a media is actually worse on writes than disk. The reason some systems use flash for "write buffering" is
that the flash SSD vendors have implemented write coalescing in the SSD firmware. One could implement this write coalescing on top of disk and get similar benefits without using flash. This is the approach we have taken at Nimble Storage; see my blog at http://bit.ly/dqHymr .
Re limitation (B): it is true that treating flash cache as an extension of DRAM cache would limit the size of the flash cache. However, it does not have to be so. If one implements the flash cache like one implements on-disk datastructures---using a hierarchical tree of blocks---one can
build a scalable flash cache.
Nonetheless, on the whole, I agree with your general message that there are many different ways of using flash to support applications. E.g., flash can be used on the server as well as within storage.
Umesh
CTO, Nimble Storage
Posted by: Umesh Maheshwari | September 02, 2010 at 07:41 PM