Was in Singapore this last week and was talking with the VMware SEs and Cisco SEs there – sharing best practices, tools, and “dos and don’ts”. There was an interesting discussion/whiteboard around the topic of storage network design around FC/FCoE (though this applies to iSCSI as well). The Cisco folks made some really interesting analogies with VoIP and TP “micro-bursting” that I thought were awesome and I wanted to share.
It’s also the reason why there is so much discussion around LUN queues and queue management in vSphere. It’s not just EMC with PowerPath/VE, but I’ve heard 3Par start to talk about “adaptive queuing”, and last week Dell/EqualLogic announced their own addition to the Pluggable Storage Architecture. It’s all about the vStorage APIs, baby :-)
It’s apropos based on the recent discussion over queuing and comparisons of NFS and block-based storage options here.
Lastly, it’s also a topic of discussion based a recent post here on the topic of whether MPPs like PowerPath/VE are there to help “legacy” arrays and whether “virtualized” arrays get any benefit from more host-side sophistication. I’ve also heard the implication that NMP + Round Robin + ALUA are good enough in all cases – if you choose the right array :-)
Vaughn and I tend to agree a LOT – and we do, much more than we disagree. We both are super into VMware, which means there’s a lot of things we share. But there are times when we disagree – and this is one of them.
But this isn’t about EMC or Netapp, and isn’t personal, it’s core technical fundamentals – and apply to all storage vendors. If you want to understand and learn more – read on!
Ok – like many Virtual Geek posts – this starts with fundamentals, and will go deep. I know this makes these long, hard slogs, but it’s how I learn… (feedback/critiques welcome!). Also, it’s really important to know – we’re talking about stuff that will NOT affect most customers. As a general principle, I’m a big believer of “start simple, keep it as simple as you can, but understand enough that you know what you need when it gets complicated”.
This is just stuff good to know so that you can diagnose issues, and determine truth from reality in the era of info overload from a bazillion sources based on your own understanding.
Let’s follow an block I/O from a VM through to the back-end disk in the shared storage array.
- The vSCSI queues are the SCSI queues internal to the guest OS. They exist – just like in a physical host. This is why with very large I/O VMs – a recommendation to have multiple vSCSI adapters shows up in the VMware/EMC Reference Architectures for larger Tier 1 VM workloads.
- The VMkernel then has admittance and management for multiple disk requests against a target (LUN) – this is configured in the advanced settings, “Disk.SchedNumReqOustanding” – shown in the screenshot below.
- Next – you hit the LUN queues (a critical element for reasons which become clear later). This is an HBA-specific setting, but in general is 32. This means 32 outstanding queued I/Os (unlike networks, which measure in buffer size in bytes, HBAs measure it in I/Os which can vary in size)
- Then, there are HBA-wide queues for all the LUNs it supports
- Then, the IO makes it’s merry way to the FC/FCoE (even iSCSI) switch, where there are ingress port buffers
- On it’s way out, it goes out an egress port buffer
- Then it hits the storage array. The storage array has in essence an HBA, but in target mode as it’s front-end port, so just like on the host, the array has port-wide target queues (which define the port maximum)
- So far, everything is the same across almost everything under the sun – the next is that there are “Storage Processor” (every vendor calls these different things) maximums – basically how fast the brains can chew up the IOs coming in. While there are array maximums, this is usually far more a function of the next things….
- All array software these days is “virtualized” in many ways – the storage object presented (in this case a LUN) is a meta object (every vendor calls these something different) – composed and presented from internal constructs. This meta-object model is needed for all sorts of things – like thin provisioning, expansion/shrink, dynamic reconfigurations/tiering, deduplication, heck – snapshots were one of the earliest examples of this model.
- These meta objects are always composed of some sort of element object (again, every vendor does this differently) sometimes its a “page”, sometimes a “component”, but there’s some sub-unit of granularity). In some cases, this relationship can be a couple levels deep (the element object itself is a meta object composed of sub-elements).
- Then, leaving the array software stack (which at the lowest level invariably addresses brown spinny things – aka disk devices - on a bus of some sort) these exit on a back-end loop (sometimes this is switched) which has it’s own buffering/queueing mechanisms
- Finally something gets written out to a disk – and all disks have their own disk queues.
This is why I say that queues exist everywhere.
In all networking, while we plan and design for average periods – it’s intrinsically very dynamic picture. That’s where all these buffers and queues come into the picture. Remember that a buffer and a queue are the same thing – just one has a depth in “bytes”, and the other in “I/Os” (which in turn are variable amounts of bytes).
Buffering/queuing allows the network to “absorb” small spikes where things are momentarily really, really busy. BUT if a queue overflows, the whole protocol stack backs off – all the way at the top, and usually fairly sharply. This is true of all networks, and is basic networking concept. It’s true of TCP (where it’s done via TCP windowing), Ethernet (where it’s done via flow control) as it is of FC. In the Ethernet case, flow control will try to back off, but if worst comes to worst, dropping an Ethernet frame is perfectly OK (due to TCP retransmit). BTW – that’s what the whole IEEE Datacenter Bridging standard (aka lossless Ethernet) would add – per priority pause – in essence a flow control state of “wait”, and don’t drop the Ethernet frame on the floor.
Ok – why is this important?
I use an example with DMX4 customer who was having a performance problem with VMware as an example. This customer was suffering poor VM performance – but when we looked at the array, the service time (how long it took to service I/O requests) had a good latency of 6ms, and the spindles weren’t really huffing and puffing. The array front-end ports (FA) were doing just fine. What’s going on – isn’t this the fastest enterprise storage array on the market?
Q: So what was happening?
A: What was happening is that there were very short transient periods – measured in timeframes much shorter than the service time (6ms) – where the number of simultaneous IO requests was very high.
Q: But shouldn’t this have shown up as the spindles being hot and busy?
A: No. Remember that every metric has a timescale. IOps is in seconds. Disk service time is in ms (5-20ms for traditional disk, about 1ms for EFD). If an I/O is served from cache, it’s in microseconds. Switch latencies are in microseconds. Here, the I/O periods were so short that they filled up the ESX LUN queues instantly, causing a “back-off” effect for the guest. These were happily serviced by the SAN and the storage array, which had no idea anything bad was going on.
Q: This seems like it must have been some crazy workload – how many bazillion IOPs was it?
A: That’s the key – it wasn’t that “big”. It was just “spiky”. measured on a normal timescale, it was easily supported by 5 15K drives. But – it was a bunch of small, very low IOps VMs that just happened to issue their slow IOs at a period that coincided.
This is where the Cisco folks in the room perked up. They then said what I thought was a perfect analogy:
“This is an effect were very familiar with. When looking at a network, people generally think in bandwidth ‘will a DS-3 carry it, or do I need an OC-3?’, or ‘will this need 100Mbps/1GbpS/10Gbps?’ But with VoIP and Telepresence (Video) workloads, sometimes a workload that on a normal timescale looks like it would only need 200kbps, when examined at very short timescales, bursts to 400Mbps. We call this ‘microbursting’. It’s one of the factors that sometimes makes customers go from Catalyst 3750 to Catalyst 6500 series switches because they have deeper port buffers that can absorb the microbursts.”
OK – so what makes this effect worse and more likely to affect a customer?
- Any one thing in the path having a really shallow queue/buffer.
- Unbalanced paths – where the service time of one path is different than another – you will tend to get the worst one affecting the others. This can happen anywhere along the path – on the host, the switch, or the array
- Traffic patterns that are bursty. A datastore with many small VMs is more likely to exhibit this statistical pattern that a datastore with a single massive IOPs generator. Remember, it’s not about the IOps or MBps per se, but the “bursty-ness” of the pattern.
- The blended workload of a lot of different hosts (the diagram is hyper-simplified, because in general a storage network – or any network for that matter – has a TON of hosts – each generating all sorts of different IO sizes and patterns).
This is why I think a response that the need for better queue management is somehow a function of solely of the array, or that somehow NMP+Round Robin+ALUA are equivalent to adaptive queuing mechanisms is WAY, WAY off.
Let’s take these one at a time.
1. A shallow queue somewhere.
A shallow queue somewhere in the IO path (or an overflowing port buffer) will cause the I/O to back off. You need the queues to be deep enough to withstand the bursts – sometimes increasing the queue depth is important. Now, if the problem isn’t actually the bursts, but the I/O service time not being sufficient for the sustained workload (aka you have a slow, or underconfigured array), increasing the queue depth will help for only a fraction of a second, after which the deeper queue will still fill up, and now you just have increased the latency even more.
While most customers will never run into this problem, some do. In VMware land – this is usually the fact that the default LUN queue (and corresponding Disk.SchedNumReqOutstanding value) are 32 – which for most use cases is just fine, but when you have a datastore with many small VMs sitting on a single LUN, the possibility of microbursting patterns becomes more likely.
This is covered in this whitepaper, and summarized in this table (which I’ve referred to along with Vaughn). In both this table, and the real world, the column on the left (outstanding I/O per LUN) is generally not the factor that determines the Maximum number of VMs – it’s the “LUN queue on each ESX host” depth column.
If you think you might be running into this problem – it’s pretty easy to diagnose. Launch ESXtop, select the ESX disk device, press "u" to display the ESX disk device monitoring screen, press "enter" to return to the ESX disk device screen. You’ll see a table like this – and QUED is the queue depth.
If this shows as 32 all the time or during “bad performance periods” – check the array service time. If it’s low (6-10ms), you should probably increase the queue depth. If you have a high array service time, then you should consider changing the configuration (usually adding more spindles to the meta object).
What about the array itself? Vaughn went on about that the CLARiiON having an internal queue for element objects (metaLUN components) and a total queue for a meta-object (the LUN), whereas NetApp FAS filers have a “global queue” (and also pointed out that their target ports have a deeper queue). BTW – the idea of a “global queue” is common on devices that don’t use internal block semantics beyond the meta-object. It’s the same on EMC Celerra serving an iSCSI LUN, for example. That doesn’t make one “virtualized” and one “legacy” – it just makes them different.
More importantly – if you understand the topic above, you see why that’s not right to extrapolate this out. Even in the very small example he pointed out from the CLARiiON document (which is an applied guide, not a chest beating document) the array LUN queue is 88 – larger than the maximum that could be configured on the ESX host. Now, if the cluster was wide enough (3 or more nodes) – the array LUN queue could be indeed be bottleneck.
From Vaughn’s post:
“It sure seems that the need for PP VE is a direct result of a less than VMware friendly design within the storage controller.
Maybe a better decision would be to look a NetApp virtualized storage array for your vSphere deployments. By design our arrays don’t have LUN queues. Instead they have target port queues, which are global, very deep, and when combined with RR the queues are aggregated. As I stated earlier each port has a queue of 2,000, or a single dual port target adapter has a queue of 4,000.
This design allows NetApp arrays to avoid the issues that can arise with such shallow queues. The virtualized architecture of a NetApp array is the ideal design for use with VMware’s Round Robin PSP as we don’t have the challenges associated with traditional legacy arrays.”
I won’t go the negative way. Heck, it would be nice to even use reasonable language – like a few qualifiers in that last paragraph… What do they serve in the water over there? :-)
I will say this – with all arrays – EMC’s, HP’s, HDS’s IBM’s, and NetApp’s – all of them, the back-end design DOES matter, and you should look at it closely – and they are all wildly different architecturally. Each has their advantages and disadvantages – and usually what makes them an advantage in one circumstance is a disadvantage with another. This is why I’m personally of the opinion that it’s more about knowing how to leverage what you happen to have.
In the example he used, PP/VE would not help materially if there were more than 3 ESX hosts, as it would be a likely case of “underconfigured array” – not host-side queuing.
Let’s break this down:
The target maximums are a red herring for the most part. This is the 2000 vs. 1600 he points out - though I would call out that a dual-port example like they use are still a toy compared with a 4 port CX4 Ultraflex I/O module and the smallest CX4s can have MANY of those I/O modules (EMC still the only vendor to my knowledge that enables you to dynamically and non-disruptively change and reconfigure ports). The target port maximums are a red herring because they are generally not a limit except in the largest enterprise cases – which isn’t caused by one host, but literally hundreds or thousands of hosts attached to the array. Generally in those cases, customers aren’t looking at NetApp or EMC CLARiiON for block storage, but rather enterprise arrays like IBM, HDS and EMC Symmetrix.
If you find your array service time is long, or the array LUN queue (if your array has one) is a problem – you need to fix that before you look at queue depths and multipathing. On EMC arrays – this can be done easily and is included as a basic array function. In the negative example Vaughn used, the document clearly calls out how it can be remedied – you could use a Virtual LUN operation, or expand the RAID group. While not an expert on Netapp arrays, I trust Vaughn when he says that they don’t have a LUN queue (like the similar architectural model of block objects on other filesystem-oriented devices like the Celerra or Openfiler). I’d imagine that in their case, storage service time would be a function of the underlying FlexVol/Aggregate/RAID Group configuration, the workload that FlexVol/Aggregate was supporting as well as FAS platform type. I’m sure that one could relatively easily, and non-disruptively change the underlying aggregate/RAID Group configuration of a FlexVol containing a LUN in some fashion? I haven’t read the NetApp guides as closely as NetApp seems to read EMC’s, but I’m sure someone out there could provide the procedure or link to the document in the comments. If you know how to do this, or can link to a post or whitepaper, please do.
The important point is that the queuing generally happens (even in that rinky-dinky 4+1 example that was referenced) at the host, far, far earlier than at the array LUN.
2. Unbalanced paths – this is a case where the queue depth, the number of hops and their corresponding buffers, or the array internals are non symmetrical.
Remember that ALUA is “Asymmetric Logical Unit Access”. Asymmetric. ALUA is a standard which enables a mid-range array to use an internal interconnect between Storage Processors to service I/Os from the storage processor. Enterprise arrays have very large bandwidth between storage engines/directors (or whatever the vendor call them) – so performance models across all ports can be linear and symmetrical. Every Mid-Range array does this internal interconnect differently. I don’t claim to be an expert on how anyone else does it, but on a CX4, this is an internal PCIe 4 or 8 lane bus depending on the unit. I believe that this is a fairly large interconnect for a mid-range array. It’s an important architectural element of an ALUA configuration. Does anyone else know what it would be on the modern Netapp FAS family? If you do, and would like to comment, feel welcome. Now, the amount of bandwidth is high, sure, but compared to the internal bandwidth and latency of the “better” path through the storage processor owning the LUN – it’s decidedly asymmetrical.
To do an ALUA version of the diagram above, I’ve added purple lines to the diagram, showing data flowing down the “less good” paths, then across the internal bus.
ALUA is a good thing when a host MUST for one reason or another have an “active-active” model (and “just because I like the sound of it” isn’t a rational reason – and now that path persistence is fixed in vSphere, the old “MRU path non-persistence” isn’t a good one either) – but without adaptive queuing it is BAD. Simple round robin I/O distribution will drag down the performance to the level of the “less good” paths. That’s why I disagree with the statement I’ve heard others make: NMP+RR+ALUA = NMP+adaptive queuing+ALUA.
3. Traffic patterns that are bursty.
Queue full conditions can be the root of slow steady-state (at the “large timescale”) performance on single-LUN datastores with many individually small IO VMs when there is no adaptive multipathing – particularly comparing a single LUN VMFS datastore vs a NFS datastore. It’s not the SCSI reservation mechanism – which is widely, incorrectly FUDed as the root cause of this. SCSI reservations are the root cause for slow snapshots and VM create/delete on datastores undergoing many meta data updates (again, not a problem with many, many customers, just pointing it out for completeness). Why doesn’t NFS suffer similar behavior? The answer is in the diagram below:
While so many of the elements are the same, and there are still all the same stages of buffering/queuing, some key items are different. There is no LUN queue at the ESX host. The only LUN queues that matter here are those that are behind the meta/element objects (which in the case of NAS are the filesystem/volume management/block layout elements). This means that its all about network link speed, and network buffers. When Vaughn and I were originally authoring the NFS joint post, we debated about how to say this – does it “scale better”? Making one of those multi-vendor posts requires healthy back n’ forth. In the end, we agreed – it “scales differently”. If “scale” is peak IOps, or peak MBps/$, or lower CPU, or lower latency, or failover period, then NFS scales worse. If “scale” is “number of instantaneous I/Os in a short period within network port buffer limits” it scales better.
It’s notable that this NFS datastores are more analagous to spanned VMFS use cases – where the filesystem is supported by many back end LUN objects. If you have multiple LUNs supporting a VMFS volume (see article here), you get parallelism of the LUN queues, and can support larger numbers of VMs in that same datastore. And note – the availability is no better, and no worse statiscally than multiple standalone VMFS datastores. Making spanned VMFS easier is an area of work (along with longer term NFS client improvements) between the storage vendors and VMware.
But – in the end, this is why I personally think that combinations of NAS and block storage models are the most flexible for VMware (and frankly in general). Most customers have VMs with all sorts of different SLAs. If you have a bunch of smaller IO VMs for whom the longer time outs of NFS are good enough, you will be able to put more of them on NFS than you will on a single LUN VMFS datastore. Conversely, if you have a bunch of VMs with a large bandwidth consistent I/O pattern that needs lower latency VMFS will generally win. And if you have VMs that need a very fast failover model, with a ton of predictability, generally VMFS on block wins. But matching the good behavior (no host-side queues) of an NFS datastore without the bad would require a spanned VMFS datastore with many LUN queues and adaptive multipathing to handle the “spikyness” in the combined IO of many (hundreds) of VMs. That’s why I keep saying - “one way, all the time – it’s not the right answer”.
4. The blended workload of a lot of different hosts
This is where things in the original post that prompted this gets outright silly – and I think reflects a misunderstanding of block storage. The diagrams I’ve shown to date are ridiculously oversimplified because they show one host only, one IO only. In the real world, people have many VI3.x and vSphere 4 clusters, each with many hosts hitting the network and the array. Further, they have other hosts of all types hitting the array. Each of those are transmitting I/Os of all sizes all the time. It’s organized chaos.
You can see immediately that even if the array could somehow perfectly serve I/Os, perfectly balanced, at all times (which none of us – none – can do) – there will be varying amounts of queuing and buffering at each element of the stack – from each host, to each host LUN, to each HBA, to each port in the entire fabric, to each array target.
This are the things that drive the need for adaptive queuing models.
TAKEAWAY:
One thing I was worried about as I wrote this was that it makes it sound REALLY complicated. It’s not – it’s very simple – I’m just exposing the deep bowels of how this works.
Heck, we humans all look pretty simple on the surface – two eyes, mouth, ears, nose, skin, arms, legs – so on and so forth, but open us up, and the innards get complex. If you start to look at how our neural networks work – well, no one understands that entirely :-)
BUT – we all know how to basically operate, and for most – that’s enough. Like right now – I know I’m hungry :-)
Summarizing then….
- Keep it simple. For almost all cases – you can use a single LUN per VMFS, and be happy with many, many more VMs than most people think (which most people incorrectly think is 12-16 based on ancient best practices)
- in the VMware View 3 reference architecture – we had 64 VMs per datastore with a LUN with a basic 4+1 configuration with no customization of queues – and it did just fine. With more spindles, deeper queues, spanned VMFS – you could do hundreds.
- When I did a VMTN podcast call with John Troyer, I said you could have more than what people think – 32 and more. You’d think it was blasphemy the way people started to say “no way – 10 only, ever!” and stuff like that :-)
- With block workloads – you need to keep an eye on QUED as one of the primary “performance metrics”.
- This effect is worse if you don’t use DRS – having a whole bunch of VMs on one host, one LUN queue is worse than spreading it around a cluster.
- All arrays deal with queue full conditions differently – some better, some worse, but in all cases, you should try to avoid it. If you see QUED full – check the array service time. If it’s bad, make it faster – any block array worth it’s salt should enable you to do that non-disruptively. If it’s good, consider making the host queues (and the advanced parameter Disk.SchedReqNumOutstanding) larger.
- Having more queues is can help (spanned VMFS) – this has many backing LUNs behind a datastore, and since VMs are spread around,
Multipathing behavior is important when it comes to improving this queue condition.
- Round robin is better than manual static load balancing (ergo vSphere 4 is better than VI3.x). Adaptive (or dynamically weighted round robin) is better than simple round robin (ergo PP/VE vs. NMP). Predictive (where the array target provides input to the dynamic, adaptive algorithm along with the host queue depth state – as EMC block platforms do) is best.
- Automated path policy selection is better than manual. Today, using RR requires MANUALLY selecting this for every target, and every LUN unless a vendor changes writes their own SATP. I would encourage NetApp and every vendor to consider doing that – it seems that Dell/EqualLogic was the next to do it after EMC. The best is where all path configuration – both path selection and path state management all happens automatically, which is what PP/VE does.
EMC absolutely supports native multipathing wherever we can. We support, embrace NMP, MPIO, ALUA.
We also look at things and say “how could we make this better”. This gives customers CHOICE. (including choosing that we’re wrong on all counts of course!)
PowerPath/VE makes vSphere multipathing better – for EMC arrays and 3rd party arrays including HP, HDS, and (with RPQs) IBM. It abstracts out ALL this stuff. You install it, then don’t need to configure it. It requires no LUN by LUN changes – which at any reasonable scale, starts to become well, unwieldy.
Lastly – my advice - don’t listen when anyone (EMC – and if I look closely in the mirror, this includes me - is as guilty of this as anyone – heck look at V-Max) calls their array “virtualized” and implies that this solves everything under the sun. While I guess there’s always some need for marketing, in every respect that matters all arrays are “virtualized”.
This stuff has lots of moving parts under the covers.
Thanks for investing the time here – and I hope this helps!
In my simple mind, it seems that this "debate", if you want to call it that, between yourself and Vaughn about the viability and need for PP/VE can be summed up by saying that PP/VE tries to account for misconfiguration, congestion and other factors in the fabric between the ESX host and the storage array. My understanding is that it does not configure queue depths (or really anything) on the array at all, but rather always expects the array to be correctly configured and capable of servicing all IO requests as quickly as possible (please correct me if I'm wrong).
Assuming the storage array services all requests adequately and equally, PP/VE attempts to measure the effect the fabric has on requests and then use the most appropriate path to reach the array.
Vaughn is assuming that the fabric is not the bottleneck (or at least he hasn't addressed it), but rather the issue lies at the array and the internal (SCSI) queue it has assigned to the LUN. In an ideal world it would be that simple...multiple hosts all connected to the same LUN, each with a client queue depth, and the aggregate of those causes an "overrun" of the LUN queue on the array.
Two different subjects, but both are important. Thanks for the post and keep up the good work!
Posted by: Andrew | June 21, 2009 at 10:24 AM
Thanks for the comment, Andrew. I think you got it, but I want to highlight a couple things.
You have to remember that both individual hosts and groups of hosts are all sending all sorts of different I/O sizes all the time.
This isn't "misconfiguration" - it's the normal course of business on a busy network and dynamic host/cluster (particularly true in VMware environments, which consolidate the I/O and due to DRS move that I/O around the cluster).
The will always be some degree of randomness on any given host, the initiators on those hosts, on groups of hosts, on the various ports of the fabric (which in most customers have various degrees of fan-in/fan-out and multiple aggregation and core layers), and the target ports on the arrays.
As you say - all the stuff above (host/fabric) is vendor-independent. You note that Vaughn assumes that the fabric isn't the bottleneck - but consider that this isn't just the fabric (which generally is NOT the bottlneck without design errors in fan-in) but is also the host itself - which can indeed be bottlenecked, particularly with relatively shallow queue depths.
Also remember that DRS will be moving the Virtual Machines around the cluster - periodically loading some host-queues more deeply than others (and currently has no I/O governance across a cluster)
My point is that everything outside the array is VERY dynamic, very chaotic.
That before you hit the "magic" of the internals of the array itself.
If you assume that the array is an "idealized" target, able to support all I/O devices equally, then it works one way. But the "idealized" target is just that. Idealized.
All I/O devices have different mechanics here - and layout internals do matter (in the CLARiiON case, it's the RAID Group/MetaLUN/LUN construct - which has LUN queue constructs throughout, in the NetApp FAS case, it's the RAID Group/Aggregate/FlexVol/LUN construct - which has no LUN queues except at the bottom).
Further, while I don't claim to be a Netapp expert - it seems to me that Ontap (whatever underlying OS it is) must have a driver stack for the HBAs they use in target mode. All HBA driver stacks have both the queues for HBA (the target queue) and a LUN queue. I've given them the benefit of the doubt here, as I don't claim to be an expert on FAS/Ontap internals.
It makes absolute sense that once the I/O is in the Ontap layer they don't have LUN queues, as there is no LUN construct analagous to block-centric array designs.
This is where there can be legitimate debate over the architectural benefits of "big virtual pool" models aka "go wide" models vs. "focused layout models" aka "this LUN lives here" layout models - both of which EMC supports on all the arrays.
But, beyond the academic sense, there is no "right/wrong" answer, IMHO.
The first group (pool models) are generally harder to screw up for average customers, and assume everything all averages out. The latter (focused models) involve more design thinking by the human, and can end up more "imbalanced", but on the other hand when there are some workloads very different than others, can sometimes be more efficient.
Now, then you get to the idea of LUN queues that are internal througout almost all "block centric arrays".
I refuse to call Netapp "emulated block" as others have in a pejorative way - everything "emulates" stuff... "emulate" is the negative way of saying "abstract" or "virtualize". Likewise, I think it's silly to call the rest of the non-"file centric arrays", "legacy".
If you "go wide" with a pool-sytle layout design - you essentially are using huge numbers of spindles behind each LUN (and have correspondingly very deep queues). There are use cases where both types of layout (go pool vs. focused) models "win".
BTW - PowerPath/VE does indeed measure the array target queue depth on EMC arrays as a method of predicitively adjusting best path selection. This won't correct fix a misconfigured internal LUN, but will predictive distribute load among target ports - important as I noted in larger configurations (where port congestion is a real consideration). Remember that this target port imbalance is not a construct of "misconfiguration", but the natural variability in the mixed workloads of all the hosts attached to the array.
Lastly - this is ALL almost silly - and an academic argument (though that seems to sadly be one of the main functions of these technical blogs) for most customers, who can be perfectly happy in all design models - and have success/failure more often based on what they know rather than what they bought :-)
Thanks again!
Posted by: Chad Sakac | June 21, 2009 at 11:23 AM
Reading your entries shows me how much I have to learn about storage. :)
Two things I can add to the discussion:
1) For reasons that I cannot defend, you have to change *two* things with VI3 and vSphere to increase the queue depth. VC must be informed (as you've detailed) but the HBA driver must be configured, too. See page 107 of VMware's SAN configuration guide: http://www.vmware.com/pdf/vi3_301_201_san_cfg.pdf.
2) The simplest way to know if and by how much the queue is over-flowing is to check kernel latencies in the storage screens of esxtop and VC. Under normal conditions, kernel latency should be 0ms, representing no waiting in the kernel (queuing). Even 1ms kernel latencies means that commands are waiting in the kernel to get to the storage queue. In this case storage performance would benefit from a larger storage queue. A number above even 3ms requires immediate attention.
Scott
Posted by: Scott Drummonds | June 21, 2009 at 03:37 PM
Thanks Scott - you bet - you've got to change the qlogic and emulex module params, and the san cfg doc does a good job spelling it out.
No need to defend - it's a function of the fact that the vmkernel does it's own admittance for every disk object.
And, man, each of us has our area of focus - you forget more about VMware performance than I will learn in a lifetime :-)
Each of these domains is big enough for one human brain - the key is to share, and share transparently.
See you at VMworld!
Posted by: Chad Sakac | June 21, 2009 at 09:48 PM
Let me tell you a story that supports somewhat Vaughn's reasoning about badly designed block arrays (I don't know how EMC arrays would react to such workload):
I (think I) have encountered an queue full effect on an HP EVA 4100 array that can easily happen. We had a bunch of Windows servers that had Adaptive Load Balancing activated (which sent requests on all 4 paths) and they hit hard (windows queue depths over 30) a SATA disk group that responded slowly.
The problem is that the other servers (including ESX) that hit another disk group made of high performance disks had very poor performance also and that really shouldn't happen on a system that is properly designed (the system should back off the hard hitting servers and leave room for the other disk groups in my opinion).
Once we disabled load balancing the system began behaving normally
Posted by: Mihai | June 24, 2009 at 03:35 AM
Mihai - thanks for the comment.
My experience? All arrays can be misconfigured - all of them. EMC, HP, IBM, NetApp. And in each of the cases, the more you move out of the "mainstream" of the vendor's focus, the difficulty increases.
Now - I don't claim to be an expert on HP (I don't claim to be a HP expert - those that are can chime in!), but I believe that their Adaptive Load Balancing feature is a feature of the HP MPIO Device Specific Module (DSM). The DSM is kind of like the SATP/PSP in the VMware Pluggable Storage Architecture. ALB is important in the ALUA use case, and changes the behavior from standard load balancing. By default, the Windows MPIO load balancing behavior will direct down all paths ( which when ALUA is used, include both owning SP and non-owning SP).
Now, I know that in your case, turning on ALP hurt, but the fact that there **ALP** highlights the downside of ALUA (paths down non-owner SP ports are "less good") - and HP wouldn't have added the ALP function to their DSM if it wasn't important.
This idea applies to all midrange arrays. In NetApp's case, the "global queues" aren't global across the two filer heads in the FAS cluster (again, I don't claim to be a NetApp expert - those that are can chime in!). I'm basing that on my experience with NetApp that each of the FAS heads are in effect, managed as seperate devices, and mirror NVRAM over the internal interconnect.
Remember, that's what ALUA does - It presents ALL the ports across a mid-range clustered as paths to the LUN.
Now - IMHO - sounds to me like what you saw was an issue with the ALP DSM. This highlights the difficulty - host-side multipathing is not easy engineering work. People sometimes believe that this is "simple, and all done", but it's not.
PowerPath has more developers on it than any multipathing on the market (which makes sense, as it's used by the majority of the block storage market).
PowerPath is not perfect, and I've run into my own fair share of cases where PP itself was causing a problem. Multipathing however, is one of those very "unsexy" things - when it works, it's invisible, and saves your bacon. when it doesn't work, it's VERY visible.
Posted by: Chad Sakac | June 24, 2009 at 09:57 AM
What about the SP write cache? When writing to a Clariion wouldn't the cache absorb bursts prior to maxing out the LUN queue maximums? Am I correct in my understanding that the SP cache resides between the SP maximums and the meta object maximums?
Thanks
Posted by: Cass Snider | June 25, 2009 at 03:13 PM
For the NetApp interconnect, twould be Infiniband (I wasn't sure if this is confidential info but a quick Google on "netapp infiniband" turns it up nicely).
Posted by: Andrew | July 06, 2009 at 03:57 PM
The VMkernel then has admittance and management for multiple disk requests against a target (LUN) – this is configured in the advanced settings, “Disk.SchedNumReqOustanding” – shown in the screenshot below.
Posted by: tinggi badan | June 01, 2010 at 11:52 PM
I would like to appreciate the great work done by You
Posted by: Generic Viagra | June 30, 2010 at 11:48 AM
Great article! I'll preface my comment with the fact that I work for Atlantis Computing. That said I think you will find that there are alternative ways to address some of these challenges. Atlantis ILIO is one of them. I recently had the opportunity to work with customers who had IBM XIV and Pillar storage. Both storage solutions delivered extremely high IOPS load due to their distributed nature. Of course once we ran Atlantis in front of them capacity IOPS capacity was increased by 2X to 4X and the hypervisor queuing went away. Personally I believe that regardless of how companies overcome their IOPS challenges using tools like Liquid Ware labs will help clearly define the problem for a specific environment and how much improvement is achieved for each fix.
Posted by: Robert Kadish | July 01, 2010 at 08:37 PM
Thank you very much!! Awsome work guys
Posted by: Joshua | March 07, 2012 at 01:52 PM
Great work !!! thanks for it !! may I ask you a question ... just to put the thing in real, and some update to May 2012.... I have UCS blade servers with "palo" CNA adaptors. That means we put the FC and the Ethernet on the same 10GB wire. How the !#$%!%&$@ you configure Queue depth in this Adaptors ?!?!? I am doing this same question from september 2009 and nobody has been able to give me a decent response .... I have seen some pretty beautifull attemts ...but still not quite comfortable with it ....
Now ge have a big cloud on service with several houndreds of customers and almost a thouthand VMs .... and some weird things happening some times .... Help would be welcome
Posted by: Gustavo Ossandon | May 03, 2012 at 07:33 PM
On an EMC array (vmax, fastVP) what is the best way to reduce the FA queing? We have our esx disk queues increased but we tend to have larger clusters with up to 8 nodes. In io bursts we have seen the FA queues does back up. It seems the best way to handle this is to add more tdevs to the meta. Would you agree?
Posted by: Bryan | April 04, 2013 at 08:34 AM
This is a fantastic post (I know it is from quite some time ago). Great work Chad. I was a lead developer on ATF and PowerPath for years, and many of the challenges from 14 years ago are still present today. Multipathing and availability remains mis-understood.
Posted by: Markkulacz | December 07, 2013 at 12:26 PM