« VMworld 2009 Save a few $$ and find out what EMC is doing | Main | Help Get a Canadian on the Medal Podium »

June 20, 2009

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Andrew

In my simple mind, it seems that this "debate", if you want to call it that, between yourself and Vaughn about the viability and need for PP/VE can be summed up by saying that PP/VE tries to account for misconfiguration, congestion and other factors in the fabric between the ESX host and the storage array. My understanding is that it does not configure queue depths (or really anything) on the array at all, but rather always expects the array to be correctly configured and capable of servicing all IO requests as quickly as possible (please correct me if I'm wrong).

Assuming the storage array services all requests adequately and equally, PP/VE attempts to measure the effect the fabric has on requests and then use the most appropriate path to reach the array.

Vaughn is assuming that the fabric is not the bottleneck (or at least he hasn't addressed it), but rather the issue lies at the array and the internal (SCSI) queue it has assigned to the LUN. In an ideal world it would be that simple...multiple hosts all connected to the same LUN, each with a client queue depth, and the aggregate of those causes an "overrun" of the LUN queue on the array.

Two different subjects, but both are important. Thanks for the post and keep up the good work!

Chad Sakac

Thanks for the comment, Andrew. I think you got it, but I want to highlight a couple things.

You have to remember that both individual hosts and groups of hosts are all sending all sorts of different I/O sizes all the time.

This isn't "misconfiguration" - it's the normal course of business on a busy network and dynamic host/cluster (particularly true in VMware environments, which consolidate the I/O and due to DRS move that I/O around the cluster).

The will always be some degree of randomness on any given host, the initiators on those hosts, on groups of hosts, on the various ports of the fabric (which in most customers have various degrees of fan-in/fan-out and multiple aggregation and core layers), and the target ports on the arrays.

As you say - all the stuff above (host/fabric) is vendor-independent. You note that Vaughn assumes that the fabric isn't the bottleneck - but consider that this isn't just the fabric (which generally is NOT the bottlneck without design errors in fan-in) but is also the host itself - which can indeed be bottlenecked, particularly with relatively shallow queue depths.

Also remember that DRS will be moving the Virtual Machines around the cluster - periodically loading some host-queues more deeply than others (and currently has no I/O governance across a cluster)

My point is that everything outside the array is VERY dynamic, very chaotic.

That before you hit the "magic" of the internals of the array itself.

If you assume that the array is an "idealized" target, able to support all I/O devices equally, then it works one way. But the "idealized" target is just that. Idealized.

All I/O devices have different mechanics here - and layout internals do matter (in the CLARiiON case, it's the RAID Group/MetaLUN/LUN construct - which has LUN queue constructs throughout, in the NetApp FAS case, it's the RAID Group/Aggregate/FlexVol/LUN construct - which has no LUN queues except at the bottom).

Further, while I don't claim to be a Netapp expert - it seems to me that Ontap (whatever underlying OS it is) must have a driver stack for the HBAs they use in target mode. All HBA driver stacks have both the queues for HBA (the target queue) and a LUN queue. I've given them the benefit of the doubt here, as I don't claim to be an expert on FAS/Ontap internals.

It makes absolute sense that once the I/O is in the Ontap layer they don't have LUN queues, as there is no LUN construct analagous to block-centric array designs.

This is where there can be legitimate debate over the architectural benefits of "big virtual pool" models aka "go wide" models vs. "focused layout models" aka "this LUN lives here" layout models - both of which EMC supports on all the arrays.

But, beyond the academic sense, there is no "right/wrong" answer, IMHO.

The first group (pool models) are generally harder to screw up for average customers, and assume everything all averages out. The latter (focused models) involve more design thinking by the human, and can end up more "imbalanced", but on the other hand when there are some workloads very different than others, can sometimes be more efficient.

Now, then you get to the idea of LUN queues that are internal througout almost all "block centric arrays".

I refuse to call Netapp "emulated block" as others have in a pejorative way - everything "emulates" stuff... "emulate" is the negative way of saying "abstract" or "virtualize". Likewise, I think it's silly to call the rest of the non-"file centric arrays", "legacy".

If you "go wide" with a pool-sytle layout design - you essentially are using huge numbers of spindles behind each LUN (and have correspondingly very deep queues). There are use cases where both types of layout (go pool vs. focused) models "win".

BTW - PowerPath/VE does indeed measure the array target queue depth on EMC arrays as a method of predicitively adjusting best path selection. This won't correct fix a misconfigured internal LUN, but will predictive distribute load among target ports - important as I noted in larger configurations (where port congestion is a real consideration). Remember that this target port imbalance is not a construct of "misconfiguration", but the natural variability in the mixed workloads of all the hosts attached to the array.

Lastly - this is ALL almost silly - and an academic argument (though that seems to sadly be one of the main functions of these technical blogs) for most customers, who can be perfectly happy in all design models - and have success/failure more often based on what they know rather than what they bought :-)

Thanks again!

Scott Drummonds

Reading your entries shows me how much I have to learn about storage. :)

Two things I can add to the discussion:
1) For reasons that I cannot defend, you have to change *two* things with VI3 and vSphere to increase the queue depth. VC must be informed (as you've detailed) but the HBA driver must be configured, too. See page 107 of VMware's SAN configuration guide: http://www.vmware.com/pdf/vi3_301_201_san_cfg.pdf.
2) The simplest way to know if and by how much the queue is over-flowing is to check kernel latencies in the storage screens of esxtop and VC. Under normal conditions, kernel latency should be 0ms, representing no waiting in the kernel (queuing). Even 1ms kernel latencies means that commands are waiting in the kernel to get to the storage queue. In this case storage performance would benefit from a larger storage queue. A number above even 3ms requires immediate attention.

Scott

Chad Sakac

Thanks Scott - you bet - you've got to change the qlogic and emulex module params, and the san cfg doc does a good job spelling it out.

No need to defend - it's a function of the fact that the vmkernel does it's own admittance for every disk object.

And, man, each of us has our area of focus - you forget more about VMware performance than I will learn in a lifetime :-)

Each of these domains is big enough for one human brain - the key is to share, and share transparently.

See you at VMworld!

Mihai

Let me tell you a story that supports somewhat Vaughn's reasoning about badly designed block arrays (I don't know how EMC arrays would react to such workload):

I (think I) have encountered an queue full effect on an HP EVA 4100 array that can easily happen. We had a bunch of Windows servers that had Adaptive Load Balancing activated (which sent requests on all 4 paths) and they hit hard (windows queue depths over 30) a SATA disk group that responded slowly.

The problem is that the other servers (including ESX) that hit another disk group made of high performance disks had very poor performance also and that really shouldn't happen on a system that is properly designed (the system should back off the hard hitting servers and leave room for the other disk groups in my opinion).

Once we disabled load balancing the system began behaving normally

Chad Sakac

Mihai - thanks for the comment.

My experience? All arrays can be misconfigured - all of them. EMC, HP, IBM, NetApp. And in each of the cases, the more you move out of the "mainstream" of the vendor's focus, the difficulty increases.

Now - I don't claim to be an expert on HP (I don't claim to be a HP expert - those that are can chime in!), but I believe that their Adaptive Load Balancing feature is a feature of the HP MPIO Device Specific Module (DSM). The DSM is kind of like the SATP/PSP in the VMware Pluggable Storage Architecture. ALB is important in the ALUA use case, and changes the behavior from standard load balancing. By default, the Windows MPIO load balancing behavior will direct down all paths ( which when ALUA is used, include both owning SP and non-owning SP).

Now, I know that in your case, turning on ALP hurt, but the fact that there **ALP** highlights the downside of ALUA (paths down non-owner SP ports are "less good") - and HP wouldn't have added the ALP function to their DSM if it wasn't important.

This idea applies to all midrange arrays. In NetApp's case, the "global queues" aren't global across the two filer heads in the FAS cluster (again, I don't claim to be a NetApp expert - those that are can chime in!). I'm basing that on my experience with NetApp that each of the FAS heads are in effect, managed as seperate devices, and mirror NVRAM over the internal interconnect.

Remember, that's what ALUA does - It presents ALL the ports across a mid-range clustered as paths to the LUN.

Now - IMHO - sounds to me like what you saw was an issue with the ALP DSM. This highlights the difficulty - host-side multipathing is not easy engineering work. People sometimes believe that this is "simple, and all done", but it's not.

PowerPath has more developers on it than any multipathing on the market (which makes sense, as it's used by the majority of the block storage market).

PowerPath is not perfect, and I've run into my own fair share of cases where PP itself was causing a problem. Multipathing however, is one of those very "unsexy" things - when it works, it's invisible, and saves your bacon. when it doesn't work, it's VERY visible.

Cass Snider

What about the SP write cache? When writing to a Clariion wouldn't the cache absorb bursts prior to maxing out the LUN queue maximums? Am I correct in my understanding that the SP cache resides between the SP maximums and the meta object maximums?

Thanks

Andrew

For the NetApp interconnect, twould be Infiniband (I wasn't sure if this is confidential info but a quick Google on "netapp infiniband" turns it up nicely).

tinggi badan

The VMkernel then has admittance and management for multiple disk requests against a target (LUN) – this is configured in the advanced settings, “Disk.SchedNumReqOustanding” – shown in the screenshot below.

Generic Viagra

I would like to appreciate the great work done by You

Robert Kadish

Great article! I'll preface my comment with the fact that I work for Atlantis Computing. That said I think you will find that there are alternative ways to address some of these challenges. Atlantis ILIO is one of them. I recently had the opportunity to work with customers who had IBM XIV and Pillar storage. Both storage solutions delivered extremely high IOPS load due to their distributed nature. Of course once we ran Atlantis in front of them capacity IOPS capacity was increased by 2X to 4X and the hypervisor queuing went away. Personally I believe that regardless of how companies overcome their IOPS challenges using tools like Liquid Ware labs will help clearly define the problem for a specific environment and how much improvement is achieved for each fix.

Joshua

Thank you very much!! Awsome work guys

Gustavo Ossandon

Great work !!! thanks for it !! may I ask you a question ... just to put the thing in real, and some update to May 2012.... I have UCS blade servers with "palo" CNA adaptors. That means we put the FC and the Ethernet on the same 10GB wire. How the !#$%!%&$@ you configure Queue depth in this Adaptors ?!?!? I am doing this same question from september 2009 and nobody has been able to give me a decent response .... I have seen some pretty beautifull attemts ...but still not quite comfortable with it ....
Now ge have a big cloud on service with several houndreds of customers and almost a thouthand VMs .... and some weird things happening some times .... Help would be welcome

Bryan

On an EMC array (vmax, fastVP) what is the best way to reduce the FA queing? We have our esx disk queues increased but we tend to have larger clusters with up to 8 nodes. In io bursts we have seen the FA queues does back up. It seems the best way to handle this is to add more tdevs to the meta. Would you agree?

Markkulacz

This is a fantastic post (I know it is from quite some time ago). Great work Chad. I was a lead developer on ATF and PowerPath for years, and many of the challenges from 14 years ago are still present today. Multipathing and availability remains mis-understood.

The comments to this entry are closed.

  • BlogWithIntegrity.com

Disclaimer

  • The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by Dell Technologies and does not necessarily reflect the views and opinions of Dell Technologies or any part of Dell Technologies. This is my blog, it is not an Dell Technologies blog.