UPDATE (Feb 11, 2009) - under continued work, feedback from EMCers and others, this is narrowed down a bit. This behavior doesn't affect the Fixed Policy, only the MRU Policy. I'll be updating this again with more data (and some videos in action) shortly, but wanted to post the correction ASAP.
It's almost like I'm prepping for a VMworld session :-)
It's funny - I signed up for a session on this topic at VMworld Europe, and ever since then, customer stuff has been popping out of the woodwork on the topic. This is thing that prompted the "Multivendor iSCSI" post, and this one is my upcoming VMworld Europe breakout session: TA11 - Best Practices to Increase Availability and Throughput for the Future of VMware
OK, so what are we talking about today? Well - is a topic which I first started to see people discuss mid-year, heating up recently, and while I'm not the first to discuss it, I wanted to chime in...
Most people know that in ESX 3.5, the NMP (Native Multipathing) has only a "Fixed" and "MRU" (Most Recently Used) policy, with "Round Robin" being experimental - more on that, and the Pluggable Storage Architecture in the VMworld breakout session. This post affects ALL block storage devices using the MRU policy, regardless of FCoE, FC, iSCSI (with a couple notable exceptions) interconnect.
What most customer do is carefully statically load-balance their configurations - picking a given active paths for some LUNs and different active paths for others. Not load-balancing per se, but static path configuration.
Much to their consternation, after careful setup, they come back and find all the traffic going down a single path. What's the scoop? Read on for the answer and for a solution
This wouldn't be a Chad post without a lot of pre-reading :-) I've realized that it's how I really get to understand issues and resolutions, and then I force it on you :-)
The key VMware docs to read on this are the generally excellent Fibre Channel SAN Configuration Guide and the iSCSI SAN Configuration Guide.
OK, first understanding the config options... From the iSCSI SAN configuration guide....
- An active/active disk array, which allows access to the LUNs simultaneously through all the storage processors that are available without significant performance
degradation. All the paths are active at all times (unless a path fails)." - An active/passive disk array, in which one storage processor (SP) is actively servicing a given LUN. The other SP acts as backup for the LUN and can be actively servicing other LUN I/O. I/O can be sent only to an active processor. If the primary SP fails, one of the secondary storage processors becomes active, either automatically or through administrator intervention.
- A virtual port storage system, which allows access to all available volumes through a single virtual port. These are active‐active storage devices, but hide their multiple
connections though a single port. The ESX Server multipathing has no knowledge of the multiple connections to the storage. These storage systems handle port
failover and connection balancing transparently. This is often referred to as “transparent failover.”
Ok, VMware needs to describe these in general terms, but I will give examples.
An "Active/Active" array using VMware's definition would be something like a EMC Symmetrix or a HDS USP (rebranded as HP XP) or IBM DS8000. Array vendors usually call these "Enterprise" arrays. Not only do they have an "active/active LUN ownership" model, but their performance typically is constant even with significant element failure.
An "Active/Passive" array using VMware's definition would be a EMC CLARiiON, an HP EVA,NetApp FAS or LSI Engenio (rebranded as several IBM midrange platforms). These are usually called "Mid-range" arrays by the array vendors. It's notable that all the array vendors (including EMC) call these "Active/Active" - so we have a naming conflict (hey... "SRM" to storage people means "Storage Resource Management" - not "Site Recovery Manager" :-) They are "Active/Active" in the sense that historically each head can carry an active workload on both "brains" (storage processor), but not for a single LUN. I say historically, because they can also support something called "ALUA", or "Asymmetrix Logical Unit Access" - where LUNs that are "owned" by a storage processor can be access via ports from the other using an internal interconnect - each vendor's implementation and internal interconnect varies. This is moot for the topic of loadbalancing a given LUN with ESX 3.5, though, because until the next major release, ALUA is not supported. I prefer to call this an "Active/Passive LUN ownership" array. The other big standout is that these "midrange" Active/Passive arrays lose half their "brains" (each vendor calls these something different) if one fails - so either you accept that and oversubscribe - accepting some performance degradation if you lose a brain (acceptable in many use cases), or use it to only 50% of it's performance envelope.
A "Virtual Port Storage System" using VMware's definition would be a Dell/EqualLogic or HP/Lefthand platform. These use iSCSI redirection to dynamically react to IP ports becoming unreachable - something not possible in FC-land. The failover behavior in these cases is truly transparent to the ESX stack, except insofar that it depends on timeouts in TCP/IP timescales (see the note in the VMware iSCSI doc about extending guest timeouts - again, acceptable in many use cases, but not all).
In either the Fixed or the MRU case, the behavior is that every LUN has an active path and one or more standby paths. The thing that changes is what happens AFTER the path is restored. Fixed returns the dataflow back to the original active path. MRU leaves the dataflow on it's failed over path. Why the distinction? Well - consider these cases:
- On an Active/Passive LUN ownership array, let's say a storage processor dies. In a few minutes, the LUN becomes visible on the new path. How this works varies. On a CX, the array does what's called a LUN Tresspass - where it changes owners to the remaining Storage Processor. On a NetApp array, the failed head "boots" as a JVM on the remaining head. So what happens when you replace the failed head? Well, again, it varies a bit. On the CX case, it doesn't move the LUN back to the original SP. So, if the old path came back (which it would), and the ESX server switched back, the path would be live, but the LUN wouldn't be there - uh-oh.
- Another common case is the CLARiiON NDU (Non-disruptive Upgrade) - where the process above happens a couple times in tight sequence.
You want it to use MRU to avoid "path thrashing" and excessive LUN trespasses.
And that's why when you connect an ESX Server host to an EMC CLARiiON (at least since ESX 3.5), it sets MRU by default. This by the way, is where you make that setting change manually, and manually setup your paths.
Ok.... Good so far. Most people do this right.
What most people I talk to DON'T know is that in ESX 3.5 (inclusive of update 3), the static path selection (IF you are using the MRU policy) you make isn't permanent. Every time the host boots, the LUNs default their active path to the first enumerated path. This affects all arrays - regardless of whether they are active/active, or active/passive (it doesn't affect the iSCSI-only "Virtual Port Storage Systems"). This means that after a while, and you've run some VUM remediations, you'll find one of your array ports abnormally busy.
Crediting some important folks....
I don't claim to be the first one to post on this topic, but anytime I'm composing up a complete answer, it starts with a search and always want to give credit where it is due....
One way around this is to configure the multipathing based on the VML and not the vmhba identifier (which can be done at CLI using esxcfg-mpath). The VML to LUN relationship is shown in the \vmfs\devices\disks directory, just do a ls -l. The always excellent Scott Lowe commented on this in October here, and there is a good customer thread there as well.
An early input on this was from the excellent Duncan Epping here.
There's also a good VMTN thread on this here: http://communities.vmware.com/message/598649
Here's a Navisphere Analyzer (performance analysis tool) screenshot from an EMC CLARiiON customer
See how the load is all on Port 0 on SPA?
Ok. So what to do about it? Well - in the next VMware ESX release everything changes. You can manually specify path behavior in a more controlled way using the new NMP SATP (Storage Array Type Plugin) commands. ALUA arrays are supported. You can also just take care of it all by putting in the VIB for PowerPath for VMware (anyone in the Cisco Nexus 1Kv Beta/RC knows about the VIB process - it's the same idea). BTW - there's a reason why we're working so hard on this stuff for the "100% Virtualized Datacenter" in the vStorage and vNetwork frameworks. If you want virtualize ANY workload - at ANY scale - all these little pieces matter....
BUT - what if you're on ESX 3.5?
Expect continued work here, but in some cases it's hard (in the SW iSCSI initiator case), or the ESXi case. But - there are a LOT of customers in this case, and we're not going to leave them high and dry. My friends and colleagues at NetApp have built this into their ESX Host Utilities which Nick talks about here: http://blogs.netapp.com/storage_nuts_n_bolts/2007/12/esx-host-utilit.html. EMC has a similar tool used during case resolution to gather ESX host configuration, but that tools doesn't do the host reconfiguration for multipathing.
Here's a script EMC Professional Services (thank you Andre Rossouw!!!!) uses to re-balance configurations after EMC CLARiiON NDU operations. You'll see that it balances the backend (SP balancing of LUNs) and the ESX side (path selection).
I hate to say this, but the price of open sharing is use at your own risk. When EMC professional services use it, we assume the risk, but if you do it on your own, you have the risk. I'm not saying don't do it, but recognize that this isn't a supported product, you should be able to read it and figure it out. If you can't see the things in here you need to modify for your own environment (and there are several!), I strongly suggest you shouldn't use it. We are continually refining it, and expect it shortly with full spit and polish for broad use.
BTW - this script (and many others) is shared in the EXCELLENT EMC Global Education hands on courses entitled " CLARiiON VMware Integration" and "Symmetrix VMware Integration. These have the VMware Install and Configure course as a pre-requisite. EMC offers all of these - and I think it's good advice to ask your rep to include training on a quote - know how is important. And when you get down to brass tacks and are negotiating - don't let him take off the services, just stick to your guns... There's a strong correlation with customer satisfaction and EMC Global Education training.
The script logic at the end for path redistribution is useful across array types - including all EMC arrays and NON EMC arrays. With all that said - I'm a big believer in being open, being transparent - so here you go!
Hi,
After reading your blog, I checked your claim about non-permanent static path selections, but when you create a static path selection with the command excfg-mpath --preferred it will make a permanent configuration. I create a balancing scheme and configured the paths to the managing controller. Specifing different paths for different LUNS. After the boot, the preferred and active path still remain the same configuration.
Posted by: Frank Denneman | February 10, 2009 at 04:07 AM
I never had any of my ESXi hosts change their fixed multipath config after a reboot. Looking at the multipath section in esx.conf I see all luns identified by VML. All hosts have been configured with the VI Client.
Posted by: Christian | February 11, 2009 at 12:40 PM
It seems that for those of us with active/passive "midrange" controllers, we can leverage VMware's propensity to use the first enumerated path. By carefully designing the SAN connections and related SAN zoning, we can coax the first enumerated path selection, so we can at least attempt to load balance not only across SAN switches but at least ports 4 ports vs 1. We obviously cannot use a FIXED pathing policy in these active/passive setups or risk LUN thrashing and in my case, the excellent script won't help as we do have access to the Clarion tools.
Consider the following setup:
Hosting Engines
Two HBAs in each hosting machine - one HBA in PCI Slot 4, one HBA in PCI Slot 7
IBM DS4800 / Midrange active/passive controller
Storage Ports - 4, 3, 2, 1 - A Controller
B Controller - Storage Ports - 1, 2, 3, 4
All 8 connections to SAN switches per below...
SAN Switch B Storage Connections
Port 1 - A3, Port 3 - B1
Port 2 - A4, Port 4 - B2
SAN Switch A Storage Connections
Port 1 - A1, Port 3 - B3
Port 2 - A2, Port 4 - B4
Zoning Matrix 1
Switch Zones
A A1, B3
B A3, B1
Zoning Matrix 2
Switch Zones
A A2, B3
B A3, B2
Zoning Matrix 3
Switch Zones
A A2, B4
B A4, B2
Zoning Matrix 4
Switch Zones
A A1, B4
B A4, B1
The following layouts should ~ distribute the initial boot / VMware default pathing across both SAN switches and target controller ports as noted. There are limits based on VMware's requirement that controller A ports always be in a physically lower port #'s than controller B ports on the SAN switches - otherwise, we could split the loading across controller B ports as well (I think..)
The * path should be the first enumerated path...
Host HBA # Switch zMatrix# Zoning
0 4* A* 1 A1*, B3
0 7 B 1 A3, B1
1 7 A 2 A2, B3
1 4* B* 2 A3*, B2
2 4* A* 3 A2*, B4
2 7 B 3 A4, B2
3 7 A 4 A1, B4
3 4 B* 4 A4*,B1
... repeat
I have also included an earlier version below where the 4 controller ports (1,2,3,4) are connected to each Hosting engine, but I'm not sure if that would freak out the storage controllers in the event of a failure being most host ports operate in matched pairs (A1, B1; A2, B2;...) which in the sample case below if we lose a switch, then there is no alternative path for that specific port target but there are other targets..
I'm posting this one simply for discussion but do not plan to implement without further research into storage controller behavior for defined host ports and failover.
Sample showing distribution across all 4 controller ports - not for production
Host HBA # Switch zMatrix Zoning
0 4* A* 1 A1*, B3
0 7 B 1 A4, B2
1 7 A 2 A2, B4
1 4* B* 2 A3*, B1
2 4* A* 3 A3*, B1
2 7 B 3 A2, B4
3 7 A 4 A1, B3
3 4 B* 4 A4*,B2
Thoughts?
Posted by: G. Mobley | February 15, 2009 at 02:54 PM