There’s some internal dialog today on our “VMware Champions” and “CLARiiON Champions” EMC distribution lists – I want to share this with the outside world (customers/partners) – as it’s pertinent. While the second point (SRM and FLARE 29) is CLARiiON specific – the first point (ALUA and vSphere) is pertinent for multiple vendors (though I have written it up with a CLARiiON-specific bent and notes).
If you’re a CLARiiON customer, and using either vSphere or SRM – firstly thank you for being VMware and EMC customers!, secondly please read on…
First topic: Native Multipathing, Round Rob and ALUA configurations with CLARiiON and vSphere ESX/ESXi 4
This is discussed in the CLARiiON vSphere applied tech guide here, (which I can’t stress strongly enough should be mandatory reading for people with CLARiiONs supporting VMware environments) but I’m going to provide a bit more of the WHY in here. Note that in the CLARiiON case -
CLARiiON arrays are in VMware’s nomenclature an “active/passive” array – they have a LUN ownership model where internally a LUN is “owned” by one storage processor or the other. This is common in mid-range (“two brain” aka “storage processors”) arrays who’s core architecture was born in the 1990s (and is similar to NetApp’s Fibre Channel – but not iSCSI - implementation, HP EVA and others). In storage processor failure cases, in a variety of ways the LUNs become active on the secondary brain. On a CLARiiON, this is called a “trespass”. On these arrays, in the VI client (or vSphere client) LUN shows as “active” on ports behind the owning storage processor, and “standby” on the non-owning storage processor.
For this reason, in VI3 and earlier – the default failover behavior on these arrays was Most Recently Used (or MRU) – where when a storage processor fails (its paths go to “dead”), the LUN trespasses (with a CLARiiON, ESX 3 actually issues a “trespass” command) to the other storage processor, the paths change to an Active state (ESX issues what’s called a “test unit ready” or TUR command to check LUN state), and then I/O continues. MRU then doesn’t failback to the other paths when the failed storage processor is fixed and the original paths transition from “dead” to “standby” (because they aren’t the SP owner). The behavior of the “Fixed” path policy would revert back to the original path (which would trigger another trespass).
This meant you could get into a “race condition” if you didn’t use MRU, where the ESX host is chasing a constantly moving LUN (sometimes called “path thrashing”).
Does that make sense?
Now, trespasses can occur VERY fast, and are transparent to the guest. This is a VERY VERY reliable mechanism. There are very, very (VERY) rare cases where a LUN tresspass does not complete where the second storage processor is shutdown before trespasses are complete (I’ve been involved in one case amongst more than a hundred thousands of CLARiiONs out there).
So – if this works well, and is mature, what’s new?
The answer is that vSphere supports Aysmmetric Logical Unit Access (ALUA) – note that there is no support for this in VI3.x. ALUA is a SCSI standard, and is widely implemented across mid-range storage arrays – including the CLARiiON. with ALUA, the LUN is reachable across both storage processors at the same time. Paths for the “non-owning” storage processor take IO and transit it across the internal interconnect architecture of the mid-range arrays (bandwidth and behavior varies). in the example in the diagram below, the paths on SPA advertise as “active (non-optimized)”, and aren’t used for I/O unless the active I/O paths are not working.
When you use ALUA, these mid-range arrays can work with the Fixed and Round Robin Path Selection Plugin (PSP) multipathing options in vSphere 4’s Native Multipathing Plugin (NMP). You can then used Fixed and Round Robin policies without worrying about path thrashing.
The ALUA standard can be implemented with SCSI-2 commands OR SCSI-3 commands. in vSphere by default, the SCSI-3 command set for reservations required to handle ALUA behavior are implemented, NOT SCSI-2.
To make this even more clear (I hope), don’t be confused: a device can be a SCSI-3 device without using SCSI-3 for ALUA. vSphere requires that all SCSI devices are SCSI-3 devices (whether iSCSI, FC, or FCoE), otherwise they don’t show up to the ESX host.
CLARiiONs as of FLARE 26 implement SCSI-2 reservation mechanisms for ALUA. As of FLARE 28.5 they support both SCSI-2 and SCSI-3 mechansims.
FLARE 28.5 and FLARE 29 are supported on CLARiiON CX4, but are currently not supported on the older CLARiiON CX3. Ergo, a CX4 can support ALUA with vSphere, a CX3 cannot.
With CLARiiON, to configure a host for ALUA mode failover behavior, simply run the Failover wizard in Navisphere, and switch to Failover Mode = 4 (ALUA mode).
Making this change in my experience means using VMware maintenance mode to either bounce the ESX host non-disruptively, or remove/re-add the devices with a manual vmotion to make it non-disruptive. Personally – I would recommend the maintenance mode approach, it’s hard to make an error that would cause an outage.
If you are running a version of FLARE earlier than FLARE 28.5, and change the failover mode, the storage devices will not be visible to the ESX host – so remember, CLARiiON CX4, FLARE 28.5 and later only!
Once you make this change, you can change NMP from MRU to Round Robin (NMP RR) – either one device at a time in the GUI or the CLI. Note: I’m going to start standardizing on the vMA CLI syntax, as I think (personally) that’s the way to go – and applies equally to ESX and ESXi:
esxcli --server=<SERVERNAME> nmp device setpolicy --device <device UID> --psp <PSP type>
alternatively you can change the PSP which is assigned to the SATP (single mass change for an ESX host) :using this command:
esxcli --server=<SERVERNAME> nmp satp setdefault –psp <PSP type> –satp <SATP type>
The Round Robin policy doesn’t issue I/Os in a simple “round robin” between paths in the way many expect. By default RR sends 1000 commands down each path before moving to the next path; this is called the IOOperationLimit. In configurations where a LUN queue is busy, this limit doesn't demonstrate much path aggregation because quite often some of the thousand commands will have completed before the last command is sent. That means the paths aren't full (even though queue at the storage array might be). When using 1Gbit iSCSI, quite often the physical path is often the limiting factor on throughput, and making use of multiple paths at the same time shows better throughput.
You can reduce down the number of commands issued down a particular path before moving on to the next path all the way to 1, thus ensuring that each subsequent command is sent down a different path:
You can make this change by using this command:
esxcli --server=<SERVERNAME> nmp roundrobin setconfig --device <lun ID> --iops 1 --type iops
Note that cutting down the number of iops does present some potential problems with some storage arrays caching is done per path. By spreading the requests across multiple paths, you are defeating and caching optimization at the storage end and could end up hurting your performance. Luckily, most modern storage systems (this is true of CLARiiON) don't cache per port. There's still a minor path-switch penalty in ESX, so switching this often probably represents a little more CPU overhead on the host.
There are some cases where RR isn’t recommended – more to come on that in a followup around iSCSI and vSphere 4 (though they apply to all protocols)
PowerPath/VE is a MPP (a full NMP substitute) – it improves path discovery, path selection from basic round robin to adaptive in general and predictive with EMC arrays, and also adds ALUA reservation support using SCSI-2. This means you can use EMC PowerPath/VE with vSphere regardless of whether the array uses SCSI-2 or SCSI-3. In fact, PP/VE provides the benefit of full path utilization without any configuration needed other than simply loading the vmkernel module.
Each array behaves differently – so check with your storage vendor, and don’t assume anything here (for better or worse for them or for EMC) applies to others.
Netting it out:
- vSphere 4 supports ALUA, but does so with the SCSI-3 reservation mechanism.
- CLARiiON supports ALUA as of FLARE 26 using the SCSI-2 reservation mechanism, and both the SCSI-2 and SCSI-3 reservation mechanism as of FLARE 28.5.
- FLARE 28.5 and later (FLARE 29 is the most current firmware rev) are supported on CLARiiON CX4 only, not older CX3, or AX.
- If you are CLARiiON customer and want to drive all paths to devices using the free ALUA NMP Round Robin, you need to be on a CX4, and running FLARE 28.5 and later.
- If want to drive all paths to devices, and want new path discovery to automated, and the I/O distribution to be predictive, you can use any EMC array (and some HDS, HP and IBM arrays) and use EMC PowerPath/VE
Second topic: VM-Aware Navipshere (FLARE 29) and Site Recovery Manager
So, VM-Aware Navisphere (FLARE 29) is out there. If you want to know more about it, check out this post here.
A case came up late last night from an EMC partner. The partner reached out and pinged me. The MirrorView current MirrorView SRA (1.3.0.8) doesn’t work with FLARE 29 – minor API incompatibilities (you can see his post here: http://vmjunkie.wordpress.com/2009/09/15/srm-warning-flare-4-29-mirrorview-incompatible-with-current-sra)
The new SRA (1.4) is just finishing up the SRM qual, and will be immediately available when the next SRM release is out (which will be shortly). This one also has a treat in store for MirrorView customers (a free tool call MirrorView Insight) that adds a TON of SRM-related stuff, including failback. More on that shortly.
Netting it out:
- If you’re using a CLARiiON with MirrorView and Site Recovery Manager, hold off the FLARE 29 upgrade just for a BIT longer – otherwise it will break Site Recovery Manager.
To clarify though (since i almost crapped myself at first read thru...)
From the Clariion and VMware integration paper:
"On a CX3 or earlier CLARiiON storage ststems, the Most Recently Used (MRU) or Round Robin policy must be used with failovermode=1"
So RR is ok in non ALUA modes on CX3 or under.
Too bad customers (like us) with somewhat newish CX3-XX's get screwed again (FLARE29 anyone....)
Posted by: PaulO | September 16, 2009 at 01:08 PM
I assume we will never see FLARE 28.5 or higher on any CX3 box, but is there any chance that future releases of FLARE 26 for CX3 will support (and when) SCSI3 reservation with ALUA?
Posted by: Krzysztof Cieplucha | September 16, 2009 at 05:22 PM
I second that Paul, wish those of us with CX3's were able to get newer Flare. Chad, any chance of that? Frustrating to have a couple recent CX3-80's and no longer get the cool stuff without going with a CX4.
Posted by: DougJ | September 16, 2009 at 07:52 PM
Keep it coming guys - I'm pointing to feedback here to the CLARiiON org. Some stuff I understand why it would be very, very hard (ergo VM-Aware Navi and search), but I'm pushing hard for a FLARE 26 patch that adds this ALUA SCSI-3 reservation.
Believe it or not - I understand the frustration personally - I purchased 167 NS-20's (which have CX3-10s included) in 2009 to distribute to the field and partners :-)
Thank you for being customers, and please, keep the feedback coming!
Posted by: Chad Sakac | September 17, 2009 at 12:38 PM
My question comes from a different direction, PowerPath/VE. Based on what I've read, PowerPath/VE was built on the basis of ESXi. This means that all management is remote from the physical. So, we need to maintain a separate management server, either Windows 2003 or RHEL, in order to manage our PowerPath/VE installs. Further, based on the implementation of PowerPath/VE, we need to maintain the root passwords for our ESX hosts on this management workstation inside an encrypted "lockbox." Unfortunately, this means that: 1. Someone getting into my management workstation is able to affect PowerPath on all of the ESX/ESXi servers managed by it. 2. The loss of my management workstation means a disruption in management of PowerPath/VE until I rebuild. ..... All of that said, do we really need to go forward with this architecture? Can't we put the management utilities on the physical ESX servers? Why are we purely building around the ESXi model? My VMware admin is asking if the additional licensing cost for vSphere to support PowerPath/VE is worth it, and another mangement level is not something he wants to hear.
Posted by: Will | September 17, 2009 at 12:47 PM
Will - thanks for the question.
- PowerPath/VE works with ESX or ESXi
- no management is needed per se with PowerPath/VE. All path configuration is 100% automated, PERIOD. except the initial licensing, I would expect that many (most) customers would never use the rpowermt tool.
- but if you want to get some reporting, you use a "rpowermt" utility. This utility can be on any VM (and you then get all the VM HA goodness), it doesn't have to be a physical host at all. The reason for that management model is wanted to stay away from COS tools (for all the good reasons VMware is also going that way). Think of rpowermt as being like VMware's vMA (in fact, you could even install it IN the vMA). It interacts with PowerPath/VE via deep vmkernel API calls (ESX CIM and others).
- the "lockbox" is only there if you want to use it (again, this is the same model as VMware's vMA - you absolutely DO NOT need to cache the ESX passwords.
FYI,management of PP/VE in the near/mid term will be extended via a vcenter plugin in addition to the optional CLI tools provided via rpowermt.
Hope that helps!
Posted by: Chad Sakac | September 17, 2009 at 12:56 PM
Chad,
Thanks for the response. In regards to using powermt/rpowermt, I use powermt a lot on other OSs during LUN presentation, troubleshooting, and documentation. Also, I'm a CLI and scripting guy in many ways. The extra layer (indirect/remote CLI) still concerns me somewhat, but I've gotten used to Navisphere CLI.
Do you happen to know what encryption level PowerPath/VE wraps around the communication between the PowerPath management server and the ESX host?
Thanks again.
Posted by: Will | September 17, 2009 at 07:52 PM
Chad, I'm not first and foremost a storage guy so forgive me for what might be a stupid question... I read over the post a few times. So if I'm on esx 3.5 with a cx-3xx it sounds like we stay with MRU and may end up with all our LUNs going thru the same storage processor after it comes back online (standby).
We have two ways to keep that from happening:
a. Upgrade to a cx4-xx and FLARE 29 at which point we'll be able to utilize RR (but only if on vsphere due ALUA) which will send xxxx (1000 by default) IOs down one path and then the other path giving us better performance than MRU but still not powerpath aka MPP performance or
b. Upgrade to Vsphere which will allow us (with ent plus licensing) to use powerpath/ve which can talk ALUA reservations in SCSI-2 giving us better performance (in this regard anyhow) than going from a cx-3 to cx-4+ RR anyhow
I ask because this statement really muddies that:
FLARE 28.5 and FLARE 29 are supported on CLARiiON CX4, but are currently not supported on the older CLARiiON CX3. Ergo, a CX4 can support ALUA with vSphere, a CX3 cannot.
A cx3 can support ALUA (SCSI-2) under vsphere but you've got to use powerpath to do it.
I get that right?
We've budgeted to go to a cx4 with vsphere and powerpath next year but I want to make sure I understand.
summary
cx3 / esx 3.5 NMP MRU
cx4 / esx 3.5 NMP MRU
cx4 / vsphere NMP RR
cx* / vsphere+powerpath true mpp
so in this context going to a cx4 w/o vsphere doesn't buy me anything. correct?
Posted by: Jay weinshenker | October 25, 2009 at 05:08 PM
Are there any plans to support ALUA on the AX4?
Posted by: Ben | December 30, 2009 at 04:03 PM
In short ... if we are using 3.5, then need to use MRU, or path thrashing can be an issue. In vSphere thorugh round robin or fixed can achieve path balancing that can survive a host reboot (even withouth powerpath installed).
is this correct?
Posted by: Trevor | June 04, 2010 at 11:57 AM
Very inportant notes I recommended this to all my friends.
Posted by: seroquel medication | June 16, 2010 at 05:20 AM
I would like to appreciate the great work done You
Posted by: Generic Viagra | July 01, 2010 at 02:52 AM
Recently, i have installed powerpath in esxi.Afterthat, i can sees all the paths are active in esxi.But, while running rpowermt host=host ip check_registartion command,it ask the lockdown password and esx username and password.After 20 minutes,it througout the powerpath not found error.
please help me.
Posted by: Gopi | July 17, 2010 at 11:17 AM
Thanks for the post. But I am confused about ALUA and Reservation relations.Can you please explain why ALUA implementation depends on ALUA.Or give me some info about that issue?
Posted by: Gevorg | February 15, 2011 at 06:19 AM