UPDATE: Sept 21st, 2009 – 4:24pm EST – Are you using vSphere, or thinking about it? We’ve done an updated “Multivendor iSCSI post” with vSphere details here. If using ESX 3.x read on…
Today’s post is one you don’t often find in the blogosphere, see today’s post is a collaborative effort initiated by me, Chad Sakac (EMC), which includes contributions from Andy Banta (VMware), Vaughn Stewart (NetApp), Eric Schott (Dell/EqualLogic), and Adam Carter (HP/Lefthand), David Black (EMC) and various other folks at each of the companies.
Together, our companies make up the large majority of the iSCSI market, all make great iSCSI targets, and we (as individuals and companies) all want our customers to have iSCSI success.
I have to say, I see this one often - customer struggling to get high throughput out of iSCSI targets on ESX. Sometimes they are OK with that, but often I hear this comment: "…My internal SAS controller can drive 4-5x the throughput of an iSCSI LUN…"
Can you get high throughput with iSCSI with GbE on ESX? The answer is YES. But there are some complications, and some configuration steps that are not immediately apparent. You need to understanding some iSCSI fundamentals, some Link Aggregation fundamentals, and know some ESX internals – none of which are immediately obvious…
If you’re interested (and who wouldn’t be interested with a great topic and a bizzaro-world “multi-vendor collaboration”... I can feel the space-time continuum collapsing around me :-), read on...
We could start this conversation by playing a trump card; 10GbE, but we’ll save this topic for another discussion. Today 10GbE is relatively expensive per port and relatively rare, and the vast majority of iSCSI and NFS deployments are on GbE. 10GbE is supported by VMware today (see the VMware HCL here), and all of the vendors here either have, or have announced 10GbE support.
10GbE can support the ideal number of cables from an ESX host – two. This reduction in port count can simplify configurations, reduce the need for link aggregation, provide ample bandwidth, and even unify FC using FCoE on the same fabric for customers with existing FC investments. We all expect to see rapid adoption of 10GbE as prices continue to drop. Chad has blogged on 10GbE and VMware here.
This post is about trying to help people maximize iSCSI on GbE, so we’ll leave 10GbE for a followup.
If you are serious about iSCSI in your production environment, it’s valuable to do a bit of learning, and it’s important to do a little engineering during design. iSCSI is easy to connect and begin using, but like many technologies which excel in terms of their simplicity the default options and parameters may not be robust enough to provide an iSCSI infrastructure which can support your business.
With that in mind, this post is going to start with sections called “Understanding” which will walk through protocol details and ESX Software Initiator internals. You can skip them if you want to jump to configuration options, but a bit of learning goes a long way into understanding the WHY of the HOWs (which I personally always think makes them easier to remember).
Understanding your Ethernet Infrastructure
Do you have a “bet the business” Ethernet infrastructure? Don’t think of iSCSI (or NFS datastores) use here as “it’s just on my LAN”, but “this is the storage infrastructure that is supporting my entire critical VMware infrastructure”. IP storage needs the same sort of design thinking applied to FC infrastructure. Here are some things to think about:
- Are you separating you storage and network traffic on different ports? Could you use VLANs for this? Sure. But is that “bet the business” thinking? Do you want a temporarily busy LAN to swamp your storage (and vice-versa) for the sake of a few NICs and switch ports? If you’re using 10GbE, sure – but GbE?
- Think about Flow-Control (should be set to receive on switches and transmit on iSCSI targets)
- Enable spanning tree protocol with either RSTP or portfast enabled
- Filter / restrict bridge protocol data units on storage network ports
- If you want to squeeze out the last bit, configure jumbo frames (always end-to-end – otherwise you will get fragmented gobbledygook)
- Use Cat6 cables rather than Cat5/5e. Yes, Cat5e can work – but remember – this is “bet the business”, right? Are you sure you don’t want to buy that $10 cable?
- You’ll see later that things like cross-stack Etherchannel trunking can be handy in some configurations.
- Each Ethernet switch also varies in its internal architecture – for mission-critical, network intensive Ethernet purposes (like VMware datastores on iSCSI or NFS), amount of port buffers, and other internals matter – it’s a good idea to know what you are using.
If performance is important, have you thought about how many workloads (guests) are you running? Both individually and in aggregate are they typically random, or streaming? Random I/O workloads put very little throughput stress on the SAN network. Conversely, sequential, large block I/O workloads place a heavier load.
In the same vein, be careful running single stream I/O tests if your environment is multi-stream / multi-server. These types of tests are so abstract they provide zero data relative to the shared infrastructure that you are building.
In general, don’t view “a single big LUN” as a good test – all arrays have internal threads handling I/Os, and so does the ESX host itself (for VMFS and for NFS datastores). In general, in aggregate, more threads are better than fewer. You increase threading on the host with more operations against that single LUN (or file system), and every vendor’s internals are slightly different, but in general, more internal array objects are better than fewer – as there are more threads.
Not an “Ethernet” thing, but while we’re talking on the subject of performance generally and not skimping, there’s no magic on the brown spinny things – you need enough array spindles to support the IO workload – often not enough drives in total, or an under-configured specific sub/group of drives – every vendor does this differently (aggregates/RAID groups/pools), but all have some sort of “disk grouping” out of which LUNs (and file systems in some cases) get their collective IOPs.
Understanding: iSCSI Fundamentals
We need to begin with a prerequisite nomenclature to establish a start point. If you really want the “secret decoder ring” then start here: http://tools.ietf.org/html/rfc3720
This diagram is chicken scratch, but it gets the point across. The red numbers are explained below.
- iSCSI initiator = an iSCSI client, and serves the same purpose as an HBA, sending SCSI commands, and encapsulating in IP packets. This can operate in the hypervisor (example in this case this would be the ESX software initiator or hardware initiator) and/or in the guests (example – the Microsoft iSCSI initiator).
- iSCSI target = an iSCSI server, usually on an array of some type. Arrays vary in how they implement this. Some have one (the array itself), some have many, some map them to physical interfaces, some make each LUN an iSCSI target.
- iSCSI initiator port = the end-point of an iSCSI session, and is not a TCP port. After all the handshaking, the iSCSI initiator device creates and maintains a list of iSCSI initiator ports. Think of the iSCSI initiator port as the “on ramp” for data.
- iSCSI network portal = an IP address or grouping of IP addresses used by iSCSI initiator or target (in which case it’s IP address and TCP port). There can be groupings of network portals into.. portal groups (see Multiple Connections per Session)
- iSCSI Connection = a TCP Connection, and carries control info, SCSI commands and data being read or written.
- iSCSI Session = one or more TCP connections that form an initiator-target session
- Multiple Connections per Session (MC/S) = iSCSI can have multiple connections within a single session (see above).
- MPIO = Multipathing, and used very generally as a term – but exists ABOVE the whole iSCSI layer (which in turn is on top of the network layer) in the hypervisor and/or in the guests. As an example, when you configure the ESX storage multipathing, that’s MPIO. MPIO is defacto load-balancing and availability model for iSCSI
Understanding: Link Aggregation Fundamentals
The next thing as a core bit of technology to understand is Link Aggregation. The group spent a fair amount of time going around on this as we were writing this post. Many people jump to this as a way as and “obvious” mechanism to provide greater aggregate bandwidth than a single GbE link can provide.
The core thing to understand (and the bulk of our conversation – thank you Eric and David) is that 802.3ad/LACP surely aggregates physical links, but the mechanisms used to determine the whether a given flow of information follows one link or another are critical.
Personally, I found this doc very clarifying.: http://www.ieee802.org/3/hssg/public/apr07/frazier_01_0407.pdf
You’ll note several key things in this doc:
- All frames associated with a given “conversation” are transmitted on the same link to prevent mis-ordering of frames. So what is a “conversation”? A “conversation” is the TCP connection.
- The link selection for a conversation is usually done by doing a hash on the MAC addresses or IP address.
- There is a mechanism to “move a conversation” from one link to another (for loadbalancing), but the conversation stops on the first link before moving to the second.
- Link Aggregation achieves high utilization across multiple links when carrying multiple conversations, and is less efficient with a small number of conversations (and has no improved bandwith with just one). While Link Aggregation is good, it’s not as efficient as a single faster link.
It’s notable that Link Aggregation and MPIO are very different. Link Aggregation applies between two network devices only. Link aggregation can load balance efficiently – but is not particularly efficient or predictable when there are a low number of TCP connections.
Conversely MPIO applies on an end-to-end iSCSI session – a full path from the initiator to the target. It can be efficient in loadbalancing with a low number of TCP sessions. While Link Aggregation can be applied to iSCSI (as will be discussed below), MPIO is generally the design point for iSCSI multipathing.
Understanding: iSCSI implementation in ESX 3.x
The key to understanding the issue is that the ESX 3.x software initiator only supports a single iSCSI session with a single TCP connection for each iSCSI target.
Making this visual… in the diagram above, while in iSCSI generally you can have multiple “purple pipes” each with one or more “orange pipes” to any iSCSI target, and use MPIO with multiple active paths to drive I/O down both paths.
You can also have multiple “orange pipes” (the iSCSI connections) in each “purple pipe” (single iSCSI session) - Multiple Connections per Session (which effectively multipaths below the MPIO stack), shown in the diagram below.
But in the ESX software iSCSI intiator case, you can only have one orange “pipe” for each purple pipe for every target (green boxes marked 2), and only one “purple pipe” for every iSCSI target. The end of the “purple pipe” is the iSCSI intiator port – and these are the “on ramps” for MPIO
So, no matter what MPIO setup you have in ESX, it doesn't matter how many paths show up in the storage multipathing GUI for multipathing to a single iSCSI Target, because there’s only one iSCSI initiator port, only one TCP port per iSCSI target. The alternate path to the gets established after the primary active path is unreachable. This is shown in the diagram below.
VMware can’t be accused of being unclear about this. Directly in the iSCSI SAN Configuration Guide: “ESX Server‐based iSCSI initiators establish only one connection to each target. This means storage systems with a single target containing multiple LUNs have all LUN traffic on that one connection”, but in general, in my experience, this is relatively unknown.
This usually means that customers find that for a single iSCSI target (and however many LUNs that may be behind that target – 1 or more), they can’t drive more than 120-160MBps.
This shouldn’t make anyone conclude that iSCSI is not a good choice or that 160MBps is a show-stopper. For perspective I was with a VERY big customer recently (more than 4000 VMs on Thursday and Friday two weeks ago) and their comment was that for their case (admittedly light I/O use from each VM) this was working well. Requirements differ for every customer.
Now, this behavior will be changing in the next major VMware release. Among other improvements, the iSCSI initiator will be able to use multiple iSCSI sessions (hence multiple TCP connections). Looking at our diagram, this corresponds with “multiple purple pipes”for a single target. It won’t support MC/S or “multiple orange pipes per each purple pipe” – but in general this is not a big deal (large scale use of MC/S has shown a marginal higher efficiency than MPIO at very high end 10GbE configurations) .
Multiple iSCSI sessions will mean multiple “on-ramps” for MPIO (and multiple “conversations” for Link Aggregation). The next version also brings core multipathing improvements in the vStorage initiative (improving all block storage): NMP round robin, ALUA support, and EMC PowerPath for VMware which integrates into the MPIO framework and further improves multipathing. In the spirit of this post, EMC is working to make PowerPath for VMware as heterogeneous as we can.
Together – multiple iSCSI sessions per iSCSI target and improved multipathing means aggregate throughput for a single iSCSI target above that 160MBps mark in the next VMware release, as people are playing with now. Obviously we’ll do a follow up post.
(Strongly) Recommended Additional Reading
- I would STRONGLY recommend reading a series of posts that the inimitable Scott Lowe has done on ESX networking, and start at his recap here: http://blog.scottlowe.org/2008/12/19/vmware-esx-networking-articles/
- Also – I would strongly recommend reading the vendor documentation on this carefully.
- START HERE - VMware: iSCSI SAN Configuration Guide
- EMC Celerra: VMware ESX Server Using EMC Celerra Storage Systems – Solutions Guide
- EMC CLARiiON: VMware ESX Server Using EMC CLARiiON Storage Systems - Solutions Guide
- EMC DMX: VMware ESX Server Using EMC Symmetrix Storage Systems – Solutions Guide
- NetApp: NetApp & VMware Virtual Infrastructure 3 : Storage Best Practices (Vaughn is proud to say this is the most popular NetApp TR)
- HP/LeftHand: LeftHand Networks VI3 field guide for SAN/iQ 8 SANs
- Dell/EqualLogic:
ENOUGH WITH THE LEARNING!!! HOW do you get high iSCSI throughput in ESX 3.x?
As discussed earlier, the ESX 3.x software initiator really only works on a single TCP connection for each target – so all traffic to a single iSCSI Target will use a single logical interface. Without extra design measures, it does limit the amount of IO available to each iSCSI target to roughly 120 – 160 MBs of read and write access.
This design does not limit the total amount of I/O bandwidth available to an ESX host configured with multiple GbE links for iSCSI traffic (or more generally VMKernel traffic) connecting to multiple datastores across multiple iSCSI targets, but does for a single iSCSI target without taking extra steps.
Here are the questions that customers usually ask themselves:
Question 1: How do I configure MPIO (in this case, VMware NMP) and my iSCSI targets and LUNs to get the most optimal use of my network infrastructure? How do I scale that up?
Question 2: If I have a single LUN that needs really high bandwidth – more than 160MBps and I can’t wait for the next major ESX version, how do I do that?
Question 3: Do I use the Software Initiator or the Hardware Initiator?
Question 4: Do I use Link Aggregation and if so, how?
Here are the answers you seek…
.
.
.
Question 1: How do I configure MPIO (in this case, VMware NMP) and my iSCSI targets and LUNs to get the most optimal use of my network infrastructure? How do I scale that up?
Answer 1: Keep it simple. Use the ESX iSCSI software initiator. Use multiple iSCSI targets. Use MPIO at the ESX layer. Add Ethernet links and iSCSI targets to increase overall throughput. Ser your expectation for no more than ~160MBps for a single iSCSI target.
Remember an iSCSI session is from initiator to target. If use multiple iSCSI targets, with multiple IP addresses, you will use all the available links in aggregate, the storage traffic in total will load balance relatively well. But any individual one target will be limited to a maximum of single GbE connection's worth of bandwidth.
Remember that this also applies to all the LUNs behind that target. So, consider that as you distribute the LUNs appropriately among those targets.
The ESX initiator uses the same core method to get a list of targets from any iSCSI array (static configuration or dynamic discovery using the iSCSI SendTargets request) and then a list of LUNs behind that target (SCSI REPORT LUNS command).
So, to place your LUNs appropriately to balance the workload:
- On an EMC CLARiiON, each physical interface is seen by an ESX host as a separate target, so balance the LUNs behind your multiple iSCSI targets (physical ports).
- On a Dell/EqualLogic array, since every LUN is a target, balancing is automatic and you don’t have to do this.
- On an HP/LeftHand array, since every LUN is a target, balancing is automatic and you don’t have to do this.
- On a NetApp array each interface is a seen by an ESX host as a separate target, so balance your LUNs behind the targets.
- On an EMC Celerra array, you can configure as many iSCSI targets as you want, up to 1000 and assign them to any virtual or physical network interface - balance your LUNs behind the targets.
Select your active paths in the VMware ESX multi-pathing dialog to balance the I/O across the paths to your targets and LUNs using the Virtual Center dialog shown below (from the VMWare iSCSI SAN Configuration Guide). Also it can take up to 60 seconds for the standby path to become active as the session needs to be established and the MPIO failover needs to occur, as noted in VMware iSCSI configuration guide. There are some good tips there (and in the Vendor best practice docs) about extending guest timeouts to withstand the delay without a fatal I/O error in the guest.
Question 2: If I have a single LUN that needs really high bandwidth – more than 160MBps and I can’t wait for the next major ESX version, how do I do that?
Answer 2: Use an iSCSI software initiator in the guest along with either MPIO or MC/S
This model allows the Guest Operating Systems to be “directly” on the SAN and to manage their own LUNs. Assign multiple vNICs to the VM, and map those to different pNICs. Many of the software initiators in this space are very robust (like the Microsoft iSCSI initiator). They provide their guest-based multipathing and load-balancing via MPIO (or MC/S) based on the number of NICs allocated to the VM.
As we worked on this post, all the vendors involved agreed – we’re surprised that this mechanism isn't more popular. People have been doing it for a long time, and it works, even through VMotion operations where some packets are lost (TCP retransmits them – iSCSI is ok with occasional loss, but constant losses slow TCP down – something to look at if you’re seeing poor iSCSI throughput).
It has a big downside, though – you need to manually configure the storage inside each guest, which doesn’t scale particularly well from a configuration standpoint – so for most customers they stick with the “keep it simple” method in Answer 1, and selectively use this for LUNs needing high throughput.
There are other bonuses too:
- This also allows host SAN tools to operate seamlessly – on both physical or virtual environments – integration with databases, email systems, backup systems, etc.
- Also has the ability to use a different vSwitch and physical network ports than VMkernel allowing for more iSCSI load distribution and separation of VM data traffic from VM boot traffic.
- Dynamic and automated LUN (i.e. you don’t need to do something in Virtual Center for the guest to use the storage) surfacing to the VM itself (useful in certain database test/dev use cases)
- You can use it for VMs that require a SCSI-3 device (think Windows 2008 cluster quorum disks – though those are not officially supported by VMware even as of VI3.5 update 3)
There are, of course, things that negative about this approach.
- I suppose "philosophically" there's something a little dirty of "penetrating the virtualizing abstraction layer", and yeah - I get why that philosophy exists. But hey, we're not really philosophers, right? We're IT professionals, and this works well :-)
- It is notable that this option means that SRM is not supported (which depends on LUNs presented to ESX, not to guests)
Question 3: Do I use the Software Initiator or the Hardware Initiator?
Answer 3: In general, use the Software Initiator except where iSCSI boot is specifically required.
This method bypasses the ESX software initiator entirely. Like the ESX software initiator, hardware iSCSI initiators uses the ESX MPIO storage stack for multipathing – but doesn’t have the single connection per target limit.
But, since you still have all the normal caveats with static load balancing and using the ESX NMP software (active/passive model, with static, manual loadbalancing), this won’t increase the throughput for a single iSCSI target.
In general, across all the contributors from each company, our personal preference is to use the software initiator. Why? In general it’s simple, and since it’s used very widely, very tested, very robust. It also has a clear 10GbE support path.
Question 4: Do I use Link Aggregation and if so, how?
Answer 4: There are some reasons to use Link Aggregation, but increasing a throughput to a single iSCSI target isn’t one of them in ESX 3.x.
What about Link Aggregation – shouldn’t that resolve the issue of not being able to drive more than a single GbE for each iSCSI target? In a word – NO. A TCP connection will have the same IP addresses and MAC addresses for the duration of the connection, and therefore the same hash result. This means that regardless of your link aggregation setup, in ESX 3.x, the network traffic from an ESX host for a single iSCSI target will always follow a single link.
So, why discuss it here? While this post focuses on iSCSI, in some cases, customers are using both NFS and iSCSI datastores. In the NFS datastore case, MPIO mechanisms are not an option, load-balancing and HA is all about Link Aggregation. So in that case, the iSCSI solution needs to work in with concurrently existing Link Aggregation.
Now, Link Aggregation can be used completely as an alternative to MPIO from the iSCSI initiator to the target. That said, it is notably more complex than the MPIO mechanism, requiring more configuration, and isn’t better in any material way.
If you’ve configured Link Aggregation to support NFS datastores, it’s easier to leave the existing Link Aggregation from the ESX host to the switch, and then simply layer on top many iSCSI targets and MPIO (i.e. “just do answer 1 on top of the Link Aggregation”).
To keep this post concise and focused on iSCSI, the multi-vendor team here decided to cut out some of NFS/iSCSI hybrid use case and configuration details, and leave that to a subsequent EMC Celerra/NetApp FAS post.
In closing.....
I would suggest that anyone considering iSCSI with VMware should feel confident that their deployments can provide high performance and high availability. You would be joining many, many customer enjoying the benefits of VMware and advanced storage that leverages Ethernet.
To make your deployment a success, understand the “one link max per iSCSI target” ESX 3.x iSCSI initiator behavior. Set your expectations accordingly, and if you have to, use the guest iSCSI initiator method for LUNs needing higher bandwidth than a single link can provide.
Most of all ensure that you follow the best practices of your storage vendor and VMware.
This post is insanely great! Nice job, guys, and thanks for getting together on the topic.
Looking forward to ESX 4!
Posted by: Stephen Foskett | January 26, 2009 at 10:12 AM
Thanks for reaching out on this initiative. Looking forward to doing more of these types of posts in the future.
Posted by: Vaughn Stewart | January 26, 2009 at 11:12 AM
Excellent article!
Thanks for sharing your knowledge.
Posted by: Mike La Spina | January 26, 2009 at 11:45 AM
And with ESX 3.5, there's probably not much you can do with an AX150i array. I hope multipathing with 4.x will help immensely.
No idea what EMC was thinking with that one, should never have left the design floor.
Posted by: Ian Beyer | January 27, 2009 at 10:26 AM
Couple of very important aspects your article does not discuss which are very important.
1. You cannot use Vmware VCB if using iSCSI Guest software initator.
2. You cannot place pagefiles on iSCSI volumes
3. You cannot use SRM out of the box
4. Consider lack of snapshots on iSCSI volumes
5. Windows 2000 problems with network shares
Anyone else care to add to the list?
Posted by: andrew | January 27, 2009 at 06:08 PM
Nice work here guys! This is the kind of collaboration and transparency I hope to see more of in the storage blogosphere.
Posted by: Val Bercovici | January 27, 2009 at 10:33 PM
Andrew - thanks for the comments - none of us has all the answers, each only have parts, so contributions from all are good - thank you!
Answers for your 5 points:
1) We should have pointed out that native ESX snapshots, and all the things that depend on them (VCB, svmotion, lab manager) don't work with guest iSCSI initiators. I will point out that usually the things requiring the high throughput (again, that can't be achieved via any other means in the VI 3.5 model without using another storage protocol type, which we put aside in this post) are usually things like large Exchange databases, or databases of another type.
VCB 1.5 doesn't support application-integrated backup for guest applications like these (including log handling and recovery modes) - so the loss of VCB isn't a big one.
These also happen to be the case where snapshot based backup tools using array replication can work very well using our tools (keeping in the spirit of the post - EMC's Replication Manager, NetApp SnapManager family, or Dell/EqualLogic's similar tools)
2) Can you point to something confirming that pagefiles can't be on iSCSI? I'm not familiar with that guideline, and would like to see it. It doesn't make sense, and is likely grounded in error.
3) We did point out that SRM support for the case of the iSCSI sw initiator in the guest is a "x" in the column. "It is notable that this option means that SRM is not supported (which depends on LUNs presented to ESX, not to guests) "
4) this is the same point as 2)
5) Like 2, can you please be more explicit, or link to supporting docs, my apologies, I'm unfamiliar with anything along those lines.
I will reinforce our overall comment (and this was the consensus view).
- Start with the answer to Question 1. Keep it simple, and just accept the ~160MBps per iSCSI target in ESX 3.5 for iSCSI. For many this is enough - period.
- Use the guest initiator model selectively when needed, for specific VM use cases where high bandwidth is needed in that larger set of VMs. recognizing the restrictions we pointed out and the one you added (no native ESX snapshots, or any VMware function that is built off ESX snaps, or SRM support for that particular VM in a recovery plan)
Posted by: Chad Sakac | January 28, 2009 at 01:39 AM
Val, thanks for your comment. Trying always to be above the fray - focus on customer, respect competitors.
"Be the change you want to see in the world" - Gandhi
Posted by: Chad Sakac | January 28, 2009 at 01:50 AM
Chad,
I have a Celerra (NS20). When you say to have multiple iscsi targets, does each Target on the Celerra require a separate IP address (currently I have 3 targets all using the one IP address)? I've always wondered if adding more IP addresses to the targets would help with throughput. BTW, The celrra is connected to the switch via 2 x 1Gb ethernet ports using Etherchannel.
Great article. I'm trying to fix/improve I/O performance at my work and blogs like these are great.
Posted by: David | January 28, 2009 at 08:15 AM
Terrific Post Guys! I've had to explain most of this over and over and over again to my iSCSI clients who didn't have the knowledge of iSCSI, LACP, etc. and complained about their slow LUNs (thus perpetuating the "iSCSI is slow myth"), and after about a day of reconfiguration, dramatically increased their performance with the same SAN equipment.
I do have one point of contention is the discussion of there being no quantifiable benefit between LACP and just using MPIO. Of course you stated in the article that MPIO can take up to 60 seconds to fail over. However, in a properly configured LACP environment, failover is much quicker (on the order of 30ms-5sec) and transparent to VMWare (because it doesn't have to MPIO to another IP address) so the need to reconfigure guests with higher iSCSI timers is unnecessary. Of course, you can then have multiple LACPs with multiple IP address and load balancing and really ramp it up. So in that sense, LACP does have a quantifiable advantage over MPIO at this time and is a relatively KISS principle compliant solution assuming your switch and SAN support it in a painless way, although if the MPIO timers were tweakable I suppose you could possibly get the same result.
Another important point is that VMWare actually doesn't support LACP, which is the negotiation protcol for creating aggregate links. Instead, it only supports 802.3ad Static mode. Hopefully we'll get LACP support in ESX4, as that will help with both the setup learning curve (removing misconfigured ports from the trunk) and failover time.
My currently favorite config-du-jour is either the stackable Cisco 3750's or the Stackable Dell Powerconnect 6248's (which is a surprisingly good high performance, feature laden, and cheap L3 switch believe it or not) and 802.3AD cross-stack trunks from both the SAN targets (assuming it supports it) and the VMWare infrastructure.
Thanks for the excellent article guys! This definitely goes into my "read this first" pile for clients.
-Justin Grote
Senior Systems Engineer
En Pointe Global Services
Posted by: Justin Grote | January 28, 2009 at 09:27 AM
I would _love_ to see a guest vs host s/w initiator comparison...
Posted by: Stu | January 28, 2009 at 04:35 PM
thanks for the comments all!
David - Thank you for being an EMC/VMware customer! Hope you're enjoying your Celerra!
Each iSCSI targets, maps to one or more Network Portals (IP addresses). Now, unless you have more than one iSCSI target, all traffic will follow one network link from ESX - period (for the reasons discussed above). BTW - in the next VMware release, you can have multiple iSCSI sessions for a single target, and there are round-robin multipathing and the more advanced EMC PowerPath for VMware (which integrates into the vmkernel - very cool!)
But, for 3.5, you will see more throughput if you configure differently.
Your NS20 has 4 front end GbE ports, so you have a couple of simple easy choices that will dramatically improve your performance.
It depends on how you have configured your ESX server - are you using link aggregation to the ESX host to the switch, or multiple vSwitches? (this is something we need to add to the post) Let me know, and I'll repond...
UPDATE (1/31/09). David, I haven't heard from you, so will give the answer here for all, and also reach out to you directly.
Long and short - with 1 iSCSI target configured, you will never get more than 1 GbE connection's worth of throughput. You need to configure multiple iSCSI targets.
Now, the Celerra is really flexible about how to configure an iSCSI target. You can have many of them, and each of them can have many network portals (IPs). BUT, since the ESX iSCSI software initiator cannot do multiple sessions per target, or multiple connections per target - in this case, create multiple iSCSI targets - at least as many as you have GbE interfaces used for vmkernel traffic on your ESX cluster. Each needs a seperate IP address by definition.
By balancing the LUNs behind the iSCSI targets you will distribute the load.
You have used 2 of the 4 GbE interfaces on your Celerra (there are 4 per datamover, and the NS20 can have two datamovers - the Celerra family as a whole can scale to many datamovers).
SO, your choice is either to plug in the other two, assign IP addresses, and assign iSCSI targets (just use the simple iSCSI target wizard)
OR
The Celerra can have many logical interfaces attached to each device (where a device is either a physical NIC or aggregated/failover logical device). You could alternatively just create another logical IP for the existing 2 linked interfaces, and assign the IP address to that.
Now, you also need to consider how you will loadbalance from the ESX servers to the switch.
You can either:
a) use link aggregation (which will do some loadbalancing since there will be more than one TCP session, since you have more than one iSCSI target) - make sure to set the policy to "IP hash"
b) use the ESX vmkernel TCP/IP routing to loadbalance - here you have two vswitches, each with their own VMkernel ports on seperate subnets, and then you need to have the iSCSI target IP addresses on seperate subnets. This ensures even loadbalancing.
Let me know if this helps!!!
Posted by: Chad Sakac | January 29, 2009 at 10:26 AM
In Answer #3 above you make mention of using hardware initiators only if iSCSI boot is required - what about the fact that hardware initiators support 9K jumbo frames whereas jumbo frames are not yet a supported configuration for the 3.5 vmkernel?
Wouldn't the performance benefit of jumbo frames alone merit going with a hardware initiator?
Posted by: Gregory Perry | January 29, 2009 at 06:22 PM
All the talk seems to be about throughput performance. I'm not seeing anything on latency. For example, in a laptop i have, changing to a SSD drive has improved my boot and app load times immensely. throughtput hasn't changed, i still get the same MB/sec as with the old disk but the latency is way down.
for applications where the usage is like this; i/o, cpu, i/o, cpu, i/o etc, latency is a huge issue. For this kind of use, i'm not seeing iSCSI being a good choice. Fibre Channel still has the benefit of lower latency. and when that's not enough, Infinniband.
what do people think?
Posted by: daniel baird | January 29, 2009 at 09:25 PM
Daniel Baird,
I m not negating what you are saying, just complementing it: what you did to your laptop is a classic
"I have re-designed and moved a bottleneck somewhere else" thing. Its a never ending story mate..
The above applies to any design exercise.
Cheers,
Eric Barlier
Posted by: eric | January 29, 2009 at 11:29 PM
Great post, Chad, can we syndicate (ie. promote) this on VIOPS?
Posted by: Steve Chambers | January 30, 2009 at 05:28 AM
Steve - you can ABSOLUTELY promote on VIOPS (which is awesome BTW).
Posted by: Chad Sakac | January 30, 2009 at 09:16 AM
Daniel/Eric - latency DOES matter.
EMC has a lot of experience with EFD - which we've been shipping for a year.
To understand Enterprise Flash Disk: think of these as the SSD you see in your laptop but on steroids. They are designed for many, many more read/write cycles (and have lots more extra cells). They have dual-ported interfaces, and a lot of extra firmware/SRAM between the interfaces and the actual flash.
EMC's view of the future here is that soon there will be only two types of disks - huge slow SATA, and hyper-fast EFDs. All sort of host interfaces (SAS/FC/iSCSI/NAS/FCoE....) but that will be the stuff that stores the data. All our arrays support EFDs now.
OK - back to the point - LATENCY does matter!
The reasons EFDs rock (even in high end cached arrays, and even in VERY high cache hit rates like 95%) is their ability to deliver 30x the IOPS (IOs per second) of a traditional disk. This actually means that while they cost about 9x more than a FC disk (commercial SSDs costing about 3x more today than a SATA disk), they are in the end cheaper. They also save a TON of power/cooling space.
Exciting stuff, and our customes LOVE IT.
It was, however out of the scope of the post. In talking with the guys, we decided to put aside FC (which of course deals with the same multipathing issues in 3.5, but has lower latency and much higher effective bandwidth at large block workloads) since that would become potentially political and exclude the iSCSI-only vendors, negating the point of the joint exercise.
Speaking as the EMC person, our view is the following:
When VMware is used for the "100% virtualized datacenter" there is no single pat answer about protocol choice, backend choice, connectivity choice - because you have some "craplication" VMs, some "important VMs", some "mission critical VMs".
Each one of those have differing IO requirements which are "orthogonal" - i.e. no correlation to the "importance".
In fact, my recommendation (personally) is that every ESX cluster should have block (and the choice of type varies - and in some cases is "several") **AND** NFS - as each have VMware "super powers" and also some features which work only on one or the other.
We pride ourselves as EMC as covering all those bases.
It makes for a more complicated discussion. It's simplistic to say "iSCSI is the ONLY way" or "NFS is the ONLY way" or "FC is the only way". Those are the answers of someone with an agenda, a bias, or a cult :-)
The "it depends" answer is correct and the answer of a pragmatist. It needs to be followed by a "let's talk about what you're trying to do, and how we can design a solution that meets those requirements and is the simplest we can make it at the same time".
Posted by: Chad Sakac | January 30, 2009 at 10:13 AM
Greg - re jumbo frame support. It works, and works fine, but as you point out it's not supported (and I try my darnest to never recommend something in a production environment that isn't supported).
I was surprised that the jumbo frames didn't help more when we did testing. We did a lot of testing around that, and I posted on it here: http://virtualgeek.typepad.com/virtual_geek/2008/06/answers-to-a-bu.html
I'm not saying it's not good, it does make a difference, but only with large block IOs (64K or larger - common during backup/restore or database DSS workloads).
In my opinion, this doesn't warrant moving to and iSCSI HBA. For the same dough, you can get an FC HBA, and all our arrays (exception of the AX series) support FC and iSCSI together at no extra cost. Yes, there is the price of the FC switch and ports, but they are a lot cheaper these days.
So, here's the logic:
- I **LOVE** iSCSI. If you are set on using iSCSI, use the guidance in the doc. the SW initiator is the focus of most qual, work, testing and the most widely deployed.
BUT
- If you have to drive high throughput to a SINGLE target, either wait until ESX 4, which will support jumbo, will have multipathing, AND multiple sessions.
CAN'T WAIT?
Rather than spend on the iSCSI HBAs, go FC. It will cost marginally more,
WANT TO BUILD FOR THE FUTURE?
Go with the 10GbE converged adapters, which look like two 10GbE NICs and two 8Gbps FC HBAs to ESX.
These are supported with ESX 3.5 already, and EMC e-Lab (a multibillion dollar interop effort) has qualified all the gear (FCoE switches, CNAs).
Like the earlier comment - no single answer is right for all customers, but there is an answer for EVERY customer.
Posted by: Chad Sakac | January 30, 2009 at 01:10 PM
Thanks for the replies Chad & Eric.
Seeing this article in the context of "when it makes sense to use iSCSI..." makes all the difference.
Eric, I agree I just changed where the bottleneck is. On my laptop, I get much higher average CPU utilisation now. That's now the bottleneck. I didnt explain myself very well, i think my point was more that reducing latency was what helped my performance issues. a disk with higher throughput with the same latency wouldnt have done much for me.
Chad, thanks for the lengthy reply! I'm new to virtualisation, i'm from a telco core engineering background. enterprise is a new area for me, so i'm having to learn a new solution flowchart/methodology. There's a lots to learn
e.g. with FC, you have the FC HBA and FC switch costs. with iSCSI, its NIC and Ethernet switch costs. you also have the cost of the extra CPU you're using for the TCP/IP overhead (less if you use a TOE card, but they cost a similar amount to a FC HBA). old style servers with OS directly on hardware usually have more CPU headroom which can be given over to iSCSI processing, but with ESX/Xen/etc you're trying for high average CPU util so there's not the "spare" cpu available.
lots of pros and cons. :) it would be cool to see a cost/performance comparision of FC, iSCSI and NFS with a set pair of storage and server boxes. for example several HP DL380s and an EMC Clariion running a number of VMs. the VMs would be running key apps like MS Exchange or Oracle etc. i find there's not enough data on how apps use storage to help you pick the type that best fits. in the past i've seen too many over-engineered solutions where expensive kit is throw in because the app's behaviour is not well understood. i'm sure we'd all love to have the time and budget to lab test all the hardware combinations to see what's the most efficient. apps that move big chunks of data around and are less affected by latency would seem to be prime candidates for iSCSI.
also, there's FCoE. that really kicks the ant's nest. but perhaps it can happilly exist alongside iSCSI. FCoE may kick out the existing full-stack FC. but i digress...
Posted by: daniel baird | February 01, 2009 at 07:19 PM
Guys, from a networker that just stumled across this very good article: be aware that the networking side, if you use ether-channels between switches (servers not all connected to 1 switch) you need to take the load-balancing algorithm of the switch(yes...)
So that work on source/dest mac/ip. example:
cisco.com: Use the option that provides the greatest variety in your configuration. For example, if the traffic on a channel is going only to a single MAC address, using the destination-MAC address always chooses the same link in the channel. Using source addresses or IP addresses might result in better load balancing.
> (IOS CLI) port-channel load-balance {dst-ip | dst-mac | src-dst-ip | src-dst-mac | src-ip | src-mac}
http://www.cisco.com/en/US/tech/tk389/tk213/technologies_configuration_example09186a008089a821.shtml
Posted by: Martijn Jansen BT | February 05, 2009 at 10:26 AM
Hi,
Guys,A “Multivendor Post” to help our mutual iSCSI customers using VMware" is a good idea to share knowledge.We are working with Stonefly and DNF products.If some body have any question or query on integration of servers(FC,iSCSI,IP SAN) with Stonefly or DNF product plz forward it to me.
Stonefly and DNF product can be integrated with almost any solution like Microsoft,VMware EsX ,Solaris,Linux,Solaris etc.....
Regds.
StorageSolutionGroup
Posted by: Storage Solution | February 19, 2009 at 05:26 PM
Virtual Iron which is an inexpensive server virtualization product also integrates with iSCSI in a similar fashion.
Posted by: VISE | February 26, 2009 at 01:51 PM
Should flow control be enabled for NFS as well?
Posted by: Ben | April 09, 2009 at 05:00 PM
I have to agree this is exactly the type of colaboration that produces high quality work.
Couple of questions regarding ESXi and Clariion:
We done a few test that have conflicting results. Now, I somewhat understand why. Could you confirm or expand on these 2 concepts:
1) Essentially, because of the current limitation to a target, Exchange and SQL should use the GuestOS scsi intiaitor (and powerpath in my case) to provide greater throughput. Is that accurate and will that change when PowerPath for ESX is released?
2) We use a fully redundant fabric of two subnets, one for each processor of the AX4. In our case, it would be better to use two vSwitchs, each with their own VMkernel ports on seperate subnet over using a single vswitch with two nics committed.
Thanks again for this outstanding information.
Posted by: Dan Israel | April 21, 2009 at 03:53 PM
Hi,
what a wonderful article.After reading your article i realised that by using the following deployment http://img11.imageshack.us/my.php?image=deploymenti.jpg it limits the bandwidth to just one link (fail over) rather than 2 GBps which can boost performance.
I wonder if i can use trunking using directly attached 2x 1 GB Ethernet cable as iSCSI from the Dell MD3000i SAN into two ESXi Servers, would that be a faster solution rather than using a switch in between ?
Posted by: Albert Widjaja | May 13, 2009 at 07:54 AM
Hi Chad,
First of all, i would like to thank you for your valuable information.
I have some questions about cellera and vmware esx.
we are going to start a new vmware project soon and we plan to use the methods you mentioned about. We are going to use vmware esx 3.5.
As you say, i know that i need to use multiple iscsi target at the side of ip storage in order to get max performance with the iscsi software init.
We will use a virtual switch which composes of 4 nics for the isci network and “ip hash” as nic teaming load balance policy.
we can use a few methods related to it.
1. we can generate two iscsi lun on every target by creating 4 isci target.
1.a is it possible to have a different isci session on each nic (virtual switch included 4 nics)?
2.we can generate one iscsi lun on every target by creating 8 iscsi target
2.a is it possible to have a two different isci session towards 2 different iscsi target on each nic(virtual switch included 4 nics)?
Do you have any suggestion about another trick for the best isci performance?
Does vshpere support multiple connections per session (mc/s) ?
We are planing to upgrade vmware from 3.5 to vshpere at the end of year .Do you think that we need a change when upgrading to vshpere in the structure above? ( by taking consideration about iscsi target and lun number)
Finally, there is 60 disks on ns40. Is there any suggestion of you for the max performance at the side of storage? is it the best method to use AVM?
We are looking forward to receive your answers.
Thank you for your attention.
Best regards,
Posted by: cemal dur | May 24, 2009 at 05:37 PM
Max throughput 160MB? I have always thought it was 1000/8=125MB. How are you getting 160MB across a 1Gbps link?
Posted by: Steve | June 07, 2009 at 07:02 PM
@Steve - remember that it's 1Gbps unidirectionally, and generally ethernet is configured in full-duplex.
125MBps is really an unacheiveable goal - even if there were no overhead, and there of course is overhead (ethernet frame, IP header, TCP header, iSCSI PDU header) and re-transmits, and control traffic.
So - 80MBps is a more acheiveable throughput with a 100% write or 100% read workload (which result in predominantly unidirectional iSCSI traffic), and about 160MBps with a mixed read/write workload.
Posted by: Chad Sakac | June 11, 2009 at 07:11 PM
I'm looking at the 160MB number and am curious how you are getting 160MB throughput across a 1GB link? 1000/8=125 please show how you are reaching 160MB.
Posted by: Steve | June 14, 2009 at 03:24 PM
Great article. I learned the hard way that it will always use a single NIC when you only have one iSCSI target.
Question... I can set up multiple IP's on my SAN and set up multiple iSCSI targets, but do you only use 1 vSwitch for the vmkernel and service console or do you create multiple vSwitches for this?
Posted by: Dan J | June 18, 2009 at 10:56 AM
Chad,
I also have a Celerra (NS20). 4 ports are active for the Server 2 Datamover, and the server 3 DataMover is in standby....right? That's what was explained to me.
So i carved out 5 LUNs @ 500GB each. I also have 12 NICS on each of my 3 ESX servers.
Initially I setup the Celerra for LACP on all 4 ports as a single target going to my 3750 switch. All my LUNs were behind the single target IP. After reading this, I broke them up over 4 target IPs which really made my VMotions slower.
What are my options for the best speed but fault tolerance for VSphere?
Posted by: Sean | June 18, 2009 at 03:59 PM
Thank You for the great post.
What indicators should we be looking for to identify that we have maxed out an iSCSI session? We are using LeftHand equipment and we have link aggregation in place.
Obviously we can look for bandwidth bottlenecks on switch interfaces. But from a Windows Virtual Server, would you start seeing disk queue length counters climbing? Are there other perfmon counters that we would notice?
We are looking to place Exchange 2007 in a VM for 2500 users. Currently our Exchange environment lives on a FC Clarrion. I am a bit concerned after reading this that iSCSI may not have enough throughput for our Exchange environment.
Thank you
Posted by: Kevin | July 30, 2009 at 08:15 PM
Kevin, as an EMCer, if this response doesn't buy me "well, at least he's honest" I don't know what will :-)
I wouldn't worry about that user load and Exchange and iSCSI.
Exchange is actually (in steady state), IOps bound - not bandwidth (MBps) bound. For example - assuming 0.5 IOPs/user, and the 8KB IO size of Exchange 2007 - your 2500 users = 10MBps, which is well under these limits.
Now - during a backup (if you're doing a traditional backup) it will be bandwidth bound (it drives as much as you've got)
There's an easy way to check before you migrate. Just use perfmon, and measure the physicaldisk stats for a week (capture the backup periods). If they are in those bounds, you're good. If not, you need to look at the workarounds we listed.
More important with Exchange is generally the number of spindles. just make sure you have enough in your lefthand configuration.
The key things to watch with iSCSI from a VMware standpoint are the network bandwidth statistics, the vscsi stats on latency (bad latency = unhappy apps), and if the backend storage is happy, but latency is not good, look at QUED (queue depth) using perfmon - and check to make sure the queues aren't overflowing.
Good luck - and let me know if I can help further!
Posted by: Chad Sakac | July 30, 2009 at 10:56 PM
Wow,thanks for this really good explanation on the link agregation/esx relatioinships.This cleared up a whole load of questions for me.
Many thanks
Posted by: John Doyle | September 26, 2009 at 03:59 PM
Can anyone comment on openfiler iSCSI SAN? To the best of my knowledge, iSCSI targets in OpenFiler does not support multiple connections per initiator. I am not even sure if you can assign an unique IP per LUN. So is this mean that I am limited to 1.0GB for the entire openfiler?
Posted by: www.facebook.com/profile.php?id=1734056550 | October 11, 2009 at 02:57 AM
I love that quote and would be honored for Chad to know me!
It's hard prepping for interviews when I know who I am most interested in and who I can earn a career with, not just a job. One night I was supposed to be studying switches and instead wrote my elementary understandings of VDI. Don’t get me wrong, I did study switches as well. Today I am supposed to be prepping for other interviews and I want to study this blog because it's good and my high level friends tell me to. Again I will probably do both, with pneumonia, because that's who I am! I can’t help but notice my interviews are teaching me the entire network, maybe to understand some of the intricacies of consolidation? As I said in my profile, my writings display my passion and self motivation to ramp, not what I wish to know.
Shannon, ISR/Lead Generation
Posted by: shannon | March 21, 2010 at 01:53 PM
Opps, I was commenting on, "If you are passionate about these technologies, good in front of people, like working hard when it’s something you believe in, and feel like we’re at the cusp of a wave of technological change – I want to know you.”
What I will say about your above writing is it's not our place to tout competitors on our marketing blogs, but I love how you do this when it's required to service the client. I love how you stay professional in this competitive marketplace, and use facts. Kuddos to the customer centric way in which you do business.
Shannon
Posted by: shannon | March 21, 2010 at 02:09 PM
Dear Chad! Tank you a lot for sharing. This is a very relevant information for me, especially about cellera and vmware esx . I will bookmark your blog and will use this information in my custom paper writing. Wish you good luck.
Posted by: Alex | May 08, 2010 at 01:23 AM
Chad, nicely done, I posted something similar to building a Linux machine with iSCSI. I would love to share that with the group, it takes less then 1 hour to build everything from ISO to working machine. Using your configuration your done in 30 more minutes. Great Job!
Posted by: Arthur Gressick | September 14, 2010 at 07:55 PM
Chad, I'm still using ESX 3.5 and I'd like to understand hot to configure the iSCSI storage infrastructure to get redundancy (and not to increase throughput). Disk-array is an EMC Celerra.
Could you point me to a good reference?
Posted by: Domenico Viggiani | November 11, 2010 at 06:34 AM
@Domenico - yes, please see the Celerra Techbook, here: http://www.emc.com/collateral/software/technical-documentation/h5536-vmware-esx-srvr-using-emc-celerra-stor-sys-wp.pdf
basically - make sure you have the iSCSI target exposed via multiple virtual/physical interfaces. Those should go into a redundant switch fabric. You'll see the iSCSI LUN visible on multiple targets. In the LUN properties in vCenter, you'll see an active path, and a failover path.
Posted by: Chad Sakac | November 12, 2010 at 11:58 AM
Chad, thank you very much.
I heard you and Vaughn Stewart at VMworld speaking about storage best practies with VMware (the best session I attended!)
With EMC NS480, do I still need to setup different target/interfaces in different subnets?
Posted by: Domenico Viggiani | November 15, 2010 at 05:55 AM
Hi all, since there's several on here with first hand EMC experience, figured I'd pose a question. We've just deployed a Cisco UCS cluster with ten gig out and an EMC Clariion CX4-480 with ten gig. The storage is on its own vlan configured for 9000 byte MTU (jumbo frames) on dedicated static vnics. I've verified with ping and the do not fragment bit that jumbo is working end to end. The server blades in the UCS are running ESXi 4.1 and EMC powerpath with four targets for each LUN (emc spa/0 & 1, spb/0 & 1). Our raid groups in the EMC are a meta lun striping two 5+1 RAID 5 arrays each.
So, that brings me to the question; the best throughput we've been able to achieve from a guest to the EMC running disk benchmarks to an idle CX4 has been 249 MB/sec block writes and 165 MB/sec block reads. As far as I can tell, there is no limit being hit anywhere; the drives in the CX4 are barely lit, the switches show no errors and have not bursted higher than ~1700 Mbit/sec, no exhaustion of buffers, cpu load on the server side looks fine. What I dont know is whether esxi 4.1 on cisco UCS is using hardware for the iscsi? If not, is perhaps this a vmware cpu issue?
Posted by: David H | November 20, 2010 at 10:39 AM
@domenico - first, thanks for being a customer! On an NS480 - there are two ways to do iSCSI: 1) via the Datamover (which we are going to phase out over time - for example, there won't be VAAI support for it, or provisioning via the vCenter plugin); 2) via the Storage processor (this will be the "winner"). BTW - you can non-disruptively add the second type if you're currently not using it. This was a difficult decision, but it didn't make sense to have two iSCSI targets in the same array.
I ask because the best practices differ depending on the target. Let me know which type you are using and I will help.
@ David H - also - thanks for being a customer! EMC, like a lot of folks in the industry (this has hit pretty well most vendors supporting 10GbE) it is very likely (but not absolutely surely) due to the Broadcom chipset. First, are you running the F29 patch for 10GbE customers? This is also fixed in FLARE 30.5 (the most current rev).
I talked about that here: http://virtualgeek.typepad.com/virtual_geek/2010/10/nice-updated-emc-unified-iscsifcoe-tidbits.html (download the VMworld session and have a look-see)
Setting flow control on the switches is also important.
With all those set, the only question I would have would be the spindle config. As described (why not use Storage Pools rather than Metas? Storage pools are the future!), you MIGHT not have enough backend to saturate the interfaces (if it's a small block size) - though I DOUBT this is the case (based on your description).
Let me know if the suggested fixes help you, otherwise, please open a case, and we'll work it.
Posted by: Chad Sakac | November 20, 2010 at 01:12 PM
Hey Chad, we're running what's showing as 4.30.0.5.508 Chad, is that the "F29" patch? Or better? I'm very new to the EMC side of things so just jumping in feet first. I did read some of your threads before we even got started and made sure we'd be taken up to Flare 30 before we deployed since I knew it had the multi-path iscsi initiator fix. As far as i can tell, we're not experiencing any logout issues with our four paths (via powerpath) to the targets which I think was the issue in pre-FLARE 30.
I have absolutely no problem blowing away our meta luns and raid arrays and re-doing them as raid pools if it would be advantageous; we only have a few virtual machines booted so I could easily move them to one meta lun, rebuild the rest of the storage and then hot migrate them to complete the final rebuild. Would that be worth doing? I got the feeling we only went that route because the install-vendor was more used to doing it that way.
On the Cisco side, I'm showing:
MTU 9000 bytes, BW 10000000 Kbit, DLY 10 usec,
input flow-control is on, output flow-control is off
Posted by: David H | November 20, 2010 at 10:25 PM
Hi Chris, just an update, what I've done this morning is reconfigure my CX4 back down to 1500 byte MTUs and deleted/recreated the iscsi vnic's on the ESXi side to go back to 1500 byte MTUs and now my read speed on an 8-drive (2 x 3+1 raid 5) meta LUN jumped from a best ever 165 MB/sec to 297 MB/sec.
On the Cisco switch side I'm not showing any buffer overruns or drops since they have a pretty large 175 MB buffer on the 4900M switches. Plus the PowerPath I can tell is correctly splitting the load across two switches since SPB owns the MetaLUN in question and its port 0 and 1 are on different switches. Also on the Cisco side I can see a lot of pause frames coming back from the CX4 which makes me think maybe at 9000 byte MTU's the buffers on whatever NIC hardware the CX4 10gig cards use are not sized in a way that works well with 9000 byte frames.
Do you know if there's any internal EMC testing data that shows what the ideal MTU is for the CX4's 10gig cards? I can step through the 11 pre-defined options between 1500 and 9000 if not but that's slow going thanks to the vsphere side of deleting/recreating the vnic's. :-)
Posted by: David H | November 22, 2010 at 11:20 AM
Sorry for messing up your name in the last post; been playing with storage more than sleeping. lol I deleted a few metaluns/raid groups and created a pool of 10 drives but did not see much change in the performance unfortunately.
Posted by: David H | November 24, 2010 at 02:20 AM
@chad, I understand what you say, thanks.
I'm evaualating pro's and con's of all methods to "attach" storage to VMware (and not only to it... also Linux and Windows boxes will share same "fabric"):
If possible, I prefer FC that works at its best without many efforts of configuration.
As an alternative to FC, I'm looking at iSCSI (with MPIO as failover/load-sharing option) and NFS (with network level solutions for failover/load-sharing), as you suggest in your posts.
I'm trying to avoid any prejudiced position.
I know that NS480 has iSCSI on the front-end, peraphs it's the best solution (sincerely, I already have a few old Celerra's and its datamover architecture is not my love!). If you have some spare time and point me to right direction, I surely will avoid a lot of mistakes! Thanks in advance
reghards from a long time EMC customer
Posted by: Domenico Viggiani | November 24, 2010 at 05:59 AM
I am trying to understand the difference between ISCI multipathing vs. Nic Teaming (Active/Active)--What are the differences?
Posted by: J | December 26, 2010 at 12:58 AM