I’ve always found that resolving performance issues is more about the HOW than the WHAT. In other words, it’s not like there is a list of “secret switches” that simply make things scream (ergo: “just enable jumbo frames and it will solve all your problems” or “do/don’t use thin, it will be good/bad”).
I call these hyper-specific pieces of guidance “WHAT” (in the sense that there might be a specific set of configurations that will always be good).
There’s simply no way for any human to be able to keep track of all the infinite variations, which all are constantly changing.
So, here are my guiding principles on HOW I try to approach performance – in other words best practices that “ are process things get you to a good place”, rather than “knobs to turn to get you to a good place”
- First Personal Rule: KISS is more than a cute acronym, it’s a way of life.
- Second Personal Rule: If you find yourself needing to tweak a lot of defaults, you’re probably on the wrong track.
- Third Personal Rule: You’ll inevitably run into problems – so sticking to a good “how to diagnose” process is very important.
I’m going to use ONE example to explore these ideas (and this is a real customer example, with real customer data).
If you’re interested, read on…
So – here’s the scoop – there’s a GREAT VMware and EMC customer that IMO is generally quite advanced. Back in 2010, they started deploying 10GbE on VMware (using Intel X520 interfaces), and used EMC’s midrange storage platforms (a CX4, they are happy, and evaluating VNX) and added 10GbE iSCSI Ultraflex SLICs.
they found that they were not getting the kind of bandwidth from the interfaces that they expected. Here were the ESXTOP results from one of the hosts.
First of all – remember – it’s not “real world” to drive a workload to saturate links (which is possible), so don’t auger in on maximum throughput. The actual thing that was generating the load was doing a sequential IO within a single VM. Also, they were finding latencies that were materially higher than they expected – here’s a vSCSIstats dump:
As you can see – the distribution had WAY too many that landed in the 100ms bucket. Long and short – they felt they could and should get more.
At VMworld 2010, I presented a series of topics on general storage best practices together with Vaughn. While we intentionally worked to make the content apply across the storage landscape, I put a couple things in there that were germane. Here was one:
There were two main points here: 1) to optimize your bandwidth, make sure you’re current on the software, and make sure you enable flow control end-to-end; 2) don’t obsess over line rate (and set expectations on what’s reasonable to expect on a given platform). BTW – I did it cognizant of the fact that competitors could use it for silly counters, but frankly, we are the leader, they will always needle us. I believe as much as possible, we serve our customers best when we’re as transparent as possible – and I worry less about our competition and what they’ll do.
The customer read VMware KB 1002598, and tried disabling DelayedAck, and saw dramatically improved performance, as shown in this ESXTOP result.
Likewise, latencies got a LOT better.
Then, they looked a the core array performance:
… and saw that it was much better.
So – why isn’t this a mainstream recommendation?
The answer is simple – this doesn’t always have the desired effect, and we need to do more before changing it broadly. It doesn’t ALWAYS have the positive impact.
Well – I DO think we could/should update our core documents – the one that applies to this topic you can see here:
But – this is a optimization, that is, in effect, a reflection of a transient thing. I know that transient things can cause a LOT of pain (hence knowing about the workaround is important). An analogous (and related) issue was a lot of VMware customers struggled with Broadcom NICs with VMware – which was fixed in vSphere 4.1 update 1.
That type of solution – transient, manual, complex, are, in general, bad solutions to performance problems. I’m not a purist, and I know you need to make compromises – so for this particular customer (and maybe for others), the workaround is a good thing.
So – let’s take a look against the “guiding principles”
- First Personal Rule: KISS is more than a cute acronym, it’s a way of life.
- “Defaults are usually the best” – only change the default if you really have to. If the default should change, then the default should change.
- Here, the KISS principle says that if the fix applies generally, it’s important to know about the workaround, but it’s most important for the fix be applied into the platforms, so you don’t need to change an advanced setting. Working on it. In the meantime, I want awareness to be high.
- Please follow the general principles discussed in the recording we did at the Atlanta VMUG (see it here), and on this webcast here.
- Second Personal Rule: If you find yourself needing to tweak a lot of defaults, you’re probably on the wrong track.
- Stick, as much as possible, to the core recommendations in the core docs.
- The fact that is why delayed ack is a workaround, not a fix, since changing this requires detailed changes on a host-by host basis, it’s “fragile”.
- Third Personal Rule: You’ll inevitably run into problems – so sticking to a good “how to diagnose” process is very important.
- This is what I think was perhaps the most instructive thing in this example - the customer started HIGHER in the stack (at the VMware layer, using ESXtop and vSCSIstats), and then worked LOWER (the array stats). This is a VERY important performance troubleshooting technique.
Personally, IMO #3 is one of the most important principles of performance troubleshooting. Here’s my “infrastructure troubleshooting” sequence:
- If it’s serious, take the little bit of time to open official cases with the vendors involved.
- the most powerful resolution to problems are open, community based. Use google.
- When troubleshooting – start with the connectivity from client to the application, then…
- the application (ergo look at the SQL code and optimize), then…
- the guest OS (ergo look at windows Perfmon or Linux iostat), then…
- the VM itself (ergo look at vSCSIstats), then…
- the ESX host (ergo look at ESXtop), then…
- the connectivity host-to-array (ergo use wireshark, look for dropped frames, retransmits), then…
- the array itself (ergo look at NAR/SAR/systat)
This sequence has served me well many times. Note – there’s no correlation with problems occurring more/less at various parts of the troubleshooting process, or having more/less impact at given parts of this troubleshooting process.
The point is that it gets you to approach this stuff programmatically.
I REALLY want to highlight a couple of powershell scripts the vSpecialists have been using for step 4 and step 5.
Would love to hear feedback from you… What do you do to troubleshoot? what could we do to help you?
Hi Chad, Excellent post.
Posted by: PReddy777 | March 09, 2011 at 03:32 PM
I concur, great article!
Posted by: Conrad | March 09, 2011 at 03:44 PM
Great post Chad. I have used your diagnosis process in your third rule for years. However I have always had to "Start" somewhere where I have seen the symptom. For example if you see a network issues you start at that step and go up or down the chain based on the information you have discovered. Also if you can't find a resolution to your problem you need to re-review all of your steps yourself or with a second pair of eyes. I have had cases where I go through the whole process don't find the problem and by re-reviewing everything with another pair of eyes I found I missed something...we are all human :)
Posted by: Csmykay | March 09, 2011 at 03:56 PM
Great, make me think more
Posted by: lei zhang | March 10, 2011 at 10:24 AM
Thanks Chad, useful post!
Posted by: Ema | March 10, 2011 at 10:36 AM
Does problem can arise with NS480 unified array (Flare 4.30.000.5.004)?
Posted by: Domenico Viggiani | March 14, 2011 at 11:02 AM
Hi Chad
James B here.
The delayed ack (as a result of Nagle's algorithm) is something which affects a lot of environments, sometimes does not cause enough obvious issues as to warrant a deep investigation. People tend to live with it and accept "well, thats the best I can get".
Which is a pity.
While, it is true in VMWare and on the array target side, the same applies for Windows hosts (VMWare with in-VM iSCSI initators, Hyper-V Servers with iSCSI).
Standard IP network traffic, for the most part do not fall into the same realm in terms of payload (its much bigger - typically over 1492 bytes) as storage-based Block IP.
Quite often, SCSI commands are very small in terms of payload size, (could be just 10 bytyes) for slow-path CDB, inquiry, metadata and control commands. These are exactly the commands you do not want to be slow!
Storage vendors such as EMC, and OS/Hypervisor vendors use small SCSI commands to control and inquire storage. Typically the code written here is sequential single-threaded code to enable correct timing and arbitration of devices - and this is when Delayed Ack really stings. Like your customer, above with his sequential workload.
The default for iSCSI initators in Windows 2008 onwards is to enable Nagle's Algorithm, which I believe is the wrong thing to do. iSCSI, by its nature assumes a lower-hop point-to-point network between host and storage, and as such link saturation is less likely to occur due to aggregate traffic on trunks.
After my issues documented here;
http://sustainablesharepoint.wordpress.com/2010/03/10/best-practice-for-hyper-v-with-iscsi/
I tried to convince Microsoft to either change the default for an iSCSI adapter or at least enable a radio button on the iSCSI Initator UI in Windows 2008. Well, I got to write them a KB Article instead :-)
http://support.microsoft.com/kb/2020559
So, more times than not, it can be better to have delayed ack disabled by default for iSCSI adapters.....
Cheers
James
Posted by: James Baldwin | March 25, 2011 at 04:34 PM
Much could be said about this, but I'll sum it up by saying that this is one of the top 5 best articles I've ever read regarding VMware/Storage on the internet ever.
Posted by: Justin Hensley | June 21, 2011 at 10:44 PM
What do you mean by enable flow control end-to-end?
Posted by: Phil | October 31, 2011 at 07:12 PM