« vSphere and 2TB LUNs changes from VI3.x | Main | FCoE Ratified! »

June 09, 2009

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Vaughn Stewart

Chad - thanks again for all of your efforts in making this post happen.

Chad Sakac

Vaughn my pleasure - and THANK YOU! It was a good collaborative effort. I hope (as I know you do) that we are helping customers!

David Barker

Chad - you've hit the nail on the head why everyone treats NFS as a second tier - timeouts :-)

Is there any chance VMware could change the default NFS timeouts to something more sensible? (And while we're at it, get vmtools to fiddle with the guest too?...)

Without the defaults changing, I'd argue NFS should be treated as a second tier. I think this thread sums it up quite nicely:

http://communities.vmware.com/thread/197850

Chad Sakac

David - note that both EMC and NetApp suggest to INCREASE the default NFS timeouts over the VMware defaults. This isn't a VMware issue - this is a NAS device characteristic.

Both NetApp and EMC Celerra engineering have been laser focused on reducing the duration of failover events, and have been making steady progress over the years. The trick is to solve this not just for the "best case" (in which case we can both be well under the 125 second recommendation), but "all cases" as the failover time (again, for both of us) tends to depend on many parameters.

I'm an engineer - so I tend to call this "unbounded behavior".

Certain things (fast failover, consistency groups) are "easier" from an engineering standpoint on "traditional block" architectures

Others (deduplication, thin provisioning) are "easier" on "traditional NAS" devices

Others still (object-level policy) on CAS/COS device architectures.

These "easier" things dont mean that they are impossible on the various architectures (many things on the lists exist now across multiple platforms architectural models), but does tend to be the reason they appear in one place first, then later in others.

The other element here is that as noted (both in the post, and in the VMTN thread you post), the failover domain of NAS is link-based, and block is path based. NAS also depends on the longer TCP/IP native timeouts, ARPs, and various other elements of the TCP/IP stack. Again - none of this is intrinsically BAD - but is intrinsically DIFFERENT.

Vaughn - you might want to have someone at NetApp help out the customer on the thread David points out.

Thanks for commenting - and we are indeed furiously working on this on the EMC side David - more to come soon.

David Barker


Sorry, I didn't mean to say NFS was bad either - like you say, different! :-)

(yes, 'second tier' was unfair...)

Strictly speaking, timeouts aren't a vmware issue but:
- The current defaults don't work for the worlds most popular NAS boxes.
- New users tend to trip up on NFS (thinking it's a cheap way to do storage), which is perpetuating the FUD. NFS servers tend to be treated like an office fileserver; 'if it goes down, it's OK; clients will reconnect'.

Maybe NFS should be treated as an 'advanced/expert' protocol in VMWare? Or just add an extra warning about timeouts in the vSphere client when setting up NFS stores?

PS: Many thanks to you and Vaughn for your hard work on this (and other) blog posts :-)

Shai Harmelin

Chad and Vaughn,

Excellent write up and I commend you on your collaboration. Why doesn't vmware support multiple paths to the same datastore over NFS?
Clustered NAS storage systems like Isilon (for which I work for) provide access to the same datastore from multiple storage devices and could easily allow multi-pathing if it was supported by the ESX host.

Perhaps that could be achieved by creating a 3rd party plugin but it seems like that can only be set for SAN storage.

Also, NFS/IP failover on an Isilon cluster takes no more than 5 seconds (mainly to allow the gratuitous arp to update the switch).
Regards,
Shai

smooter

Awesome article, certainly on par with your iSCSI Multi-Vendor example!

I have to agree with the second tier/DR storage solution.

I (like many) have been bitten too many times by the timeouts, and the lost connection reactions of ESX 3.5 U3 and below. It would seem that there would be a little better "healing" reaction to make NFS a more trustworthy storage solution.

I can honestly say I have reserved my NFS mounts for things like ISO image storage, and other ancillary data. Even then the only times in the last couple of years I have had to reboot my Cluster was due to hung NFS mounts.

This isn't unlike any Unix/Linux flavor (or Windows as an NFS client either for that matter), so I wouldn't just assume that VMWare's reaction to a lost NFS connection would be different.

Just my 2c

Thanks for all of your hard work, and solutions. You guys are quite the crutch for me! :-)

smooter

Michael Bergman

Read and considered everything in this article carefully. Great effort from you on this. Now, I have one specific question which I hope someone can answer.

I'm not 100% clear on the left path in the diagram for HA. We have multi-chassis LAG everywhere in a new modern environment we're building, with Nexus 5000. So increase and use multiple NFS datastores, to increase bwidth (TCP connections) over LACP LAG "trunks". I understand the topology pictures below the decision chart (the 1st one is our scenario).

But it says: "Configure NFS server to have multiple IP-addresses, can be on the same subnet."

We will have LACP LAG 802.3ad with two (2) 10 GbE ports at the storage (NFS server) side. The same ports will be VLAN tagged to up to 62 different subnets, to avoid routing NFS traffic in this environment. Does the above statement imply that to get ESX 3.5 (and also 4.0) to utilize both 10 Gbit links as good as possible, there needs to be at least 2 separate IP addresses (or even more?) on the the server subnet where the ESX server(s) will sit?

It doesn't really say anywhere in the article why that is. I don't know ESX at in practice so for those ppl who do, this may be obvious. Sorry if that's the case.

Thanks for these two articles on iSCSI and NFS, really useful for me.
/M

Michael Bergman

I noticed one more thing while reading and comparing these two articles (iSCSI & NFS).

"[ESX 3.5 SW iSCSI initiator ...] this behavior will be changing in the next major VMware release . Among other improvements, the iSCSI initiator will be able to use multiple iSCSI sessions (hence multiple TCP connections)."

This will make LACP LAG much more efficient in the iSCSI scenario in vSphere 4 than it has been in 3.x. In this article it says, less encouragingly:

"[...] every NFS datastore mounted by ESX (including vSphere – though NetApp and EMC are both collaborating for longer term NFS client improvements in the vmkernel) uses two TCP sessions – one for NFS control information, and the other for NFS data flow itself. This means that the vast majority of the traffic to a single NFS datastore will use a single TCP session."

No similar improvement here then, like in the iSCSI case :-(
Unless I misinterpreted something the same scenario for utilising more bandwidth with LACP LAG still holds for vSphere. One still has to do something deliberate to work around this particular inefficiency.

/M

Maria

Great article, just one comment, according to VMware, Jumbo Frames are not supported until ESX 4.0.(vSphere). So your note regarding "Support for Jumbo Frames for NFS (and iSCSI) was added in VMware ESX 3.5U3 and later..." is not entirely accurate.

Source: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009473

Chad Sakac

Thanks Maria - I heard otherwise directly from the devs, but the KB is authoritative. Let me double check, and either will correct my post or the KB.

Thanks again for commenting!

Maria

Thank you Chad, also I was wondering if you have implemented Flex-10 technology, from HP, with NFS.

We have seen some performance issues implementing Flex-10 with NetApp using NFS. I would like to know if you have any recommendations. Basically the problem is the throughput, is decreasing I/O performance even using a 10GbE, from 10 to 4.5. If our goal is to have as many as 7,000 VMs then this could be a big problem.

Do you have any thoughts on this?

Thanks in advance,

Chad Sakac

I haven't personally Maria, but I some on my team have. Driving 10GbE at line speed, while possible, is non-trivial. Achieving 4.5Gbps is not bad. Not great, but not bad.

I would start by looking at every link in the network for congestion characteristics (dropped frames). I would also look at the array (not because I want to point the finger at NetApp, but because it's a quick check). I'm not a NetApp expert, but I'm sure they have something analagous to our Analyzer tools - just quickly check to see that you're not bound by the backend aggregate or the filer itself (I hope this is a 6000 series FAS array - right? There's a reason why we started our 10GbE support on our analagous NS-960 - it's not because we couldn't put it in smaller ones, they will just struggle to support the throughput).

Have you checked to make sure you are using TSO - this is an important setting in these circumstances. Also, as an experiment (a quick determination of how much is on the ESX host, and how much are in other elements are in the network or the array target) would be to try VMdirectpathIO - now, in the vSphere 4 release, the limitations are very steep

Personally - without digging into it much further (for example, perhaps they are very, very light), I would be very hesistant to put 7000 VMs on a single 10GbE network and on a single array (of any mid-range class - that's not a knock on NetApp, I would say the same on the EMC mid-range stuff).

Do you want me to grab my HP and NetApp colleagues to try to help you?

I can also get the 10GbE NFS experts at EMC on the Celerra team to reach out - just let me know.

Good luck!

Chad Sakac

@Maria - I just checked also with the development team - you are RIGHT - Jumbo frames aren't supported until vSphere 4. Argh - need to update the post, but important to be correct.

Maria

Thank you Chad, I really appreciated your thoughts on this.

We are using a NetApp FAS6080A, but only 2 heads(most likely this is the actual problem). We have 9 chassis going to the NetApp(each chassis would have ~ 800 VMs). As recommended, also in this article, we have separated VLANs and physical Switches for Network and Storage. Using the same Network adapter though with Flex-10.

Is it possible to get some links from your NetApp colleagues? Regarding best practices to implement NFS/VMware/NetApp? Their thoughts or POC on Flex-10?
I just want to make sure that we are on the right path to improve this.

Thank you again,

Vaughn

Maria,

From your comments I'm not quite clear on the challenges you are seeing. May I suggest that you contact the NetApp Global Support Center at http://now.netapp.com or 1-888-4-NetApp.

Thanks
Vaughn Stewart (note I'm with NetApp)

Michael Bergman

Chad and/or Vaughn,
would you care to comment on my question in the first one of my two posts above? I'm interested in the same scenario as Maria, although not with 1000s of VMs. We want to use LAG (802.1AX aka 802.3ad) & 10 GbE & NFS with NetApp FAS3170A systems and vSphere 4. The TCP session limitation still present in vSphere 4 means one has to use multiple NFS Datastores. Then, when using multi-chassis Link Aggregation, I would very much like fully understand this detail with "multiple IP-addresses for the NFS Server, can be on same subnet".

Would you please elaborate a little bit further on this?

Thanks,
/M

Generic Viagra

I would like to appreciate the great work done by You

Justin Cockrell

Excellent article, I like the multivendor aspect and how you tie together the important information without it becoming a plug for a specific vendor. That's not easy to find these days, and it's exactly what most people need.

I'm planning a large ESXi 4.0u2 deployment based on NetApp NFS and iSCSI storage. This is my first interaction with NetApp gear and specifically using NFS to host the majority of the VMs, so this and the iSCSI post are a great help. I did want to ask whether there have been any changes specifically to the number of NFS TCP connections with any of the latest updates, or any new best practices per VMware or NetApp?

Thanks for the great info, keep it up! :)

Simon Reynolds

Hi Chad / Vaughn

Thanks for your multivendor posts -- really useful information.

Can you clarify for me what difference the "real" network load balancing policy available on distributed virtual switches in vSphere 4.1 makes to the "single tcp connection to one NFS volume" story.

Does this allow multipathing to a single NFS volume? That is, can I get, say, 2gbps to a volume if I have a two-way physical vmnic team attached to the virtual switch used for the NFS vmkernel port?

Cheers

Simon

ベネトリン

What an idea, Great tips, I would like to join your blog anyway,

Mike

Hi, i have a question on your HA section. My switches don't support cross-stack ether channel. so the decision tree states to use multiple links i need to create vmkernels on different subnets. but how do you tell vmware that a NFS export is available on two subnets (or two diff IP addresses for that matter)

for example, with a celerra, i can create two interfaces of 10.10.10.100 and 10.10.20.100. I can use both of those IP addresses to get to an NFS export, but when you define a NFS export in vSphere you only have an option of supplying one IP address per export. so if the switch supporting the celerra interface of 10.10.10.100 were to die, the switch supporting 10.10.20.100 would still be alive and that nfs export would be available on 10.10.20.100, but if i configured the nfs export in vsphere using the ip address of the interface that is not available (10.10.10.100) anymore, then how does vmware get access to the nfs export using the other IP address?

watch jersey shore

would you care to comment on my question in the first one of my two posts above? I'm interested in the same scenario as Maria, although not with 1000s of VMs. We want to use LAG (802.1AX aka 802.3ad) & 10 GbE & NFS with NetApp FAS3170A systems and vSphere 4. The TCP session limitation still present in vSphere 4 means one has to use multiple NFS Datastores. Then, when using multi-chassis Link Aggregation, I would very much like fully understand this detail with "multiple IP-addresses for the NFS Server, can be on same subnet".

Wout Mertens

I just asked a netapp rep and it seems that the single-tcp-session limitation is no longer present in 4.1

Mark Burgess

Hi Chad,

We have been doing a lot of investigations with regard to setting up the networking on NFS with the new VNXe.

We believe that we need to follow the following option listed above:

"To use multiple links, use vmkernel routing table (separte subnets) this requires multiple datastores"

This statement is mentioned in several EMC documents, but they never go on to document the actual configuration at the VMware level.

Where as the Multi-chassis link aggregation option is fully documented.

Do you have documentation on how to setup VNXe/VNX/Celera NFS using the vmkernel routing table option?

Also are you planning to update this post to cover VNXe?

Many thanks
Mark

The comments to this entry are closed.

  • BlogWithIntegrity.com

Disclaimer

  • The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by Dell Technologies and does not necessarily reflect the views and opinions of Dell Technologies or any part of Dell Technologies. This is my blog, it is not an Dell Technologies blog.