UPDATED – June 11th, 8:21pm – new Celerra vSphere Best Practices posted
We were quite a bit surprised to see how popular our “Multivendor iSCSI” post was. The feedback was overwhelming and very supportive of industry leaders partnering to ensure customer’s success with VMware. While writing that post, we (Vaughn Stewart from NetApp and Chad Sakac from EMC) discussed following up the iSCSI post with one focused on deploying VMware over NFS. The most difficult part around creating this post is that we couldn’t do it with our iSCSI-focused colleagues.
Since the original post, we’ve been busy assisting our customers and partners. We apologize for the delay, so without further ado we present to you the followup: a “Multivendor NFS” post for our joint customers. One of the goals of this post is to dispel the FUD customers often hear around NFS. Heck, if EMC and NetApp can agree – then you KNOW this post is FUD-Free!
We would like to thank Stu Baker and Satyam Vaghani from VMware, along with numerous folks at EMC and NetApp for their input on this post.
While any NFSv3 server will work with VMware, and there are many NFS servers on the ESX HCL, there is a significant difference between what one can do with an enterprise class NFS storage array from EMC or NetApp. The reality is only NetApp and EMC are supporting NFS deployments with VMware in significant volume.
Both of us personally are big supporters of NFS for VMware – but if you look at our post histories - we’re both also rational and try our best (we’re human, so sometimes we fail) to be balanced and neutral (in my case, this is a balance to my VMFS post here). We try to be good pragmatic voices, so our goal here is pragmatism and facts to help our mutual customers.
For more – read on…
Ok – let’s get a couple things off the table right off the bat:
- “Is NFS a good, highly available, high performance option for VMware and deserves equal consideration as with the more traditional SAN choices?” – YES.
- “Is NFS the be-all end-all storage protocol for VMware?” – NO.
Let’s breakdown the myth-busting and best practices into the following: 1) Performance and Scaling; 2) High Availability;
1) Performance.
Often – people are dismissive of NFS performance. In our experience, this is rooted in the fact that NAS originated outside the datacenter (engineering/development), leveraging existing “cheap and dirty” (and effective!) LAN design, and with poor performance client-mode NFS clients running on what at the time where very limited CPU cycles.
This is pretty well the opposite of the origins of SANs which originated in the datacenter, ran on “relatively expensive and lossless” (and effective!) SAN design, and with high performance hardware and kernel-mode drivers.
This argument reminds both of us of those who said IP would never be able to provide the quality required for telephone conversations. Never bet against Ethernet! This should be evident today with Cisco’s unified networking architecture. Consider:
- NAS is widely deployed in the Datacenter today.
- It’s possible to build “bet the business” Ethernet infrastructure, even including lossless characteristics traditionally associated with Fibre Channel. This lossless behavior is exactly what is being delivered with Datacenter Ethernet. 10 Gbps throughput, very low latency, very low jitter and lossless characteristics that match the fastest FC SANs.
- NFS clients, like most iSCSI initiator, aren’t free. They cost CPU cycles; however CPU cycles are cheap and readily available. In fact, the abundance of CPU cycles has enabled us to virtualize our servers. This trend is accelerating. That said, with workloads where you are measuring every ESX host CPU cycle, or the workload density is gated on ESX host CPU cycles, a cost/benefit tradeoff should be considered - don’t base your thinking on CPU cycles alone.
Performance consideration #1: What Kind of Data Are You Serving?
We would like to suggest the notion that an ESX server may require three types of storage that we will label: Physical, General Purpose Shared, and High Performance. We would like to share our view on these three types and the characteristics of each. Remember that the goal for our mutual customers is to virtualize all workloads, all applications, all use cases - and doing it in the most simple and efficient way. Flexibility is paramount.
- Physical Device Access
This is the easiest form of storage models to understand as it is very traditional. It is the storage model that is required by a physical ESX server in order for it to boot and run. This storage could be direct attached storage, or alternatively it could be a FC / FCoE / or iSCSI LUN.
- General Purpose Shared Storage Pools
As you know, Virtual Machines are comprised of files that for production functionality must reside on a shared storage architecture. General purpose VMs which are consolidated and stored in a shared storage pool may individually have moderate I/O requirements; however, their aggregated I/O load can be quite substantial.
VMware hit it out of the park; they developed VMFS, a clustered file system, which made it simple to have multiple ESX hosts simultaneously access a shared filesystem. Traditionally, clustered host filesystems are extremely complex.
In VMware ESX 3.x, VMware added support for NFS, which is natively a shared storage medium. With vSphere 4, NFS datastores support all the major VMware features at parity with VMFS. If one requires greater VM to datastore density, NFS can scale to general purpose VM densities equal to and often beyond what is possible with VMFS.
A key element of VMFS is the SCSI architecture that includes a command queue limit or a limit to the number of commands that can be addressed simultaneously by the LUN. In general this LUN and HBA queue limit is the limit to VMFS scaling (as noted in this whitepaper: http://www.vmware.com/resources/techresources/1059 )
While VMFS VM to Datastore density can in theory match NFS from a VM per datastore scale, it requires advanced configurations that allow increased LUN queues. Examples of this include spanning VMFS volumes across multiple LUNs. These types of designs achieve the parallelism that exists by definition internally on an NFS server, which obscure all block-level queue management (LUN queues support the underlying filesystems in NAS devices, but generally there are many LUNs used for a single filesystem, and many queues - and this is all invisible as far as VMware is concerned).
Spanned VMFS volumes in essence replicating what the NFS server makes simple (taking many block devices and creating a shared filesystem from them). With NFS datastores, the VMware NFS client simply logs into the NFS server - which handles all of the back-end I/O. The ability to have fewer larger datastores benefits IT operations by reducing storage management operations including provisioning, replication, backup, etc for this general purpose shared pool use case.
All the details needed to really understand VMFS is in this post here.
- High Performance Datasets
As customers virtualize more and more servers they eventually have to design configurations in order to address the resource requirements of more demanding applications. Examples of demanding applications are Microsoft SQL Server or Oracle Database Server.
These types of application are characterized scale in two different dimensions: 1) Unlike the general purpose VMs, whose individual IO workloads are light, but have a large aggregate IO requirement, these are the reverse - a single VM has a large I/O requirement (can be IOps or MBps); 2) often the application best practices require their IO workload to be isolated from the IO workload of other systems. These design considerations are the same whether the application is deployed as a physical or a virtual server.
VMware, NetApp, & EMC all recommend that an applications with high I/O requirements or one which is sensitive to latency variation - these require a storage design that focuses on that particular VM, and should be isolated form other datasets. Ideally, the data will reside on a VMDK stored on a datastore that is connected to multiple ESX servers, yet is only accessed by a single VM. The name of the game with these workloads isn’t scale in terms of VMs per datastore, but scaling the performance of one VM.
Also - certain specific use cases require very specific guest-level SCSI task management (aborts, resets, etc). This is usually true of clustered apps (hence some of the existing RDM requirements). When virtualizing VMDKs, the ESX storage virtualization layer needs to map task management requests to primitives that are understood by the underlying layer. In case of VMFS, this means mapping virtual task management requests to physical SCSI device task management requests. This is straightforward. In case of NFS, there are no analogs to SCSI task management, hence aborts and resets can only be processed on a best effort basis (for eg, a command can only be aborted if it hasn't been issued on the wire -- once it is, there is no way for the host to convey a cancellation to the NFS server, etc). Beyond the clustered use cases - these are exceeding rare.
Performance consideration #2: Design a “Bet the Business” Ethernet Network
Can one run NFS datastores on any off-the shelf GbE switches? Yes – but it’s not a good idea. Remember that you are designing a storage network that needs to have a performance/availability profile to support your VMware cluster, and that the aggregate availability of your
- Separate your IP storage and LAN network traffic on separate physical switches or be willing and able to logically isolate them using VLANs.
- Enable Flow-Control
- Enable spanning tree protocol with either RSTP or portfast enabled
- Filter / restrict bridge protocol data units on storage network ports
- Configure jumbo frames (always end-to-end - meaning in every device in all the possible IP storage network paths). Support for Jumbo Frames for NFS (and iSCSI) was added in VMware ESX 3.5U3 and later.
- Strongly consider using Cat6 cables rather than Cat5/5e. Can 1GbE work on Cat 5 cable? Yes. Are you building a “bet the business” Ethernet infrastructure? Remember that retransmissions will absolutely recover from errors - but have a more significant impact for these IP storage use cases than in general networking use cases.
- Ensure your Ethernet switches have the proper amount of port buffers, and other internals to properly support NFS (and iSCSI) traffic optimally
- While vSphere adds support for IPv6 for VM networks and VMkernel networks - IPv6 for VMkernel storage traffic is experimental at the initial vSphere release
- With NFS datastores – strongly consider switches which support cross-stack Etherchannel or Virtual port Channeling technologies. (This will become apparent during the HA section)
- With NFS datastores – strongly consider 10GbE or a simple upgrade path to 10GbE as an important Ethernet switch feature.
Performance consideration #3: Think about Bandwidth (MBps)
There are 3 primary measures of storage performance – throughput (IOps), bandwidth (MBps) and latency (ms). Throughput and bandwidth are related in the sense that the bandwdith needed is the throughput x the I/O size. People sometimes confuse filesystem allocation size (4K default in NTFS, 4K for WAFL, 8K for UxFS) – but they are unrelated. The I/O size is the size of the I/O operation from the host perspective.
IOps are usually gated by the backend configuration, whereby backend we mean the array target. If the workload is cached, then it’s determined by the cache response (which is almost always astronomical), but most often, it’s by the spindle configuration that supports the storage object. In the case of NFS datastores, the storage object is the filesystem. So, on a NetApp FAS, the IOps achieved are primarily determined by the number of disk drives in an Aggregate, and likewise on a Celerra they are primarily determined by the Automated Volume Manager configuration. Yes, there are other considerations (at a certain point, the FAS/Datamovers themselves as well as the host ability to generate IOs become limits), but up to the things most people run into – it’s the backend.
Ok – next thing to understand is that every NFS datastore mounted by ESX (including vSphere – though NetApp and EMC are both collaborating for longer term NFS client improvements in the vmkernel) uses two TCP sessions – one for NFS control information, and the other for NFS data flow itself.
This means that the vast majority of the traffic to a single NFS datastore will use a single TCP session. What this means is that the upper limit throughput achievable for a single datastore – regardless of link aggregation or other methods – will use a single link for the traffic to that datastore.
The key to this is understanding how Link Aggregation works. We strongly recommend going back and reading the section on “Understanding Link Aggregation” in the ESX/ESXi 3.5 iSCSI post – as it’s equally pertinent here. Seriously – go there now…
You back? Ok, now you understand why a the NFS datastore dataflow being on one TCP session will result in a single link being used no matter how it’s configured.
As we covered, if you are using 1GBps this means that a reasonable expectation is a unidirectional read/write workload of ~80-100Mbps (GbE is full duplex - so this can be 160MBps bidirectionally with a mixed workload)
Higher total throughput on an ESX servers can be achieved by leveraging multiple datastores. You can scale up the total throughput to multiple datastores via link aggregation and routing mechanisms.
What type of virtual machine workloads are well suited to NFS? A shared datastore comprised of many VMs with an aggregate requirement within the guidelines above (can be large amounts of IOps, but generally lots of small-block I/O - and not large block I/O that needs more bandwidth than the one GbE link can provide), or a A single busy as long as its I/O load can be served by a single GbE link.
Now, these performance parameters can be enough for MANY use cases – so don’t write it off.
With small block I/O (like 8K) – this is 12,500 IOPs – or put differently, roughly the performance of 70 15K spindles. But, on the other end, if you have a Sharepoint VM (or are doing a guest-level backup) – they tend to do IO sizes of 256K or larger. With 256K IO sizes, that’s 390 IOPs – or the performance of roughly 2 15K spindles – and likely not enough.
Another option is 10GbE.
If you use 10GbE – though a single TCP session will be used per datastore there is much more throughput available for your most demanding workloads; however, I’d add if you have 10 GbE you probably have access to FCoE & iSCSI, and this flexibility may be required for supporting some of you most demanding workloads.
If 10GbE isn’t an option – you can always use NFS for some VMs and FC for others.
So – what do the economics look like?
While 10GbE prices per port are higher today than 4Gbps FC, 10GbE prices are starting to drop rapidly, and we expect it to continue to drop through 2009, and this trend will be accelerated as 10GbE LoM (LAN on Motherboard) starts to become more prevalent. Also – from a TCO (acquisition, cabling, power, space, etc) standpoint, 10GbE Datacenter Ethernet like the Cisco Nexus 5000 series is comparable to separate 1Gbps Ethernet and 8Gbps FC today. If you’re looking at FC and NFS together – take a good look at the 2nd generation FCoE converged adapters, the Cisco Nexus 5K. FCoE configurations are supported by VMware, Cisco, EMC e-Lab and NetApp (FCoE got standardized last week – post going up shortly) – so while these are early days, customers can being to evaluate in earnest.
So - how many datastores? There is no hard and fast rule here – but the recommendation For peak performance – increase the number of NFS datastores using the ESX advanced settings here from 8 to a higher number (this is a vSphere screenshot, but the same advanced property is available in ESX 3.5 – only difference in vSphere the max is 64, not the 32 maximum of VI3.5).
When you increase the NFS datastore count, increase the heap memory assigned to and available to the networking stack (including the NFS client) – and do this across all ESX hosts.
- increase Net.TcpIpHeapSize to 30. This immediately increases the heap memory to 30MB
- Increase Net.TcpIpHeapMax to 120. This increases the maximum heap memory for to 120MB
With EMC Celerra there are a couple other important NFS related settings:
On the Celerra filesystem supporting the NFS export:
- Enable the uncached write mechanism for all file systems (30% + improvement)
- Disable the prefetch read mechanism for file systems consisting of VMs with small random accesses patterns
Performance consideration #3: Plan your NFS server design accordingly
In general, consider on both performance and capacity axes - you need to design for meeting capacity requirements (TB), and performance (MBps, IOps, latency). You should employ every method you can to be as efficient as you can, but you need to make sure that you plan to have enough spindles behind the filesystem that is supporting the NFS export to support the aggregate IOPs workload needed by all the VMs in the datastore. This isn’t hard to estimate – just measure a representative host(s) using perfmon, top, or the VMware Capacity Planner. It also is easy to fix if you have enough backend spindles – expand the filesystem (simple on both NetApp and EMC Celerra) – and in vSphere, storage vmotion is supported with NFS datastores as sources or targets, so you can re-balance datastores as needed.
2) High Availability
NFS uses a different model for HA design than native block devices – but you can absolutely create high-availability configurations.
HA consideration #1: Network and NFS server design
The first core difference is that block (iSCSI/FC/FCoE) use an initiator-to-target multipathing model based on MPIO. The domain of the path choice is from the initiator to the target. For NAS – the domain of link selection is from one Ethernet MAC to another Ethernet MAC – or one link hop. this is configured from the host-to-switch, switch-to-host, and NFS server-to-switch and switch to NFS server, and the comparison is shown below (note that I called it “link aggregation”, but more accurately this is either static NIC teaming, or dynamic LACP):
The mechanism that is used to select one link or another are fundamentally:
- A Link Aggregation choice – which is setup per TCP connection – and is either static (setup once and permanent for the duration of the TCP session) or dynamic (can be renegotiated while maintaining the TCP connection – but still always on one link or another)
- A TCP/IP routing choice – where an IP addres (and the associated link) is selected based on an layer-3 routing choice.
Note: Out of the box ESX/ESXi does not support dynamic LACP; however, Cisco’s 1000V vDS does provide this functionality along with numerous other enhancements which could take another blog post to discuss.
Here’s the basic decision tree:
The path on the left has a topology that looks like this (note that the little arrows mean that you must configure the link aggregation/static teaming from the ESX host to the switch and on the switch to the ESX host, and the same “setup on both sides” for the switch-NFS server relationship):
The path on the right looks has a topology that looks like this (you can use link aggregation/teaming on the links – remembering that it won’t help with a single datastore - but routing is the selection mechanism):
HA consideration #2: NFS Client Timeout considerations
NAS device failover is generally longer than a native block device, block devices generally failover after a “front end” failure in seconds (or milliseconds), NAS devices tend to failover in 10’s of seconds (can be longer depending on the NAS device and the configuration specifics). This often gets thrown by “block-heads” (the equivalent of a “NAS-bigot” – both types are equally dangerous :-) around to instill FUD. The question is how much time elapses before ESX does something about it, and what’s the guest behavior during that time period.
First – the same timeout concept exists with block storage but the failover time is extremely rapid in almost all cases. Failed path detection is generally as soon as the first I/O fails for Fibre Channel and FCoE, and actual path change occurs within as soon as one of the SCSI commands that signal a dead path (NOT_READY, ILLEGAL_REQUEST, NO_CONNECT and SP_HUNG for MRU arrays, or NO_CONNECT for Fixed arrays). This time period are configurable (steps vary by HBA), but the defaults are good in almost all cases, are within the common guest OS timeout values, and are measured in low seconds. In vSphere, the behavior is controlled by the Path Selection Plugin in vSphere (and path state is handled by the Storage Array Type Plugin). Third Party Multipathing Plugins can further optimize this behavior. Second – ESX and guest timeouts can be extended to survive reasonable FAS/Datamover failover intervals.
Third - use cases have varying tolerances for this behavior - some are perfectly fine with long timeouts, requiring no changes. Others are more sensitive.
Both NetApp FAS and EMC Celerra recommend the same ESX failover timeout settings. We recommend increasing the default values to avoid VMs being disconnected during a FAS/Datamover failover event.
The recommended settings both EMC and NetApp recommend (do this across all ESX hosts)
- NFS.HeartbeatFrequency(NFS.HeartbeatDelta in vSphere) = 12
- NFS.HeartbeatTimeout = 5
- NFS.HeartbeatMaxFailures = 10
The way these work:
- Every “NFS.HeartbeatFrequency” (or 12 seconds) the ESX server checks to see that the NFS datastore is reachable.
- Those heartbeats expire after “NFS.HeartbeatTimeout” (or 5 seconds), after which another heartbeat is sent.
- If “NFS.HeartbeatMaxFailures” (or 10) hearbeats fail in a row, the datastore is marked as unavailable and the VMs “crash”.
This means that the NFS datastore can be unavailable for a maximum of 125 second before being marked unavailable which covers the large majority of both NetApp FAS and EMC Celerra failover events.
Now – what does a guest see during this period? It sees a non-responsive SCSI disk on the vSCSI adapter. The disk timeout is how long the guest OS will chill as the disk is non-responsive. To set operating system timeout for Windows servers to match the 125 second maximum set for the datastore:
- Back up your Windows registry.
- Select Start>Run, type regedit.exe and click OK.
- In the left‐panel hierarchy view, double‐click HKEY_LOCAL_MACHINE, then
System, then CurrentControlSet, then Services, and then Disk. - Select the TimeOutValue and set the data value to 125 (decimal).
Additonal Recommended Reading
1) We would STRONGLY recommend reading a series of posts that the inimitable Scott Lowe has done on ESX networking, and start at his recap here: http://blog.scottlowe.org/2008/12/19/vmware-esx-networking-articles/
2) Also – prior to getting started we recommend all deployments read our documentation.
- EMC customers using NFS with vSphere:
- EMC customers using NFS with VI 3.x
- VMware ESX Server Using EMC Celerra Storage Systems – Solutions Guide
- EMC Celerra VMware ESX Server Optimization with EMC® Celerra® Performance Study - Technical Note P/N 300-006-724
- NetApp: NetApp & VMware Virtual Infrastructure 3: Storage Best Practices
In conclusion - NFS is an absolutely legitimate storage model for VMware - with many advantages. It deserves consideration along with all the other storage options available. As with everything - success is determined not only by technological factors, but design - and most importantly - the customer’s experience with various technologies and models. As unified networking or 10 GbE becomes the norm we expect to see customers to deploy a mix of storage protocols as each has their pros and cons.
I’d like to thank my friend, competitor and partner in the blogosphere for making this post happen. We hope you find this information helpful and more importantly useful in the design of your virtual data center.
Chad - thanks again for all of your efforts in making this post happen.
Posted by: Vaughn Stewart | June 09, 2009 at 10:03 AM
Vaughn my pleasure - and THANK YOU! It was a good collaborative effort. I hope (as I know you do) that we are helping customers!
Posted by: Chad Sakac | June 09, 2009 at 04:19 PM
Chad - you've hit the nail on the head why everyone treats NFS as a second tier - timeouts :-)
Is there any chance VMware could change the default NFS timeouts to something more sensible? (And while we're at it, get vmtools to fiddle with the guest too?...)
Without the defaults changing, I'd argue NFS should be treated as a second tier. I think this thread sums it up quite nicely:
http://communities.vmware.com/thread/197850
Posted by: David Barker | June 10, 2009 at 06:45 AM
David - note that both EMC and NetApp suggest to INCREASE the default NFS timeouts over the VMware defaults. This isn't a VMware issue - this is a NAS device characteristic.
Both NetApp and EMC Celerra engineering have been laser focused on reducing the duration of failover events, and have been making steady progress over the years. The trick is to solve this not just for the "best case" (in which case we can both be well under the 125 second recommendation), but "all cases" as the failover time (again, for both of us) tends to depend on many parameters.
I'm an engineer - so I tend to call this "unbounded behavior".
Certain things (fast failover, consistency groups) are "easier" from an engineering standpoint on "traditional block" architectures
Others (deduplication, thin provisioning) are "easier" on "traditional NAS" devices
Others still (object-level policy) on CAS/COS device architectures.
These "easier" things dont mean that they are impossible on the various architectures (many things on the lists exist now across multiple platforms architectural models), but does tend to be the reason they appear in one place first, then later in others.
The other element here is that as noted (both in the post, and in the VMTN thread you post), the failover domain of NAS is link-based, and block is path based. NAS also depends on the longer TCP/IP native timeouts, ARPs, and various other elements of the TCP/IP stack. Again - none of this is intrinsically BAD - but is intrinsically DIFFERENT.
Vaughn - you might want to have someone at NetApp help out the customer on the thread David points out.
Thanks for commenting - and we are indeed furiously working on this on the EMC side David - more to come soon.
Posted by: Chad Sakac | June 10, 2009 at 08:41 AM
Sorry, I didn't mean to say NFS was bad either - like you say, different! :-)
(yes, 'second tier' was unfair...)
Strictly speaking, timeouts aren't a vmware issue but:
- The current defaults don't work for the worlds most popular NAS boxes.
- New users tend to trip up on NFS (thinking it's a cheap way to do storage), which is perpetuating the FUD. NFS servers tend to be treated like an office fileserver; 'if it goes down, it's OK; clients will reconnect'.
Maybe NFS should be treated as an 'advanced/expert' protocol in VMWare? Or just add an extra warning about timeouts in the vSphere client when setting up NFS stores?
PS: Many thanks to you and Vaughn for your hard work on this (and other) blog posts :-)
Posted by: David Barker | June 10, 2009 at 10:37 AM
Chad and Vaughn,
Excellent write up and I commend you on your collaboration. Why doesn't vmware support multiple paths to the same datastore over NFS?
Clustered NAS storage systems like Isilon (for which I work for) provide access to the same datastore from multiple storage devices and could easily allow multi-pathing if it was supported by the ESX host.
Perhaps that could be achieved by creating a 3rd party plugin but it seems like that can only be set for SAN storage.
Also, NFS/IP failover on an Isilon cluster takes no more than 5 seconds (mainly to allow the gratuitous arp to update the switch).
Regards,
Shai
Posted by: Shai Harmelin | June 10, 2009 at 02:51 PM
Awesome article, certainly on par with your iSCSI Multi-Vendor example!
I have to agree with the second tier/DR storage solution.
I (like many) have been bitten too many times by the timeouts, and the lost connection reactions of ESX 3.5 U3 and below. It would seem that there would be a little better "healing" reaction to make NFS a more trustworthy storage solution.
I can honestly say I have reserved my NFS mounts for things like ISO image storage, and other ancillary data. Even then the only times in the last couple of years I have had to reboot my Cluster was due to hung NFS mounts.
This isn't unlike any Unix/Linux flavor (or Windows as an NFS client either for that matter), so I wouldn't just assume that VMWare's reaction to a lost NFS connection would be different.
Just my 2c
Thanks for all of your hard work, and solutions. You guys are quite the crutch for me! :-)
smooter
Posted by: smooter | June 10, 2009 at 03:39 PM
Read and considered everything in this article carefully. Great effort from you on this. Now, I have one specific question which I hope someone can answer.
I'm not 100% clear on the left path in the diagram for HA. We have multi-chassis LAG everywhere in a new modern environment we're building, with Nexus 5000. So increase and use multiple NFS datastores, to increase bwidth (TCP connections) over LACP LAG "trunks". I understand the topology pictures below the decision chart (the 1st one is our scenario).
But it says: "Configure NFS server to have multiple IP-addresses, can be on the same subnet."
We will have LACP LAG 802.3ad with two (2) 10 GbE ports at the storage (NFS server) side. The same ports will be VLAN tagged to up to 62 different subnets, to avoid routing NFS traffic in this environment. Does the above statement imply that to get ESX 3.5 (and also 4.0) to utilize both 10 Gbit links as good as possible, there needs to be at least 2 separate IP addresses (or even more?) on the the server subnet where the ESX server(s) will sit?
It doesn't really say anywhere in the article why that is. I don't know ESX at in practice so for those ppl who do, this may be obvious. Sorry if that's the case.
Thanks for these two articles on iSCSI and NFS, really useful for me.
/M
Posted by: Michael Bergman | June 16, 2009 at 07:48 PM
I noticed one more thing while reading and comparing these two articles (iSCSI & NFS).
"[ESX 3.5 SW iSCSI initiator ...] this behavior will be changing in the next major VMware release . Among other improvements, the iSCSI initiator will be able to use multiple iSCSI sessions (hence multiple TCP connections)."
This will make LACP LAG much more efficient in the iSCSI scenario in vSphere 4 than it has been in 3.x. In this article it says, less encouragingly:
"[...] every NFS datastore mounted by ESX (including vSphere – though NetApp and EMC are both collaborating for longer term NFS client improvements in the vmkernel) uses two TCP sessions – one for NFS control information, and the other for NFS data flow itself. This means that the vast majority of the traffic to a single NFS datastore will use a single TCP session."
No similar improvement here then, like in the iSCSI case :-(
Unless I misinterpreted something the same scenario for utilising more bandwidth with LACP LAG still holds for vSphere. One still has to do something deliberate to work around this particular inefficiency.
/M
Posted by: Michael Bergman | June 16, 2009 at 08:11 PM
Great article, just one comment, according to VMware, Jumbo Frames are not supported until ESX 4.0.(vSphere). So your note regarding "Support for Jumbo Frames for NFS (and iSCSI) was added in VMware ESX 3.5U3 and later..." is not entirely accurate.
Source: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009473
Posted by: Maria | June 26, 2009 at 02:15 PM
Thanks Maria - I heard otherwise directly from the devs, but the KB is authoritative. Let me double check, and either will correct my post or the KB.
Thanks again for commenting!
Posted by: Chad Sakac | June 26, 2009 at 05:12 PM
Thank you Chad, also I was wondering if you have implemented Flex-10 technology, from HP, with NFS.
We have seen some performance issues implementing Flex-10 with NetApp using NFS. I would like to know if you have any recommendations. Basically the problem is the throughput, is decreasing I/O performance even using a 10GbE, from 10 to 4.5. If our goal is to have as many as 7,000 VMs then this could be a big problem.
Do you have any thoughts on this?
Thanks in advance,
Posted by: Maria | July 03, 2009 at 04:08 PM
I haven't personally Maria, but I some on my team have. Driving 10GbE at line speed, while possible, is non-trivial. Achieving 4.5Gbps is not bad. Not great, but not bad.
I would start by looking at every link in the network for congestion characteristics (dropped frames). I would also look at the array (not because I want to point the finger at NetApp, but because it's a quick check). I'm not a NetApp expert, but I'm sure they have something analagous to our Analyzer tools - just quickly check to see that you're not bound by the backend aggregate or the filer itself (I hope this is a 6000 series FAS array - right? There's a reason why we started our 10GbE support on our analagous NS-960 - it's not because we couldn't put it in smaller ones, they will just struggle to support the throughput).
Have you checked to make sure you are using TSO - this is an important setting in these circumstances. Also, as an experiment (a quick determination of how much is on the ESX host, and how much are in other elements are in the network or the array target) would be to try VMdirectpathIO - now, in the vSphere 4 release, the limitations are very steep
Personally - without digging into it much further (for example, perhaps they are very, very light), I would be very hesistant to put 7000 VMs on a single 10GbE network and on a single array (of any mid-range class - that's not a knock on NetApp, I would say the same on the EMC mid-range stuff).
Do you want me to grab my HP and NetApp colleagues to try to help you?
I can also get the 10GbE NFS experts at EMC on the Celerra team to reach out - just let me know.
Good luck!
Posted by: Chad Sakac | July 06, 2009 at 10:13 PM
@Maria - I just checked also with the development team - you are RIGHT - Jumbo frames aren't supported until vSphere 4. Argh - need to update the post, but important to be correct.
Posted by: Chad Sakac | July 06, 2009 at 10:14 PM
Thank you Chad, I really appreciated your thoughts on this.
We are using a NetApp FAS6080A, but only 2 heads(most likely this is the actual problem). We have 9 chassis going to the NetApp(each chassis would have ~ 800 VMs). As recommended, also in this article, we have separated VLANs and physical Switches for Network and Storage. Using the same Network adapter though with Flex-10.
Is it possible to get some links from your NetApp colleagues? Regarding best practices to implement NFS/VMware/NetApp? Their thoughts or POC on Flex-10?
I just want to make sure that we are on the right path to improve this.
Thank you again,
Posted by: Maria | July 08, 2009 at 02:45 PM
Maria,
From your comments I'm not quite clear on the challenges you are seeing. May I suggest that you contact the NetApp Global Support Center at http://now.netapp.com or 1-888-4-NetApp.
Thanks
Vaughn Stewart (note I'm with NetApp)
Posted by: Vaughn | July 09, 2009 at 11:22 AM
Chad and/or Vaughn,
would you care to comment on my question in the first one of my two posts above? I'm interested in the same scenario as Maria, although not with 1000s of VMs. We want to use LAG (802.1AX aka 802.3ad) & 10 GbE & NFS with NetApp FAS3170A systems and vSphere 4. The TCP session limitation still present in vSphere 4 means one has to use multiple NFS Datastores. Then, when using multi-chassis Link Aggregation, I would very much like fully understand this detail with "multiple IP-addresses for the NFS Server, can be on same subnet".
Would you please elaborate a little bit further on this?
Thanks,
/M
Posted by: Michael Bergman | July 22, 2009 at 05:49 PM
I would like to appreciate the great work done by You
Posted by: Generic Viagra | June 30, 2010 at 10:57 AM
Excellent article, I like the multivendor aspect and how you tie together the important information without it becoming a plug for a specific vendor. That's not easy to find these days, and it's exactly what most people need.
I'm planning a large ESXi 4.0u2 deployment based on NetApp NFS and iSCSI storage. This is my first interaction with NetApp gear and specifically using NFS to host the majority of the VMs, so this and the iSCSI post are a great help. I did want to ask whether there have been any changes specifically to the number of NFS TCP connections with any of the latest updates, or any new best practices per VMware or NetApp?
Thanks for the great info, keep it up! :)
Posted by: Justin Cockrell | July 05, 2010 at 06:19 PM
Hi Chad / Vaughn
Thanks for your multivendor posts -- really useful information.
Can you clarify for me what difference the "real" network load balancing policy available on distributed virtual switches in vSphere 4.1 makes to the "single tcp connection to one NFS volume" story.
Does this allow multipathing to a single NFS volume? That is, can I get, say, 2gbps to a volume if I have a two-way physical vmnic team attached to the virtual switch used for the NFS vmkernel port?
Cheers
Simon
Posted by: Simon Reynolds | February 18, 2011 at 05:09 AM
What an idea, Great tips, I would like to join your blog anyway,
Posted by: ベネトリン | February 21, 2011 at 04:31 AM
Hi, i have a question on your HA section. My switches don't support cross-stack ether channel. so the decision tree states to use multiple links i need to create vmkernels on different subnets. but how do you tell vmware that a NFS export is available on two subnets (or two diff IP addresses for that matter)
for example, with a celerra, i can create two interfaces of 10.10.10.100 and 10.10.20.100. I can use both of those IP addresses to get to an NFS export, but when you define a NFS export in vSphere you only have an option of supplying one IP address per export. so if the switch supporting the celerra interface of 10.10.10.100 were to die, the switch supporting 10.10.20.100 would still be alive and that nfs export would be available on 10.10.20.100, but if i configured the nfs export in vsphere using the ip address of the interface that is not available (10.10.10.100) anymore, then how does vmware get access to the nfs export using the other IP address?
Posted by: Mike | February 23, 2011 at 08:51 PM
would you care to comment on my question in the first one of my two posts above? I'm interested in the same scenario as Maria, although not with 1000s of VMs. We want to use LAG (802.1AX aka 802.3ad) & 10 GbE & NFS with NetApp FAS3170A systems and vSphere 4. The TCP session limitation still present in vSphere 4 means one has to use multiple NFS Datastores. Then, when using multi-chassis Link Aggregation, I would very much like fully understand this detail with "multiple IP-addresses for the NFS Server, can be on same subnet".
Posted by: watch jersey shore | March 07, 2011 at 04:11 AM
I just asked a netapp rep and it seems that the single-tcp-session limitation is no longer present in 4.1
Posted by: Wout Mertens | May 12, 2011 at 06:57 AM
Hi Chad,
We have been doing a lot of investigations with regard to setting up the networking on NFS with the new VNXe.
We believe that we need to follow the following option listed above:
"To use multiple links, use vmkernel routing table (separte subnets) this requires multiple datastores"
This statement is mentioned in several EMC documents, but they never go on to document the actual configuration at the VMware level.
Where as the Multi-chassis link aggregation option is fully documented.
Do you have documentation on how to setup VNXe/VNX/Celera NFS using the vmkernel routing table option?
Also are you planning to update this post to cover VNXe?
Many thanks
Mark
Posted by: Mark Burgess | July 08, 2011 at 07:02 AM