LOL – funny post title, wish I didn’t have to put the disclaimer.
Warning – this DOES NOT WORK WITH vSphere right now – so if you’re busy, move on :-) On the other hand, this will be important over the next couple of years (with vSphere and also in in storage land in general), so file this under “stay tuned”, and if you have some time, it’s worth reading as as a “heads up”….
If you’re a glutton for punishment, or just plain curious, read on…
Those of you that know me, know that I’m a big fan of the application of NFS (often in conjunction with VMFS, not in substitution) in VMware environments. Anyone who is a hands-on-nerd with their own little VMware environment knows that it’s hard to match the crazy simplicity of NFS in VMware environments.
Unfortunately, anyone who is operating a large-scale VMware environment also get exposed to some of the more icky parts of using NFS with VMware. Note that isn’t to imply that there aren’t “icky bits” with everything (there certainly are with VMFS), sometimes they become visible fast, sometimes they become visible later.
Most commonly in my experience (and I think these apply to both EMC’s NFS servers and the other NFS server folks pretty evenly) these “icky” bits:
- NFS Server failure behavior
- NFSv3 client limitations = unexpected bottlenecks
There are other items (for example, speaking for ourselves, the effect of guest OS alignment is more pronounced on NFS than it is on VMFS based datastores – but this depends a lot on the particulars of the vendor implementation) – but those two are the biggies.
Both of these are why the continued evolution of NFS is important. pNFS is one of the possible routes. It’s early days in pNFS land, but over the coming months and years, I expect it to become a standard part of our lexicon.
The latest EMC DART GA code-base (DART 6.0) supports pNFS (as well as NFS v4.1 of course). Note that it doesn’t support a filesystem existing across multiple datamovers (scale-out NAS) yet – the filesystem exists on a single datamover. Like all things with our NAS device – you can play with the REAL THING using the Celerra VSA – here.
EMC is sponsoring and supporting a multi-vendor interop plugfest along with some friends this week…
I love the attendee list for the meeting. Not only will it be a nerdy extravaganza (I wish I were there), it highlights how much work goes on behind the scenes BETWEEN the vendors.
Attendee Name | Company |
Cruz, Art | BlueArc |
Adamson, William | NetApp |
Baker, William | Oracle |
Bao, Haiyun | EMC |
Baranova, Tatjana | dCache |
Benjamin, Matthew | Linuxbox |
Black, David | EMC |
Bodley, Casey | CITI Univ of Michigan |
Casey, Brian | Microsoft |
Chintakindi, Shashi | EMC |
Dickson, Steve | RedHat |
Emerson, Adam | Linuxbox |
Erasani, Pranoop | NetApp |
Eshel, Marc | IBM |
Faibish, Sorin | EMC |
Fields, Bruce | RedHat |
Gardere, Daniel | EMC |
Gemignani, John | Isilon |
Halevy, Ben | Panasas |
Harrosh, Boaz | Panasas |
Haynes, Thomas | NetApp |
Heller, Jeffrey | NetApp |
Honeyman, Peter | CITI Univ. of Michigan |
Joshi, Sandeep | BlueArc |
Kaura, Suchit | BlueArc |
Khounsavath, Dara | EMC |
Kirsch, Zacharie | Isilon |
Kornievskaia, Olga | CITI Univ. of Michigan |
Labiaga, Ricardo | NetApp |
Lentini, James | NetApp |
Lever, Charles | Oracle |
Myklebust, Trond | NetApp |
Navrotskaya, Yuliya | EMC |
Noveck, Dave | EMC |
Pedone, James | EMC |
Philippe, Deniel | Ganesha |
Quigley, David | N/A |
Raj, Theresa | NetApp |
Rees, James | CITI Univ. of Michigan |
Schumaker, Bryan | NetApp |
Shah, Peter | NetApp |
Staubach, Peter | EMC |
Thieme, Lynn | NetApp |
Thurlow, Robert | Oracle |
Uddin, Mark | EMC |
Yap, Joon-Jack | EMC |
So…. What the heck is pNFS?
Here’s a little stick diagram (thanks Sorin):
There are two observations:
- notice that the “where is my data” is handled by the pNFS metadata server. The “give me my data” is handled by the pNFS data server (bottleneck not for IO, but for queries, and critical for availability)
- the protocol for data delivery can be file, object or block. (bottleneck for IO itself, and robustness of data delivery)
They are drawn like separate boxes, but of course, those are functional, not physical. When I say that DART support pNFS 4.1, it’s saying that it can operate as a standards-based pNFS server, as well as delivering storage access.
How might this look in the future with VMware (note!!! FUTURE!!!! VMware does NOT support anything more than NFSv3 right now!!!!)
One way with pNFS(file), you’ll access giant NFS datastores that span multiple boxes, perhaps even multiple vendors.
Another way might look like this
This way, pNFS delivery and VMFS delivery share a common model –
Let’s hit those two “icky” bits looking at the world of today….
1 - The first topic (“NFS Server failure behavior"):
This is something that is in the NFS server vendor’s domain of control, and not really dependent on the client (though pNFS could change the mechanic).
The magic mark (IMO) is that failover ALWAYS occurs in 30 seconds or less. With certain application exceptions, Windows and Linuux guests tend to “timeout” in about 1 minute of a storage device being unresponsive. If that device is supporting swap, often that translates to a guest OS crash.
If fileserver failure scenarioes are always less than 30s, then you hit many (not all – some apps timeout before the OS does) use cases without resorting to unnatural acts (modifying timeouts all over the place).
DART 6.0 has boatloads of improvements, and one of them is continued march towards faster and more predictable failover. But, it still involves (in essence) the datamover booting (albeit from a “warm state” so it’s much faster) on the failover, and taking on the “persona” of the failed datamover. Until the IP address supporting the NFS export comes up – well, the datastore is unavailable.
Things that govern the amount of time to failover are things like: time for the datamover to boot, filesystems to mount, and core services to become available. In turn, these tend to depend on a bunch of parameters. In “block land” these are generally not functions of those same variables.
There are a couple of ways of cracking that nut.
One way is to simply improve failover time of a single file server. That’s something we’ve been working on for years. It’s surprisingly non-trivial.
The other way is to distribute the data across multiple things, and make a very HA meta-data server. This is the general approach of clustered-filesystems (not entirely the same thing at all, but some similar ideas), and of pNFS. You can imagine that if the meta-data server was very robust, regardless of whether you were using pNFS file/block/object model for data delivery – the dependency on a single pNFS data server may not be as sensitive, so maybe it doesn’t need to be optimized to the same degree.
2 - The second topic (“NFSv3 client limitations = unexpected bottlenecks"):
People expect that their storage resources (CPUs, disks, ports) can be used “horizontally” – spreading load all over the place. The macro-level reason is that without that, a “localized spike” (spike in anything) tends to have more effect. The more “spread out” something is, the less these “localized spikes” tend to be.
Let me give you an example… you have 1000 VMs. There are 80 of them in each NFS datastore, and each NFS datastore is 4TB in size. (this works also with 100 VMs, with only two datastores, each with 50 VMs if you’re a smaller customer). You’re on an NFS server with 4 front-end interfaces on each file-server. at first glance, you would expect each NFS datastore to be using all 4 of those interfaces if you set it up right. In reality – due to the NFSv3 client in vSphere, each datastore will have all the traffic going to a single MAC address (which one will be determined by your exact config).
So, how do you design around that? Well, you create multiple datastores, configure them properly (via the vCenter plugin your vendor should have), and distribute load properly – best practices everyone calls out.
The problem is that creates localization of load. One of the NFS servers resources could be taxed while others are idle. You can “brute force” it (a valid solution) using 10GbE, but that’s a “workaround” not a “fix”.
NFS v4.1 adds the ability to have multiple connections for a session – which means that you could drive load across multiple interfaces as needed.
pNFS adds the ability for the data servers to be VERY parallel. In one of the pNFS pre-cursors (MPFS), EMC has used this for a while for very high-bandwidth use cases – often driving 2-5GBps out of our older tech, while having NFS behavior for access. We can do even more out of the current gen stuff, and also the next gen stuff will be even higher. It’s also notable that this has a lot of the “good bits” of block (end to end path selection/failover via MPIO/latency, bandwidth) with the “good bits” of NAS (big namespace, client caching options, lock handling).
Put this together with converged networks – and pNFS could represent an end to the “protocol debate” (which regardless of any attempt I make to point out it’s futility, continues to hound us).
There are opensource client and server implementations for Linux and OpenSolaris and now a Windows client.
SO, where does this stand with VMware?
No comment :-) Seriously, I can’t comment on future VMware releases, and even need to tap-dance around EMC stuff. Getting pNFS into VMware is trickier than you would expect – not just due to questions of stuff like code quality, but some business questions (3rd party clients embedded into the vmkernel raises all sorts of complex questions). But, rest assured, we’re (the NFS vendor folks) working with VMware together to try to drive this as fast as we can.
It will be longer than anyone in this NFS server business (or NFS customers) would hope, but likely just right for more maturity in the pNFS space (which is the right product management trade-off). Stay tuned.
The only way a pNFS files data store can span vendors would be through something like FedFS, and then it's simply the namespace doing the spanning. Data servers in a pNFS files setup use a vendor-specific "control protocol" to interact with metadata servers. There is no standard covering these control protocols, and they are tightly-coupled to the vendor's server implementation.
Posted by: Dan Muntz | October 04, 2010 at 06:22 PM
Hi Chad, thanks for another excellent post.
When you said that "due to the NFSv3 client in vSphere, each _datastore_ will have all the traffic going to a single MAC address (which one will be determined by your exact _config_)", do you mean that we can configure which pNIC to use for which datastore? How do I configure that (I guess it has to be CLI on each ESX).
If we use Load Based Teaming in vSphere 4.1, will that help when we have a pair of 1 GE pNIC dedicated for NFS?
Many thanks from Singapore
e1 at vmware dot com
Posted by: iwan 'e1' Rahabok | October 08, 2010 at 10:41 PM
@e1 - that's a common misunderstanding. NIC teaming/loadbalancing (802.3ad) works by picking a link when a TCP session is first established. Which one is picked based on a hash of some sort (source/dest IP, source/dest MAC).
If it's static link aggregation, that selected link is used constantly for that TCP session.
If it's dynamic link aggregation, the link that was selected can be changed based on load.
BUT only a single link is used at any given time.
That's one of the key things - NFS v3 uses a single session per mount, and therefore a single link.
So - no matter how you configure, if you're using 1GbE (as opposed to 10GbE), expect to see about 80MBps as the max for the datastore if 100% read or write, and about 160 MBps as the max for the datastore if you have a 50/50 read/write workload (since 1GbE is actually bi-directional).
That's why if you want to increase throughput you need to have multiple datastores, and it's also ONE of the reasons why changes in the NFS client behavior in vSphere would be a big deal.
Posted by: Chad Sakac | December 20, 2010 at 11:36 PM
If you really want the product, call the customer service line for the website and ask questions.
Posted by: meizitang online | February 17, 2011 at 11:28 AM