So – recently, there was a set of EMC/VMware cases (these will apply across the storage community) that triggered my colleagues Scott Drummonds (here) and Scott Lowe (here) to comment.
The scoop is that SIOC looks at the datastore as if it contains a fixed amount of IOps that can be “sunk”, and that when the SIOC threshold is met, the throttling mechanisms should have an effect on the guest-latency (which is the trigger in the first place).
In a basic sense, this is is one of the factors which results in the threshold values based on disk rotational speed/types.
If the threshold is triggered, and SIOC kicks in, but doesn’t see the expected change in guest latency, it figures “something else must be contending for IO”.
If that happens – you may see an event in the log that reads:
“External I/O workload detected on shared datastore running Storage I/O (SIOC) for congestion management.”
And there is a KB article with more detail:
http://kb.vmware.com/kb/1020651
Now, Scott/Scott’s discussion on this was great, and I don’t want to repeat their content (check them out).
Let me tell you why (except perhaps the slightly over the top scary KB article text) why I think this is a FEATURE, not a bug.
In essence, it’s a example of where VMware is trying to go long term re: resource management/coordination of storage, just like they do in the scheduler for CPU, and how they manage host memory.
In the early days of storage resource management in vSphere (aka today), VMware looks at the datastore as being able to provide “a fixed” (but unknown) quantity of a set of services. Using queue management, they are trying to provide some scheduling of storage resources.
Storage/Networking resource management is a MUCH different (harder?) problem than CPU/memory scheduling as the resources are:
- not in the host itself
- subject to the insane amount of architectural variability in storage subsystems (whereas the CPU/memory subsystems of x86 server are much more common)
- subject to external contention by a set of widely variable “other factors” beyond vSphere’s control.
That last bullet was the trigger of case here – the customer was replicating the datastore, and it was being contended with the array-internal replication load, which SIOC had no awareness of.
So – what to do?
A: look for the error. It MIGHT mean that you have something wrong you should investigate. If you look into it and sure enough, everything is fine (or if you’re undergoing a RAID rebuild, or replicating, or some other load), you can always acknowledge the alarm. Sometimes, on the other hand – it will reflect an issue in the network, or in the storage configuration in some other way.
What I think is cool is that this little case highlights and gives a “sneak peek” at an area of “advanced integration work” – storage array-to-esx-to-vCenter signalling of underlying storage configuration and capabilities (to make resource management a more real option across many arrays, and heterogeneous configs). Perhaps in the future, the datastore won’t be a “black box” to vSphere.
It’s cool, and I’m VERY fortunate to be exposed to some of the R&D thinking/design decisions down this path. It’s a very interesting space. There are more and more “big automated pool” characteristics appearing in storage array land across host use cases (think auto-tiering, in array and across-array and across-site federation as some examples of many). VMware is adding more and more functions internally (SIOC, svmotion). It’s a complex topic we know needs to be simple in practice, and are working very hard on (at VMware, and in and across the storage community).
Stay tuned in this space….
Wouldn't that be great if storage vendors come out with a standard management protocol to exchange information about storage status, health, perf, etc. of their devices with any compatible clients such vSphere or other storage devices too?
Networking industry have standards like SNMP, IGMP. I expect storage industry can make it as well, no!?
Rgds,
Didier
Posted by: Dpironet | November 09, 2010 at 03:09 AM
Didier,
Do you mean SNIA's Storage Management Initiative (SMI) ? EMC and other storage vendors have support. The main thing is how each of the _fully_ implemented the protocol. It's something like SNMP, etc.
Posted by: Serge | November 10, 2010 at 03:27 PM