Update: With the release of the VNXe3200 all of the issues discussed in this post have been resolved. For a review of the VNXe3200 controller failover process check out my new article: http://www.jpaul.me/2014/08/vnxe-3200-controller-failover-behavior/
Let me start by saying that the purpose of this post is not to say that the EMC VNXe is a bad SAN, it is also not the purpose of this post to say that the HP P2000 I used for comparison is a superior SAN. However, the purpose of this post is simply to spell out how fail-over and fail-back works, and the pitfalls you may encounter if there is a failure. Moreover the information in this article should be considered when planning a VMware vSphere environment where a VNXe is being considered so that applications with real-time or near real-time requirements are properly planned for.
Typically when we think of a SAN we think of redundancy, lots of disk drives, and redundant controllers with near instant fail-over. After all the only reason we put all of our eggs in one basket is because we know that if components in a SAN fail we can keep right on going.
Traditionally when a controller or the paths to a controller fail there is redundancy built in so that the servers using that storage simply start asking the other controller for access to it… fail-over takes place. Normally we will also see redundant links from each controller through two separate switches and then those two switches up-link to each server, this provides us with no single point of failure in a typical SAN solution such as an EMC VNX, Hp P2000, or NetApp Filer.
The VNXe is a little different, instead of the traditional block level controllers that serve up iSCSI, EMC is using iSCSI emulation (think like iSCSI target software on linux or windows) that runs like a Windows service on top of the controllers operating system. So when a controller is put into maintenance mode for a firmware upgrade or when it just fails in general, there is no graceful transition to the sister controller. Instead it is just like a service that has stopped responding. Then after a short period of time, the sister controller starts that service on itself and storage I/O resumes. In my tests this process of fail-over took approximately 2 minutes, during which time a file transfer that I was doing was frozen as was all other I/O in the affected VM. However I will note that the transfer did complete without error. For more information on this you can go into the properties of one of your VMware datastores and click the “Manage Paths” button. Normally on most SAN’s you will see 4 paths, two that are “Active (I/O)” and two that are just “Active” if you are using Round Robin NMP.
In this picture you can tell that storage is viewable from each controller because in the target string you can see both 192.168.71.31 and 192.168.72.31 (which are on controller A) as well as 192.168.71.32 and 192.168.72.32 (which are on controller B).
However if you look at a VNXe datastore you will only see two “Active (I/O)” paths and no other paths. The reason for this is because a normal SAN uses ALUA which makes storage available on all ports, but the controller that owns the LUN holds the optimal ports. The VNXe only shows the LUN out the ports on the controller running the iSCSI server that owns the LUN, the other controller has no information about those LUNs until the iSCSI Server service owning then is failed over to it.
In this screenshot you can see that only Controller A (the owner) is presenting the storage because we only see its IP addresses here (192.168.71.33 and 192.168.71.34).
So the moral of the story is that if you are going to use a VNXe on your project you need to make sure that whatever workload you are running on it can sustain a period of time where all resources on it are unavailable. For some companies this is no problem at all. As long as there is no data loss and the system recovers from the problem things are fine. However I can think of other situations where 2 minutes of downtime would be a problem, specifically I am thinking of PLC’s or other sensors that send real-time manufacturing data back to a database. Downtime of that length where SQL services are not responding could in some cases cause that machine to slow down or stop. So the bottom line is to make sure you know your workload and what requirements it has.
To test my suspicions I used a VNXe 3100 running the latest firmware as well as an HP P2000 G3 array with the latest firmware. Then I configured iSCSI as I normally would with two separate subnets and switching fabrics, after that I created LUN’s on each array and presented them as VMware Datastores, and finally I created a Windows 2008 R2 server with enough space to conduct a fairly large file transfer. To see when there was no active I/O happening I was moving two large ISO files back and forth between two locations in the VM.
First lets look at the performance graph of the HP P2000 SAN.
In this picture you can see when I took the owning controller offline, denoted in the picture by ” Controller offline”, and after it goes offline you can see a brief period where I/O is reduced (it was less then 20 seconds) and then you can see that the transfer rate jumps right back up to where it was before the controller went offline. The only difference was a little bit higher latency, but the VM was still usable. to the far right of this picture you can also see where I brought the controller back online, and for a split second I/O is interupted but it was almost undetectable to me, if the graph wouldnt have shown the increased latency and downward spike I would have not noticed it. Overall the P2000 preformed exactly how I expected it too, which is also how I would expect a VNX or other traditional array to work.
Now lets look at the VNXe graph.
In this picture we have a lot to talk about. First on the left side we can see two pretty normal looking spikes, the first is where I transfered the ISO files from the NAS to the virtual server. The second hump is where I transfered them between locations inside of the VM just to get an idea of what to expect under normal operations. Then the third red hump is where I started the move again, but after a short time I put the owning controller into maintenance mode. This caused a huge jump in latency which I expected, but then it took more then 2 minutes for the iSCSIserver00 service to fail over to the other controller. After it did finally fail over the transfer rate on the move fell from about 40MB/s to about 8 and stayed there through-out the completion of the move. Latency also hovered around 600ms, where as the P2000 was in the 150ms range. After letting it finish the move, I started the move again and then rebooted the offline controller eventually bringing it back online. When it did come online there was a spike of throughput for about 10 seconds, and then things returned to normal in the 40ish MB/s range.
The main concern I had was the amount of time to fail over the iSCSIserver process to the sister controller, but I have to also mention that the VM was almost unusable while finishing up the transfer. It took a very long time to even open Internet Explorer.
NFS on VNXe
I also did the same test with an NFS datastore and when the owning controller is rebooted or taken offline for any reason the same thing happens. However the amount of time while things are offline is reduced to about 1 minute according to the performance graphs.
To sum things up, I want to reiterate that the VNXe is not a bad solution, but you do need to be aware of how it works so that you can determine if its quirks will fit your business need or if you need to select something more traditional like an EMC VNX or other brand.
Think of it like this, if you are given the choice of Goodyear Run-Flat Tires or an “in the trunk” spare tire, which would you rather have? Either one will get you down the road until you get the problem fixed, but one involves a bunch more time along the side of the road then the other.
Here are some other screenshots I took during the testing.