Last week I worked with HP on a customer SAN where two nodes had randomly rebooted all by themselves without any warning. Because they had another node and a failover manager all their volumes stayed online, but after the second one rebooted we wanted to know what was going on. So I called HP and they had me check the nodes for a specific patch: 20031-02, these nodes did not have this patch. Because of that, the nodes have a known issue when they have been online for 208.5 days where they will just reboot. To fix this you can either upgrade to SANiQ 9.5 or apply that specific patch. Either way, a reboot is required, but you will definitely want to do this before you get to the 208-day uptime mark 🙂
Here is a link to the official HP Advisory release:
At some point after 208.5 days of continuous runtime, a counter in the SAN/iQ Linux kernel may incur a divide–by–zero error that leads to a kernel panic, which causes HP P4000 storage systems running SAN/iQ software version 9.0 or 9.0.01 to go offline immediately. After ten minutes, the HP P4300 G2, HP P4500 G2, HP P4800 G2 and HP LeftHand DL320s storage systems will perform an Automatic Server Recovery (ASR) followed by resumed operations. Other storage systems running SAN/iQ version 9.0 or 9.0.01 will hang indefinitely until manually rebooted.
This issue may occur on any HP P4000 SAN Solution with storage systems running SAN/iQ software version 9.0 or 9.0.01.
This issue may be resolved by upgrading to SAN/iQ software version 9.5 and later. For customers continuing to run SAN/iQ version 9.0 or 9.0.01, Patch 20031-02 is available to address this error.