Don't keep storagnode offline because of brief HDD reconnect as it would result in DQ

I encountered the problem that my external HDD got reconnected by ubuntu (reason unknown yet, might be a starting HDD failure, unreliable enclosure or power source). When that happens, storagenode crashes but doesn't try to restart again. Instead it waits for me to start it up again. This is not needed as with the --mount option, the storagenode checks if the mountpoint is available and wouldn't start if it is not available and therefore not store any data inside the docker volume. A brief reconnect of the HDD means that the mountpoint is almost instantly available again and the node could run again.

So when I wake up 6 hours later or come back from work, my node would already be disqualified for downtime although the storagenode should have just restarted since the mountpoint is available again. This creates a few other points of failures in an already very strict environment. Also many people will use RPis with external HDDs so this could be a common problem in the long run (but I am of course just guessing).

I'm aware that an unreliable connector could cause a DB corruption, in which case the node is gone anyways but at least by a real error, not avoidable downtime.

 

However since the node crash is on a system level by the docker daemon because --mount becomes unavailable, there is no way of setting an option for the node to start again (according to chat with Alexey) so I guess one would need another container, a "watchdog" container that tries to start the crashed node again after a few seconds.

 

  • Guest
  • Aug 15 2019
  • Attach files
  • Guest commented
    15 Aug 21:40

    This is of course not a solution for unreliable hardware and failing HDDs but gives the operator more time to react to his unreliable setup without the node being disqualified.