Hotspare versus coldspare: how to save time and nerves when working with RAID systems

Advantages and disadvantages of a “hot” replacement hard disk in RAID

Why do security-conscious administrators not want a hotspare disk in the RAID, but rather a coldspare disk safely stowed in the cabinet?

What is Hotspare anyway?

Experts call an integrated hard disk in a RAID system that is not part of a RAID set or group a hot spare. However, the prefix “Hot” is in front of it because this hard disk – just like the active disks – is also supplied with power after the RAID system is switched on and therefore its spindle motors also run. The advantage: The replacement disk can instantly step in if a productive HDD fails and immediately starts the rebuild. Quite a few admins, however, object to this convenient feature. How can it be?

A case study from practice

Suppose we have a RAID system with 16 disks. The administrator has created 15 of these disks into a RAID 5 and set up the 16th disk as a hotspare. Now, if a disk in the RAID set fails, the hot spare disk jumps in immediately, as expected, and the rebuild begins. The stumbling block: In order to be able to restore a block on the spare disk, all corresponding blocks on the remaining 14 disks of the RAID set must be read so that the RAID controller can reconstruct the contents of the defective drive – i.e. the sum of the blocks to be restored. So the system must also read all the blocks on the other disks.

And here’s the rub. In normal operation, there are always areas on data carriers that have not been read for a long time. These can be, for example, stored invoices, archive files or photos from previous years. But if unreadable blocks have already crept into precisely these areas, timeouts and sometimes further drive failures are the result. A system that is inconspicuous in normal operation could therefore finally lose its way in the rebuild. This again would be the super disaster! In a RAID 5, after all, a second drive must not fail – otherwise there is a risk of data loss.

And how do security-conscious administrators do it?

They put a spare disk in your closet as a cold spare. This way, admins have an HDD immediately at hand in the event of a hard drive failure. Before they replace the failed HDD with the spare disk, though, they check whether a backup exists of all the data on the RAID set. If not they immediately back up the missing data. Since it is usually only the current inventory, the likelihood of another disk failure during the backup remains unchanged.
Only after the backup is complete does the admin use the spare disk to start the rebuild. Should the same scenario occur as described above, at least no data will be lost as it can be restored from the previous backup.

Pro tip
Always activate a regular check reading– if available – for RAID systems. The controller then checks all blocks on all drives in the RAID set – for example, at two-week intervals – and simultaneously verifies their parity information. This allows you to detect faulty blocks at an early stage and to replace the corresponding plates in a planned manner.

This test reading or parity check or scrubbing is offered by the following products from our range:RAID controllers from Areca, ATTO and Broadcom/LSI, RAIDdeluxe systems and Infortrend storages.

Conclusion: Hotspare versus Coldspare

The Hotspare convenience – i.e. the automatically started rebuild in case of absence – is bought by the admin with a higher risk of a complete data loss. After all, the probability of a disk failure is significantly higher in stress situations such as rebuild than in normal operation. You should also consider this fact in your daily workflow. We therefore generally recommend RAID 6 or RAID Z2 for a stable drive array.

Starline contact

Any questions? Please contact us.

Konrad Beyer has been with us since 2006 and has made a name for himself as an expert for network and IT security. He has expertise to almost all topics: from operating systems- Windows, Linux, macOS and VMware- to special fields- FC, iSCSI, Tape and NAS- to the product lines of Infortrend and Tiger Technology. As a human firewall, he is committed to ensuring that our internal systems run securely and that no malware infects our intranet. The hobby sailor also has the Starline telephone system firmly under control.

Konrad Beyer
Technical Support