LSI 95xx firmware causes total failure - Starline Computer: Storage und Server Lösungen von erfahrenen Experten

Exactly our profession

Starline experts rotate at full speed after disruption

Controller firmware causes total failure

A short story about a support incident that happened exactly like this. For the sake of the customer, however, we have removed the name.

It was around 9 p.m. on a Thursday evening when sales representative Thomas Heigl from the Enterprise Storage Solutions team received a call from his good customer. It was long after closing time, but given the good relationship of trust with this partner, he naturally answered the phone.

The resolved customer reported a failure in an eminently important system: After a shutdown, the production server could no longer find the SSDs from the RAID set. Thus 100 percent data loss. Super disaster! For the customer, the system house and also for Starline. After all, our ambition is to always provide optimal support for the systems we deliver.

Known how

As an experienced project manager, Thomas immediately realized that it was highly unlikely if the damaged RAID set had been caused by the simultaneous failure of multiple disks. The failure had to have occurred at a more central point.

The all-clear was already given by the hastily called Starline supporter Patrick Weber: He was sure that it could not be a matter of defective drives due to the error pattern. The technician expected that the failed SSDs would switch back to "online" and that the RAID configuration could then be re-imported.

After consulting several support experts from the companies involved and discussing other possible solutions, the decision was made in favor of Patrick's plan.

Lo and behold, SSDs that were set to Failed came back online immediately. The hard disks, however, did not respond as quickly, so one of the hot spares had to step in. Half an hour later, however, all RAID sets were running again, which was followed by DataCore virtualization and then the VMs booting.

At 11 p.m., it was clear that there was no data loss and the entire system was working regularly again. Last but not least, the happy customer thanked the Starliners for their commitment at such an unusual time.

Shutdown

Conclusion: Support good, all good.

As befits a responsible distributor, Starline has of course informed all customers who use these controllers/HBAs according to their order history about this problem and supported them in overcoming it.

Epilog

Starline Forensics was finally able to reproduce the error on the test system: The causally responsible Broadcom/LSI RAID controller - and this applies to all MegaRAID 95XX controllers as well as HBAs based on this chip - could lose its configuration during a longer shutdown. Apparently a discrepancy in the firmware of the controller and the PSOC (Programmable System on Chip).

As a workaround with the old firmware, systems with these controllers or HBAs should never be shut down completely. Since this is not practical in most constellations, affected systems should receive new firmware as a precaution.

Broadcom recommends updating the firmware for MegaRAID controllers to at least 7.24, HBAs to at least P26, and the PSOC itself to version 1.25.

In addition, the profile must be adapted for MegaRAID controllers if only SAS/SATA storage media are used. (NVME/SAS/SATA ProfileID (default): 30, SAS/SATA ProfileID: 32).

Service advantages

Benefit from storage veterans

user-check
Loyal partner

Our long-standing business relationships prove our loyalty to customers and suppliers.

git-branch
Highly Experienced

Starline has been active in the sector since 1982 and is a specialist in all aspects of data storage. 

check-circle
Full Service

On-site service or on-site installation can be booked - on request also within four hours (24 x 7 x 4).

Load more

Any questions?

Contact us!

TH
Thomas Heigl
Sales

Project Manager from the Enterprise Storage Solutions Team - our specialists for large projects.