Latest Blog

A challenging week

04.08.2011 By: Ryan Cardy

It has been a challenging week this last week, helping one of our customers through a DR situation.

Tectrade got a phone call first thing on Wednesday morning to say that a tier of storage was offline and the vendor’s support was unable to bring it back over night, could we help?

Following a couple of calls and remote investigation of their environment we had a clearer picture of what the situation was. An entire MDISK was offline on their SVC which meant several critical servers were down, and to make matters worse it was month end the next day.

Tectrade supported the customer through this experience, by liaising with the vendor directly, gathering the logs and information required, making suggestions and supplementing the customer’s skills with our extremely knowledgeable staff. Tectrade continued to work on the issue through the night providing the vendor with all the information required to try and recover the data from the failed SVC MDISK.

In conjunction with this, we also worked with the customer in forming alternative plans, and were soon involved with recovering the data from TSM backups.

As time passed it became increasingly clear that the issue was not going to get suddenly fixed, and the DR work became the focus of getting the business running in time for month end. Tectrade provided help and support over the weekend assisting the customer through a difficult and stressful time. Following the recovery of data attention is turning now to Root Cause Analysis. It appears at this stage as though an Enterprise Class disk subsystem had a LUN offlined due to a failed disk! However it should be made clear that the firmware of all components are almost 4 years out of date. There is a very real chance (and the RCA will tell us) that the issues that caused this were identified and fixed in a later firmware release.

A couple of interesting points on firmware.

  • 60% of all “hardware” calls are actually fixed by updating the firmware
  • It is not just about bug fixes, there are often reporting and information logging updates in firmware. The length of time to resolution of an issues can be made longer as the later firmware revisions may have more efficient or pertinent logging in the error logs. One of the longest parts of the analysis was the review of a 32MB error log with millions of lines of entries to it, which could well have been smaller and more targeted with later firmware revisions.

To understand more about how Tectrade can provide your organisation support which includes pro-actively upgrading firmware levels each year before it becomes a critical issue, then please email info@tectrade.co.uk