Incident Report Assignment SAW AUNG THU HEIN

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

BAN BANK IT Outages Incident Report

1. Reporting Staff’s Particulars

Name of institution: BAN BANK


Name of reporting staff: One Solutions Pte Ltd Monitoring Team staff
Designation: Network Operation Centre Monitoring
Date : 5-July-2019

2. Executive Summary

Date, start time, end time 3-July-2019_11:06 AM


Nature of Incident System Malfunction/IT Security Incident/Others*.
System Malfunction: communication link of storage
subsystem resulted in the service outage on 5 July 2019
Cause of incident Communication link cables and card problem
Incident description
One Solutions Pte Ltd determined that a repeated failure
to apply the correct procedure when addressing
instability in the communications link of the storage
subsystem resulted in the service outage on 5 July 2019.
One Solutions Pte Ltd’s immediate priority was to
ensure that customer data was not in any way
compromised while services were being restored as
quickly as possible. BAN BANK' services were restored
the same morning with full and complete data integrity.

3. Impact of Incident

a. System / Operational Impact

Critical systems impacted Affected business / operations


Communication link of sub
BAN BANK banking services were disrupted due to
storage system
this incident. The storage system ceased
communicating to protect the data.

1
b. Others

Describe other possible implications, e.g., financial and legal, that the incident will / may have on
the institution.

As data integrity is considered a higher priority than availability, the storage system is
designed to automatically cease communicating under these conditions. In doing so, the
system preserved full data integrity.
In spite of the machine’s high availability and redundancy, these incorrect procedures
caused the outage.

Chronology of Events

One Solutions Pte Ltd determined that a repeated failure to apply the correct procedure
when addressing instability in the communications link of the storage subsystem resulted
in the service outage on 5 July 2019.
One Solutions Pte Ltd’s immediate priority was to ensure that customer data was not in
any way compromised while services were being restored as quickly as possible. BAN
BANK' services were restored the same morning with full and complete data integrity.
Prior to the outage, the following events took place:

Time (hr/min) Action Details


3 July 2019, One Solutions Pte Ltd software monitoring tools sent an alert message to
11.06am One Solutions Pte Ltd’s Asia Pacific support centre located outside of
Singapore. It indicated there was instability in a communications link in the
storage system which was connected to a mainframe. At this point, the
storage system was functioning. A One Solutions Pte Ltd field engineer was
despatched to the BAN BANK data centre and was given approval by BAN
BANK to repair the machine.
3 July 2019, The cable in question was replaced. The One Solutions Pte Ltd field
7.50pm engineer did not use the machine’s maintenance interface but used the
instructions given by the support centre. Although this was done using an
incorrect step, the error message ceased. The storage system was still
functioning.

3 July 2019, The cable in question was replaced. The One Solutions Pte Ltd field
7.50pm engineer did not use the machine’s maintenance interface but used the
instructions given by the support centre. Although this was done using an
incorrect step, the error message ceased. The storage system was still
functioning.

2
4 July 2019, The error message reappeared. This time, it indicated instability in the cable
2.55pm and associated electronic cards. The One Solutions Pte Ltd field engineer
was despatched for the second time to the data centre. He diagnosed and
escalated the issue to the regional One Solutions Pte Ltd support centre.

4 July 2019, Based on instructions from the regional One Solutions Pte Ltd support
5.16pm centre, the cable was removed for inspection and reseated, using the same
incorrect step. The error message ceased. The storage system continued
functioning.

4 July 2019, The error message reappeared. Over the next five hours and 22 minutes, the
6.14pm regional One Solutions Pte Ltd support centre analysed the log from the
machine and recommended to the field engineer that he unplug the cable
and check for a bent pin. The storage system continued functioning.

4 July 2019,
The One Solutions Pte Ltd field engineer did not find a bent pin and
11.38pm
reseated the cable. The error message persisted. The storage system was
still functioning and able to communicate with the mainframe. The
regional One Solutions Pte Ltd support centre and the One Solutions Pte
Ltd field engineer continued diagnosing the issue, including reseating the
cable for a second time.
• Subsequently, BAN BANK was contacted and authorised a cable change
at 2.50am, a quiet period, which is standard operating procedure. While
waiting to replace the cable, the One Solutions Pte Ltd field engineer
decided to inspect the cable again to ensure that it was not defective and that
it was installed properly. He then unplugged the cable for inspection using
the previous incorrect procedure recommended by the regional One
Solutions Pte Ltd support centre.
5 July 2019,
The cable was replaced using the same procedures. This caused errors that
2.58am
threatened data integrity. As a result, the storage system ceased
communicating in order to protect the data.
At this point, BAN BANK banking services were disrupted.
If the correct procedures had been used, the storage system would have
automatically suspended the communications link and the machine would
have instructed the engineer to replace the cable and both cards together
and maintain redundancy of the system.
As data integrity is considered a higher priority than availability, the
storage system is designed to automatically cease communicating under
these conditions. In doing so, the system preserved full data integrity.

3
In spite of the machine’s high availability and redundancy, these incorrect
procedures caused the outage.

4. Detailed Root-Cause Analysis with Recovery Actions as following for outages


Immediately after the outage had occurred, One Solutions Pte Ltd informed BAN
BANK and an onsite technical command function comprising BAN BANK and
One Solutions Pte Ltd.’s staff was activated by 3.40am.
The immediate priority was to ensure that customer data was not in any way
compromised while restoring services as quickly as possible.
This process required time to ensure that data integrity was maintained. This
included careful efforts to reconcile data in the cache and disk within the storage
subsystem.
A restart of the systems was initiated at 5.20am. Banking services were restored
progressively from 10am to 12.30pm on 5 July 2019.
One Solutions Pte Ltd restored the system with full data integrity.

5. Conclusion

Preventive actions for


Regional General Manager, One Solutions Pte Ltd ASEAN
future Incidents
Ms Smita Tan once again apologised to BAN BANK and its
customers for the inconvenience caused by this incident. She
said that the corrective and preventive actions which they
took were of the highest priority for One Solutions Pte Ltd.
They have also taken steps to review installations of the same
storage system at other financial institutions in Singapore for
whom they also currently provide maintenance services.

Is the problem Yes


resolved?

4
Declaration

1. I declare that all information given in this report and in the attached annexes (if any) are
true and accurate.

Signature

Name of Approver

Date

You might also like