SVC Bkmap Svctrblshoot
SVC Bkmap Svctrblshoot
SVC Bkmap Svctrblshoot
Troubleshooting Guide
IBM
GC27-2284-09
Note
Before using this information and the product it supports, read the information in Notices on page 359.
This edition applies to IBM SAN Volume Controller, Version 7.6, and to all subsequent releases and modifications
until otherwise indicated in new editions.
Copyright IBM Corporation 2003, 2015.
US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
Figures . . . . . . . . . . . . .. vii When to use the management GUI . . . .. 60
Accessing the management GUI . . . . .. 60
Tables . . . . . . . . . . . . . .. ix Deleting a node from a clustered system using
the management GUI . . . . . . . . .. 61
Adding nodes to a system . . . . . . .. 63
About this guide . . . . . . . . . .. xi Service assistant interface . . . . . . . . .. 66
Who should use this guide . . . . . . . .. xi When to use the service assistant . . . . .. 67
Emphasis . . . . . . . . . . . . . .. xi Accessing the service assistant . . . . . .. 67
SAN Volume Controller library and related Command-line interface . . . . . . . . .. 68
publications . . . . . . . . . . . . .. xi When to use the CLI . . . . . . . . .. 68
How to order IBM publications . . . . . .. xiv Accessing the system CLI. . . . . . . .. 68
Related websites . . . . . . . . . . .. xiv Service command-line interface . . . . . . .. 68
Sending your comments . . . . . . . . .. xv When to use the service CLI . . . . . . .. 68
How to get information, help, and technical Accessing the service CLI. . . . . . . .. 69
assistance . . . . . . . . . . . . . .. xv USB flash drive interface . . . . . . . .. 69
Summary of changes for GC27-2284-07 SAN Technician port for node access . . . . . . .. 75
Volume Controller Troubleshooting Guide. . .. xvii Front panel interface . . . . . . . . . .. 76
Summary of changes for GC27-2284-06 SAN
Volume Controller Troubleshooting Guide. . .. xvii
Chapter 4. Performing recovery actions
using the SAN Volume Controller CLI . 77
Chapter 1. SAN Volume Controller
Validating and repairing mirrored volume copies
overview . . . . . . . . . . . . .. 1 using the CLI. . . . . . . . . . . . .. 77
Systems. . . . . . . . . . . . . . .. 11 Repairing a thin-provisioned volume using the CLI 78
Configuration node . . . . . . . . . .. 11 Recovering offline volumes using the CLI . . .. 79
Configuration node addressing . . . . . .. 12
Management IP failover . . . . . . . .. 12
Chapter 5. Viewing the vital product
SAN fabric overview . . . . . . . . . .. 14
data . . . . . . . . . . . . . . .. 81
Downloading the vital product data using the
Chapter 2. Introducing the SAN Volume
management GUI . . . . . . . . . . .. 81
Controller hardware components . .. 15 Displaying the vital product data using the CLI .. 81
SAN Volume Controller nodes . . . . . . .. 15 Displaying node properties by using the CLI .. 81
SAN Volume Controller controls and indicators 15 Displaying clustered system properties using the
SAN Volume Controller operator-information CLI . . . . . . . . . . . . . . .. 82
panel . . . . . . . . . . . . . .. 20 Fields for the node VPD . . . . . . . . .. 83
SAN Volume Controller rear-panel indicators and Fields for the system VPD . . . . . . . .. 88
connectors . . . . . . . . . . . . .. 24
Fibre Channel port numbers and worldwide port
Chapter 6. Using the front panel of the
names . . . . . . . . . . . . . .. 35
Requirements for the SAN Volume Controller SAN Volume Controller. . . . . . .. 91
environment . . . . . . . . . . . .. 35 Boot progress indicator . . . . . . . . .. 91
Redundant AC-power switch . . . . . . .. 42 Boot failed. . . . . . . . . . . . . .. 91
Redundant AC-power environment requirements 43 Charging . . . . . . . . . . . . . .. 92
Cabling of redundant AC-power switch Error codes . . . . . . . . . . . . .. 92
(example) . . . . . . . . . . . . .. 44 Hardware boot . . . . . . . . . . . .. 92
Uninterruptible power supply . . . . . . .. 47 Node rescue request . . . . . . . . . .. 93
2145 UPS-1U . . . . . . . . . . . .. 48 Power failure . . . . . . . . . . . . .. 93
Uninterruptible power-supply environment Powering off . . . . . . . . . . . . .. 93
requirements . . . . . . . . . . . .. 52 Recovering . . . . . . . . . . . . .. 94
Defining the SAN Volume Controller FRUs . . .. 53 Restarting . . . . . . . . . . . . . .. 94
SAN Volume Controller FRUs . . . . . .. 53 Shutting down . . . . . . . . . . . .. 94
Redundant AC-power switch FRUs . . . .. 58 Validate WWNN? option . . . . . . . . .. 95
SAN Volume Controller menu options . . . .. 96
Cluster (system) options . . . . . . . .. 98
Chapter 3. SAN Volume Controller user
Node options . . . . . . . . . . .. 100
interfaces for servicing your system .. 59 Version options . . . . . . . . . . .. 100
Management GUI interface . . . . . . . .. 59
Contents v
vi SAN Volume Controller: Troubleshooting Guide
Figures
1. SAN Volume Controller system in a fabric 2 37. Photo of the redundant AC-power switch 43
2. Data flow in a SAN Volume Controller system 3 38. A four-node SAN Volume Controller system
3. Example of a basic volume . . . . . . .. 4 with the redundant AC-power switch feature . 45
4. Example of mirrored volumes . . . . . .. 4 39. Rack cabling example. . . . . . . . .. 47
5. Example of stretched volumes. . . . . .. 5 40. 2145 UPS-1U front-panel assembly . . . .. 49
6. Example of HyperSwap volumes . . . . .. 6 41. 2145 UPS-1U connectors and switches. . .. 51
7. Example of a standard system topology . .. 7 42. 2145 UPS-1U dip switches . . . . . .. 52
8. Example of a stretched system topology . .. 7 43. Ports not used by the 2145 UPS-1U . . .. 52
9. Example of a HyperSwap system topology 8 44. Power connector . . . . . . . . . .. 52
10. SAN Volume Controller nodes with internal 45. SAN Volume Controller front-panel assembly 91
Flash drives . . . . . . . . . . .. 10 46. Example of a boot progress display . . .. 91
11. Configuration node . . . . . . . . .. 12 47. Example of an error code for a clustered
12. SAN Volume Controller 2145-DH8 front panel 16 system . . . . . . . . . . . . .. 92
13. SAN Volume Controller 2145-CG8 front panel 18 48. Example of a node error code . . . . .. 92
14. SAN Volume Controller 2145-CF8 front panel 18 49. Node rescue display . . . . . . . .. 93
15. SAN Volume Controller 2145-DH8 operator 50. Validate WWNN? navigation. . . . . .. 95
information panel . . . . . . . . .. 21 51. SAN Volume Controller options on the
16. SAN Volume Controller 2145-CG8 or 2145-CF8 front-panel display . . . . . . . . .. 97
operator-information panel . . . . . .. 21 52. Viewing the IPv6 address on the front-panel
17. SAN Volume Controller 2145-CG8 or 2145-CF8 display. . . . . . . . . . . . .. 100
operator-information panel . . . . . .. 22 53. Upper options of the actions menu on the
18. SAN Volume Controller 2145-DH8 rear-panel front panel . . . . . . . . . . .. 104
indicators . . . . . . . . . . . .. 24 54. Middle options of the actions menu on the
19. Connectors on the rear of the SAN Volume front panel . . . . . . . . . . .. 105
Controller 2145-DH8 . . . . . . . .. 25 55. Lower options of the actions menu on the
20. Power connector . . . . . . . . . .. 25 front panel . . . . . . . . . . .. 106
21. SAN Volume Controller 2145-DH8 service 56. Language? navigation . . . . . . . .. 116
ports . . . . . . . . . . . . . .. 26 57. Example of inventory information email 137
22. SAN Volume Controller 2145-DH8 unused 58. Example of a boot error code . . . . .. 159
Ethernet port . . . . . . . . . . .. 26 59. Example of a boot progress display . . .. 160
23. SAN Volume Controller 2145-CG8 rear-panel 60. Example of a displayed node error code 160
indicators . . . . . . . . . . . .. 27 61. Example of a node-rescue error code 161
24. SAN Volume Controller 2145-CG8 rear-panel 62. Example of a create error code for a clustered
indicators for the 10 Gbps Ethernet feature .. 27 system . . . . . . . . . . . . .. 161
25. Connectors on the rear of the SAN Volume 63. Example of a recovery error code . . . .. 162
Controller 2145-CG8 . . . . . . . .. 27 64. Example of an error code for a clustered
26. 10 Gbps Ethernet ports on the rear of the SAN system . . . . . . . . . . . . .. 162
Volume Controller 2145-CG8 . . . . . .. 28 65. Node rescue display . . . . . . . .. 270
27. Power connector . . . . . . . . . .. 28 66. Error LED on the SAN Volume Controller
28. Service ports of the SAN Volume Controller models. . . . . . . . . . . . .. 277
2145-CG8 . . . . . . . . . . . .. 29 67. SAN Volume Controller 2145-DH8
29. SAN Volume Controller 2145-CG8 port not operator-information panel . . . . . .. 278
used . . . . . . . . . . . . . .. 29 68. Hardware boot display . . . . . . .. 278
30. SAN Volume Controller 2145-CF8 rear-panel 69. SAN Volume Controller 2145-DH8 front panel 279
indicators . . . . . . . . . . . .. 30 70. Power LED on the SAN Volume Controller
31. Connectors on the rear of the SAN Volume 2145-DH8 . . . . . . . . . . . .. 284
Controller 2145-CG8 or 2145-CF8 . . . .. 30 71. Power LED indicator on the rear panel of the
32. Power connector . . . . . . . . . .. 31 SAN Volume Controller 2145-DH8 . . .. 285
33. Service ports of the SAN Volume Controller 72. AC, dc, and power-supply error LED
2145-CF8 . . . . . . . . . . . .. 31 indicators on the rear panel of the SAN
34. SAN Volume Controller 2145-CF8 port not Volume Controller 2145-DH8 . . . . .. 286
used . . . . . . . . . . . . . .. 31 73. Power LED on the SAN Volume Controller
35. SAN Volume Controller 2145-DH8 AC, DC, models 2145-CG8 or2145-CF8
and power-error LEDs . . . . . . . .. 34 operator-information panel . . . . . .. 289
36. SAN Volume Controller 2145-CG8 or 2145-CF8
AC, DC, and power-error LEDs . . . . .. 34
The chapters that follow introduce you to the SAN Volume Controller, expansion
enclosure, the redundant AC-power switch, and the uninterruptible power supply.
They describe how you can configure and check the status of one SAN Volume
Controller node or a clustered system of nodes through the front panel, with the
service assistant GUI, or with the management GUI.
The vital product data (VPD) chapter provides information about the VPD that
uniquely defines each hardware and microcode element that is in the SAN Volume
Controller. You can also learn how to diagnose problems using the SAN Volume
Controller.
The maintenance analysis procedures (MAPs) can help you analyze failures that
occur in a SAN Volume Controller. With the MAPs, you can isolate the
field-replaceable units (FRUs) of the SAN Volume Controller that fail. Begin all
problem determination and repair procedures from MAP 5000: Start on page 275.
Emphasis
Different typefaces are used in this guide to show emphasis.
The information collection in the IBM Knowledge Center contains all of the
information that is required to install, configure, and manage the system. The
information collection in the IBM Knowledge Center is updated between product
releases to provide the most current documentation. The information collection is
available at the following website:
publib.boulder.ibm.com/infocenter/svc/ic/index.jsp
Unless otherwise noted, the publications in the library are available in Adobe
portable document format (PDF) from a website.
www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss
Click Search for publications to find the online publications you are interested in,
and then view or download the publication by clicking the appropriate item.
Table 1 lists websites where you can find help, services, and more information.
Table 1. IBM websites for help, services, and information
Website Address
Directory of worldwide contacts http://www.ibm.com/
planetwide
Support for SAN Volume Controller (2145) www.ibm.com/storage/
support/2145
Support for IBM System Storage and IBM TotalStorage www.ibm.com/storage/
products support/
Each of the PDF publications in the Table 2 library is also available in the IBM
Knowledge Center by clicking the number in the Order number column:
Table 2. SAN Volume Controller library
Title Description Order number
IBM SAN Volume Controller The guide provides the
Model 2145-DH8 Hardware instructions that the IBM
Installation Guide service representative uses to
install the hardware for SAN
Volume Controller model
2145-DH8.
IBM System Storage SAN The guide provides the
Volume Controller Model instructions that the IBM
2145-CG8 Hardware service representative uses to
Installation Guide install the hardware for SAN
Volume Controller model
2145-CG8.
Table 3 lists an IBM publication that contains information that is related to SAN
Volume Controller.
Table 3. Other IBM publications
Title Description Order number
IBM System Storage Multipath The guide describes the IBM GC52-1309
Subsystem Device Driver System Storage Multipath
User's Guide Subsystem Device Driver for IBM
System Storage products and
how to use it with the SAN
Volume Controller.
Table 4 lists websites that provide publications and other information about the
SAN Volume Controller or related products or technologies. The IBM Redbooks
publications provide positioning and value guidance, installation and
implementation experiences, solution scenarios, and step-by-step procedures for
various products.
Table 4. IBM documentation and related websites
Website Address
IBM Publications Center www.ibm.com/e-business/linkweb/publications/
servlet/pbi.wss
IBM Redbooks publications www.redbooks.ibm.com/
To view a PDF file, you need Adobe Reader, which can be downloaded from the
Adobe website:
www.adobe.com/support/downloads/main.html
The IBM Publications Center offers customized search functions to help you find
the publications that you need. Some publications are available for you to view or
download at no charge. You can also order publications. The publications center
displays prices in your local currency. You can access the IBM Publications Center
through the following website:
www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss
Related websites
The following websites provide information about SAN Volume Controller or
related products or technologies:
To submit any comments about this book or any other SAN Volume Controller
documentation:
v Send your comments by email to starpubs@us.ibm.com. Include the following
information for this publication or use suitable replacements for the publication
title and form number for the publication on which you are commenting:
Publication title: IBM SAN Volume Controller Troubleshooting Guide
Publication form number: GC27-2284-07
Page, table, or illustration numbers that you are commenting on
A detailed description of any information that should be changed
Information
IBM maintains pages on the web where you can get information about IBM
products and fee services, product implementation and usage assistance, break and
fix service support, and the latest technical information. For more information,
refer to Table 5.
Table 5. IBM websites for help, services, and information
Website Address
Directory of worldwide contacts http://www.ibm.com/planetwide
Support for SAN Volume Controller www.ibm.com/storage/support/2145
(2145)
Support for IBM System Storage www.ibm.com/storage/support/
and IBM TotalStorage products
Note: Available services, telephone numbers, and web links are subject to change
without notice.
Before calling for support, be sure to have your IBM Customer Number available.
If you are in the US or Canada, you can call 1 (800) IBM SERV for help and
service. From other parts of the world, see http://www.ibm.com/planetwide for
the number that you can call.
If you call from somewhere other than the US or Canada, you must choose the
software or hardware option when calling for assistance. Choose the software
option if you are uncertain if the problem involves the SAN Volume Controller
software or hardware. Choose the hardware option only if you are certain the
problem solely involves the SAN Volume Controller hardware. When calling IBM
for service regarding the product, follow these guidelines for the software and
hardware options:
Software option
Identify the SAN Volume Controller product as your product and supply
your customer number as proof of purchase. The customer number is a
7-digit number (0000000 to 9999999) assigned by IBM when the product is
purchased. Your customer number should be located on the customer
information worksheet or on the invoice from your storage purchase. If
asked for an operating system, use Storage.
Hardware option
Provide the serial number and appropriate 4-digit machine type. For SAN
Volume Controller, the machine type is 2145.
In the US and Canada, hardware service and support can be extended to 24x7 on
the same day. The base warranty is 9x5 on the next business day.
You can find information about products, solutions, partners, and support on the
IBM website.
To find up-to-date information about products, services, and partners, visit the IBM
website at www.ibm.com/storage/support/2145.
Make sure that you have taken steps to try to solve the problem yourself before
you call.
Some suggestions for resolving the problem before calling IBM Support include:
v Check all cables to make sure that they are connected.
v Check all power switches to make sure that the system and optional devices are
turned on.
v Use the troubleshooting information in your system documentation. The
troubleshooting section of the information center contains procedures to help
you diagnose problems.
v Go to the IBM Support website at www.ibm.com/storage/support/2145 to check
for technical information, hints, tips, and new device drivers or to submit a
request for information.
Information about your IBM storage system is available in the documentation that
comes with the product.
If you have questions about how to use and configure the machine, sign up for the
IBM Support Line offering to get a professional answer.
The maintenance supplied with the system provides support when there is a
problem with a hardware component or a fault in the system machine code. At
times, you might need expert advice about using a function provided by the
system or about how to configure the system. Purchasing the IBM Support Line
offering gives you access to this professional advice while deploying your system,
and in the future.
Contact your local IBM sales representative or your support group for availability
and purchase information.
New information
The following information has been added to this guide since the previous edition,
GC27-2284-06.
v USB flash drive interface on page 69
v Resolving a problem with SSL/TLS clients on page 245
v Procedure: Making drives support protection information on page 245
Updated information
New information
The following information has been added to this guide since the previous edition,
GC27-2284-05..
A SAN is a high-speed Fibre Channel network that connects host systems and
storage devices. In a SAN, a host system can be connected to a storage device
across the network. The connections are made through units such as routers and
switches. The area of the network that contains these units is known as the fabric of
the network.
IBM SAN Volume Controller is built with IBM Spectrum Virtualize software,
which is part of the IBM Spectrum Storage family.
The software provides these functions for the host systems that attach to SAN
Volume Controller:
v Creates a single pool of storage
v Provides logical unit virtualization
v Manages logical volumes
v Mirrors logical volumes
Figure 1 on page 2 shows hosts, SAN Volume Controller nodes, and RAID storage
systems connected to a SAN fabric. The redundant SAN fabric comprises a
Host zone
Node
Redundant
SAN fabric
Node
Node
RAID RAID
storage system storage system
svc00600
Storage system zone
Volumes
A system of SAN Volume Controller nodes presents volumes to the hosts. Most of
the advanced functions that SAN Volume Controller provides are defined on
volumes. These volumes are created from managed disks (MDisks) that are
presented by the RAID storage systems. The volumes can also be created by arrays
that are provided by flash drives in an expansion enclosure such as SAN Volume
Controller 2145-24F. All data transfer occurs through the SAN Volume Controller
node, which is described as symmetric virtualization.
Node
Redundant
SAN fabric
Node
I/O is sent to
managed disks.
RAID RAID
storage system storage system
svc00601
Data transfer
The nodes in a system are arranged into pairs that are known as I/O groups. A
single pair is responsible for serving I/O on a volume. Because a volume is served
by two nodes, no loss of availability occurs if one node fails or is taken offline.
Volumes types
I/O Group 1
Basic Volume
Volume
Copy
svc00909
Figure 3. Example of a basic volume
v Mirrored volumes, where copies of the volume can either be in the same storage
pool or in different storage pools. The volume is cached in a single I/O group.
Typically, mirrored volumes are established in a standard system topology.
Standard System
I/O Group 1
Mirrored Volume
Volume
Volume Volume
Volume
Copy
Copy Copy
Copy
svc00908
v Stretched volumes, where copies of a single volume are in different storage pools
at different sites. The volume is cached in one I/O group. Stretched volumes are
only available in stretched topology systems.
Stretched Volume
Volume Volume
Copy Copy
svc00907
Site 1 Site 2
HyperSwap Volume
Active-active
relationship
Volume Volume
Copy Copy
Volume Volume
Copy Copy
svc00906
Site 1 Site 2
System topology
The topology property of a SAN Volume Controller system can be set to one of the
following states.
Note: You cannot mix I/O groups of different topologies in the same system.
v Standard topology, where all nodes in the system are at the same site.
Node 1 Node 2
svs00919
Site 1
v Stretched topology, where each node of an I/O group is at a different site. When
one site is not available, access to a volume can continue but with reduced
performance.
I/O Group 1
Node 1 Node 2
svs00920
Site 1 Site 2
v HyperSwap topology, where the system is comprised of at least two I/O groups.
Each I/O group is at a different site. Both nodes of an I/O group are at the
same site. A volume can be active on two I/O groups so that it can immediately
be accessed by the other site when a site is not available.
Node 1 Node 3
Node 2 Node 4
svs00921
Site 1 Site 2
Table 6 summarizes the types of volumes that can be associated with each system
topology.
Table 6. System topology and volume summary
Volume Type
Topology
Basic Mirrored Stretched HyperSwap Custom
Standard X X X
Stretched X X X
HyperSwap X X X
System management
The SAN Volume Controller nodes in a system operate as a single system and
present a single point of control for system management and service. System
management and error reporting are provided through an Ethernet interface to one
of the nodes in the system, which is called the configuration node. The configuration
node runs a web server and provides a command-line interface (CLI). Any node in
the system can be the configuration node. If the current configuration node fails, a
new configuration node is selected from the remaining nodes. Each node also
provides a command-line interface and web interface for initiating hardware
service actions.
I/O operations between hosts and SAN Volume Controller nodes and between
SAN Volume Controller nodes and arrays use the SCSI standard. The SAN Volume
Controller nodes communicate with each other through private SCSI commands.
Table 7 shows the fabric type that can be used for communicating between hosts,
nodes, and RAID storage systems. These fabric types can be used at the same time.
Table 7. SAN Volume Controller communications types
Communications Host to SAN Volume SAN Volume SAN Volume
type Controller Controller to storage Controller to SAN
system Volume Controller
Fibre Channel SAN Yes Yes Yes
iSCSI (1 Gbps Yes No No
Ethernet or 10 Gbps
Ethernet)
Fibre Channel Over Yes Yes Yes
Ethernet SAN (10
Gbps Ethernet)
Flash drives
Some SAN Volume Controller nodes contain flash drives or are attached to
expansion enclosures that contain flash drives. These flash drives can be used to
create RAID-managed disks (MDisks) that in turn can be used to create volumes.
In SAN Volume Controller 2145-DH8 nodes, the flash drives are in an expansion
enclosure that is connected to both sides of an I/O group.
Flash drives provide host servers with a pool of high-performance storage for
critical applications. Figure 10 on page 10 shows this configuration. MDisks on
flash drives can also be placed in a storage pool with MDisks from regular RAID
storage systems. IBM Easy Tier performs automatic data placement within that
storage pool by moving high-activity data onto better-performing storage.
Node
with SSDs Redundant
SAN fabric
svc00602
Figure 10. SAN Volume Controller nodes with internal Flash drives
The nodes are always installed in pairs; a minimum of one pair and a maximum of
four pairs of nodes constitute a system. Each pair of nodes is known as an I/O
group.
I/O groups take the storage that is presented to the SAN by the storage systems as
MDisks and translates the storage into logical disks (volumes) that are used by
applications on the hosts. A node is in only one I/O group and provides access to
the volumes in that I/O group.
The SAN Volume Controller 2145-DH8 node has the following features:
v A 19-inch rack-mounted enclosure
v At least one Fibre Channel adapter or one 10 Gbps Ethernet adapter
v Optional second, third, and fourth Fibre Channel adapters
v 32 GB memory per processor
v One or two, eight-core processors
v Dual redundant power supplies
v Dual redundant batteries for better reliability, availability, and serviceability than
for a SAN Volume Controller 2145-CG8 with an uninterruptible power supply
v Up to two SAN Volume Controller 2145-24F expansion enclosures to house up to
24 flash drives each
v iSCSI host attachment (1 Gbps Ethernet and optional 10 Gbps Ethernet)
v Supports optional IBM Real-time Compression
v A dedicated technician port for local access to the initialization tool or the
service assistant interface.
The SAN Volume Controller 2145-CG8 node has the following features:
v A 19-inch rack-mounted enclosure
v One 4-port 8 Gbps Fibre Channel adapter
v One optional 2-port 10 Gbps Fibre Channel over Ethernet converged network
adapter
v Optional second 4-port 8 Gbps Fibre Channel adapter
v 24 GB memory
v Fibre Channel over Ethernet host attachment (need to add only one)
v One quad-core processor
v Dual, redundant power supplies
v Supports up to four optional flash drives
v iSCSI host attachment (1 Gbps Ethernet and optional 10 Gbps Ethernet)
v Supports optional IBM Real-time Compression
Note: The optional flash drives and optional 10 Gbps Ethernet cannot be in the
same 2145-CG8 node.
Systems
A system is a collection of SAN Volume Controller nodes.
A system can consist of between two to eight SAN Volume Controller nodes.
All configuration settings are replicated across all nodes in the system.
Management IP addresses are assigned to the system. Each interface accesses the
system remotely through the Ethernet system-management addresses, also known
as the primary, and secondary system IP addresses.
Configuration node
A configuration node is a single node that manages configuration activity of the
system.
If the configuration node fails, the system chooses a new configuration node. This
action is called configuration node failover. The new configuration node takes over
the management IP addresses. Thus you can access the system through the same
IP addresses although the original configuration node has failed. During the
failover, there is a short period when you cannot use the command-line tools or
management GUI.
1 Configuration
Node
IP Interface
This node then acts as the focal point for all configuration and other requests that
are made from the management GUI application or the CLI. This node is known as
the configuration node.
If the configuration node is stopped or fails, the remaining nodes in the system
determine which node will take on the role of configuration node. The new
configuration node binds the management IP addresses to its Ethernet ports. It
broadcasts this new mapping so that connections to the system configuration
interface can be resumed.
The new configuration node broadcasts the new IP address mapping using the
Address Resolution Protocol (ARP). You must configure some switches to forward
the ARP packet on to other devices on the subnetwork. Ensure that all Ethernet
devices are configured to pass on unsolicited ARP packets. Otherwise, if the ARP
packet is not forwarded, a device loses its connection to the SAN Volume
Controller system.
If a device loses its connection to the SAN Volume Controller system, it can
regenerate the address quickly if the device is on the same subnetwork as the
system. However, if the device is not on the same subnetwork, it might take hours
for the address resolution cache of the gateway to refresh. In this case, you can
restore the connection by establishing a command line connection to the system
from a terminal that is on the same subnetwork, and then by starting a secure copy
to the device that has lost its connection.
Management IP failover
If the configuration node fails, the IP addresses for the clustered system are
transferred to a new node. The system services are used to manage the transfer of
the management IP addresses from the failed configuration node to the new
configuration node.
Note: Some Ethernet devices might not forward ARP packets. If the ARP
packets are not forwarded, connectivity to the new configuration node cannot be
established automatically. To avoid this problem, configure all Ethernet devices
to pass unsolicited ARP packets. You can restore lost connectivity by logging in
to the SAN Volume Controller and starting a secure copy to the affected system.
Starting a secure copy forces an update to the ARP cache for all systems
connected to the same switch as the affected system.
If the Ethernet link to the SAN Volume Controller system fails because of an event
unrelated to the SAN Volume Controller, the SAN Volume Controller does not
attempt to fail over the configuration node to restore management IP access. For
example, the Ethernet link may fail if a cable is disconnected or an Ethernet router
fails. SAN Volume Controller provides the option for two Ethernet ports, each with
its own management IP address, to protect against this type of failure. If you
cannot connect through one IP address, attempt to access the system through the
alternate IP address.
Note: IP addresses that are used by hosts to access the system over an Ethernet
connection are different from management IP addresses.
SAN Volume Controller supports the following protocols that make outbound
connections from the system:
v Email
v Simple Network Mail Protocol (SNMP)
v Syslog
v Network Time Protocol (NTP)
These protocols operate only on a port configured with a management IP address.
When making outbound connections, the SAN Volume Controller uses the
following routing decisions:
v If the destination IP address is in the same subnet as one of the management IP
addresses, the SAN Volume Controller system sends the packet immediately.
When configuring any of these protocols for event notifications, use these routing
decisions to ensure that error notification works correctly in the event of a network
failure.
In the host zone, the host systems can identify and address the nodes. You can
have more than one host zone and more than one disk zone. Unless you are using
a dual-core fabric design, the system zone contains all ports from all nodes in the
system. Create one zone for each host Fibre Channel port. In a disk zone, the
nodes identify the storage systems. Generally, create one zone for each external
storage system. If you are using the Metro Mirror and Global Mirror feature, create
a zone with at least one port from each node in each system; up to four systems
are supported.
Note: Some operating systems cannot tolerate other operating systems in the same
host zone, although you might have more than one host type in the SAN fabric.
For example, you can have a SAN that contains one host that runs on an IBM AIX
operating system and another host that runs on a Microsoft Windows operating
system.
A label on the front of the node indicates the SAN Volume Controller node type,
hardware revision (if appropriate), and serial number.
Figure 12 on page 16 shows the controls and indicators on the front panel of the
SAN Volume Controller 2145-DH8.
3 4 5
- -
1 2 3 4 5 6 7 8 1+ 2+
1 2 3 4
aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaaaa aaaaaa aaaaaa aaaaaa aaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa a a a a a a a a aaaaaaaaaaaaaaaaaaaaa a a a a a a a a a a a
aaaa aaaa aaaa aaaa aaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaa aaa aaaa aaaa aaaa a a a a a a a a a a a a a a a a a a a a a a a a a a
aa
a aa
aa
a aa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa SAN Volume Controller
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa a
aaaa a
aaaa a a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa a aa a aa a a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa a a a a a a a a a a a a a a a a a a a a a a a a a a
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaa aaaa a a a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a a a a a a a a a a a 6
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
12 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
11
a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 10 7
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
svc00800
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
8
- -
+ 9
The node status LED provides the following system activity indicators:
Off The node is not operating as a member of a system.
On The node is operating as a member of a system.
Slow blinking
The node is in candidate or service state.
Fast blinking
The node is dumping cache and state data to the local disk in anticipation
of a system reboot from a pending power-off action or other controlled
restart sequence.
The green battery status LED indicates one of the following battery conditions.
Off The system software is not running on the node or the state of the system
cannot be saved if power to the node is lost.
Fast blinking
Battery charge level is too low for the state of the system to be saved if
power to the node is lost. Batteries are charging.
Slow blinking
Battery charge level is sufficient for the state of the system to be saved
once if power to the node is lost.
On Battery charge level is sufficient for the state of the system to be saved
twice if power to the node is lost.
The amber battery fault LED indicates one of the following battery conditions.
Off The system software is not running on the node or this battery does not
have a fault.
Blinking
This battery is being identified.
On This battery has a fault. It cannot be used to save the system state if power
to the node is lost.
The green drive activity LED indicates one of the following conditions.
Off The drive is not ready for use.
Flashing
The drive is in use.
On The drive is ready for use, but is not in use.
The amber drive status LED indicates one of the following conditions.
Off The drive is in a good state or has no power.
Blinking
The drive is being identified.
On The drive has failed.
Figure 13 on page 18 shows the controls and indicators on the front panel of the
SAN Volume Controller 2145-CG8.
6 5
svc00717
1 2
3 4
1 Node-status LED
2 Front-panel display
3 Navigation buttons
4 Operator-information panel
5 Select button
6 Error LED
Figure 14 shows the controls and indicators on the front panel of the SAN Volume
Controller 2145-CF8.
1 2 3 4
6 5
svc00541c
1 2
3 4
1 Node-status LED
2 Front-panel display
3 Navigation buttons
4 Operator-information panel
5 Select button
6 Error LED
The node status LED provides the following system activity indicators:
Off The node is not operating as a member of a system.
Front-panel display
The front-panel display shows service, configuration, and navigation information.
You can select the language that is displayed on the front panel. The display can
show both alphanumeric information and graphical information (progress bars).
The front-panel display shows configuration and service information about the
node and the system, including the following items:
v Boot progress indicator
v Boot failed
v Charging
v Hardware boot
v Node rescue request
v Power failure
v Powering off
v Recovering
v Restarting
v Shutting down
v Error codes
v Validate WWNN?
Navigation buttons
You can use the navigation buttons to move through menus.
There are four navigational buttons that you can use to move throughout a menu:
up, down, right, and left.
Each button corresponds to the direction that you can move in a menu. For
example, to move right in a menu, press the navigation button that is located on
the right side. If you want to move down in a menu, press the navigation button
that is located on the bottom.
Note: The select button is used in tandem with the navigation buttons.
This number is used for warranty and service entitlement checking and is included
in the data sent with error reports. It is essential that this number is not changed
during the life of the product. If the system board is replaced, you must follow the
system board replacement instructions carefully and rewrite the serial number on
the system board.
The select button and navigation buttons help you to navigate and select menu
and boot options, and start a service panel test. The select button is located on the
front panel of the SAN Volume Controller, near the navigation buttons.
The node identification label is the six-digit number that is input to the addnode
command. It is readable by system software and is used by configuration and
service software as a node identifier. The node identification number can also be
displayed on the front-panel display when node is selected from the menu.
If the service controller assembly front panel is replaced, the configuration and
service software displays the number that is printed on the front of the
replacement panel. Future error reports contain the new number. No system
reconfiguration is necessary when the front panel is replaced.
Error LED
Critical faults on the service controller are indicated through the amber error LED.
Figure 15 on page 21 shows the operator-information panel for the SAN Volume
Controller 2145-DH8.
ifs00064
5 6 7
Note: If the node has more than four Ethernet ports, activity on ports five and
above is not reflected on the operator-information panel Ethernet activity LEDs.
Figure 16 shows the operator-information panel for the SAN Volume Controller
2145-CG8.
1 2 3 4 5
1 2
svc00722
8 7 6
1 Power-button cover
2 Ethernet 1 activity LED. The operator-information panel LEDs refer to the
Ethernet ports that are mounted on the system board.
3 Ethernet 2 activity LED. The operator-information panel LEDs refer to the
Ethernet ports that are mounted on the system board.
4 System-information LED
5 System-error LED
6 Release latch
Note: If you install the 10 Gbps Ethernet feature, the port activity is not reflected
on the activity LEDs.
Figure 17 shows the operator-information panel for the SAN Volume Controller
2145-CF8.
1 2 3 4 5
svc_bb1gs008
2 1
4 3
10 9 8 7 6
1 Power-button cover
2 Ethernet 2 activity LED
3 Ethernet 1 activity LED
4 System-information LED
5 System-error LED
6 Release latch
7 Locator button and LED
8 Not used
9 Not used
10 Power button and LED
System-error LED
When it is lit, the system-error LED indicates that a system-board error has
occurred.
This amber LED lights up if the hardware detects an unrecoverable error that
requires a new field-replaceable unit (FRU). To help you isolate the faulty FRU, see
MAP 5800: Light path to help you isolate the faulty FRU.
Attention: If you use the reset button, the node restarts immediately without the
SAN Volume Controller control data being written to disk. Service actions are then
required to make the node operational again.
Power button
The power button turns main power on or off for the SAN Volume Controller.
To turn on the power, press and release the power button. You must have a
pointed device, such as a pen, to press the button.
To turn off the power, press and release the power button. For more information
about how to turn off the SAN Volume Controller node, see MAP 5350: Powering
off a SAN Volume Controller node.
Attention: When the node is operational and you press and immediately release
the power button, the SAN Volume Controller writes its control data to its internal
disk and then turns off. This can take up to five minutes. If you press the power
button but do not release it, the node turns off immediately without the SAN
Volume Controller control data being written to disk. Service actions are then
required to make the SAN Volume Controller operational again. Therefore, during
a power-off operation, do not press and hold the power button for more than two
seconds.
Power LED
The green power LED indicates the power status of the system.
Note: A power LED is also at the rear of these SAN Volume Controller nodes:
v 2145-CG8
v 2145-CF8
System-information LED
When the system-information LED is lit, a noncritical event has occurred.
Check the light path diagnostics panel and the event log. Light path diagnostics
are described in more detail in the light path maintenance analysis procedure
(MAP).
Locator LED
The SAN Volume Controller does not use the locator LED.
The operator-information panel LEDs refer to the Ethernet ports that are mounted
on the system board. If you install the 10 Gbps Ethernet card on a SAN Volume
Controller 2145-CG8, the port activity is not reflected on the activity LEDs.
Figure 18 shows the rear-panel indicators on the SAN Volume Controller 2145-DH8
back-panel assembly.
1 4
2 5
3 6
svc00862
1 2 3 4
1 Ethernet-link LED
2 Ethernet-activity LED
3 Power, location, and system-error LEDs
4 AC, DC, and power-supply error LEDs
Figure 19 on page 25 shows the external connectors on the SAN Volume Controller
2145-DH8 back panel assembly.
1 4
2 5
3 6
svc00859
1 2 3 4 13 12 11 10 9 8 7 6
Figure 19. Connectors on the rear of the SAN Volume Controller 2145-DH8
1
2
svc00838
1 Neutral
2 Ground
3 Live
Note: Optional host interface adapters provide additional connectors for 10Gbps
Ethernet, Fibre Channel, or SAS.
The SAN Volume Controller 2145-DH8 contains a number of ports that are only
used during service procedures.
1 4
2 5
3 6
svc00866
1 2 3 4 5
During normal operation, none of these ports are used. Connect a device to any of
these ports only when you are directed to do so by a service procedure or by an
IBM service representative.
The SAN Volume Controller 2145-DH8 includes one port that is not used.
Figure 22 shows the one port that is not used during service procedures or normal
operation. This port is disabled in software to make the port inactive.
1 4
2 5
3 6
1 svc00867
svc00720
3 5
Figure 24 shows the rear-panel indicators on the SAN Volume Controller 2145-CG8
back-panel assembly that has the 10 Gbps Ethernet feature.
svc00729
2
Figure 24. SAN Volume Controller 2145-CG8 rear-panel indicators for the 10 Gbps Ethernet
feature
1 10 Gbps Ethernet-link LEDs. The amber link LED is on when this port is
connected to a 10 Gbps Ethernet switch and the link is online.
2 10 Gbps Ethernet-activity LEDs. The green activity LED is on while data is
being sent over the link.
These figures show the external connectors on the SAN Volume Controller
2145-CG8 back panel assembly.
1 2 3 4 5 6
svc00732
9 8 7
Figure 25. Connectors on the rear of the SAN Volume Controller 2145-CG8
1 2
svc00731
Figure 26. 10 Gbps Ethernet ports on the rear of the SAN Volume Controller 2145-CG8
Neutral
Ground
Live
The SAN Volume Controller 2145-CG8 contains a number of ports that are only
used during service procedures.
Figure 28 on page 29 shows ports that are used only during service procedures.
3 2
svc00724
Figure 28. Service ports of the SAN Volume Controller 2145-CG8
During normal operation, none of these ports are used. Connect a device to any of
these ports only when you are directed to do so by a service procedure or by an
IBM service representative.
The SAN Volume Controller 2145-CG8 can contain one port that is not used.
Figure 29 shows the one port that is not used during service procedures or normal
use.
svc00730
Figure 29. SAN Volume Controller 2145-CG8 port not used
When present, this port is disabled in software to make the port inactive.
The SAS port is present when the optional high-speed SAS adapter is installed
with one or more flash drives.
svc_00219b_cf8
5 4 5 4 3
Figure 31 shows the external connectors on the SAN Volume Controller 2145-CF8
back panel assembly.
1 2 3 4 5 6
svc_00219_cf8
9 8 7
Figure 31. Connectors on the rear of the SAN Volume Controller 2145-CG8 or 2145-CF8
Live
The SAN Volume Controller 2145-CF8 contains a number of ports that are only
used during service procedures.
Figure 33 shows ports that are used only during service procedures.
svc00227cf8
1 2 3
During normal operation, none of these ports are used. Connect a device to any of
these ports only when you are directed to do so by a service procedure or by an
IBM service representative.
The SAN Volume Controller 2145-CF8 can contain one port that is not used.
Figure 34 shows the one port that is not used during service procedures or normal
use.
1
svc00227cf8b
When present, this port is disabled in software to make the port inactive.
The SAS port is present when the optional high-speed SAS adapter is installed
with one or more flash drives.
Two LEDs are used to indicate the state and speed of the operation of each Fibre
Channel port. The bottom LED indicates the link state and activity.
Table 8. Link state and activity for the bottom Fibre Channel LED
LED state Link state and activity indicated
Off Link inactive
On Link active, no I/O
Flashing Link active, I/O active
Each Fibre Channel port can operate at one of three speeds. The top LED indicates
the relative link speed. The link speed is defined only if the link state is active.
Table 9. Link speed for the top Fibre Channel LED
LED state Link speed indicated
Off SLOW
On FAST
Blinking MEDIUM
Table 10 shows the actual link speeds for the SAN Volume Controller 2145-CF8 and
for the SAN Volume Controller 2145-CG8.
Table 10. Actual link speeds
Link speed Actual link speeds
Slow 2 Gbps
Fast 8 Gbps
Medium 4 Gbps
There is a set of LEDs for each Ethernet connector. The top LED is the Ethernet
link LED. When it is lit, it indicates that there is an active connection on the
Ethernet port. The bottom LED is the Ethernet activity LED. When it flashes, it
indicates that data is being transmitted or received between the server and a
network device.
The following terms describe the power, location, and system-error LEDs:
Power LED
This is the top of the three LEDs and indicates the following states:
Off One or more of the following are true:
v No power is present at the power supply input
v The power supply has failed
v The LED has failed
On The SAN Volume Controller is powered on.
Blinking
The SAN Volume Controller is turned off but is still connected to a
power source.
Location LED
This is the middle of the three LEDs and is not used by the SAN Volume
Controller.
System-error LED
This is the bottom of the three LEDs that indicates that a system board
error has occurred. The light path diagnostics provide more information.
AC and DC LEDs
The AC and DC LEDs indicate whether the node is receiving electrical current.
AC LED
The upper LED indicates that AC current is present on the node.
DC LED
The lower LED indicates that DC current is present on the node.
The AC, DC, and power-supply error LEDs indicate whether the node is receiving
electrical current.
Figure 35 on page 34 shows the location of the SAN Volume Controller 2145-DH8
AC, DC, and power-supply error LEDs.
2 5
3 6
svc00864
Figure 35. SAN Volume Controller 2145-DH8 AC, DC, and power-error LEDs
Each of the two power supplies has its own set of LEDs.
1 Indicates that AC current is present on the node.
2 Indicates that DC current is present on the node.
3 Indicates a problem with the power supply.
AC, DC, and power-supply error LEDs on the SAN Volume Controller 2145-CF8
and SAN Volume Controller 2145-CG8:
The AC, DC, and power-supply error LEDs indicate whether the node is receiving
electrical current.
Figure 36 shows the location of the AC, DC, and power-supply error LEDs.
1
2
3
svc00542
Figure 36. SAN Volume Controller 2145-CG8 or 2145-CF8 AC, DC, and power-error LEDs
Each of the two power supplies has its own set of LEDs.
AC LED
The upper LED (1) on the left side of the power supply, indicates that
AC current is present on the node.
The physical port numbers identify Fibre Channel adapters and cable connections
when you run service tasks. World wide port names (WWPNs), which uniquely
identify the devices on the SAN, are used for tasks such as Fibre Channel switch
configuration. The WWPNs are derived from the worldwide node name (WWNN)
of the node in which the ports are installed.
Input-voltage requirements
Ensure that your environment meets the voltage requirements that are shown in
Table 11.
Table 11. Input-voltage requirements
Voltage Frequency
100-127 / 200-240Vac 50 Hz or 60 Hz
Ensure that your environment meets the power requirements as shown in Table 12.
The maximum power that is required depends on the node type and the optional
features that are installed.
Table 12. Maximum power consumption
Components Power requirements
SAN Volume Controller 2145-DH8 750 W
Note: You cannot mix ac and dc power sources; the power sources must match.
Ensure that your environment falls within the following ranges if you are not
using redundant AC power.
If you are not using redundant ac power, ensure that your environment falls
within the ranges that are shown in Table 13.
Table 13. Physical specifications
Relative Maximum dew
Environment Temperature Altitude humidity point
Operating in 5C to 40C 0 to 950 m
lower altitudes (41F to 104F) (0 ft to 3,117 ft)
Operating in 5C to 28C 951 m to 3,050 m 8% to 85% 24C (75F)
higher altitudes (41F to 82F) (3,118 ft to
10,000 ft)
Turned off (with 5C to 45C 0 m to 3,050 m 8% to 85% 27C (80.6F)
standby power) (41F to 113F) (0 ft to 10,000 ft)
Storing 1C to 60C 0 m to 3,050 m 5% to 80% 29C (84.2F)
(33.8F to (0 ft to 10,000 ft)
140.0F)
Shipping -40C to 60C 0 m to 10,700 m 5% to 100% 29C (84.2F)
(-40F to 140.0F) (0 ft to 34,991 ft)
Note: Decrease the maximum system temperature by 1C for every 175 m increase
in altitude.
The following tables list the physical characteristics of the 2145-DH8 node.
Use the parameters that are shown in Table 14 to ensure that space is available in a
rack capable of supporting the node.
Table 14. Dimensions and weight
Height Width Depth Maximum weight
86 mm (3.4 in.) 445 mm (17.5 in) 746 mm (29.4 in) 25 kg (55 lb) to 30 kg
(65 lb) depending on
configuration
Ensure that space is available in the rack for the additional space requirements
around the node, as shown in Table 15.
Table 15. Additional space requirements
Additional space
Location requirements Reason
Left side and right side Minimum: 50 mm (2 in.) Cooling air flow
Back Minimum: 100 mm (4 in.) Cable exit
The node dissipates the maximum heat output that is given in Table 16.
Table 16. Maximum heat output of each 2145-DH8 node
Model Heat output per node
2145-DH8 v Minimum configuration: 419.68 Btu per
hour (AC 123 watts)
v Maximum configuration: 3480.24 Btu per
hour (AC 1020 watts)
Input-voltage requirements
Ensure that your environment meets the voltage requirements that are shown in
Table 17.
Table 17. Input-voltage requirements
Voltage Frequency
200 V - 240 V single phase ac 50 Hz or 60 Hz
Attention:
v If the uninterruptible power supply is cascaded from another uninterruptible
power supply, the source uninterruptible power supply must have at least three
times the capacity per phase and the total harmonic distortion must be less than
5%.
v The uninterruptible power supply also must have input voltage capture that has
a slew rate of no more than 3 Hz per second.
Ensure that your environment meets the power requirements as shown in Table 18.
The maximum power that is required depends on the node type and the optional
features that are installed.
Table 18. Maximum power consumption
Components Power requirements
SAN Volume Controller 2145-CG8 and 2145 200 W
UPS-1U
For the high-speed SAS adapter with from one to four solid-state drives, add 50 W
to the power requirements.
The 2145 UPS-1U has an integrated circuit breaker and does not require additional
protection.
If you are not using redundant ac power, ensure that your environment falls
within the ranges that are shown in Table 19.
Table 19. Environment requirements without redundant AC power
Maximum wet
Relative bulb
Environment Temperature Altitude humidity temperature
Operating in 10C - 35C 0 m - 914 m 8% - 80% 23C (73F)
lower altitudes (50F - 95F) (0 ft - 3000 ft) noncondensing
Operating in 10C - 32C 914 m - 2133 m 8% - 80% 23C (73F)
higher altitudes (50F - 90F) (3000 ft - 7000 ft) noncondensing
Turned off 10C - 43C 0 m - 2133 m 8% - 80% 27C (81F)
(50F - 109F) (0 ft - 7000 ft) noncondensing
Storing 1C - 60C 0 m - 2133 m 5% - 80% 29C (84F)
(34F - 140F) (0 ft - 7000 ft) noncondensing
Shipping -20C - 60C 0 m - 10668 m 5% - 100% 29C (84F)
(-4F - 140F) (0 ft - 34991 ft) condensing, but
no precipitation
If you are using redundant ac power, ensure that your environment falls within the
ranges that are shown in Table 20.
Table 20. Environment requirements with redundant AC power
Maximum wet
Relative bulb
Environment Temperature Altitude humidity temperature
Operating in 15C - 32C 0 m - 914 m 20% - 80% 23C (73F)
lower altitudes (59F - 90F) (0 ft - 3000 ft) noncondensing
Operating in 15C - 32C 914 m - 2133 m 20% - 80% 23C (73F)
higher altitudes (59F - 90F) (3000 ft - 7000 ft) noncondensing
Turned off 10C - 43C 0 m - 2133 m 20% - 80% 27C (81F)
(50F - 109F) (0 ft - 7000 ft) noncondensing
Storing 1C - 60C 0 m - 2133 m 5% - 80% 29C (84F)
(34F - 140F) (0 ft - 7000 ft) noncondensing
Shipping -20C - 60C 0 m - 10668 m 5% - 100% 29C (84F)
(-4F - 140F) (0 ft - 34991 ft) condensing, but
no precipitation
The following tables list the physical characteristics of the SAN Volume Controller
2145-CG8 node.
Use the parameters that are shown in Table 21 to ensure that space is available in a
rack capable of supporting the node.
Table 21. Dimensions and weight
Height Width Depth Maximum weight
4.3 cm 44 cm 73.7 cm 15 kg
(1.7 in.) (17.3 in.) (29 in.) (33 lb)
Ensure that space is available in the rack for the additional space requirements
around the node, as shown in Table 22.
Table 22. Additional space requirements
Additional space
Location requirements Reason
Left side and right side Minimum: 50 mm (2 in.) Cooling air flow
Back Minimum: 100 mm (4 in.) Cable exit
The node dissipates the maximum heat output that is given in Table 23.
Table 23. Maximum heat output of each SAN Volume Controller 2145-CG8 node
Model Heat output per node
SAN Volume Controller 2145-CG8 160 W (546 Btu per hour)
SAN Volume Controller 2145-CG8 plus flash 210 W (717 Btu per hour)
drive
The 2145 UPS-1U dissipates the maximum heat output that is given in Table 24.
Table 24. Maximum heat output of each 2145 UPS-1U
Model Heat output per node
Maximum heat output of 2145 UPS-1U 10 W (34 Btu per hour)
during normal operation
Maximum heat output of 2145 UPS-1U 100 W (341 Btu per hour)
during battery operation
Ensure that your environment meets the voltage requirements that are shown in
Table 25.
Table 25. Input-voltage requirements
Voltage Frequency
200 - 240 V single phase ac 50 or 60 Hz
Attention:
v If the uninterruptible power supply is cascaded from another uninterruptible
power supply, the source uninterruptible power supply must have at least three
times the capacity per phase and the total harmonic distortion must be less than
5%.
v The uninterruptible power supply also must have input voltage capture that has
a slew rate of no more than 3 Hz per second.
Ensure that your environment meets the power requirements as shown in Table 26.
Table 26. Power requirements for each node
Components Power requirements
SAN Volume Controller 2145-CF8 node and 200 W
2145 UPS-1U power supply
Notes:
v SAN Volume Controller 2145-CF8 nodes cannot connect to all revisions of the
2145 UPS-1U power supply unit. The SAN Volume Controller 2145-CF8 nodes
require the 2145 UPS-1U power supply unit part number 31P1318. This unit has
two power outlets that are accessible. Earlier revisions of the 2145 UPS-1U
power supply unit have only one power outlet that is accessible and are not
suitable.
v For each redundant AC-power switch, add 20 W to the power requirements.
v For each high-speed SAS adapter with one to four flash drives, add 50 W to the
power requirements.
The 2145 UPS-1U has an integrated circuit breaker and does not require additional
protection.
If you are not using redundant ac power, ensure that your environment falls
within the ranges that are shown in Table 27 on page 41.
If you are using redundant ac power, ensure that your environment falls within the
ranges that are shown in Table 28.
Table 28. Environment requirements with redundant AC power
Maximum wet
Relative bulb
Environment Temperature Altitude humidity temperature
Operating in 15C to 32C 0 - 914 m 20% to 80% 23C (73F)
lower altitudes (59F to 90F) (0 - 2998 ft) noncondensing
Operating in 15C to 32C 914 - 2133 m 20% to 80% 23C (73F)
higher altitudes (59F to 90F) (2998 - 6988 ft) noncondensing
Turned off 10C to 43C 0 - 2133 m 20% to 80% 27C (81F)
(50F to 110F) (0 - 6988 ft) noncondensing
Storing 1C to 60C 0 - 2133 m 5% to 80% 29C (84F)
(34F to 140F) (0 - 6988 ft) noncondensing
Shipping -20C to 60C 0 - 10668 m 5% to 100% 29C (84F)
(-4F to 140F) (0 - 34991 ft) condensing, but
no precipitation
The following tables list the physical characteristics of the SAN Volume Controller
2145-CF8 node.
Use the parameters that are shown in Table 29 to ensure that space is available in a
rack capable of supporting the node.
Table 29. Dimensions and weight
Height Width Depth Maximum weight
43 mm 440 mm 686 mm 12.7 kg
(1.69 in.) (17.32 in.) (27 in.) (28 lb)
Ensure that space is available in the rack for the additional space requirements
around the node, as shown in Table 30.
Table 30. Additional space requirements
Additional space
Location requirements Reason
Left and right sides 50 mm (2 in.) Cooling air flow
Back Minimum: Cable exit
100 mm (4 in.)
The node dissipates the maximum heat output that is given in Table 31.
Table 31. Heat output of each SAN Volume Controller 2145-CF8 node
Model Heat output per node
SAN Volume Controller 2145-CF8 160 W (546 Btu per hour)
SAN Volume Controller 2145-CF8 and up to 210 W (717 Btu per hour)
four optional flash drives
Maximum heat output of 2145 UPS-1U 10 W (34 Btu per hour)
during typical operation
Maximum heat output of 2145 UPS-1U 100 W (341 Btu per hour)
during battery operation
Restriction: The Redundant AC-power switch does not apply to SAN Volume
Controller 2145-DH8.
You must connect the redundant AC-power switch to two independent power
circuits. One power circuit connects to the main power input port and the other
power circuit connects to the backup power-input port. If the main power to the
SAN Volume Controller node fails for any reason, the redundant AC-power switch
automatically uses the backup power source. When power is restored, the
redundant AC-power switch automatically changes back to using the main power
source.
Place the redundant AC-power switch in the same rack as the SAN Volume
Controller node. The redundant AC-power switch logically sits between the rack
power distribution unit and the 2145 UPS-1U.
You can use a single redundant AC-power switch to power one or two SAN
Volume Controller nodes. If you use the redundant AC-power switch to power two
nodes, the nodes must be in different I/O groups. If the redundant AC-power
switch fails or requires maintenance, both nodes turn off. Because the nodes are in
two different I/O groups, the hosts do not lose access to the back-end disk data.
svc00297
Figure 37. Photo of the redundant AC-power switch
You must properly cable the redundant AC-power switch units in your
environment. The following topics provide environment and cabling information.
The redundant AC-power switch requires two independent power sources that are
provided through two rack-mounted power distribution units (PDUs). The PDUs
must have IEC320-C13 outlets.
The redundant AC-power switch comes with two IEC 320-C19 to C14 power cables
to connect to rack PDUs. There are no country-specific cables for the redundant
AC-power switch.
The power cable between the redundant AC-power switch and the 2145 UPS-1U is
rated at 10 A.
The following tables list the physical characteristics of the redundant AC-power
switch.
Ensure that space is available in a rack that is capable of supporting the redundant
AC-power switch.
Table 32. Rack space required for redundant AC-power switch
Height Width Depth Maximum weight
43 mm (1.69 in.) 192 mm (7.56 in.) 240 mm 2.6 kg (5.72 lb)
The maximum heat output that is dissipated inside the redundant AC-power
switch is approximately 20 watts (70 Btu per hour).
Figure 38 on page 45 shows an example of the main wiring connections for a SAN
Volume Controller clustered system with the redundant AC-power switch feature.
This example is designed to clearly show the cable connections; the components
are not positioned as they would be in a rack. Figure 39 on page 47 shows a
typical rack installation. The four-node clustered system consists of two I/O
groups:
v I/O group 0 contains nodes A and B
v I/O group 1 contains nodes C and D
4
5
7
8
6
9
10
11
12
14 13
svc00358_cf8
Figure 38. A four-node SAN Volume Controller system with the redundant AC-power switch
feature
1 I/O group 0
2 SAN Volume Controller node A
3 2145 UPS-1U A
4 SAN Volume Controller node B
5 2145 UPS-1U B
6 I/O group 1
7 SAN Volume Controller node C
8 2145 UPS-1U C
9 SAN Volume Controller node D
10 2145 UPS-1U D
11 Redundant AC-power switch 1
12 Redundant AC-power switch 2
13 Site PDU X (C13 outlets)
14 Site PDU Y (C13 outlets)
In this example, only two redundant AC-power switch units are used, and each
power switch powers one node in each I/O group. However, for maximum
redundancy, use one redundant AC-power switch to power each node in the
system.
Some SAN Volume Controller node types have two power supply units. Both
power supplies must be connected to the same 2145 UPS-1U, as shown by node A
and node B. The SAN Volume Controller 2145-CG8 is an example of a node that
has two power supplies.
Figure 39 on page 47 shows an 8 node cluster, with one redundant ac-power switch
per node installed in a rack using best location practices, the cables between the
components are shown.
DC DC
1U Filler Panel
SVC #7 IOGroup 3 Node A AC AC
DC DC
1U Filler Panel
SVC #6 IOGroup 2 Node B AC AC
DC DC
1U Filler Panel
SVC #5 IOGroup 2 Node A AC AC
DC DC
1U Filler Panel
SVC #4 IOGroup 1 Node B AC AC
DC DC
1U Filler Panel
SVC #3 IOGroup 1 Node A AC AC
DC DC
1U Filler Panel
SVC #2 IOGroup 0 Node B AC AC
DC DC
1U Filler Panel
SVC #1 IOGroup 0 Node A AC AC
DC DC
1U Filler Panel
1U !ller panel or optional 1U monitor
1U !ller panel or optional SSPC server
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
ON
CONTROLLERS TO THESE OUTLETS.
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
ON
CONTROLLERS TO THESE OUTLETS.
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
ON
CONTROLLERS TO THESE OUTLETS.
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
ON
CONTROLLERS TO THESE OUTLETS.
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
ON
CONTROLLERS TO THESE OUTLETS.
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
ON
CONTROLLERS TO THESE OUTLETS.
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
ON
CONTROLLERS TO THESE OUTLETS.
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
ATT
ENTI
CONNECT ONLY IBM SAN VOLUME
ON
CONTROLLERS TO THESE OUTLETS.
SEE SAN VOLUME CONTROLLER
INSTALLATION GUIDE.
12
1U Filler Panel
12A Max.
12A Max.
Circuit Breaker
ON
BRANCH B
Circuit Breaker
ON
BRANCH B
20A
20
20A
20
OFF
OFF
12A Max.
12A Max.
Circuit Breaker
ON
BRANCH A
Circuit Breaker
ON
BRANCH A
20A
20
20A
20
OFF
OFF
1U Filler Panel
Main Backup
Input Input
svc00765
With a 2145 UPS-1U, data is saved to the internal disk of the SAN Volume
Controller node. The uninterruptible power supply units are required to power the
SAN Volume Controller nodes even when the input power source is considered
uninterruptible.
If the 2145 UPS-1U reports a loss of input power, the SAN Volume Controller node
stops all I/O operations and dumps the contents of its dynamic random access
memory (DRAM) to the internal disk drive. When input power to the 2145 UPS-1U
is restored, the SAN Volume Controller node restarts and restores the original
contents of the DRAM from the data saved on the disk drive.
A SAN Volume Controller node is not fully operational until the 2145 UPS-1U
battery state indicates that it has sufficient charge to power the SAN Volume
Controller node long enough to save all of its memory to the disk drive. In the
event of a power loss, the 2145 UPS-1U has sufficient capacity for the SAN Volume
Controller to save all its memory to disk at least twice. For a fully charged 2145
UPS-1U, even after battery charge has been used to power the SAN Volume
Controller node while it saves dynamic random access memory (DRAM) data,
sufficient battery charge remains so that the SAN Volume Controller node can
become fully operational as soon as input power is restored.
Important: Do not shut down a 2145 UPS-1U without first shutting down the SAN
Volume Controller node that it supports. Data integrity can be compromised by
pushing the 2145 UPS-1U on/off button when the node is still operating. However,
in the case of an emergency, you can manually shut down the 2145 UPS-1U by
pushing the 2145 UPS-1U on/off button when the node is still operating. Service
actions must then be performed before the node can resume normal operations. If
multiple uninterruptible power supply units are shut down before the nodes they
support, data can be corrupted.
For connection to the 2145 UPS-1U, each SAN Volume Controller of a pair must be
connected to only one 2145 UPS-1U.
SAN Volume Controller provides a cable bundle for connecting the uninterruptible
power supply to a node. This cable is used to connect both power supplies of a
node to the same uninterruptible power supply.
The SAN Volume Controller software determines whether the input voltage to the
uninterruptible power supply is within range and sets an appropriate voltage
alarm range on the uninterruptible power supply. The software continues to
recheck the input voltage every few minutes. If it changes substantially but
remains within the permitted range, the alarm limits are readjusted.
Note: The 2145 UPS-1U is equipped with a cable retention bracket that keeps the
power cable from disengaging from the rear panel. See the related documentation
for more information.
7
LOAD 2 LOAD 1 + -
1 2 3 4 5 6 1yyzvm
The load segment 2 indicator on the 2145 UPS-1U is lit (green) when power is
available to load segment 2.
The load segment 1 indicator on the 2145 UPS-1U is not currently used by the
SAN Volume Controller.
Note: When the 2145 UPS-1U is configured by the SAN Volume Controller, this
load segment is disabled. During normal operation, the load segment 1 indicator is
off. A Do not use label covers the receptacles.
Alarm indicator:
If the alarm is on, go to the 2145 UPS-1U MAP to resolve the problem.
On-battery indicator:
The amber on-battery indicator is on when the 2145 UPS-1U is powered by the
battery. This indicates that the main power source has failed.
If the on-battery indicator is on, go to the 2145 UPS-1U MAP to resolve the
problem.
Overload indicator:
The overload indicator lights up when the capacity of the 2145 UPS-1U is
exceeded.
If the overload indicator is on, go to MAP 5250: 2145 UPS-1U repair verification to
resolve the problem.
Power-on indicator:
When the power-on indicator is a steady green, the 2145 UPS-1U is active.
On or off button:
The on or off button turns the power on or off for the 2145 UPS-1U.
After you connect the 2145 UPS-1U to the outlet, it remains in standby mode until
you turn it on. Press and hold the on or off button until the power-on indicator is
illuminated (approximately five seconds). On some versions of the 2145 UPS-1U,
you might need a pointed device, such as a screwdriver, to press the on or off
button. A self-test is initiated that takes approximately 10 seconds, during which
time the indicators are turned on and off several times. The 2145 UPS-1U then
enters normal mode.
Press and hold the on or off button until the power-on light is extinguished
(approximately five seconds). On some versions of the 2145 UPS-1U, you might
need a pointed device, such as a screwdriver, to press the on or off button. This
places the 2145 UPS-1U in standby mode. You must then unplug the 2145 UPS-1U
to turn off the unit.
Attention: Do not turn off the uninterruptible power supply before you shut
down the SAN Volume Controller node that it is connected to. Always follow the
instructions that are provided in MAP 5350 to perform an orderly shutdown of a
SAN Volume Controller node.
Use the test and alarm reset button to start the self-test.
To start the self-test, press and hold the test and alarm reset button for three
seconds. This button also resets the alarm.
Figure 41 shows the location of the connectors and switches on the 2145 UPS-1U.
svc00308
1 2 3 4 5
Figure 42 on page 52 shows the dip switches, which can be used to configure the
input and output voltage ranges. Because this function is performed by the SAN
Volume Controller software, both switches must be left in the OFF position.
svc00147
OFF
Figure 42. 2145 UPS-1U dip switches
The 2145 UPS-1U is equipped with ports that are not used by the SAN Volume
Controller and have not been tested. Use of these ports, in conjunction with the
SAN Volume Controller or any other application that might be used with the SAN
Volume Controller, is not supported. Figure 43 shows the 2145 UPS-1U ports that
are not used.
Neutral
Ground
Live
The following tables describe the physical characteristics of the 2145 UPS-1U.
Ensure that space is available in a rack that is capable of supporting the 2145
UPS-1U.
Table 34. Rack space required for the 2145 UPS-1U
Height Width Depth Maximum weight
44 mm 439 mm 579 mm 16 kg
(1.73 in.) (17.3 in.) (22.8 in.) (35.3 lb)
Note: The 2145 UPS-1U package, which includes support rails, weighs 18.8 kg (41.4 lb).
Heat output
The 2145 UPS-1U unit produces the following approximate heat output.
Table 35. Heat output of the 2145 UPS-1U
Heat output during normal Heat output during battery
Model operation operation
2145 UPS-1U 10 W (34 Btu per hour) 150 W (512 Btu per hour)
Refer to the SAN Volume Controller 2145-DH8 parts topic for the list of SAN
Volume Controller 2145-DH8 FRUs.
The following tables provide information the SAN Volume Controller 2145-24F
expansion enclosure parts and SAS drives.
Table 36 on page 54 lists the FRUs for the SAN Volume Controller 2145-24F
expansion enclosure.
Table 37 lists the FRUs for the SAN Volume Controller 2145-24F SAS drive units.
Table 37. FRU part for the SAN Volume Controller 2145-24F SAS drive units
Description FRU part number
Mini SAS HD to mini SAS HD, 1.5 m, 12 00AR311
Gb.
Mini SAS HD to mini SAS HD, 3.0 m, 12 00AR317
Gb.
FRU Description
Redundant AC-power switch The redundant AC-power switch and its input power
assembly cables.
Note: The front panel display is replaced by a technician port on some models.
You use the management GUI to manage and service your system. The Monitoring
> Events panel provides access to problems that must be fixed and maintenance
procedures that step you through the process of correcting the problem.
Some events require a certain number of occurrences in 25 hours before they are
displayed as unfixed. If they do not reach this threshold in 25 hours, they are
flagged as expired. Monitoring events are below the coalesce threshold and are
usually transient.
You can also sort events by time or error code. When you sort by error code, the
most serious events, those with the lowest numbers, are displayed first. You can
select any event that is listed and select Actions > Properties to view details about
the event.
v Recommended Actions. For each problem that is selected, you can:
Run a fix procedure.
View the properties.
v Event log. For each entry that is selected, you can:
Run a fix procedure.
Mark an event as fixed.
Filter the entries to show them by specific minutes, hours, or dates.
Reset the date filter.
View the properties.
Regularly monitor the status of the system using the management GUI. If you
suspect a problem, use the management GUI first to diagnose and resolve the
problem.
Use the views that are available in the management GUI to verify the status of the
system, the hardware devices, the physical storage, and the available volumes. The
Monitoring > Events panel provides access to all problems that exist on the
system. Use the Recommended Actions filter to display the most important events
that need to be resolved.
If there is a service error code for the alert, you can run a fix procedure that assists
you in resolving the problem. These fix procedures analyze the system and provide
more information about the problem. They suggest actions to take and step you
through the actions that automatically manage the system where necessary. Finally,
they check that the problem is resolved.
If there is an error that is reported, always use the fix procedures within the
management GUI to resolve the problem. Always use the fix procedures for both
system configuration problems and hardware failures. The fix procedures analyze
the system to ensure that the required changes do not cause volumes to be
inaccessible to the hosts. The fix procedures automatically perform configuration
changes that are required to return the system to its optimum state.
You must use a supported web browser. For a list of supported browsers, refer to
the Web browser requirements to access the management GUI topic.
You can use the management GUI to manage your system as soon as you have
created a clustered system.
Procedure
1. Start a supported web browser and point the browser to the management IP
address of your system.
The management IP address is set when the clustered system is created. Up to
four addresses can be configured for your use. There are two addresses for
IPv4 access and two addresses for IPv6 access. When the connection is
successful, you will see a login panel.
2. Log on by using your user name and password.
3. When you have logged on, select Monitoring > Events.
4. Ensure that the events log is filtered using Recommended actions.
5. Select the recommended action and run the fix procedure.
6. Continue to work through the alerts in the order suggested, if possible.
Results
After all the alerts are fixed, check the status of your system to ensure that it is
operating as intended.
The cache on the selected node is flushed before the node is taken offline. In some
circumstances, such as when the system is already degraded (for example, when
both nodes in the I/O group are online and the volumes within the I/O group are
degraded), the system ensures that data loss does not occur as a result of deleting
the only node with the cache data. If a failure occurs on the other node in the I/O
group, the cache is flushed before the node is removed to prevent data loss.
Before deleting a node from the system, record the node serial number, worldwide
node name (WWNN), all worldwide port names (WWPNs), and the I/O group
that the node is currently part of. If the node is re-added to the system at a later
time, recording this node information can avoid data corruption.
Chapter 3. SAN Volume Controller user interfaces for servicing your system 61
Attention:
v If you are removing a single node and the remaining node in the I/O group is
online, the data on the remaining node goes into write-through mode. This data
can be exposed to a single point of failure if the remaining node fails.
v If the volumes are already degraded before you remove a node, redundancy to
the volumes is degraded. Removing a node might result in a loss of access to
data and data loss.
v Removing the last node in the system destroys the system. Before you remove
the last node in the system, ensure that you want to destroy the system.
v When you remove a node, you remove all redundancy from the I/O group. As a
result, new or existing failures can cause I/O errors on the hosts. The following
failures can occur:
Host configuration errors
Zoning errors
Multipathing-software configuration errors
v If you are deleting the last node in an I/O group and there are volumes that are
assigned to the I/O group, you cannot remove the node from the system if the
node is online. You must back up or migrate all data that you want to save
before you remove the node. If the node is offline, you can remove the node.
v When you remove the configuration node, the configuration function moves to a
different node within the system. This process can take a short time, typically
less than a minute. The management GUI reattaches to the new configuration
node transparently.
v If you turn the power on to the node that has been removed and it is still
connected to the same fabric or zone, it attempts to rejoin the system. The
system tells the node to remove itself from the system and the node becomes a
candidate for addition to this system or another system.
v If you are adding this node into the system, ensure that you add it to the same
I/O group that it was previously a member of. Failure to do so can result in
data corruption.
This task assumes that you have already accessed the management GUI.
Procedure
1. Select Monitoring > System.
2. Right-click the node that you want to remove and select Remove.
If the node that you want to remove is shown as Offline, then the node is not
participating in the system.
If the node that you want to remove is shown as Online, deleting the node can
result in the dependent volumes to also go offline. Verify whether the node has
any dependent volumes.
3. To check for dependent volumes before attempting to remove the node,
right-click the node and select Show Dependent Volumes.
If any volumes are listed, determine why and if access to the volumes is
required while the node is removed from the system. If the volumes are
assigned from storage pools that contain flash drives that are located in the
node, check why the volume mirror, if it is configured, is not synchronized.
There can also be dependent volumes because the partner node in the I/O
You can use either the management GUI or the command-line interface to add a
node to the system. Some models might require using the front panel to verify the
new node has been added correctly.
Before you add a node to a system, you must make sure that the switch zoning is
configured such that the node being added is in the same zone as all other nodes
in the system. If you are replacing a node and the switch is zoned by worldwide
port name (WWPN) rather than by switch port, make sure that the switch is
configured such that the node being added is in the same VSAN or zone.
If you are adding a node that has been used previously, either within a different
I/O group within this system or within a different system, take into account that if
you add a node without changing its worldwide node name (WWNN), hosts
might detect the node and use it as if it were in its old location. This action might
cause the hosts to access the wrong volumes.
v You must ensure that the model type of the new node is supported by the
software level that is currently installed on the system. If the model type is not
supported by the software level, update the system to a software level that
supports the model type of the new node.
v Each node in an I/O group must be connected to a different uninterruptible
power supply.
v If you are re-adding a node back to the same I/O group after a service action
required the node to be deleted from the system and the physical node has not
changed, no special procedures are required and the node can be added back to
the system.
v If you are replacing a node in a system either because of a node failure or an
update, you must change the WWNN of the new node to match that of the
original node before you connect the node to the Fibre Channel network and
add the node to the system.
Chapter 3. SAN Volume Controller user interfaces for servicing your system 63
v If you are adding a node to the SAN again, ensure that you are adding the node
to the same I/O group from which it was removed. Failure to do this action can
result in data corruption. You must use the information that was recorded when
the node was originally added to the system. If you do not have access to this
information, contact the support center to add the node back into the system
without corrupting the data.
v For each external storage system, the LUNs that are presented to the ports on
the new node must be the same as the LUNs that are presented to the nodes
that currently exist in the system. You must ensure that the LUNs are the same
before you add the new node to the system.
v If you are creating an I/O group in the system and are adding a new node,
there are no special procedures because this node was never added to a system
and the WWNN for the node did not exist.
v If you are creating an I/O group in the system and are adding a new node that
has been added to a system before, the host system might still be configured to
the node WWPNs and the node might still be zoned in the fabric. Because you
cannot change the WWNN for the node, you must ensure that other components
in your fabric are configured correctly. Verify that any host that was previously
configured to use the node has been correctly updated.
v If the node that you are adding was previously replaced, either for a node repair
or update, you might have used the WWNN of that node for the replacement
node. Ensure that the WWNN of this node was updated so that you do not have
two nodes with the same WWNN attached to your fabric. Also ensure that the
WWNN of the node that you are adding is not 00000. If it is 00000, contact your
support representative.
v If the new node supports encryption, you must ensure that the following
requirements are true before the node can be added:
The node is licensed to use encryption. When the node is added in the
management GUI, you are asked to activate the license for each node that is
detected that supports encryption. An authorization code is sent with license
documentation and must be used to activate the license. Retain all license
documentation for your records.
The node is running a software level that supports encryption.
v If you are adding the new node to a system with either a HyperSwap or
stretched system topology, you must assign the node to a specific site.
After the new node is zoned and cabled correctly to the existing system, you can
use either the addnode command or the Add Node wizard in the management
GUI. To access the Add Node wizard, select Monitoring > System. On the image,
click the new node to launch the wizard. Complete the wizard and verify the new
node. If the new node is not displayed in the image, it indicates a potential cabling
issue. Check the installation information to ensure that your node was cabled
correctly.
The id parameter displays the WWNN for the node. If the node is not detected,
verify cabling to the node.
2. Enter this command to determine the I/O group where the node should be
added:
lsiogrp
3. Record the name or ID of the first I/O group that has a node count of zero (0).
You will need the name or ID for the next step. Note: You only need to do this
step for the first node that is added. The second node of the pair uses the same
I/O group number.
4. Enter this command to add the node to the system:
addnode -wwnodename WWNN -iogrp iogrp_name -name new_name_arg -site site_name
Where WWNN is the WWNN of the node, iogrp_name is the name of the I/O
group that you want to add the node to and new_name_arg is the name that you
want to assign to the node. If you do not specify a new node name, a default
name is assigned; however, it is recommended that you specify a meaningful
name. The site_name specifies the name of the site location of the new node.
This parameter is only required if the topology is a HyperSwap or stretched
system.
Chapter 3. SAN Volume Controller user interfaces for servicing your system 65
svcinfo lsnodecandidate
The id parameter displays the WWNN for the node. Ensure that the last 5
digits that are displayed match the WWNN on the front panel. If the node is
not detected, verify cabling to the node.
3. Enter this command to determine the I/O group where the node should be
added:
lsiogrp
4. Record the name or ID of the first I/O group that has a node count of zero (0).
You will need the ID for the next step. Note: You only need to do this step for
the first node that is added. The second node of the pair uses the same I/O
group number.
5. Enter this command to add the node to the system:
addnode -wwnodename WWNN -iogrp iogrp_name -name newnodename -site newsitename
Where WWNN is the WWNN of the node, iogrp_name is the name or ID of the
I/O group that you want to add the node to and newnodename is the name that
you want to assign to the node. If you do not specify a new node name, a
default name is assigned; however, it is recommended that you specify a
meaningful name. The newsitename specifies the name of the site location of the
new node. This parameter is only required if the topology is a HyperSwap or
stretched system.
If a node shows node error 578 or node error 690, the node is in service state.
Complete the following steps from the front panel to exit service state:
1. Press and release the up or down button until the Actions? option displays.
2. Press the Select button.
3. Press and release the Up or Down button until the Exit Service? option
displays.
4. Press the Select button.
5. Press and release the Left or Right button until the Confirm Exit? option
displays.
6. Press the Select button.
The node might be in a service state because it has a hardware issue, has corrupted
data, or has lost its configuration data.
The management GUI operates only when there is an online clustered system. Use
the service assistant if you are unable to create a clustered system.
The service assistant provides detailed status and error summaries, and the ability
to modify the World Wide Node Name (WWNN) for each node.
You must use a supported web browser. For a list of supported browsers, refer to
the topic Web browser requirements to access the management GUI.
Procedure
Chapter 3. SAN Volume Controller user interfaces for servicing your system 67
2. Log on to the service assistant using the superuser password.
If you do not know the current superuser password, try to find out. If you
cannot find out what the password is, reset the password.
Results
Command-line interface
Use the command-line interface (CLI) to manage a system with task commands
and information commands.
For a full description of the commands and how to start an SSH command-line
session, see the Command-line interface section of the SAN Volume Controller
Information Center.
Nearly all of the flexibility that is offered by the CLI is available through the
management GUI. However, the CLI does not provide the fix procedures that are
available in the management GUI. Therefore, use the fix procedures in the
management GUI to resolve the problems. Use the CLI when you require a
configuration setting that is unavailable in the management GUI.
You might also find it useful to create command scripts that use CLI commands to
monitor certain conditions or to automate configuration changes that you make
regularly.
Note: The service command-line interface can also be accessed by the technician
port.
For a full description of the commands and how to start an SSH command-line
session, see the Command-line interface topic in the Reference section of this
product information.
To access a node directly, it is normally easier to use the service assistant with its
graphical interface and extensive help facilities.
When a USB flash drive is inserted into one of the USB ports on a SAN Volume
Controller node, the software searches for a control file on the USB flash drive and
runs the command that is specified in the file. When the command completes, the
command results and node status information are written to the USB flash drive.
When a USB flash drive is plugged into a node canister, the node canister code
searches for a text file named satask.txt in the root directory. If the code finds the
file, it attempts to run a command that is specified in the file. When the command
completes, a file called satask_result.html is written to the root directory of the
USB flash drive. If this file does not exist, it is created. If it exists, the data is
inserted at the start of the file. The file contains the details and results of the
command that was run and the status and the configuration information from the
node canister. The status and configuration information matches the detail that is
shown on the service assistant home page panels.
The fault light-emitting diode (LED) on the node canister flashes when the USB
service action is being performed. When the fault LED stops flashing, it is safe to
remove the USB flash drive.
Results
The USB flash drive can then be plugged into a workstation and the
satask_result.html file viewed in a web browser.
To protect from accidentally running the same command again, the satask.txt file
is deleted after it has been read.
If no satask.txt file is found on the USB flash drive, the result file is still created,
if necessary, and the status and configuration data is written to it.
Chapter 3. SAN Volume Controller user interfaces for servicing your system 69
satask.txt commands
If you are creating the satask.txt command file by using a text editor, the file
must contain a single command on a single line in the file.
The commands that you use are the same as the service CLI commands except
where noted. Not all service CLI commands can be run from the USB flash drive.
The satask.txt commands always run on the node that the USB flash drive is
plugged into.
Use this command to obtain service assistant access to a node canister even if the
current state of the node canister is unknown. The physical access to the node
canister is required and is used to authenticate the action.
Syntax
-resetpassword
Parameters
-serviceip ipv4
The IPv4 address for the service assistant.
-gw ipv4
The IPv4 gateway for the service assistant.
-mask ipv4
The IPv4 subnet for the service assistant.
-serviceip_6 ipv6
The IPv6 address for the service assistant.
-gw_6 ipv6
The IPv6 gateway for the service assistant.
-prefix_6 int
The IPv6 prefix for the service assistant.
-resetpassword
Sets the service assistant password to the default value.
Description
This command resets the service assistant IP address to the default value. If the
command is run on the default value is 192.168.70.121 subnet mask: 255.255.255.0.
If the command is run on the default value is 192.168.70.122 subnet mask:
If the node canister becomes active in a system, the superuser password is reset to
that of the system. You can configure the system to disable resetting the superuser
password. If you disable that function, this action fails.
This action calls the satask chserviceip command and the satask resetpassword
command.
Use this command when you are unable to logon to the system because you have
forgotten the superuser password, and you wish to reset it.
Syntax
satask resetpassword
Parameters
None.
Description
This command resets the service assistant password to the default value passw0rd.
If the node canister is active in a system, the superuser password for the system is
reset; otherwise, the superuser password is reset on the node canister.
If the node canister becomes active in a system, the superuser password is reset to
that of the system. You can configure the system to disable resetting the superuser
password. If you disable that function, this action fails.
Snap command:
Use the snap command to collect diagnostic information from the node canister
and to write the output to a USB flash drive.
Syntax
satask snap
-dump -noimm panel_name
Parameters
-dump
(Optional) Indicates the most recent dump file in the output.
-noimm
(Optional) Indicates the /dumps/imm.ffdc file should not be included in the
output.
panel_name
(Optional) Indicates the node on which to execute the snap command.
Chapter 3. SAN Volume Controller user interfaces for servicing your system 71
Description
An invocation example
satask snap -dump 111584
Use this command to install a specific update package on the node canister.
Syntax
Parameters
-file filename
(Required) The filename designates the name of the update package .
-ignore | -pacedccu
(Optional) Overrides prerequisite checking and forces installation of the update
package.
Description
This command copies the file from the USB flash drive to the update directory on
the node canister and then installs the update package.
Syntax
Description
Use this command to change the system IP address of the storage system.
It is best to use the initialization tool to create this command in satask.txt together
with the associated clitask.txt file that changes the file modules management IP
addresses.
Syntax
satask setsystemip -systemip ipv4 -gw ipv4 -mask ipv4 -consoleip ipv4
Parameters
-systemip
The IPv4 address for Ethernet port 1 on the system.
-gw
The IPv4 gateway for Ethernet port 1 on the system.
-mask
The IPv4 subnet for Ethernet port 1 on the system.
-consolip
The management IPv4 address of SAN Volume Controller system.
Description
This command is only supported in the satask.txt file on a USB flash drive.
Chapter 3. SAN Volume Controller user interfaces for servicing your system 73
It calls the svctask chsystemip command if the USB flash drive is inserted in the
configuration node canister, Otherwise it will blink the amber identify LED of the
node canister that is the configuration node.
If the amber identify LED for a different node canister starts to blink then move
the USB flash drive over to that node canister because it is the configuration node.
When the amber LED turns off you can move the USB flash drive to one of the file
modules so that it will use the clitask.txt file to change the file module
management IP addresses.
Leave the USB flash drive in the file module for at least two minutes before you
remove it. Use a workstation to check the clitask_results.txt and satask.txt results
files on the USB flash drive.
If the IP address change was successful then you must run the startmgtsrv -r
command to restart the management service so that it will not continue to ssh
commands to the old system IP address of the volume storage system.
For example, on a Linux workstation with network access to the new management
IP address:
satask setsystemip -systemip 123.123.123.20 -gw 123.123.123.1 -mask 255.255.255.0
-consoleip 123.123.123.10
You can now access the management GUI, which you can use to change any other
IP address that needs to be changed.
Use this command to determine the current service state of the node canister.
Syntax
sainfo getstatus
Parameters
None.
Description
This command writes the output from each node canister to the USB flash drive.
See SAN Volume Controller 2145-DH8 ports used during service procedures on
page 25.
The technician port replaces the front panel display and navigation buttons on
previous models of SAN Volume Controller nodes. The technician port provides
direct access to the service assistant GUI and command-line interface (CLI).
The technician port can be used by directly connecting a computer that has web
browsing software and is configured for Dynamic Host Configuration Protocol
(DHCP) through a standard 1 Gbps Ethernet cable.
If the node has candidate status when you open the web browser, the initialization
tool is displayed. Otherwise, the service assistant interface is displayed.
| Important: Do not use the initialization tool on a node if any other node in the
| system is already active. For example, a node status LED is solid on any node of
| the system.
To access the service assistant through the technician port when the node status is
candidate, change the web page address from:
https:\\service\service\
Alternatively, reload the web page after you change the node status to service. For
example, by using the following Service CLI command:
satask startservice
See the following information for how to access the CLI through the technician
port. For other ways to access the CLI, see Chapter 3, SAN Volume Controller
user interfaces for servicing your system, on page 59.
To use the technician port, you must be next to the node. The technician port does
not work if it is connected to an Ethernet switch.
If you have Secure Shell (SSH) software on the computer that is directly connected
to the technician port, then you can use it to access the CLI as superuser at
192.168.0.1. The default superuser password is passw0rd.
Note: When your personal computer is configured with DHCP, the technician port
uses DHCP to reconfigure network services on your personal computer. Software
on your personal computer that was using these services might experience
network problems while it is connected to the technician port. For example,
selecting a link in a web page that was loaded before you connect to the technician
port might result in an error message.
Chapter 3. SAN Volume Controller user interfaces for servicing your system 75
Front panel interface
The front panel on each node has a small display, and five control buttons. This
panel provides access to system and node status information, and a means to run
certain system configuration and recovery actions. For a detailed description of
using the front panel, see Chapter 6, Using the front panel of the SAN Volume
Controller, on page 91.
Note: The SAN Volume Controller 2145-DH8 does not have a front panel display,
navigation, and select buttons. On this system, use the technician port to access the
service assistant interface.
Use the front panel when you are physically next to the system and are unable to
access one of the system interfaces.
Attention: Run the repairvdiskcopy command only if all volume copies are
synchronized.
When you issue the repairvdiskcopy command, you must use only one of the
-validate, -medium, or -resync parameters. You must also specify the name or ID
of the volume to be validated and repaired as the last entry on the command line.
After you issue the command, no output is displayed.
-validate
Use this parameter if you only want to verify that the mirrored volume copies
are identical. If any difference is found, the command stops and logs an error
that includes the logical block address (LBA) and the length of the first
difference. You can use this parameter, starting at a different LBA each time to
count the number of differences on a volume.
-medium
Use this parameter to convert sectors on all volume copies that contain
different contents into virtual medium errors. Upon completion, the command
logs an event, which indicates the number of differences that were found, the
number that were converted into medium errors, and the number that were
not converted. Use this option if you are unsure what the correct data is, and
you do not want an incorrect version of the data to be used.
-resync
Use this parameter to overwrite contents from the specified primary volume
copy to the other volume copy. The command corrects any differing sectors by
copying the sectors from the primary copy to the copies being compared. Upon
completion, the command process logs an event, which indicates the number
of differences that were corrected. Use this action if you are sure that either the
primary volume copy data is correct or that your host applications can handle
incorrect data.
-startlba lba
Optionally, use this parameter to specify the starting Logical Block Address
(LBA) from which to start the validation and repair. If you previously used the
validate parameter, an error was logged with the LBA where the first
difference, if any, was found. Reissue repairvdiskcopy with that LBA to avoid
reprocessing the initial sectors that compared identically. Continue to reissue
repairvdiskcopy using this parameter to list all the differences.
Notes:
1. Only one repairvdiskcopy command can run on a volume at a time.
2. Once you start the repairvdiskcopy command, you cannot use the command to
stop processing.
3. The primary copy of a mirrored volume cannot be changed while the
repairvdiskcopy -resync command is running.
4. If there is only one mirrored copy, the command returns immediately with an
error.
5. If a copy being compared goes offline, the command is halted with an error.
The command is not automatically resumed when the copy is brought back
online.
6. In the case where one copy is readable but the other copy has a medium error,
the command process automatically attempts to fix the medium error by
writing the read data from the other copy.
7. If no differing sectors are found during repairvdiskcopy processing, an
informational error is logged at the end of the process.
To check the progress of validation and repair of mirrored volumes, issue the
following command:
lsrepairvdiskcopyprogress delim :
If a repair operation completes successfully and the volume was previously offline
because of corrupted metadata, the command brings the volume back online. The
only limit on the number of concurrent repair operations is the number of volume
copies in the configuration.
Attention: Use this command only to repair a thin-provisioned volume that has
reported corrupt metadata.
Notes:
1. Because the volume is offline to the host, any I/O that is submitted to the
volume while it is being repaired fails.
2. When the repair operation completes successfully, the corrupted metadata error
is marked as fixed.
3. If the repair operation fails, the volume is held offline and an error is logged.
Note: Only run this command after you run the repairsevdiskcopy command,
which you must only run as required by the fix procedures recommended by your
support team.
If you have lost both nodes in an I/O group and have, therefore, lost access to all
the volumes that are associated with the I/O group, you must complete one of the
following procedures to regain access to your volumes. Depending on the failure
type, you might have lost data that was cached for these volumes and the volumes
are now offline.
One node in an I/O group has failed and failover has started on the second node.
During the failover process, the second node in the I/O group fails before the data
in the write cache is flushed to the backend. The first node is successfully repaired
but its hardened data is not the most recent version that is committed to the data
store; therefore, it cannot be used. The second node is repaired or replaced and has
lost its hardened data, therefore, the node has no way of recognizing that it is part
of the system.
Chapter 4. Performing recovery actions using the SAN Volume Controller CLI 79
Complete the following steps to recover an offline volume when one node has
down-level hardened data and the other node has lost hardened data:
Procedure
1. Recover the node and add it back into the system.
2. Delete all IBM FlashCopy mappings and Metro Mirror or Global Mirror
relationships that use the offline volumes.
3. Run the recovervdisk, recovervdiskbyiogrp or recovervdiskbysystem
command.
4. Re-create all FlashCopy mappings and Metro Mirror or Global Mirror
relationships that use the volumes.
Example
Both nodes in the I/O group have failed and have been repaired. The nodes have
lost their hardened data, therefore, the nodes have no way of recognizing that they
are part of the system.
Complete the following steps to recover an offline volume when both nodes have
lost their hardened data and cannot be recognized by the system:
1. Delete all FlashCopy mappings and Metro Mirror or Global Mirror
relationships that use the offline volumes.
2. Run the recovervdisk, recovervdiskbyiogrp or recovervdiskbysystem
command.
3. Create all FlashCopy mappings and Metro Mirror or Global Mirror
relationships that use the volumes.
Using different sets of commands, you can view the system VPD and the node
VPD. You can also view the VPD through the management GUI.
Procedure
1. In the management GUI, select Monitoring > System.
2. From the dynamic graphic of the system, select the node and click the icon to
the right of the Actions menu to download VPD information.
Procedure
1. Use the lsnode CLI command to display a concise list of nodes in the clustered
system.
Issue this CLI command to list the system nodes:
lsnode -delim :
Procedure
Table 43 on page 84 shows the fields that you see for the system board.
Table 44 shows the fields that you see for the batteries.
Table 44. Fields for the batteries
Item Field name
Batteries Battery_FRU_part
Battery_part_identity
Battery_fault_led
Battery_charging_status
Battery_cycle_count
Battery_power_on_hours
Battery_last_recondition
Battery_midplane_FRU_part
Battery_midplane_part_identity
Battery_midplane_FW_version
Battery_power_cable_FRU_part
Battery_power_sense_cable_FRU_part
Battery_comms_cable_FRU_part
Battery_EPOW_cable_FRU_part
Table 46 shows the fields that you see for each fan that is installed.
Table 46. Fields for the fans
Item Field name
Fan Part number
Location
Table 47 shows the fields that are repeated for each installed memory module.
Table 47. Fields that are repeated for each installed memory module
Item Field name
Memory module Part number
Device location
Bank location
Size (MB)
Manufacturer (if available)
Serial number (if available)
Table 48 shows the fields that are repeated for each installed adapter.
Table 48. Fields that are repeated for each adapter that is installed
Item Field name
Adapter Adapter type
Part number
Port numbers
Location
Device serial number
Manufacturer
Device
Adapter revision
Chip revision
Table 50 shows the fields that are specific to the node software.
Table 50. Fields that are specific to the node software
Item Field name
Software Code level
Node name
Worldwide node name
ID
Unique string that is used in dump file
names for this node
Table 51 shows the fields that are provided for the front panel assembly.
Table 51. Fields that are provided for the front panel assembly
Item Field name
Front panel Part number
Front panel ID
Front panel locale
Table 52 shows the fields that are provided for the Ethernet port.
Table 52. Fields that are provided for the Ethernet port
Item Field name
Ethernet port Port number
Ethernet port status
MAC address
Supported speeds
Table 53 on page 87 shows the fields that are provided for the power supplies in
the node.
Table 54 shows the fields that are provided for the uninterruptible power supply
assembly that is powering the node.
Table 54. Fields that are provided for the uninterruptible power supply assembly that is
powering the node
Item Field name
Uninterruptible power supply Electronics assembly part number
Battery part number
Frame assembly part number
Input power cable part number
UPS serial number
UPS type
UPS internal part number
UPS unique ID
UPS main firmware
UPS communications firmware
Table 55 shows the fields that are provided for the SAS host bus adapter (HBA).
Table 55. Fields that are provided for the SAS host bus adapter (HBA)
Item Field name
SAS HBA Part number
Port numbers
Device serial number
Manufacturer
Device
Adapter revision
Chip revision
Table 56 on page 88 shows the fields that are provided for the SAS flash drive.
Table 57 shows the fields that are provided for the small form factor pluggable
(SFP) transceiver.
Table 57. Fields that are provided for the small form factor pluggable (SFP) transceiver
Item Field name
Small form factor pluggable (SFP) Part number
transceiver
Manufacturer
Device
Serial number
Supported speeds
Connector type
Transmitter type
Wavelength
Maximum distance by cable type
Hardware revision
Port number
Worldwide port name
Table 58 on page 89 shows the fields that are provided for the system properties as
shown by the management GUI.
Restarting
Restarting
svc00552
Figure 45. SAN Volume Controller front-panel assembly
The SAN Volume Controller 2145-DH8 does not have a front panel display, but
does include node status, node fault, and battery status LEDs. Instead, a technician
port is available on the rear of the 2145-DH8 node to use a direction connection for
installing and servicing support. The technician port replaces the front panel
display, and provides direct access to the service assistant GUI and command-line
interface (CLI).
The Boot progress display on the front panel shows that the node is starting.
Booting 130
During the boot operation, boot progress codes are displayed and the progress bar
moves to the right while the boot operation proceeds.
Boot failed
If the boot operation fails, boot code 120 is displayed.
See the "Error code reference" topic where you can find a description of the failure
and the appropriate steps that you must complete to correct the failure.
Charging
The front panel indicates that the uninterruptible power supply battery is charging.
Charging
svc00304
A node will not start and join a system if there is insufficient power in the
uninterruptible power supply battery to manage with a power failure. Charging is
displayed until it is safe to start the node. This might take up to two hours.
Error codes
Error codes are displayed on the front panel display.
Figure 47 and Figure 48 show how error codes are displayed on the front panel.
svc00433
For descriptions of the error codes that are displayed on the front panel display,
see the various error code topics for a full description of the failure and the actions
that you must perform to correct the failure.
Hardware boot
The hardware boot display shows system data when power is first applied to the
node as the node searches for a disk drive to boot.
Power failure
The SAN Volume Controller node uses battery power from the uninterruptible
power supply to shut itself down.
The Power failure display on the laptop application shows that the SAN Volume
Controller is running on battery power because main power has been lost. All I/O
operations have stopped. The node is saving system metadata and node cache data
to the internal disk drive. When the progress bar reaches zero, the node powers
off.
Note: When input power is restored to the uninterruptible power supply, the SAN
Volume Controller turns on.
Powering off
The progress bar on the display shows the progress of the power-off operation.
Powering Off is displayed after the power button has been pressed and while the
node is powering off. Powering off might take several minutes.
The progress bar moves to the left when the power is removed.
Recovering
svc00305
When a node is active in a system but the uninterruptible power supply battery is
not fully charged, Recovering is displayed. If the power fails while this message is
displayed, the node does not restart until the uninterruptible power supply has
charged to a level where it can sustain a second power failure.
Restarting
The front panel indicates when the software on a node is restarting.
Restarting
If you press the power button while powering off, the panel display changes to
indicate that the button press was detected; however, the power off continues until
the node finishes saving its data. After the data is saved, the node powers off and
then automatically restarts. The progress bar moves to the right while the node is
restarting.
Shutting down
The front-panel indicator tracks shutdown operations.
The Shutting Down display is shown when you issue a shutdown command to a
SAN Volume Controller clustered system or a SAN Volume Controller node. The
progress bar continues to move to the left until the node turns off.
When the shutdown operation is complete, the node turns off. When you power
off a node that is connected to a 2145 UPS-1U, only the node shuts down; the 2145
UPS-1U does not shut down.
Shutting Down
Typically, this panel is displayed when the service controller has been replaced.
The SAN Volume Controller uses the WWNN that is stored on the service
controller. Usually, when the service controller is replaced, you modify the WWNN
that is stored on it to match the WWNN on the service controller that it replaced.
By doing this, the node maintains its WWNN address, and you do not need to
modify the SAN zoning or host configurations. The WWNN that is stored on disk
is the same that was stored on the old service controller.
After it is in this mode, the front panel display will not revert to its normal
displays, such as node or cluster (system) options or operational status, until the
WWNN is validated. Navigate the Validate WWNN option (shown in Figure 50) to
choose which WWNN that you want to use.
Validate WWNN?
Select
Node WWNN:
The node is now using the selected WWNN. The Node WWNN: panel is displayed
and shows the last five numbers of the WWNN that you selected.
Menu options enable you to review the operational status of the clustered system,
node, and external interfaces. They also provide access to the tools and operations
that you use to service the node.
Figure 51 on page 97 shows the sequence of the menu options. Only one option at
a time is displayed on the front panel display. For some options, additional data is
displayed on line 2. The first option that is displayed is the Cluster: option.
R/L
R/L
Node Service
Node L/R Status: L/R L/R Address
WWNN: s
S
U R/L
/
D IPv4 IPv4 IPv4 IPv6 IPv6 IPv6
L/R L/R L/R L/R L/R
Address Subnet Gateway Address Prefix Gateway
U
R/L
/
D R/L
L/R
Ethernet MAC
L/R Speed-2: L/R
Port-2: Address-2:
L/R
U
/ Ethernet MAC
D L/R Speed-3: L/R
Port-3: Address-3:
L/R
Ethernet MAC
L/R Speed-4: L/R
Port-4: Address-4:
R/L
R/L
U
/
D FC Port-3 FC Port-3 FC Port-4 L/R FC Port-4
L/R L/R
Status Speed Status Speed
Actions
x
U
/
D
Language?
L L L Select activates language
Use the left and right buttons to navigate through the secondary fields that are
associated with some of the main fields.
Note: Messages might not display fully on the screen. You might see a right angle
bracket (>) on the right side of the display screen. If you see a right angle bracket,
press the right button to scroll through the display. When there is no more text to
display, you can move to the next item in the menu by pressing the right button.
The main cluster (system) option displays the system name that the user has
assigned. If a clustered system is in the process of being created on the node, and
no system name has been assigned, a temporary name that is based on the IP
address of the system is displayed. If this node is not assigned to a system, the
field is blank.
Status option
Status is indicated on the front panel.
This field is blank if the node is not a member of a clustered system. If this node is
a member of a clustered system, the field indicates the operational status of the
system, as follows:
Active
Indicates that this node is an active member of the system.
Inactive
Indicates that the node is a member of a system, but is not now operational. It
is not operational because the other nodes that are in the system cannot be
accessed or because this node was excluded from the system.
Degraded
Indicates that the system is operational, but one or more of the member nodes
are missing or have failed.
These fields contain the IPv4 addresses of the system. If this node is not a member
of a system or if the IPv4 address has not been assigned, these fields are blank.
The IPv4 subnet mask addresses are set when the IPv4 addresses are assigned to
the system.
The IPv4 subnet options display the subnet mask addresses when the system has
IPv4 addresses. If the node is not a member of a system or if the IPv4 addresses
have not been assigned, this field is blank.
The IPv4 gateway addresses are set when the system is created.
The IPv4 gateway options display the gateway addresses for the system. If the
node is not a member of a system, or if the IPv4 addresses have not been assigned,
this field is blank.
These fields contain the IPv6 addresses of the system. If the node is not a member
of a system, or if the IPv6 address has not been assigned, these fields are blank.
The IPv6 prefix option displays the network prefix of the system and the service
IPv6 addresses. The prefix has a value of 0 - 127. If the node is not a member of a
system, or if the IPv6 addresses have not been assigned, a blank line displays.
The IPv6 gateway addresses are set when the system is created.
This option displays the IPv6 gateway addresses for the system. If the node is not
a member of a system, or if the IPv6 addresses have not been assigned, a blank
line displays.
The IPv6 addresses and the IPv6 gateway addresses consist of eight (4-digit)
hexadecimal values that are shown across four panels, as shown in Figure 52 on
page 100. Each panel displays two 4-digit values that are separated by a colon, the
address field position (such as 2/4) within the total address, and scroll indicators.
Move between the address panels by using the left button or right button.
Node options
The main node option displays the identification number or the name of the node
if the user has assigned a name.
Status option
The node status is indicated on the front panel. The status can be one of the
following states:
Active The node is operational, assigned to a system, and ready to complete I/O
operations.
Service
There is an error that is preventing the node from operating as part of a
system. It is safe to shut down the node in this state.
Candidate
The node is not assigned to a system and is not in service. It is safe to shut
down the node in this state.
Starting
The node is part of a system and is attempting to join the system. It cannot
complete I/O operations.
Version options
The version option displays the version of the SAN Volume Controller software
that is active on the node. The version consists of four fields that are separated by
full stops. The fields are the version, release, modification, and fix level; for
example, 6.1.0.0.
Build option
The Build: panel displays the level of the SAN Volume Controller software that is
currently active on this node.
The Cluster Build: panel displays the level of the software that is currently active
on the system that this node is operating in.
Press the right button to view the details of the individual Ethernet ports.
The Ethernet port options Port-1 through Port-4 display the state of the links and
indicates whether or not there is an active link with an Ethernet network.
Link Online
An Ethernet cable is attached to this port.
Link Offline
No Ethernet cable is attached to this port or the link has failed.
Speed options
The speed options Speed-1 through Speed-4 display the speed and duplex
information for the Ethernet port. The speed information can be one of the
following values:
10 The speed is 10 Mbps.
100 The speed is 100 Mbps.
1 The speed is 1Gbps.
10 The speed is 10 Gbps.
The MAC address options MAC Address-1 through MAC Address-4 display the
media access control (MAC) address of the Ethernet port.
Chapter 6. Using the front panel of the SAN Volume Controller 101
Inactive
The port is operational but cannot access the Fibre Channel fabric. One of
the following conditions caused this result:
v The Fibre Channel cable has failed.
v The Fibre Channel cable is not installed.
v The device that is at the other end of the cable has failed.
Failed The port is not operational because of a hardware failure.
Not installed
This port is not installed.
Actions options
During normal operations, action menu options are available on the front panel
display of the node. Only use the front panel actions when directed to do so by a
service procedure. Inappropriate use can lead to loss of access to data or loss of
data.
Figure 53 on page 104, Figure 54 on page 105, and Figure 55 on page 106 show the
sequence of the actions options. In the figures, bold lines indicate that the select
button was pressed. The lighter lines indicate the navigational path (up or down
and left or right). The circled X indicates that if the select button is pressed, an
action occurs using the data entered.
Only one action menu option at a time is displayed on the front-panel display.
Note: Options only display in the menu if they are valid for the current state of
the node. See Table 59 for a list of when the options are valid.
Chapter 6. Using the front panel of the SAN Volume Controller 103
Confirm
Cluster IPv4 IPv4 IPv4 Create?
Gateway: Cancel?
IPv4? Address: Subnet:
x
Confirm
Cluster IPv6 IPv6 IPv6 Create?
Address: Prefix: Gateway: Cancel?
IPv6?
x
Confirm
Service IPv4 IPv4 IPv4 Address?
Address: Gateway: Cancel?
IPv4? Subnet: x
Confirm
Service IPv6 IPv6 IPv6 Address?
Address: Gateway: Cancel?
IPv6? Prefix:
x
Confirm
Service DHCPv4?
Cancel?
DHCPv4? x
Confirm
Service DHCPv6? Cancel?
DHCPv6? x
Confirm
Change Edit WWNN?
Cancel?
WWNN? WWNN? x
svc00657
Figure 53. Upper options of the actions menu on the front panel
Confirm
Exit Exit? Cancel?
Service? x
Confirm
Recover Recover? Cancel?
Cluster?
x
Confirm
Remove Remove?
Cancel?
Cluster? x
Confirm
Paced Upgrade?
Upgrade? Cancel?
x
svc00658
Figure 54. Middle options of the actions menu on the front panel
Chapter 6. Using the front panel of the SAN Volume Controller 105
Confirm
Set FC Edit Speed?
Speed? Speed? Cancel?
x
Confirm
Reset Reset? Cancel?
Password? x
Confirm
Rescue Rescue? Cancel?
Node? x
Exit Actions?
svc00659
Figure 55. Lower options of the actions menu on the front panel
To perform an action, navigate to the Actions option and press the select button.
The action is initiated. Available parameters for the action are displayed. Use the
left or right buttons to move between the parameters. The current setting is
displayed on the second display line.
To set or change a parameter value, press the select button when the parameter is
displayed. The value changes to edit mode. Use the left or right buttons to move
between subfields, and use the up or down buttons to change the value of a
subfield. When the value is correct, press select to leave edit mode.
Each action also has a Confirm? and a Cancel? panel. Pressing select on the
Confirm? panel initiates the action using the current parameter value setting.
Pressing select on the Cancel? panel returns to the Action option panel without
changing the node.
Note: Messages might not display fully on the screen. You might see a right angle
bracket (>) on the right side of the display screen. If you see a right angle bracket,
press the right button to scroll through the display. When there is no more text to
display, you can move to the next item in the menu by pressing the right button.
Similarly, you might see a left angle bracket (<) on the left side of the display
screen. If you see a left angle bracket, press the left button to scroll through the
display. When there is no more text to display, you can move to the previous item
in the menu by pressing the left button.
The Cluster IPv4 or Cluster IPv6 option allows you to create a clustered system.
From the front panel, when you create a clustered system, you can set either the
IPv4 or the IPv6 address for Ethernet port 1. If required, you can add more
management IP addresses by using the management GUI or the CLI.
Press the up and down buttons to navigate through the parameters that are
associated with the Cluster option. When you have navigated to the desired
parameter, press the select button.
If you are creating the clustered system with an IPv4 address, complete the
following steps:
1. Press and release the up or down button until Actions? is displayed. Press and
release the select button.
2. Press and release the up or down button until Cluster IPv4? is displayed.
Press and release the select button.
3. Edit the IPv4 address, the IPv4 subnet, and the IPv4 gateway.
4. Press and release the left or right button until IPv4 Confirm Create? is
displayed.
5. Press and release the select button to confirm.
If you are creating the clustered system with an IPv6 address, complete the
following steps:
1. Press and release the up or down button until Actions? is displayed. Press and
release the select button.
2. Press and release the left or right button until Cluster Ipv6? is displayed. Press
and release the select button.
3. Edit the IPv6 address, the IPv6 prefix, and the IPv6 gateway.
4. Press and release the left or right button until IPv6 Confirm Create? is
displayed.
5. Press and release the select button to confirm.
Using the IPv4 address, you can set the IP address for Ethernet port 1 of the
clustered system that you are going to create. The system can have either an IPv4
or an IPv6 address, or both at the same time. You can set either the IPv4 or IPv6
Chapter 6. Using the front panel of the SAN Volume Controller 107
management address for Ethernet port 1 from the front panel when you are
creating the system. If required, you can add more management IP addresses from
the CLI.
Attention: When you set the IPv4 address, ensure that you type the correct
address. Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Note: If you want to disable the fast increase or decrease function, press and
hold the down button, press and release the select button, and then release the
down button. The disabling of the fast increase or decrease function lasts until
the creation is completed or until the feature is again enabled. If the up button
or down button is pressed and held while the function is disabled, the value
increases or decreases once every two seconds. To again enable the fast increase
or decrease function, press and hold the up button, press and release the select
button, and then release the up button.
4. Press the right button or left button to move to the number field that you want
to set.
5. Repeat steps 3 and 4 for each number field that you want to set.
6. Press the select button to confirm the settings. Otherwise, press the right button
to display the next secondary option or press the left button to display the
previous options.
Press the right button to display the next secondary option or press the left button
to display the previous options.
Using this option, you can set the IPv4 subnet mask for Ethernet port 1.
Attention: When you set the IPv4 subnet mask address, ensure that you type the
correct address. Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Note: If you want to disable the fast increase or decrease function, press and
hold the down button, press and release the select button, and then release the
down button. The disabling of the fast increase or decrease function lasts until
the creation is completed or until the feature is again enabled. If the up button
Using this option, you can set the IPv4 gateway address for Ethernet port 1.
Attention: When you set the IPv4 gateway address, ensure that you type the
correct address. Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Note: If you want to disable the fast increase or decrease function, press and
hold the down button, press and release the select button, and then release the
down button. The disabling of the fast increase or decrease function lasts until
the creation is completed or until the feature is again enabled. If the up button
or down button is pressed and held while the function is disabled, the value
increases or decreases once every two seconds. To again enable the fast increase
or decrease function, press and hold the up button, press and release the select
button, and then release the up button.
4. Press the right button or left button to move to the number field that you want
to set.
5. Repeat steps 3 and 4 for each number field that you want to set.
6. Press the select button to confirm the settings. Otherwise, press the right button
to display the next secondary option or press the left button to display the
previous options.
Using this option, you can start an operation to create a system with an IPv4
address.
1. Press and release the left or right button until IPv4 Confirm Create? is
displayed.
2. Press the select button to start the operation.
If the create operation is successful, Password is displayed on line 1. The
password that you can use to access the system is displayed on line 2. Be sure
to immediately record the password; it is required on the first attempt to
manage the system from the management GUI.
Chapter 6. Using the front panel of the SAN Volume Controller 109
Attention: The password displays for only 60 seconds, or until a front panel
button is pressed. The system is created only after the password display is
cleared.
If the create operation fails, Create Failed: is displayed on line 1 of the
front-panel display screen. Line 2 displays one of two possible error codes that
you can use to isolate the cause of the failure.
Using this option, you can set the IPv6 address for Ethernet port 1 of the system
that you are going to create. The system can have either an IPv4 or an IPv6
address, or both at the same time. You can set either the IPv4 or IPv6 management
address for Ethernet port 1 from the front panel when you are creating the system.
If required, you can add more management IP addresses from the CLI.
Attention: When you set the IPv6 address, ensure that you type the correct
address. Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Using this option, you can set the IPv6 prefix for Ethernet port 1.
Attention: When you set the IPv6 prefix, ensure that you type the correct
network prefix.Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Using this option, you can set the IPv6 gateway for Ethernet port 1.
Attention: When you set the IPv6 gateway address, ensure that you type the
correct address. Otherwise, you might not be able to access the system using the
command-line tools or the management GUI.
Using this option, you can start an operation to create a system with an IPv6
address.
1. Press and release the left or right button until IPv6 Confirm Create? is
displayed.
2. Press the select button to start the operation.
If the create operation is successful, Password is displayed on line 1. The
password that you can use to access the system is displayed on line 2. Be sure
to immediately record the password; it is required on the first attempt to
manage the system from the management GUI.
Attention: The password displays for only 60 seconds, or until a front panel
button is pressed. The system is created only after the password display is
cleared.
If the create operation fails, Create Failed: is displayed on line 1 of the
front-panel display screen. Line 2 displays one of two possible error codes that
you can use to isolate the cause of the failure.
Chapter 6. Using the front panel of the SAN Volume Controller 111
Service IPv4 or Service IPv6 options
You can use the front panel to change a service IPv4 address or a service IPv6
address.
The IPv4 Address panels show one of the following items for the selected Ethernet
port:
v The active service address if the system has an IPv4 address. This address can be
either a configured or fixed address, or it can be an address obtained through
DHCP.
v DHCP Failed if the IPv4 service address is configured for DHCP but the node
was unable to obtain an IP address.
v DHCP Configuring if the IPv4 service address is configured for DHCP while the
node attempts to obtain an IP address. This address changes to the IPv4 address
automatically if a DHCP address is allocated and activated.
v A blank line if the system does not have an IPv4 address.
If the service IPv4 address was not set correctly or a DHCP address was not
allocated, you have the option of correcting the IPv4 address from this panel. The
service IP address must be in the same subnet as the management IP address.
To set a fixed service IPv4 address from the IPv4 Address: panel, perform the
following steps:
1. Press and release the select button to put the panel in edit mode.
2. Press the right button or left button to move to the number field that you want
to set.
3. Press the up button if you want to increase the value that is highlighted; press
the down button if you want to decrease that value. If you want to quickly
increase the highlighted value, hold the up button. If you want to quickly
decrease the highlighted value, hold the down button.
Note: If you want to disable the fast increase or decrease function, press and
hold the down button, press and release the select button, and then release the
down button. The disabling of the fast increase or decrease function lasts until
the creation is completed or until the feature is again enabled. If the up button
or down button is pressed and held while the function is disabled, the value
increases or decreases once every two seconds. To again enable the fast increase
or decrease function, press and hold the up button, press and release the select
button, and then release the up button.
4. When all the fields are set as required, press and release the select button to
activate the new IPv4 address.
The IPv4 Address: panel is displayed. The new service IPv4 address is not
displayed until it has become active. If the new address has not been displayed
after 2 minutes, check that the selected address is valid on the subnetwork and
that the Ethernet switch is working correctly.
The IPv6 Address panels show one of the following conditions for the selected
Ethernet port:
If the service IPv6 address was not set correctly or a DHCP address was not
allocated, you have the option of correcting the IPv6 address from this panel. The
service IP address must be in the same subnet as the management IP address.
To set a fixed service IPv6 address from the IPv6 Address: panel, perform the
following steps:
1. Press and release the select button to put the panel in edit mode. When the
panel is in edit mode, the full address is still shown across four panels as eight
(four-digit) hexadecimal values. You edit each digit of the hexadecimal values
independently. The current digit is highlighted.
2. Press the right button or left button to move to the number field that you want
to set.
3. Press the up button if you want to increase the value that is highlighted; press
the down button if you want to decrease that value.
4. When all the fields are set as required, press and release the select button to
activate the new IPv6 address.
The IPv6 Address: panel is displayed. The new service IPv6 address is not
displayed until it has become active. If the new address has not been displayed
after 2 minutes, check that the selected address is valid on the subnetwork and
that the Ethernet switch is working correctly.
If a service IP address does not exist, you must assign a service IP address or use
DHCP with this action.
To set the service IPv4 address to use DHCP, perform the following steps:
1. Press and release the up or down button until Service DHCPv4? is displayed.
2. Press and release the down button. Confirm DHCPv4? is displayed.
3. Press and release the select button to activate DHCP, or you can press and
release the up button to keep the existing address.
4. If you activate DHCP, DHCP Configuring is displayed while the node attempts
to obtain a DHCP address. It changes automatically to show the allocated
address if a DHCP address is allocated and activated, or it changes to DHCP
Failed if a DHCP address is not allocated.
To set the service IPv6 address to use DHCP, perform the following steps:
1. Press and release the up or down button until Service DHCPv6? is displayed.
2. Press and release the down button. Confirm DHCPv6? is displayed.
Chapter 6. Using the front panel of the SAN Volume Controller 113
3. Press and release the select button to activate DHCP, or you can press and
release the up button to keep the existing address.
4. If you activate DHCP, DHCP Configuring is displayed while the node attempts
to obtain a DHCP address. It changes automatically to show the allocated
address if a DHCP address is allocated and activated, or it changes to DHCP
Failed if a DHCP address is not allocated.
Note: If an IPv6 router is present on the local network, SAN Volume Controller
does not differentiate between an autoconfigured address and a DHCP address.
Therefore, SAN Volume Controller uses the first address that is detected.
Important: Only change the WWNN when you are instructed to do so by a service
procedure. Nodes must always have a unique WWNN. If you change the WWNN,
you might have to reconfigure hosts and the SAN zoning.
1. Press and release the up or down button until Actions is displayed.
2. Press and release the select button.
3. Press and release the up or down button until Change WWNN? is displayed on
line 1. Line 2 of the display shows the last five numbers of the WWNN that is
currently set. The first number is highlighted.
4. Edit the highlighted number to match the number that is required. Use the up
and down buttons to increase or decrease the numbers. The numbers wrap F to
0 or 0 to F. Use the left and right buttons to move between the numbers.
5. When the highlighted value matches the required number, press and release the
select button to activate the change. The Node WWNN: panel displays and the
second line shows the last five characters of the changed WWNN.
If the node is active, entering service state can cause disruption to hosts if other
faults exist in the system.While in service state, the node cannot join or run as part
of a clustered system.
To exit service state, ensure that all errors are resolved. You can exit service state
by using the Exit Service? option or by restarting the node.
If there are no noncritical errors, the node enters candidate state. If possible, the
node then becomes active in a clustered system.
To exit service state, ensure that all errors are resolved. You can exit service state
by using this option or by restarting the node.
Perform service actions on nodes only when directed by the service procedures. If
used inappropriately, service actions can cause loss of access to data or data loss.
For information about the recover system procedure, see Recover system
procedure on page 253.
Use this option as the final step in decommissioning a system after the other nodes
have been removed from the system using the command-line interface (CLI) or the
management GUI.
Attention: Use the front panel to remove state data from a single node system. To
remove a node from a multi-node system, always use the CLI or the remove node
options from the management GUI.
To delete the state data from the node using the Remove Cluster? panel:
1. Press and hold the up button.
2. Press and release the select button.
3. Release the up button.
After the option is run, the node shows Cluster: with no system name. If this
option is run on a node that is still a member of a system, the system shows error
1195, Node missing, and the node is displayed in the list of nodes in the system.
Remove the node by using the management GUI or CLI.
Note: This action can be used only when the following conditions exist for the
node:
v The node is in service state.
v The node has no errors.
v The node has been removed from the clustered system.
For additional information, see the Upgrading the software manually topic in the
information center.
Use the Reset password? option if the user has lost the system superuser password
or if the user is unable to access the system. If it is permitted by the user's
password security policy, use this selection to reset the system superuser password.
Chapter 6. Using the front panel of the SAN Volume Controller 115
If your password security policy permits password recovery, and if the node is
currently a member of a clustered system, the system superuser password is reset
and a new password is displayed for 60 seconds. If your password security policy
does not permit password recovery or the node is not a member of a system,
completing these steps has no effect.
If the node is in active state when the password is reset, the reset applies to all
nodes in the system. If the node is in candidate or service state when the password
is reset, the reset applies only to the single node.
Note: Another way to rescue a node is to force a node rescue when the node
boots. It is the preferred method. Forcing a node rescue when a node boots works
by booting the operating system from the service controller and running a program
that copies all the SAN Volume Controller software from any other node that can
be found on the Fibre Channel fabric. See Completing the node rescue when the
node boots on page 269.
Language? option
You can change the language that displays on the front panel.
The Language? option allows you to change the language that is displayed on the
menu. Figure 56 shows the Language? option sequence.
Language?
Select
svc00410
English? Japanese?
To select the language that you want to be used on the front panel, perform the
following steps:
Procedure
1. Press and release the up or down button until Language? is displayed.
2. Press and release the select button.
Results
If the selected language uses the Latin alphabet, the front panel display shows two
lines. The panel text is displayed on the first line and additional data is displayed
on the second line.
If the selected language does not use the Latin alphabet, the display shows only
one line at a time to clearly display the character font. For those languages, you
can switch between the panel text and the additional data by pressing and
releasing the select button.
Additional data is unavailable when the front panel displays a menu option, which
ends with a question mark (?). In this case, press and release the select button to
choose the menu option.
Note: You cannot select another language when the node is displaying a boot
error.
Using the power control for the SAN Volume Controller node
Some SAN Volume Controller nodes are powered by an uninterruptible power
supply that is in the same rack as the nodes. Other nodes have internal batteries
instead, such as the SAN Volume Controller 2145-DH8.
The power state of the SAN Volume Controller is displayed by a power indicator
on the front panel. If the uninterruptible power supply battery is not sufficiently
charged to enable the SAN Volume Controller to become fully operational, its
charge state is displayed on the front panel display of the node.
Note: Never turn off the node by removing the power cable. You might lose data.
For more information about how to power off the node, see MAP 5350: Powering
off a node on page 302
If the SAN Volume Controller software is running and you request it to power off
from the management GUI, CLI, or power button, the node starts its power off
processing. During this time, the node indicates the progress of the power-off
operation on the front panel display. After the power-off processing is complete,
the front panel becomes blank and the front panel power light flashes. It is safe for
you to remove the power cable from the rear of the node. If the power button on
the front panel is pressed during power-off processing, the front panel display
changes to indicate that the node is being restarted, but the power-off process
completes before the restart.
If the SAN Volume Controller software is not running when the front panel power
button is pressed, the node immediately powers off.
Chapter 6. Using the front panel of the SAN Volume Controller 117
Note: The 2145 UPS-1U does not power off when the node is shut down from the
power button.
If you turn off a node using the power button or by a command, the node is put
into a power-off state. The SAN Volume Controller remains in this state until the
power cable is connected to the rear of the node and the power button is pressed.
During the startup sequence, the SAN Volume Controller tries to detect the status
of the uninterruptible power supply through the uninterruptible power supply
signal cable. If an uninterruptible power supply is not detected, the node pauses
and an error is shown on the front panel display. If the uninterruptible power
supply is detected, the software monitors the operational state of the
uninterruptible power supply. If no uninterruptible power supply errors are
reported and the uninterruptible power supply battery is sufficiently charged, the
SAN Volume Controller becomes operational. If the uninterruptible power supply
battery is not sufficiently charged, the charge state is indicated by a progress bar
on the front panel display. When an uninterruptible power supply is first turned
on, it might take up to 2 hours before the battery is sufficiently charged for the
SAN Volume Controller node to become operational.
If input power to the uninterruptible power supply is lost, the node immediately
stops all I/O operations and saves the contents of its dynamic random access
memory (DRAM) to the internal disk drive. While data is being saved to the disk
drive, a Power Failure message is shown on the front panel and is accompanied
by a descending progress bar that indicates the quantity of data that remains to be
saved. After all the data is saved, the node is turned off and the power light on the
front panel turns off.
Note: The node is now in standby state. If the input power to the uninterruptible
power supply unit is restored, the node restarts. If the uninterruptible power
supply battery was fully discharged, Charging is displayed and the boot process
waits for the battery to charge. When the battery is sufficiently charged, Booting is
displayed, the node is tested, and the software is loaded. When the boot process is
complete, Recovering is displayed while the uninterruptible power supply finalizes
its charge. While Recovering is displayed, the system can function normally.
However, when the power is restored after a second power failure, there is a delay
(with Charging displayed) before the node can complete its boot process.
Event logs
Error codes
The following topics provide information to help you understand and process the
error codes:
v Event reporting
v Understanding the events
v Understanding the error codes
v Determining a hardware boot failure
If the node is showing a boot message, failure message, or node error message,
and you determined that the problem was caused by a software or firmware
failure, you can restart the node to see whether that might resolve the problem.
Perform the following steps to properly shut down and restart the node:
1. Follow the instructions in MAP 5350: Powering off a node on page 302.
2. Restart only one node at a time.
3. Do not shut down the second node in an I/O group for at least 30 minutes
after you shut down and restart the first node.
Introduction
For each collection interval, the management GUI creates four statistics files: one
for managed disks (MDisks), named Nm_stat; one for volumes and volume copies,
named Nv_stat; one for nodes, named Nn_stat; and one for drives, named Nd_stat.
The files are written to the /dumps/iostats directory on the node. To retrieve the
statistics files from the non-configuration nodes onto the configuration node,
svctask cpdumps command must be used.
A maximum of 16 files of each type can be created for the node. When the 17th file
is created, the oldest file for the node is overwritten.
Tables
The following tables describe the information that is reported for individual nodes
and volumes.
Table 60 describes the statistics collection for MDisks, for individual nodes.
Table 60. Statistics collection for individual nodes
Statistic Description
name
id Indicates the name of the MDisk for which the statistics apply.
idx Indicates the identifier of the MDisk for which the statistics apply.
rb Indicates the cumulative number of blocks of data that is read (since the
node has been running).
re Indicates the cumulative read external response time in milliseconds for each
MDisk. The cumulative response time for disk reads is calculated by starting
a timer when a SCSI read command is issued and stopped when the
command completes successfully. The elapsed time is added to the
cumulative counter.
ro Indicates the cumulative number of MDisk read operations that are processed
(since the node has been running).
rq Indicates the cumulative read queued response time in milliseconds for each
MDisk. This response is measured from above the queue of commands to be
sent to an MDisk because the queue depth is already full. This calculation
includes the elapsed time that is taken for read commands to complete from
the time they join the queue.
wb Indicates the cumulative number of blocks of data written (since the node
has been running).
we Indicates the cumulative write external response time in milliseconds for each
MDisk. The cumulative response time for disk writes is calculated by starting
a timer when a SCSI write command is issued and stopped when the
command completes successfully. The elapsed time is added to the
cumulative counter.
wo Indicates the cumulative number of MDisk write operations processed (since
the node has been running).
wq Indicates the cumulative write queued response time in milliseconds for each
MDisk. This is measured from above the queue of commands to be sent to an
MDisk because the queue depth is already full. This calculation includes the
elapsed time taken for write commands to complete from the time they join
the queue.
Table 61 on page 121 describes the VDisk (volume) information that is reported for
individual nodes.
Table 62 describes the VDisk information related to Metro Mirror or Global Mirror
relationships that is reported for individual nodes.
Table 62. Statistic collection for volumes that are used in Metro Mirror and Global Mirror
relationships for individual nodes
Statistic Description
name
gwl Indicates cumulative secondary write latency in milliseconds. This statistic
accumulates the cumulative secondary write latency for each volume. You
can calculate the amount of time to recovery from a failure based on this
statistic and the gws statistics.
Table 63 describes the port information that is reported for individual nodes
Table 63. Statistic collection for node ports
Statistic Description
name
bbcz Indicates the total time in microseconds for which the port had data to send
but was prevented from doing so by a lack of buffer credit from the switch.
cbr Indicates the bytes received from controllers.
cbt Indicates the bytes transmitted to disk controllers.
cer Indicates the commands received from disk controllers.
cet Indicates the commands initiated to disk controllers.
hbr Indicates the bytes received from hosts.
hbt Indicates the bytes transmitted to hosts.
her Indicates the commands received from hosts.
het Indicates the commands initiated to hosts.
icrc Indicates the number of CRC that are not valid.
id Indicates the port identifier for the node.
itw Indicates the number of transmission word counts that are not valid.
lf Indicates a link failure count.
lnbr Indicates the bytes received to other nodes in the same cluster.
lnbt Indicates the bytes transmitted to other nodes in the same cluster.
lner Indicates the commands received from other nodes in the same cluster.
lnet Indicates the commands initiated to other nodes in the same cluster.
lsi Indicates the lost-of-signal count.
lsy Indicates the loss-of-synchronization count.
pspe Indicates the primitive sequence-protocol error count.
rmbr Indicates the bytes received to other nodes in the other clusters.
rmbt Indicates the bytes transmitted to other nodes in the other clusters.
rmer Indicates the commands received from other nodes in the other clusters.
rmet Indicates the commands initiated to other nodes in the other clusters.
Table 64 describes the node information that is reported for each nodes.
Table 64. Statistic collection for nodes
Statistic Description
name
cluster_id Indicates the name of the cluster.
cluster Indicates the name of the cluster.
cpu busy - Indicates the total CPU average core busy milliseconds since the node
was reset. This statistic reports the amount of the time the processor has
spent polling while waiting for work versus actually doing work. This
statistic accumulates from zero.
comp - Indicates the total CPU average core busy milliseconds for
compression process cores since the node was reset.
system - Indicates the total CPU average core busy milliseconds since the
node was reset. This statistic reports the amount of the time the processor
has spent polling while waiting for work versus actually doing work. This
statistic accumulates from zero. This is the same information as the
information provided with the cpu busy statistic and will eventually replace
the cpu busy statistic.
cpu_core id - Indicates the CPU core id.
comp - Indicates the per-core CPU average core busy milliseconds for
compression process cores since node was reset.
system - Indicates the per-core CPU average core busy milliseconds for
system process cores since node was reset.
id Indicates the name of the node.
node_id Indicates the unique identifier for the node.
rb Indicates the number of bytes received.
re Indicates the accumulated receive latency, excluding inbound queue time.
This statistic is the latency that is experienced by the node communication
layer from the time that an I/O is queued to cache until the time that the
cache gives completion for it.
ro Indicates the number of messages or bulk data received.
rq Indicates the accumulated receive latency, including inbound queue time.
This statistic is the latency from the time that a command arrives at the node
communication layer to the time that the cache completes the command.
wb Indicates the bytes sent.
we Indicates the accumulated send latency, excluding outbound queue time. This
statistic is the time from when the node communication layer issues a
message out onto the Fibre Channel until the node communication layer
receives notification that the message has arrived.
wo Indicates the number of messages or bulk data sent.
wq Indicates the accumulated send latency, including outbound queue time. This
statistic includes the entire time that data is sent. This time includes the time
from when the node communication layer receives a message and waits for
resources, the time to send the message to the remote node, and the time
taken for the remote node to respond.
Note: Any statistic with a name av, mx, mn, and cn is not cumulative. These
statistics reset every statistics interval. For example, if the statistic does not have a
name with name av, mx, mn, and cn, and it is an Ios or count, it will be a field
containing a total number.
v The term pages means in units of 4096 bytes per page.
v The term sectors means in units of 512 bytes per sector.
v The term s means microseconds.
v Non-cumulative means totals since the previous statistics collection interval.
v Snapshot means the value at the end of the statistics interval (rather than an
average across the interval or a peak within the interval).
Table 66 describes the statistic collection for volume cache per individual nodes.
Table 66. Statistic collection for volume cache per individual nodes. This table describes the
volume cache information that is reported for individual nodes.
Statistic Description
name
cm Indicates the number of sectors of modified or dirty data that are held in the
cache.
ctd Indicates the total number of cache destages that were initiated writes,
submitted to other components as a result of a volume cache flush or destage
operation.
ctds Indicates the total number of sectors that are written for cache-initiated track
writes.
ctp Indicates the number of track stages that are initiated by the cache that are
prestage reads.
ctps Indicates the total number of staged sectors that are initiated by the cache.
ctrh Indicates the number of total track read-cache hits on prestage or
non-prestage data. For example, a single read that spans two tracks where
only one of the tracks obtained a total cache hit, is counted as one track
read-cache hit.
Table 67 on page 129 describes the XML statistics specific to an IP Partnership port.
Actions
The XML is more complicated now, as seen in this raw XML from the volume
(Nv_statistics) statistics. Notice how the names are similar but because they are in
a different section of the XML, they refer to a different part of the VDisk.
<vdsk idx="0"
ctrs="213694394" ctps="0" ctrhs="2416029" ctrhps="0"
ctds="152474234" ctwfts="9635" ctwwts="0" ctwfws="152468611"
ctwhs="9117" ctws="152478246" ctr="1628296" ctw="3241448"
ctp="0" ctrh="123056" ctrhp="0" ctd="1172772"
ctwft="200" ctwwt="0" ctwfw="3241248" ctwfwsh="0"
ctwfwshs="0" ctwh="538" cm="13768758912876544" cv="13874234719731712"
gwot="0" gwo="0" gws="0" gwl="0"
id="Master_iogrp0_1"
ro="0" wo="0" rb="0" wb="0"
rl="0" wl="0" rlw="0" wlw="0" xl="0">
Vdisk/Volume statistics
<ca r="0" rh="0" d="0" ft="0"
wt="0" fw="0" wh="0" ri="0"
wi="0" dav="0" dcn="0" pav="0" pcn="0" teav="0" tsav="0" tav="0"
pp="0"/>
<cpy idx="0">
volume copy statistics
<ca r="0" p="0" rh="0" ph="0"
d="0" ft="0" wt="0" fw="0"
wh="0" pm="0" ri="0" wi="0"
dav="0" dcn="0" sav="0" scn="0"
pav="0" pcn="0" teav="0" tsav="0"
tav="0" pp="0"/>
</cpy>
<vdsk>
Similarly for the volume cache statistics for node and partitions:
<uca><ca dav="18726" dcn="1502531" dmx="749846" dmn="89"
sav="20868" scn="2833391" smx="980941" smn="3"
pav="0" pcn="0" pmx="0" pmn="0"
wfav="0" wfmx="2" wfmn="0"
rfav="0" rfmx="1" rfmn="0"
pp="0"
hpt="0" ppt="0" opt="0" npt="0"
apt="0" cpt="0" bpt="0" hrpt="0"
/><partition id="0"><ca dav="18726" dcn="1502531" dmx="749846" dmn="89"
fav="0" fmx="2" fmn="0"
dfav="0" dfmx="0" dfmn="0"
dtav="0" dtmx="0" dtmn="0"
pp="0"/></partition>
This output describes the volume cache node statistics where <partition id="0">
the statistics are described for partition 0.
Replacing <uca> with <lca> means that the statistics are for volume copy cache
partition 0.
Event reporting
Events that are detected are saved in an event log. As soon as an entry is made in
this event log, the condition is analyzed. If any service activity is required, a
notification is sent, if you have set up notifications.
The following methods are used to notify you and the IBM Support Center of a
new event:
v The most serious system error code is displayed on the front panel of each node
in the system.
v If you enabled Simple Network Management Protocol (SNMP), an SNMP trap is
sent to an SNMP manager that is configured by the customer.
The SNMP manager might be IBM Systems Director, if it is installed, or another
SNMP manager.
v If enabled, log messages can be forwarded on an IP network by using the syslog
protocol.
v If enabled, event notifications can be forwarded by email by using Simple Mail
Transfer Protocol (SMTP).
v Call Home can be enabled so that critical faults generate a problem management
record (PMR) that is then sent directly to the appropriate IBM Support Center by
using email.
Power-on self-test
When you turn on the SAN Volume Controller, the system board performs
self-tests. During the initial tests, the hardware boot symbol is displayed.
All models perform a series of tests to check the operation of components and
some of the options that have been installed when the units are first turned on.
This series of tests is called the power-on self-test (POST).
If a critical failure is detected during the POST, the software is not loaded and the
system error LED on the operator information panel is illuminated. If this failure
occurs, use MAP 5000: Start on page 275 to help isolate the cause of the failure.
When the software is loaded, additional testing takes place, which ensures that all
of the required hardware and software components are installed and functioning
correctly. During the additional testing, the word Booting is displayed on the front
panel along with a boot progress code and a progress bar. If a test failure occurs,
the word Failed is displayed on the front panel.
The service controller performs internal checks and is vital to the operation of the
SAN Volume Controller. If the error (check) LED is illuminated on the service
controller front panel, the front-panel display might not be functioning correctly
and you can ignore any message displayed.
Understanding events
When a significant change in status is detected, an event is logged in the event log.
Error data
To avoid having a repeated event that fills the event log, some records in the event
log refer to multiple occurrences of the same event. When event log entries are
coalesced in this way, the time stamp of the first occurrence and the last occurrence
of the problem is saved in the log entry. A count of the number of times that the
error condition has occurred is also saved in the log entry. Other data refers to the
last occurrence of the event.
You can view the event log by using the Monitoring > Events options in the
management GUI. The event log contains many entries. You can, however, select
only the type of information that you need.
You can also view the event log by using the command-line interface (lseventlog).
See the Command-line interface topic for the command details.
Table 68 describes some of the fields that are available to assist you in diagnosing
problems.
Table 68. Description of data fields for the event log
Data field Description
Event ID This number precisely identifies why the event was logged.
Description A short description of the event.
Status Indicates whether the event requires some attention.
Alert: if a red icon with a cross is shown, follow the fix procedure or
service action to resolve the event and turn the status green.
Event notifications
The system can use Simple Network Management Protocol (SNMP) traps, syslog
messages, and Call Home emails to notify you and the support center when
significant events are detected. Any combination of these notification methods can
be used simultaneously. Notifications are normally sent immediately after an event
is raised. However, there are some events that might occur because of active
service actions. If a recommended service action is active, these events are notified
only if they are still unfixed when the service action completes.
Each event that the system detects is assigned a notification type of Error, Warning,
or Information. When you configure notifications, you specify where the
notifications should be sent and which notification types are sent to that recipient.
Events with notification type Error or Warning are shown as alerts in the event log.
Events with notification type Information are shown as messages.
SNMP traps
You can use the Management Information Base (MIB) file for SNMP to configure a
network management program to receive SNMP messages that are sent by the
system. This file can be used with SNMP messages from all versions of the
software. More information about the MIB file for SNMP is available at this
website:
www.ibm.com/storage/support/2145
Search for , then search for MIB. Go to the downloads results to find Management
Information Base (MIB) file for SNMP. Click this link to find download options.
Syslog messages
The syslog protocol is a standard protocol for forwarding log messages from a
sender to a receiver on an IP network. The IP network can be either IPv4 or IPv6.
The system can send syslog messages that notify personnel about an event. The
system can transmit syslog messages in either expanded or concise format. You can
use a syslog manager to view the syslog messages that the system sends. The
system uses the User Datagram Protocol (UDP) to transmit the syslog message.
You can specify up to a maximum of six syslog servers.You can use the
management GUI or the command-line interface to configure and modify your
syslog settings.
Table 70 on page 135 shows how SAN Volume Controller notification codes map to
syslog security-level codes.
Table 71 shows how SAN Volume Controller values of user-defined message origin
identifiers map to syslog facility codes.
Table 71. SAN Volume Controller values of user-defined message origin identifiers and
syslog facility codes
SAN Volume
Controller value Syslog value Syslog facility code Message format
0 16 LOG_LOCAL0 Full
1 17 LOG_LOCAL1 Full
2 18 LOG_LOCAL2 Full
3 19 LOG_LOCAL3 Full
4 20 LOG_LOCAL4 Concise
5 21 LOG_LOCAL5 Concise
6 22 LOG_LOCAL6 Concise
7 23 LOG_LOCAL7 Concise
The Call Home feature transmits operational and event-related data to you and
service personnel through a Simple Mail Transfer Protocol (SMTP) server
connection in the form of an event notification email. When configured, this
function alerts service personnel about hardware failures and potentially serious
configuration or environmental issues.
To send email, you must configure at least one SMTP server. You can specify as
many as five additional SMTP servers for backup purposes. The SMTP server must
accept the relaying of email from the management IP address. You can then use the
management GUI or the command-line interface to configure the email settings,
including contact information and email recipients. Set the reply address to a valid
email address. Send a test email to check that all connections and infrastructure are
set up correctly. You can disable the Call Home function at any time using the
management GUI or the command-line interface.
Notifications can be sent using email, SNMP, or syslog. The data sent for each type
of notification is the same. It includes:
v Record type
v Machine type
v Machine serial number
v Error ID
v Error code
v Software version
v FRU part number
v Cluster (system) name
v Node ID
v Error sequence number
v Time stamp
v Object type
v Object ID
v Problem data
Emails contain the following additional information that allow the Support Center
to contact you:
v Contact names for first and second contacts
v Contact phone numbers for first and second contacts
v Alternate contact numbers for first and second contacts
v Offshift phone number
v Contact email address
v Machine location
To send data and notifications to service personnel, use one of the following email
addresses:
v For systems that are located in North America, Latin America, South America or
the Caribbean Islands, use callhome1@de.ibm.com
v For systems that are located anywhere else in the world, use
callhome0@de.ibm.com
Because inventory information is sent using the Call Home email function, you
must meet the Call Home function requirements and enable the Call Home email
function before you can attempt to send inventory information email. You can
adjust the contact information, adjust the frequency of inventory email, or
manually send an inventory email using the management GUI or the
command-line interface.
Example email
For detailed information about what is included in the Call Home inventory
information, configure the system to send an inventory email to yourself. Figure 57
shows an example of an email.
Error codes help you to identify the cause of a problem, a failing component, and
the service actions that might be needed to solve the problem.
Note: If more than one error occurs during an operation, the highest priority error
code displays on the front panel. The lower the number for the error code, the
higher the priority. For example, error code 1020 has a higher priority than error
code 1370.
Procedure
1. Locate the error code in one of the tables. If you cannot find a particular code
in any table, call IBM Support Center for assistance.
2. Read about the action you must complete to correct the problem. Do not
exchange field replaceable units (FRUs) unless you are instructed to do so.
3. Normally, exchange only one FRU at a time, starting from the top of the FRU
list for that error code.
Event IDs
The SAN Volume Controller software generates events, such as informational
events and error events. An event ID or number is associated with the event and
indicates the reason for the event.
Error events are generated when a service action is required. An error event maps
to an alert with an associated error code. Depending on the configuration, error
event notifications can be sent through email, SNMP, or syslog.
Informational events
The informational events provide information about the status of an operation.
Informational events are recorded in the event log and, based on notification type,
can be notified through email, SNMP, or syslog.
SCSI status
Some events are part of the SCSI architecture and are handled by the host
application or device drivers without reporting an event. Some events, such as
read and write I/O events and events that are associated with the loss of nodes or
loss of access to backend devices, cause application I/O to fail. To help
troubleshoot these events, SCSI commands are returned with the Check Condition
status and a 32-bit event identifier is included with the sense information. The
identifier relates to a specific event in the event log.
If the host application or device driver captures and stores this information, you
can relate the application failure to the event log.
Table 73 on page 143 describes the SCSI status and codes that are returned by the
nodes.
SCSI Sense
Nodes notify the hosts of events on SCSI commands. Table 74 defines the SCSI
sense keys, codes and qualifiers that are returned by the nodes.
Table 74. SCSI sense keys, codes, and qualifiers
Key Code Qualifier Definition Description
2h 04h 01h Not Ready. The logical The node lost sight of the system
unit is in the process of and cannot perform I/O
becoming ready. operations. The additional sense
does not have additional
information.
2h 04h 0Ch Not Ready. The target port The following conditions are
is in the state of possible:
unavailable. v The node lost sight of the
system and cannot perform
I/O operations. The additional
sense does not have additional
information.
v The node is in contact with
the system but cannot perform
I/O operations to the
specified logical unit because
of either a loss of connectivity
to the backend controller or
some algorithmic problem.
This sense is returned for
offline volumes.
Reason codes
The reason code appears in bytes 20-23 of the sense data. The reason code provides
the node with a specific log entry. The field is a 32-bit unsigned number that is
presented with the most significant byte first. Table 75 lists the reason codes and
their definitions.
If the reason code is not listed in Table 75, the code refers to a specific event in the
event log that corresponds to the sequence number of the relevant event log entry.
Table 75. Reason codes
Reason code
(decimal) Description
40 The resource is part of a stopped FlashCopy mapping.
50 The resource is part of a Metro Mirror or Global Mirror relationship
and the secondary LUN in the offline.
51 The resource is part of a Metro Mirror or Global Mirror and the
secondary LUN is read only.
60 The node is offline.
71 The resource is not bound to any domain.
72 The resource is bound to a domain that has been recreated.
Object types
You can use the object code to determine the type of the object the event is logged
against.
The node serial number (also known as the product or machine serial number) is
on the MT-M SN label (Machine Type - Model and Serial Number label) on the
front (left side) of the 2145-DH8 node. The node serial number is written to the
system board and to each of the two boot drives during the manufacturing
process.
When the SAN Volume Controller software starts, it reads the node serial number
from the system board (using the node serial number for the panel name) and
compares it with the node serial numbers stored on the two boot drives.
There is an event in the Monitoring > Events panel of the management GUI if the
problem produces node error 743, 744, or 745. Run the fix procedure for that event.
Otherwise, connect to the technician port to use the MT-M SN label on the node to
see the boot drive slot information and determine the problem.
Attention: If a drive slot has Yes in the Active column, the operating system
depends on that drive. Do not remove that drive without first shutting down the
node.
v Do not swap boot drives between slots.
v Each boot drive has a copy of the VPD on the system board.
v Software upgrading is to one boot drive at a time to prevent failures during
CCU.
Procedure
To resolve a problem with a boot drive, complete the following steps in order:
Note: If the node serial number does not match the node serial number on the
system board, a drive slot has a status of wrong_node. You can ignore this status
if the serial number on the MT-M SN label matches the node serial number on
the drive.
4. Move any drive that is in the wrong slot back to the correct slot.
5. Reseat the drive in any slot that has a status of failed. If the status remains
failed, replace the drive with one from FRU stock.
6. If the drive slot has status out of sync and Yes in the can_sync column, then:
v Use the service assistant GUI to synchronize boot drives, or
v Use the command-line interface (CLI) command satask chbootdrive -sync.
v If No is displayed in the can_sync column, you must resolve another boot
drive problem first.
Replacing the 2145-DH8 system board:
7. Replace the SAN Volume Controller 2145-DH8 system board.
When neither of the boot drives have usable SAN Volume Controller software:
For example, if you replace both of the boot drives from FRU stock at the same
time, neither boot drive has usable SAN Volume Controller software. If the SAN
Volume Controller software is not running, the node status, node fault, battery
status, and battery fault LEDs remain off.
8. If you cannot replace at least one of the original boot drives with a drive that
contains usable SAN Volume Controller software and has a node serial number
that matches the MT-M SN label on the front of the node, contact IBM Remote
Technical support. IBM Remote Technical support can help you install the SAN
Volume Controller software with a bootable USB flash drive .
v Field-based USB installation also repairs the node serial number and WWNN
stored on each boot drive by finding values that are stored on the system
board during manufacturing.
v If the WWNN of this node changed in the past, you must change the
WWNN again after completing the SAN Volume Controller software
installation. For example, if the node replaced a legacy SAN Volume
Controller node, you would have changed the WWNN to that of the legacy
node. You can repeat the change to the WWNN after the SAN Volume
Controller software installation with the service assistant GUI or by
command.
When every copy of the node serial number is lost:
For example, if you replace the system board and both of the boot drives with FRU
stock at the same time, every copy of the node serial number is lost.
9. If you cannot replace one of the original boot drives or the original system
board so that at least one copy of the original node serial number is present,
you cannot repair the node in the field. You must return the node to IBM for
repair.
The status of a drive slot is uninitialized only if the SAN Volume Controller
software might not automatically initialize the FRU drive. This status can happen if
the node serial number on the other boot drive does not match the node serial
number on the system board. If the node serial number on the other boot drive
matches the MT-M SN label on the front that is left of the node, you can rescue the
uninitialized boot drive from the other boot drive safely. Use the service assistant
GUI or the satask recuenode command to rescue the drive.
Line 1 of the front panel displays the message Booting that is followed by the boot
code. Line 2 of the display shows a boot progress indicator. If the boot code detects
an error that makes it impossible to continue, Failed is displayed. You can use the
code to isolate the fault.
Failed 120
If the boot detects a situation where it cannot continue, it fails. The cause might be
that the software on the hard disk drive is missing or damaged. If possible, the
boot sequence loads and starts the SAN Volume Controller software. Any faults
that are detected are reported as a node error.
Procedure
1. Attempt to restore the software by using the node rescue procedure.
2. If node rescue fails, perform the actions that are described for any failing
node-rescue code or procedure.
The codes indicate the progress of the boot operation. Line 1 of the front panel
displays the message Booting that is followed by the boot code. Line 2 of the
display shows a boot progress indicator. Figure 59 on page 160 provides a view of
the boot progress display.
Use the service assistant GUI by the Technician port to view node errors on a node
that does not have a front panel display such as a 2145-DH8 node.
Because node errors are specific to a node, for example, memory failures, the errors
might be reported only on that node. However, if the node can communicate with
the configuration node, then it is reported in the system event log.
When the node error code indicates that a critical error was detected that prevents
the node from becoming a member of a clustered system, the Node fault LED is on
for the 2145-DH8 node, or Line 1 of the front panel display contains the message
Node Error.
Line 2 contains either the error code or the error code and additional data. In
errors that involve a node with more than one power supply, the error code is
followed by two numbers. The first number indicates the power supply that has a
problem (either a 1 or a 2). The second number indicates the problem that is
detected.
Figure 60 provides an example of a node error code. This data might exceed the
maximum width of the menu screen. You can press the Right navigation to scroll
the display.
The additional data is unique for any error code. It provides the necessary
information to isolate the problem in an offline environment. Examples of
additional data are disk serial numbers and field replaceable unit (FRU) location
codes. When these codes are displayed, you can do additional fault isolation by
browsing the default menu to determine the node and Fibre Channel port status.
There are two types of node errors: critical node errors and noncritical node errors.
Critical errors
A critical error means that the node is not able to participate in a clustered system
until the issue that is preventing it from joining a clustered system is resolved. This
error occurs because part of the hardware failed or the system detects that the
Noncritical errors
A noncritical error code is logged when a hardware or code failure that is related
to just one specific node. These errors do not stop the node from entering active
state and joining a clustered system. If the node is part of a clustered system, an
alert describes the error condition. The range of errors that are reserved for
noncritical errors are 800 - 899.
To start node rescue, press and hold the left and right buttons on the front panel
during a power-on cycle. The menu screen displays the Node rescue request. See
the node rescue request topic. The hard disk is formatted and, if the format
completes without error, the software image is downloaded from any available
node. During node recovery, Line 1 of the menu screen displays the message
Booting followed by one of the node rescue codes. Line 2 of the menu screen
displays a boot progress indicator. Figure 61 shows an example of a displayed
node rescue code.
Booting 300
The three-digit code that is shown in Figure 61 represents a node rescue code.
Note: The 2145 UPS-1U does not power off following a node rescue failure.
Line 1 of the menu screen contains the message Create Failed. Line 2 shows the
error code and, where necessary, additional data.
You must perform software problem analysis before you can perform further
operations to avoid the possibility of corrupting your configuration.
Error codes for clustered systems describe errors other than recovery errors.
130 Checking the internal disk file system Explanation: The saved cluster state and cache data is
being loaded.
Explanation: The file system on the internal disk drive
of the node is being checked for inconsistencies. User response: If the progress bar has been stopped
for at least 5 minutes, power off the node and then
User response: If the progress bar has been stopped power on the node. If the boot process stops again at
for at least five minutes, power off the node and then this point, run the node rescue procedure.
power on the node. If the boot process stops again at
this point, run the node rescue procedure. Possible Cause-FRUs or other:
v None.
Possible Cause-FRUs or other:
v None.
160 Updating the service controller
132 Updating BIOS settings of the node Explanation: The firmware on the service controller is
being updated. This can take 30 minutes.
Explanation: The system has found that changes are
required to the BIOS settings of the node. These User response: When a node rescue is occurring, if the
changes are being made. The node will restart once the progress bar has been stopped for at least 30 minutes,
changes are complete. exchange the FRU for a new FRU. When a node rescue
is not occurring, if the progress bar has been stopped
User response: If the progress bar has stopped for for at least 15 minutes, exchange the FRU for a new
more than 10 minutes, or if the display has shown FRU.
codes 100 and 132 three times or more, go to MAP
5900: Hardware boot to resolve the problem. Possible Cause-FRUs or other:
2145-CG8 or 2145-CF8
135 Verifying the software v Service controller (95%)
Explanation: The software packages of the node are v Service controller cable (5%)
being checked for integrity.
All previous 2145 models
User response: Allow the verification process to
v Service Controller (100%)
complete.
168 The command cannot be initiated 310 The 2145 is running a format operation.
because authentication credentials for
Explanation: The 2145 is running a format operation.
the current SSH session have expired.
User response: If the progress bar has been stopped
Explanation: Authentication credentials for the current
for two minutes, exchange the FRU for a new FRU.
SSH session have expired, and all authorization for the
current session has been revoked. A system Possible Cause-FRUs or other:
administrator may have cleared the authentication
cache. 2145-CG8 or 2145-CF8
v Disk drive (50%)
User response: Begin a new SSH session and re-issue
the command. v Disk controller (30%)
v Disk backplane (10%)
170 A flash module hardware error has v Disk signal cable (8%)
occurred. v Disk power cable (1%)
User response: Exchange the FRU for a new FRU. v Disk cable assembly (10%)
User response: If the progress bar has been stopped Explanation: None.
for more than two minutes, exchange the FRU for a User response: None.
new FRU.
Possible Cause-FRUs or other: 370 Installing software
v Fibre Channel adapter (100%)
Explanation: The 2145 is installing software.
User response:
345 The 2145 is searching for a donor node
from which to copy the software. 1. If this code is displayed and the progress bar has
been stopped for at least ten minutes, the software
Explanation: The node is searching at 1 Gb/s for a install process has failed with an unexpected
donor node. software error.
User response: If the progress bar has stopped for 2. Power off the 2145 and wait for 60 seconds.
more than two minutes, exchange the FRU for a new 3. Power on the 2145. The software update operation
FRU. continues.
Possible Cause-FRUs or other: 4. Report this problem immediately to your Software
v Fibre Channel adapter (100%) Support Center.
1. Check the support site for a code update. 4. Plug in power, and then wait for the node to boot.
2. Use the remove and replace procedures to replace 5. If that fails, replace the system board.
the enclosure midplane.
Exchange the FRU for a new FRU.
Possible CauseFRUs or other:
v Enclosure midplane (100%) Possible Cause-FRUs or other:
User response: Follow troubleshooting procedures to Explanation: The node startup procedures have found
configure the WWNN of the node. problems with the file system on the internal disk of
the node.
1. Continue to follow the hardware remove and
replace procedure for the service controller or disk. User response: Follow troubleshooting procedures to
2. If you have not followed the hardware remove and reload the software.
replace procedures, determine the correct WWNN. 1. Follow Procedure: Rescuing node canister machine
If you do not have this information recorded, code from another node (node rescue).
examine your Fibre Channel switch configuration to 2. If the rescue node does not succeed, use the
see whether it is listed there. Follow the procedures hardware remove and replace procedures.
to change the WWNN of a node.
Possible CauseFRUs or other:
Possible Cause-FRUs or other:
v Node canister (80%)
v None
v Other (20%)
530 A problem with one of the node's power Reason 1: A power supply is not detected.
supplies has been detected. v Power supply (19%)
Explanation: The 530 error code is followed by two v System board (1%)
numbers. The first number is either 1 or 2 to indicate v Other: Power supply is not installed correctly (80%)
which power supply has the problem.
The second number is either 1, 2 or 3 to indicate the Reason 2: The power supply has failed.
reason. 1 indicates that the power supply is not v Power supply (90%)
detected. 2 indicates that the power supply has failed. 3 v Power cable assembly (5%)
indicates that there is no input power to the power
v System board (5%)
supply.
If the node is a member of a cluster, the cluster will Reason 3: There is no input power to the power supply.
report error code 1096 or 1097, depending on the error v Power cable assembly (25%)
reason.
v UPS-1U assembly (4%)
The error will automatically clear when the problem is v System board (1%)
fixed.
v Other: Power supply is not installed correctly (70%)
User response:
1. Ensure that the power supply is seated correctly 534 System board fault
and that the power cable is attached correctly to
both the node and to the 2145 UPS-1U. Explanation: There is a unrecoverable error condition
2. If the error has not been automatically marked fixed in a device on the system board.
after two minutes, note the status of the three LEDs User response: For a or storage enclosure, replace the
on the back of the power supply. For the 2145-CG8 canister and reuse the interface adapters and fans.
or 2145-CF8, the AC LED is the top green LED, the
DC LED is the middle green LED and the error For a control enclosure, refer to the additional details
LED is the bottom amber LED. supplied with the error to determine the proper parts
replacement sequence.
3. If the power supply error LED is off and the AC
and DC power LEDs are both on, this is the normal v Pwr rail A: Replace CPU 1.
condition. If the error has not been automatically Replace the power supply if the OVER SPEC LED on
fixed after two minutes, replace the system board. the light path diagnostics panel is still lit.
4. Follow the action specified for the LED states noted v Pwr rail B: Replace CPU 2.
in the table below. Replace the power supply if the OVER SPEC LED on
5. If the error has not been automatically fixed after the light path diagnostics panel is still lit.
two minutes, contact support. v Pwr rail C: Replace the following components until
"Pwr rail C" is no longer reported:
Error,AC,DC:Action
DIMMs 1 - 6
PCI riser-card assembly 1
Fan 1
536 The temperature of a device on the
Optional adapters that are installed in PCI system board is greater than or equal to
riser-card assembly 1 the critical threshold.
Replace the power supply if the OVER SPEC LED
Explanation: The temperature of a device on the
on the light path diagnostics panel is still lit.
system board is greater than or equal to the critical
v Pwr rail D: Replace the following components until threshold.
"Pwr rail D" is no longer reported:
User response: Check for external and internal air
DIMMs 7 - 12
flow blockages or damage.
Fan 2
1. Remove the top of the machine case and check for
Optional PCI adapter power cable missing baffles, damaged heat sinks, or internal
Replace the power supply if the OVER SPEC LED blockages.
on the light path diagnostics panel is still lit. 2. If the error persists, replace system board.
v Pwr rail E: Replace the following components until
"Pwr rail E" is no longer reported: Possible Cause-FRUs or other:
DIMMs 13 - 18 v None
Hard disk drives
Replace the power supply if the OVER SPEC LED 538 The temperature of a PCI riser card is
on the light path diagnostics panel is still lit. greater than or equal to the critical
v Pwr rail F: Replace the following components until threshold.
"Pwr rail F" is no longer reported: Explanation: The temperature of a PCI riser card is
DIMMs 19 - 24 greater than or equal to the critical threshold.
Fan 4 User response: Improve cooling.
Optional adapters that are installed in PCI 1. If the problem persists, replace the PCI riser
riser-card assembly 2
PCI riser-card assembly 2 Possible Cause-FRUs or other:
Replace the power supply if the OVER SPEC LED v None
on the light path diagnostics panel is still lit.
v Pwr rail G: Replace the following components until 541 Multiple, undetermined, hardware
"Pwr rail G" is no longer reported: errors
Hard disk drive backplane assembly
Explanation: Multiple hardware failures have been
Hard disk drives reported on the data paths within the node canister,
Fan 3 and the threshold of the number of acceptable errors
Optional PCI adapter power cable within a given time frame has been reached. It has not
been possible to isolate the errors to a single
v Pwr rail H: Replace the following components until
component.
"Pwr rail H" is no longer reported:
Optional adapters that are installed in PCI After this node error has been raised, all ports on the
riser-card assembly 2 node will be deactivated. The reason for this is that the
node canister is considered unstable, and has the
Optional PCI adapter power cable
potential to corrupt data.
Possible CauseFRUs or other: User response:
v Hardware (100%) 1. Follow the procedure for collecting information for
support, and contact your support organization.
535 Canister internal PCIe switch failed 2. A software [code] update may resolve the issue.
3. Replace the node canister.
Explanation: The PCI Express switch has failed or
cannot be detected. In this situation, the only
connectivity to the node canister is through the 542 An installed CPU has failed or been
Ethernet ports. removed.
User response: Follow troubleshooting procedures to Explanation: An installed CPU has failed or been
fix the hardware: removed.
User response: Replace the CPU.
Possible Cause-FRUs or other:
4. Ensure that Fibre Channel network zoning changes these actions can lead to data corruption that is
have not restricted communication between nodes undetected by the system but affects host
or between the nodes and the quorum disk. applications.
5. Perform the problem determination procedures for a. All host servers that were previously accessing
the network. the system have had all volumes unmounted or
6. The quorum disk failed or cannot be accessed. have been rebooted.
Perform the problem determination procedures for b. Ensure that the nodes at the other site are not
the disk controller. operating as a system and actions have been
taken to prevent them from forming a system in
the future.
551 A cluster cannot be formed because of a
lack of cluster resources.
After these actions have been taken the satask
Explanation: The node does not have sufficient overridequorum can be used to allow the nodes at
connectivity to other nodes or the quorum device to the surviving site to form a system that uses local
form a cluster. storage.
Attempt to repair the fabric or quorum device to
establish connectivity. If a disaster occurred and the 555 Power Domain error
nodes at the other site cannot be recovered, then it is
Explanation: Both 2145s in an I/O group are being
possible to allow the nodes at the surviving site to form
powered by the same uninterruptible power supply.
a system by using local storage.
The ID of the other 2145 is displayed with the node
User response: Follow troubleshooting procedures to error code on the front panel.
correct connectivity issues between the cluster nodes
User response: Ensure that the configuration is correct
and the quorum devices.
and that each 2145 is in an I/O group is connected
1. Check for any node errors that indicate issues with from a separate uninterruptible power supply.
Fibre Channel connectivity. Resolve any issues.
2. Ensure that the other nodes in the cluster are
556 A duplicate WWNN has been detected.
powered on and operational.
3. Using the SAT GUI or CLI (sainfo lsservicestatus), Explanation: The node has detected another device
display the Fibre Channel port status. If any port is that has the same World Wide Node Name (WWNN)
not active, perform the Fibre Channel port problem on the Fibre Channel network. A WWNN is 16
determination procedures. hexadecimal digits long. For a cluster, the first 11 digits
are always 50050768010. The last 5 digits of the
4. Ensure that Fibre Channel network zoning changes
WWNN are given in the additional data of the error
have not restricted communication between nodes
and appear on the front panel displays. The Fibre
or between the nodes and the quorum disk.
Channel ports of the node are disabled to prevent
5. Perform the problem determination procedures for disruption of the Fibre Channel network. One or both
the network. nodes with the same WWNN can show the error.
6. The quorum disk failed or cannot be accessed. Because of the way WWNNs are allocated, a device
Perform the problem determination procedures for with a duplicate WWNN is normally another cluster
the disk controller. node.
7. As a last resort when the nodes at the other site User response: Follow troubleshooting procedures to
cannot be recovered, then it is possible to allow the configure the WWNN of the node:
nodes at the surviving site to form a system by
1. Find the cluster node with the same WWNN as the
using local site storage:
node reporting the error. The WWNN for a cluster
To avoid data corruption ensure that all host servers node can be found from the node Vital Product
that were previously accessing the system have had Data (VPD) or from the Node menu on the front
all volumes unmounted or have been rebooted. panel. The node with the duplicate WWNN need
Ensure that the nodes at the other site are not not be part of the same cluster as the node
operational and are unable to form a system in the reporting the error; it could be remote from the
future. node reporting the error on a part of the fabric
After starting this command, a full connected through an inter-switch link. The WWNN
resynchronization of all mirrored volumes are of the node is stored within the service controller, so
performed when the other site is recovered. This is the duplication is most likely caused by the
likely to take many hours or days to complete. replacement of a service controller.
Contact IBM support personnel if you are unsure. 2. If a cluster node with a duplicate WWNN is found,
determine whether it, or the node reporting the
Note: Before continuing confirm that you have error, has the incorrect WWNN. Generally, it is the
taken the following actions - failure to perform node that has had its service controller that was
The node error does not persist across restarts of the User response: Check for external and internal air
machine code on the node. flow blockages or damage.
User response: Follow troubleshooting procedures to 1. Remove the top of the machine case and check for
reload the machine code: missing baffles, damaged heat sinks, or internal
blockages.
1. Get a support package (snap), including dumps,
from the node, using the management GUI or the 2. If problem persists, replace the CPU/heat sink.
service assistant.
Possible Cause-FRUs or other:
2. If more than one node is reporting this error,
contact IBM technical support for assistance. The v CPU
support package from each node will be required. v Heat sink
3. Check the support site to see whether the issue is
known and whether a machine code update exists 570 Battery protection unavailable
to resolve the issue. Update the cluster machine
code if a resolution is available. Use the manual Explanation: The node cannot start because battery
update process on the node that reported the error protection is not available. Both batteries require user
first. intervention before they can become available.
4. If the problem remains unresolved, contact IBM User response: Follow troubleshooting procedures to
technical support and send them the support fix hardware.
package.
The appropriate service action will be indicated by an
Possible CauseFRUs or other: accompanying non-fatal node error. Examine the event
log to determine the accompanying node error.
v None
1. Follow the procedure to run a node rescue. 2. Attempt to reestablish the clustered system by using
2. If the error occurs again, contact IBM technical other nodes. This step might involve fixing
support. hardware issues on other nodes or fixing
connectivity issues between nodes.
Possible CauseFRUs or other:
3. If you are able to reestablish the clustered system,
v None remove the system data from the node that shows
error 578 so it goes to a candidate state. It is then
574 The node machine code is damaged. automatically added back to the clustered system.
a. To remove the system data from the node, go to
Explanation: A checksum failure has indicated that the service assistant, select the radio button for
the node machine code is damaged and needs to be the node with a 578, click Manage System, and
reinstalled. then choose Remove System Data.
User response: b. Or use the CLI command satask leavecluster
1. If the other nodes are operational, run node rescue; -force.
otherwise, install new machine code using the If the node does not automatically add back to the
service assistant. Node rescue failures, as well as the clustered system, note the name and I/O group of
repeated return of this node error after the node, and then delete the node from the
reinstallation, are symptomatic of a hardware fault clustered system configuration (if this has not
with the node. already happened). Add the node back to the
clustered system using the same name and I/O
Possible CauseFRUs or other: group.
v None 4. If all nodes have either node error 578 or 550,
follow the recommended user response for node
576 The cluster state and configuration data error 550.
cannot be read. 5. Attempt to determine what caused the nodes to
shut down.
Explanation: The node has been unable to read the
saved cluster state and configuration data from its
Possible CauseFRUs or other:
internal drive because of a read or medium error.
v None
User response: In the sequence shown, exchange the
FRUs for new FRUs.
579 Battery subsystem has insufficient
Possible CauseFRUs or other: charge to save system data
v 2145-CG8 or 2145-CF8
Explanation: Not enough capacity is available from
Disk drive (50%) the battery subsystem to save system data in response
Disk controller (30%) to a series of battery and boot-drive faults.
Disk backplane (10%) User response: Follow troubleshooting procedures to
Disk signal cable (8%) fix hardware.
Disk power cable (1%) The appropriate service actions are indicated by the
System board (1%) series of battery and boot-drive faults. Examine the
event log to determine the accompanying faults. Service
the other faults.
578 The state data was not saved following
a power loss.
580 The service controller ID could not be
Explanation: On startup, the node was unable to read
read.
its state data. When this happens, it expects to be
automatically added back into a clustered system. Explanation: The 2145 cannot read the unique ID from
However, if it is not joined to a clustered system in 60 the service controller, so the Fibre Channel adapters
sec, it raises this node error. This error is a critical node cannot be started.
error, and user action is required before the node can
User response: In the sequence shown, exchange the
become a candidate to join a clustered system.
following FRUs for new FRUs.
User response: Follow troubleshooting procedures to
Possible Cause-FRUs or other:
correct connectivity issues between the clustered system
nodes and the quorum devices. 2145-CG8 or 2145-CF8
1. Manual intervention is required once the node v Service controller (70%)
reports this error. v Service controller cable (30%)
Service controller (100%) 1. Ensure that only one 2145 is receiving power from
the 2145 UPS-1U. Also ensure that no other devices
Other: are connected to the 2145 UPS-1U.
v None 2. Disconnect the 2145 from the 2145 UPS-1U. If the
Overload Indicator is still illuminated, on the
disconnected 2145 replace the 2145 UPS-1U.
581 A serial link error in the 2145 UPS-1U
has occurred. 3. If the Overload Indicator is now off, and the node is
a 2145-CG8 or 2145-CF8, on the disconnected 2145,
Explanation: There is a fault in the communications with all outputs disconnected, determine whether it
cable, the serial interface in the uninterruptible power is one of the two power supplies or the power cable
supply 2145 UPS-1U, or 2145. assembly that must be replaced. Plug just one
power cable into the left hand power supply and
User response: Check that the communications cable
start the node and see whether the error is reported.
is correctly plugged into the 2145 and the 2145 UPS-1U.
Then shut down the node and connect the other
If the cable is plugged in correctly, replace the FRUs in
power cable into the left hand power supply and
the order shown.
start the node and see whether the error is repeated.
Possible Cause-FRUs or other: Then repeat the two tests for the right hand power
supply. If the error is repeated for both cables on
2145-CF8, or 2145-CG8
one power supply but not the other, replace the
v 2145 power cable assembly (40%) power supply that showed the error; otherwise,
v 2145 UPS-1U assembly (30%) replace the power cable assembly.
v 2145 system board (30%)
Possible Cause-FRUs or other:
v Power cable assembly (45%)
582 A battery error in the 2145 UPS-1U has
occurred. v Power supply assembly (45%)
v UPS-1U assembly (10%)
Explanation: A problem has occurred with the
uninterruptible power supply 2145 UPS-1U battery.
586 The power supply to the 2145 UPS-1U
User response: Exchange the FRU for a new FRU.
does not meet requirements.
After replacing the battery assembly, if the 2145
UPS-1U service indicator is on, press and hold the 2145 Explanation: None.
UPS-1U Test button for three seconds to start the
User response: Follow troubleshooting procedures to
self-test and verify the repair. During the self-test, the
fix the hardware.
rightmost four LEDs on the 2145 UPS-1U front-panel
assembly flash in sequence.
587 An incorrect type of uninterruptible
Possible Cause-FRUs or other:
power supply has been detected.
v UPS-1U battery assembly (50%)
Explanation: An incorrect type of 2145 UPS-1U was
v UPS-1U assembly (50%)
installed.
User response: Exchange the 2145 UPS-1U for one of
583 An electronics error in the 2145 UPS-1U
the correct type.
has occurred.
Possible Cause-FRUs or other:
Explanation: A problem has occurred with the 2145
UPS-1U electronics. v 2145 UPS-1U (100%)
Other:
650 The canister battery is not supported
v Cabling error (100%)
Explanation: The canister battery shows product data
that indicates it cannot be used with the code version
589 The 2145 UPS-1U ambient temperature of the canister.
limit has been exceeded.
User response: This is resolved by either obtaining a
Explanation: The ambient temperature threshold for battery which is supported by the system's code level,
the 2145 UPS-1U has been exceeded. or the canister's code level is updated to a level which
User response: Reduce the temperature around the supports the battery.
system: 1. Remove the canister and its lid and check the FRU
1. Turn off the 2145 UPS-1U and unplug it from the part number of the new battery matches that of the
power source. replaced battery. Obtain the correct FRU part if it
does not.
2. Clear the vents and remove any heat sources.
2. If the canister has just been replaced, check the code
3. Ensure that the air flow around the 2145 UPS-1U is level of the partner node canister and use the
not restricted. service assistant to update this canister's code level
4. Wait at least five minutes, and then restart the 2145 to the same level.
UPS-1U. If the problem remains, exchange 2145
UPS-1U assembly. Possible causeFRUs or other cause
v canister battery
590 Repetitive node transitions into standby
mode from normal mode because of
651 The canister battery is missing
power subsystem-related node errors.
Explanation: The canister battery cannot be detected.
Explanation: Multiple node restarts occurred because
of 2145 UPS-1U errors, which can be reported on any User response:
node type 1. Use the remove and replace procedures to remove
This error means that the node made the transition into the node canister and its lid.
standby from normal mode because of power 2. Use the remove and replace procedures to install a
subsystem-related node errors too many times within a battery.
short period. Too many times are defined as three, and 3. If there is a battery present ensure it is fully
a short period is defined as 1 hour. This error alerts the inserted. Replace the canister.
user that something might be wrong with the power
4. If this error persists, use the remove and replace
subsystem as it is clearly not normal for the node to
procedures to replace the battery.
repeatedly go in and out of standby.
If the actions of the tester or engineer are expected to Possible causeFRUs or other cause
cause many frequent transitions from normal to v canister battery
standby and back, then this error does not imply that
there is any actual fault with the system.
652 The canister battery has failed
User response: Follow troubleshooting procedures to
fix the hardware: Explanation: The canister battery has failed. The
1. Verify that the room temperature is within specified battery may be showing an error state, it may have
limits and that the input power is stable. reached the end of life, or it may have failed to charge.
Possible causeFRUs or other cause Explanation: On the current systems, users cannot be
set to remote.
v canister battery
User response: Any user defined on the system must
be a local user. To create a remote user the user must
655 Canister battery communications fault.
not be defined on the local system.
Explanation: The canister cannot communicate with
the battery.
670 The UPS battery charge is not enough to
User response: allow the node to start.
v Use the remove and replace procedures to replace Explanation: The uninterruptible power supply
the battery. connected to the node does not have sufficient battery
v If the node error persists, use the remove and replace charge for the node to safely become active in a cluster.
procedures to replace the node canister. The node will not start until a sufficient charge exists to
store the state and configuration data held in the node
Possible Cause-FRUs or other cause:
memory if power were to fail. The front panel of the
v Canister battery node will show "charging".
v Node canister
User response: Wait for sufficient battery charge for
enclosure to start:
656 The canister battery has insufficient 1. Wait for the node to automatically fix the error
charge when there is sufficient charge.
Explanation: The canister battery has insufficient 2. Ensure that no error conditions are indicated on the
charge to save the canisters state and cache data to the uninterruptible power supply.
internal drive if power were to fail.
User response:
This node error does not, in itself, stop the node Explanation: A Fibre Channel adapter is degraded.
becoming active in the system. However, the Fibre This node error does not, in itself, stop the node
Channel network might be being used to communicate becoming active in the system. However, the Fibre
between the nodes in a clustered system. Therefore, it Channel network might be being used to communicate
is possible that this node error indicates the reason why between the nodes in a clustered system. Therefore, it
the critical node error 550 A cluster cannot be formed is possible that this node error indicates the reason why
because of a lack of cluster resources is reported the critical node error 550 A cluster cannot be formed
on the node.
because of a lack of cluster resources is reported v Check that the Fibre Channel cable is connected
on the node. at both ends and is not damaged. If necessary,
replace the cable.
Data:
v Check the switch port or other device that the
v A number indicating the adapter location. The
cable is connected to is powered and enabled in a
location indicates an adapter slot. See the node
compatible mode. Rectify any issue. The device
description for the definition of the adapter slot
service interface might indicate the issue.
locations.
v Use the remove and replace procedures to replace
User response: the SFP transceiver in the 2145 node and the SFP
1. If possible, use the management GUI to run the transceiver in the connected switch or device.
recommended actions for the associated service v Use the remove and replace procedures to replace
error code. the adapter.
2. Use the remove and replace procedures to replace
the adapter. If this does not fix the problem, replace Possible Cause-FRUs or other cause:
the system board. v Fibre Channel cable
v SFP transceiver
Possible Cause-FRUs or other cause:
v Fibre Channel adapter
v Fibre Channel adapter
v System board
705 Fewer Fibre Channel I/O ports
operational.
704 Fewer Fibre Channel ports operational.
Explanation: One or more Fibre Channel I/O ports
Explanation: A Fibre Channel port that was that have previously been active are now inactive. This
previously operational is no longer operational. The situation has continued for one minute.
physical link is down.
A Fibre Channel I/O port might be established on
This node error does not, in itself, stop the node either a Fibre Channel platform port or an Ethernet
becoming active in the system. However, the Fibre platform port using FCoE. This error is expected if the
Channel network might be being used to communicate associated Fibre Channel or Ethernet port is not
between the nodes in a clustered system. Therefore, it operational.
is possible that this node error indicates the reason why
the critical node error 550 A cluster cannot be formed Data:
because of a lack of cluster resources is reported Three numeric values are listed:
on the node.
v The ID of the first unexpected inactive port. This ID
Data: is a decimal number.
Three numeric values are listed: v The ports that are expected to be active, which is a
hexadecimal number. Each bit position represents a
v The ID of the first unexpected inactive port. This ID
port, with the least significant bit representing port 1.
is a decimal number.
The bit is 1 if the port is expected to be active.
v The ports that are expected to be active, which is a
v The ports that are actually active, which is a
hexadecimal number. Each bit position represents a
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
port, with the least significant bit representing port 1.
The bit is 1 if the port is expected to be active.
The bit is 1 if the port is active.
v The ports that are actually active, which is a
hexadecimal number. Each bit position represents a User response:
port, with the least significant bit representing port 1. 1. If possible, use the management GUI to run the
The bit is 1 if the port is active. recommended actions for the associated service
error code.
User response:
2. Follow the procedure for mapping I/O ports to
1. If possible, use the management GUI to run the
platform ports to determine which platform port is
recommended actions for the associated service
providing this I/O port.
error code.
3. Check for any 704 (Fibre channel platform port
2. Possibilities:
not operational) or 724 (Ethernet platform port
v If the port has been intentionally disconnected, not operational) node errors reported for the
use the management GUI recommended action platform port.
for the service error code and acknowledge the
4. Possibilities:
intended change.
v If the port has been intentionally disconnected, v Check that the SAN zoning is correct.
use the management GUI recommended action
for the service error code and acknowledge the Possible Cause: FRUs or other cause:
intended change. v None.
v Resolve the 704 or 724 error.
v If this is an FCoE connection, use the information 710 The SAS adapter that was previously
the view gives about the Fibre Channel forwarder present has not been detected.
(FCF) to troubleshoot the connection between the
port and the FCF. Explanation: A SAS adapter that was previously
present has not been detected. The adapter might not
Possible Cause-FRUs or other cause: be correctly installed or it might have failed.
v None Data:
v A number indicating the adapter location. The
706 Fibre Channel clustered system path location indicates an adapter slot. See the node
failure. description for the definition of the adapter slot
locations.
Explanation: One or more Fibre Channel (FC)
input/output (I/O) ports that have previously been User response:
able to see all required online nodes can no longer see 1. If possible, use the management GUI to run the
them. This situation has continued for 5 minutes. This recommended actions for the associated service
error is not reported unless a node is active in a error code.
clustered system. 2. Possibilities:
A Fibre Channel I/O port might be established on v If the adapter has been intentionally removed,
either a FC platform port or an Ethernet platform port use the management GUI recommended actions
using Fiber Channel over Ethernet (FCoE). for the service error code, to acknowledge the
change.
Data:
v Use the remove and replace procedures to
Three numeric values are listed: remove and open the node and check the adapter
v The ID of the first FC I/O port that does not have is fully installed.
connectivity. This is a decimal number. v If the previous steps have not isolated the
v The ports that are expected to have connections. This problem, use the remove and replace procedures
is a hexadecimal number, and each bit position to replace the adapter. If this does not fix the
represents a port - with the least significant bit problem, replace the system board.
representing port 1. The bit is 1 if the port is
expected to have a connection to all online nodes. Possible Cause-FRUs or other cause:
v The ports that actually have connections. This is a v High-speed SAS adapter
hexadecimal number, each bit position represents a v System board
port, with the least significant bit representing port 1.
The bit is 1 if the port has a connection to all online
nodes. 711 A SAS adapter has failed.
User response: Explanation: A SAS adapter has failed.
1. If possible, this noncritical node error should be Data:
serviced using the management GUI and running
v A number indicating the adapter location. The
the recommended actions for the service error code.
location indicates an adapter slot. See the node
2. Follow the procedure: Mapping I/O ports to description for the definition of the adapter slot
platform ports to determine which platform port locations.
does not have connectivity.
User response:
3. There are a number of possibilities.
1. If possible, use the management GUI to run the
v If the ports connectivity has been intentionally
recommended actions for the associated service
reconfigured, use the management GUI
error code.
recommended action for the service error code
and acknowledge the intended change. You must 2. Use the remove and replace procedures to replace
have at least two I/O ports with connections to the adapter. If this does not fix the problem, replace
all other nodes. the system board.
v Resolve other node errors relating to this
Possible Cause-FRUs or other cause:
platform port or I/O port.
v High-speed SAS adapter
v System board port, with the least significant bit representing port 1.
The bit is 1 if the port is expected to be active.
712 A SAS adapter has a PCI error. v The ports that are actually active, which is a
hexadecimal number. Each bit position represents a
Explanation: A SAS adapter has a PCI error. port, with the least significant bit representing port 1.
Data: The bit is 1 if the port is active.
a compatible mode. Rectify any issue. The because of a lack of cluster resources is reported
device service interface might indicate the issue. on the node canister.
d. If this is a 1 Gbps port, use the remove and Data:
replace procedures to replace the SFP transceiver
v A number indicating the adapter location. Location 0
in the SAN Volume Controller and the SFP
indicates that the adapter integrated into the system
transceiver in the connected switch or device.
board is being reported.
e.
User response:
Replace the adapter or the system board
(depending on the port location) by using the 1. If possible, use the management GUI to run the
remove and replace procedures. recommended actions for the associated service
error code.
Possible CauseFRUs or other cause: 2. As the adapter is located on the system board,
v Ethernet cable replace the node canister using the remove and
replace procedures.
v Ethernet SFP transceiver
v Ethernet adapter Possible Cause-FRUs or other cause:
v System board v Node canister
730 The bus adapter has not been detected. 732 The bus adapter has a PCI error.
Explanation: The bus adapter that connects the Explanation: The bus adapter that connects the
canister to the enclosure midplane has not been canister to the enclosure midplane has a PCI error.
detected.
This node error does not, in itself, stop the node
This node error does not, in itself, stop the node canister becoming active in the system. However, the
canister becoming active in the system. However, the bus might be being used to communicate between the
bus might be being used to communicate between the node canisters in a clustered system; therefore it is
node canisters in a clustered system. Therefore, it is possible that this node error indicates the reason why
possible that this node error indicates the reason why the critical node error 550 A cluster cannot be formed
the critical node error 550 A cluster cannot be formed because of a lack of cluster resources is reported
because of a lack of cluster resources is reported on the node canister.
on the node canister.
Data:
Data:
v A number indicating the adapter location. Location 0
v A number indicating the adapter location. Location 0 indicates that the adapter integrated into the system
indicates that the adapter integrated into the system board is being reported.
board is being reported.
User response:
User response:
1. If possible, this noncritical node error should be
1. If possible, use the management GUI to run the serviced using the management GUI and running
recommended actions for the associated service the recommended actions for the service error code.
error code.
2. As the adapter is located on the system board,
2. As the adapter is located on the system board, replace the node canister using the remove and
replace the node canister using the remove and replace procedures.
replace procedures.
Possible Cause-FRUs or other cause:
Possible Cause-FRUs or other cause:
v Node canister
v Node canister
743 A boot drive is offline, missing, out of 747 The Technician port is being used.
sync, or the persistent data is not usable.
Explanation: The Technician port is active and being
Explanation: A boot drive is offline, missing, out of used
sync, or the persistent data is not usable.
User response: No service action is required. Use the
User response: Look at a boot drive view to workstation to configure the node.
determine the problem.
1. If slot status is out of sync, then re-sync the boot
drives by running the command satask 766 CMOS battery failure.
chbootdrive. Explanation: CMOS battery failure.
2. If slot status is missing, then put the original drive
User response: Replace the CMOS battery.
back in this slot or install a FRU drive.
3. If slot status is failed, then replace the drive. Possible Cause-FRUs or other:
v CMOS battery
Possible Cause-FRUs or other:
v Boot drive 768 Ambient temperature warning.
Explanation: The ambient temperature of the node is
744 A boot drive is in the wrong location. close to the point where it stops performing I/O and
Explanation: A boot drive is in the wrong slot or enters a service state. The node is currently continuing
comes from another SAN Volume Controller 2145-DH8 to operate.
node. Data:
User response: Look at a boot drive view to v A text string identifying the thermal sensor reporting
determine the problem. the warning level and the current temperature in
1. Replace the boot drive with the correct drive and degrees (Celsius).
put this drive back in the node from which it came. User response:
2. Sync the boot drive if you choose to use it in this
node.
1. If possible, use the management GUI to run the Possible Cause-FRUs or other cause:
recommended actions for the associated service v CPU
error code.
2. Check the temperature of the room and correct any
775 Power supply problem.
air conditioning or ventilation problems.
3. Check the airflow around the system to make sure Explanation: A power supply has a fault condition.
no vents are blocked. User response: Replace the power supply.
769 CPU temperature warning. 776 Power supply mains cable unplugged.
Explanation: The temperature of the CPU within the Explanation: A power supply mains cable is not
node is close to the point where the node stops plugged in.
performing I/O and enters service state. The node is User response: Plug in power supply mains cable.
currently continuing to operate. This is most likely an
ambient temperature problem, but it might be a Possible Cause-FRUs or other:
hardware problem. v None
Data:
v A text string identifying the thermal sensor reporting 777 Power supply missing.
the warning level and the current temperature in
Explanation: A power supply is missing.
degrees (Celsius).
User response: Install power supply.
User response:
1. If possible, use the management GUI to run the Possible Cause-FRUs or other:
recommended actions for the associated service v Power supply
error code.
2. Check the temperature of the room and correct any 779 Battery is missing
air conditioning or ventilation problems.
Explanation: The battery is not installed in the system.
3. Check the airflow around the system. Ensure no
vents are blocked. User response: Install the battery.
4. Make sure the node fans are operational. You can power up the system without the battery
5. If the error is still reported, replace the nodes CPU. installed.
Possible Cause-FRUs or other:
Possible CauseFRUs or other cause:
v Battery (100%)
v CPU
This error is reported only if the battery subsystem causes charging to terminate early, before the entire
cannot provide full protection. battery pack is fully charged.
An inability to charge is not reported if the combined Ending recharging prematurely effectively reduces the
charge available from all installed batteries can provide available capacity of the pack.
full protection at the current charge levels.
Circuitry within the battery pack corrects such errors
User response: No service action required, use the normally, but can take tens of hours to complete.
console to manage the node.
If this error is not fixed after 24 hours, or if the error
Wait for the battery to warm up. reoccurs after it fixes itself, the error is likely indicative
of a problem in the battery cells. In such a case, replace
the battery pack.
782 Battery is above the maximum operating
temperature User response: No service action required, use the
console to manage the node.
Explanation: The battery cannot perform the required
function because it is above the maximum operating Wait for the cells to balance.
temperature.
This error is reported only if the battery subsystem 786 Battery VPD checksum error
cannot provide full protection.
Explanation: The checksum on the vital product data
An inability to charge is not reported if the combined (VPD) stored in the battery EEPROM is incorrect.
charge available from all installed batteries can provide
User response: No service action required, use the
full protection at the current charge levels.
console to manage the node.
User response: No service action required, use the
Replace the battery.
console to manage the node.
Wait for the battery to cool down.
787 Battery is at a hardware revision level
not supported by the current code level
783 Battery communications error
Explanation: The battery currently installed is at a
Explanation: A battery is installed, but hardware revision level that is not supported by the
communications via I2C are not functioning. current code level.
This might be either a fault in the battery unit or a User response: No service action required, use the
fault in the battery backplane. console to manage the node.
User response: No service action required, use the Either update the code level to one that supports the
console to manage the node. currently installed battery or replace the battery with
one that is supported by the current code level.
Replace the battery. If the problem persists, conduct the
corrective service procedure described in 1109 on
page 205. 803 Fibre Channel adapter not working
Explanation: A problem has been detected on the
784 Battery is nearing end of life nodes Fibre Channel (FC) adapter. This node error is
reported only on SAN Volume Controller 2145-CG8 or
Explanation: The battery is near the end of its useful
older nodes.
life. You should replace it at the earliest convenient
opportunity. User response: Follow troubleshooting procedures to
fix the hardware.
This might be either a fault in the battery unit or a
fault in the battery backplane.
818 Unable to recover the service controller
User response: No service action required, use the
flash disk.
console to manage the node.
Explanation: A nonrecoverable error occurred when
Replace the battery.
accessing the service controller persistent memory.
User response:
785 Battery capacity is reduced because of
cell imbalance 1. Restart the node and see if it recovers.
2. Replace the field replaceable units (FRUs) in the
Explanation: The charge levels of the cells within the
order listed.
battery pack are out of balance.
Some cells become fully charged before others, which Possible Cause-FRUs or other cause:
v Service controller User response: Remove the USB flash drive from the
v Service controller cable port.
Note: This count includes both FC and Fibre Channel 920 Unable to perform cluster recovery
over Ethernet (FCoE) logins. The log-in count will not because of a lack of cluster resources.
include masked ports. Explanation: The node is looking for a quorum of
When this event is logged. the cluster id and node id of resources which also require cluster recovery.
the first node whose logins exceed this limit on the
local node will be reported, as well as the WWNN of User response: Contact IBM technical support.
said node. If logins change, the error is automatically
fixed and another error is logged if appropriate (this 921 Unable to perform cluster recovery
may or may not choose the same node to report in the because of a lack of cluster resources.
sense data if the same node is still over the maximum
allowed). Explanation: The node does not have sufficient
connectivity to other nodes or quorum device to form a
Data cluster. If a disaster has occurred and the nodes at the
other site cannot be recovered, then it is possible to
Text string showing allow the nodes at the surviving site to form a system
using local storage.
v WWNN of the other node
v Cluster ID of other node User response: Repair the fabric or quorum device to
Enclosure (5%)
1001 Automatic cluster recovery has run.
Explanation: All cluster configuration commands are 1009 DIMMs are incorrectly installed.
blocked.
Explanation: DIMMs are incorrectly installed.
User response: Call your software support center.
User response: Ensure that memory DIMMs are
Caution: You can unblock the configuration commands spread evenly across all memory channels.
through the cluster GUI, but you must first consult
1. Shut down the node.
with your software support to avoid corrupting your
cluster configuration. 2. Ensure that memory DIMMs are spread evenly
across all memory channels.
Possible Cause-FRUs or other:
3. Restart the node.
v None
4. If the error persists, replace system board.
1018 Fibre Channel adapter in slot 2 PCI Explanation: The cluster is reporting that a node is
fault. not operational because of critical node error 500. See
the details of node error 500 for more information.
Explanation: The Fibre Channel adapter in slot 2 is
failing with a PCI fault. User response: See node error 500.
User response:
1022 The detected memory size does not
1. In the sequence that is shown in the log, replace
match the expected memory size.
any failing FRUs with new FRUs.
2. Check node status: Explanation: The cluster is reporting that a node is
not operational because of critical node error 510. See
v If all nodes show a status of online, mark the
the details of node error 510 for more information.
error as fixed.
v If any nodes do not show a status of online, go User response: See node error 510.
to the start MAP.
v If you return to this step, contact your support 1024 CPU is broken or missing.
center to resolve the problem with the node.
Explanation: CPU is broken or missing.
3. Go to the repair verification MAP.
User response: Review the node hardware using the
Possible Cause, FRUs, or other: svcinfo lsnodehw command on the node indicated by
this event.
v Dual port Fibre Channel host bus adapter - full
height (80%) 1. Shutdown the node. Replace the CPU that is broken
as indicated by the light path and event data.
v PCI riser card (10%)
2. If error persist, replace system board.
v Other (10%)
User response: The action depends on the extra data v Disk backplane (10%)
that is provided with the node error and the light path v Disk signal cable (8%)
diagnostics. v Disk power cable (1%)
Possible Cause-FRUs or other: v System board (1%)
v Variable
1031 Node canister location unknown.
1027 Unable to update BIOS settings. Explanation: Node canister location unknown.
Explanation: The cluster is reporting that a node is User response: Complete the following steps to
not operational because of critical node error 524. See resolve this problem.
the details of node error 524 for more information.
1. List all enclosure canisters for all control enclosures.
User response: See node error 524. Look for an online canister that does not have a
node ID associated with it. This canister is the one
with the problem.
1028 System board service processor failed.
2. Unplug the SAS cable from port 2 of the canister
Explanation: System board service processor failed. that is identified in step 1.
User response: Complete the following steps: 3. Run the command lsenclosurecanister, and see
whether there is a node ID present. If step 2 fixes
1. Shut down the node.
the error (a node ID is present), then something
2. Remove the main power cable. failed in one of the attached devices.
3. Wait for the lights to stop flashing. 4. Reconnect the expansion enclosures and see
4. Plug in the power cable. whether the system is able to isolate the fault.
Explanation: A problem has been detected on the Explanation: Canister failure, canister replacement
nodes Fibre Channel (FC) adapter. This node error is required.
reported only on SAN Volume Controller 2145-CG8 or
User response: Replace the canister.
older nodes.
A canister can be safely replaced while the system is in
User response: Follow troubleshooting procedures to
production. Make sure that the other canister is the
fix the hardware.
active node before removing this canister. It is
1. If possible, use the management GUI to run the preferable that this canister shuts down completely
recommended actions for the associated service before removing it, but it is not required.
error code.
Possible cause-FRUs or other:
Possible Cause-FRUs or other cause: Interface adapter (50%)
v None
SFP (20%)
Canister (20%)
1034 Canister fault type 2
Internal interface adapter cable (10%)
Explanation: There is a canister internal error.
User response: Reseat the canister, and then replace
1040 A flash module error has occurred after
the canister if the error continues.
a successful start of a 2145.
Possible Cause-FRUs or other:
Explanation: Note: The node containing the flash
Canister (80%) module has not been rejected by the cluster.
Explanation: The Fibre Channel adapter in PCI slot 1 Possible Cause, FRUs, or other:
is present but is failing.
v N/A
User response:
1. In the sequence that is shown in the log, replace 1057 Fibre Channel adapter (four-port) in slot
any failing FRUs with new FRUs. 2 adapter is present but failing.
2. Check node status:
Explanation: The four-port Fibre Channel adapter in
v If all nodes show a status of online, mark the slot 2 is present but failing.
error as fixed.
v If any nodes do not show a status of online, go User response:
to the start MAP. 1. In the sequence that is shown in the log, replace
v If you return to this step, contact your support any failing FRUs with new FRUs.
center to resolve the problem with the node. 2. Check node status:
3. Go to the repair verification MAP. v If all nodes show a status of online, mark the
error as fixed.
Possible Cause, FRUs, or other: v If any nodes do not show a status of online, go
v Fibre Channel host bus adapter (100%) to the start MAP.
v If you return to this step, contact your support
center to resolve the problem with the node.
User response: Inspect the enclosure and the Possible Cause-FRUs or other:
enclosure environment.
1. Check environmental temperature. 2145-CF8, 2145-DH8 or 2145-CG8
2. Ensure that all of the components are installed or v Fan module (100%)
that there are fillers in each bay.
3. Check that all of the fans are installed and 1090 One or more fans (40x40x28) are failing.
operating properly.
Explanation: One or more fans (40x40x28) are failing.
4. Check for any obstructions to airflow, proper
clearance for fresh inlet air, and exhaust air. User response:
5. Handle any specific obstructed airflow errors that 1. Determine the failing fans from the fan indicator on
are related to the drive, the battery, and the power the system board or from the text of the error data
supply unit. in the log.
6. Bring the system back online. If the system 2. Verify that the cable between the fan backplane and
performed a hard shutdown, the power must be the system board is connected:
removed and reapplied. v If all fans on the fan backplane are failing
v If no fan fault lights are illuminated
Possible Cause-FRUs or other:
3. Exchange the FRU for a new FRU.
Canister (2%) 4. Go to repair verification MAP.
3. Exchange the FRU for a new FRU. online, go to the start MAP. If you return to this
4. Go to repair verification MAP. step, contact your support center to resolve the
problem with the 2145.
Possible Cause, FRUs, or other: 3. Go to repair verification MAP.
v N/A
For the 2145-DH8 and 2145-CF8 only:
1. Check for external air flow blockages.
1092 The temperature soft or hard shutdown
threshold of the 2072 has been exceeded. 2. Remove the top of the machine case and check for
The 2072 has automatically powered off. missing baffles, damaged heatsinks, or internal
blockages.
Explanation: The temperature soft or hard shutdown
3. If the problem persists after taking these measures,
threshold of the 2072 has been exceeded. The 2072 has
replace the CPU assembly FRU if 2145-DH8 or
automatically powered off.
2145-CF8.
User response:
1. Ensure that the operating environment meets Possible Cause-FRUs or other:
specifications.
2145-CF8 or 2145-CG8
2. Ensure that the airflow is not obstructed.
v Fan module (20%)
3. Ensure that the fans are operational.
v System board (5%)
4. Go to the light path diagnostic MAP and perform
the light path diagnostic procedures. v Canister (5%)
5. Check node status. If all nodes show a status of
online, mark the error that you have just repaired Possible Cause-FRUs or other:
as fixed. If any nodes do not show a status of
online, go to the start MAP. If you return to this 2145-DH8 or 2145-CF8
step, contact your support center to resolve the v CPU assembly (30%)
problem with the 2145.
6. Go to the repair verification MAP. Other:
2072 - Node Canister (100%) 1094 The ambient temperature threshold has
v The FRU that is indicated by the Light path been exceeded.
diagnostics (25%)
Explanation: The ambient temperature threshold has
v System board (5%) been exceeded.
User response:
System environment (100%)
1. Ensure that the internal airflow of the node has not
been obstructed.
1095 Enclosure temperature has passed
2. Check node status. If all nodes show a status of
critical threshold.
online, mark the error that you have just repaired
fixed. If any nodes do not show a status of Explanation: Enclosure temperature has passed critical
threshold.
User response: Check for external and internal air OFF,OFF,ON:The power supply has a fault. Replace the
flow blockages or damage. power supply.
1. Check environmental temperature.
OFF,ON,OFF:Ensure that the power supply is installed
2. Check for any impedance to airflow.
correctly. If the DC LED does not light, replace the
3. If the enclosure has shut down, then turn off both power supply.
power switches on the enclosure and power both
back on. Possible Cause-FRUs or other:
User response: Check for external and internal air Explanation: One of the voltages that is monitored on
flow blockages or damage. the system board is over the set threshold.
1. Check environmental temperature. User response:
2. Check for any impedance to airflow. 1. See the light path diagnostic MAP.
2. If the light path diagnostic MAP does not resolve
Possible Cause-FRUs or other: the issue, exchange the system board assembly.
v None 3. Check node status. If all nodes show a status of
online, mark the error that you have just repaired
1099 Temperature exceeded warning as fixed. If any nodes do not show a status of
threshold online, go to start MAP. If you return to this step,
contact your support center to resolve the problem
Explanation: Temperature exceeded warning with the 2145.
threshold.
4. Go to repair verification MAP.
User response: Inspect the enclosure and the
enclosure environment. Possible Cause-FRUs or other:
1. Check environmental temperature.
2145-CF8, or 2145-CG8
2. Ensure that all of the components are installed or
that there are fillers in each bay. v Light path diagnostic MAP FRUs (98%)
3. Check that all of the fans are installed and v System board (2%)
operating properly.
4. Check for any obstructions to airflow, proper 1105 One of the voltages that is monitored on
clearance for fresh inlet air, and exhaust air. the system board is under the set
5. Wait for the component to cool. threshold.
Explanation: One of the voltages that is monitored on
Possible Cause-FRUs or other: the system board is under the set threshold.
4. Check node status. If all nodes show a status of and one power cable (which uses 12 red and 12 black
online, mark the error that you have just repaired heavy gauge wires) are involved:
as fixed. If any nodes do not show a status of v The EPOW cable runs to a 20-pin connector at the
online, go to start MAP. If you return to this step, front of the system planar, which is the edge nearest
contact your support center to resolve the problem the drive bays, near the left side.
with the 2145.
To check that this connector is mated properly, it is
5. Go to repair verification MAP. necessary to remove the plastic airflow baffle, which
lifts up.
1106 One of the voltages that is monitored on A number of wires run from the same connector to
the system board is under the set the disk backplane located to the left of the battery
threshold. backplane.
Explanation: One of the voltages that is monitored on v The LPC cable runs to a small adapter that is
the system board is under the set threshold. plugged into the back of the system planar between
two PCI Express adapter cages. It is helpful to
User response: remove the left adapter cage when checking that
1. Check the cable connections. these connectors are mated properly.
2. See the light path diagnostic MAP. v The PWR_SENSE cable runs to a 24-pin connector at
3. If the light path diagnostic MAP does not resolve the back of the system planar between the PSUs and
the issue, exchange the system board assembly. the left adapter cage. Check the connections of both a
female connector (to the system planar) and a male
4. Check node status. If all nodes show a status of connector (to the connector from the top PSU).
online, mark the error that you have just repaired Again, it can be helpful to remove the left adapter
as fixed. If any nodes do not show a status of cage to check the proper mating of the connectors.
online, go to start MAP. If you return to this step,
contact your support center to resolve the problem v The power cable runs to the system planar between
with the 2145. the PSUs and the left adapter cage. It is located just
in front of the PWR_SENSE connector. This cable has
5. Go to repair verification MAP. both a female connector that connects to the system
planar, and a male connector that mates with the
Possible Cause-FRUs or other: connector from the top PSU. Due to the bulk of this
cable, care must be taken to not disturb PWR_SENSE
2145-CF8, or 2145-CG8 connections when dressing it away in the space
v Light path diagnostic MAP FRUs (98%) between the PSUs and the left adapter cage.
v System board (2%) v The LED cable runs to a small PCB on the front
bezel. The only consequence of this cable not being
mated correctly is that the LEDs do not work.
1107 The battery subsystem has insufficient
capacity to save system data due to
If no problems exist, replace the battery backplane as
multiple faults.
described in the service action for 1109.
Explanation: This message is an indication of other
problems to solve before the system can successfully You do not replace either battery at this time.
recharge the batteries.
To verify that the battery backplane works after
User response: No service action is required for this
replacing it, check that the node error is fixed.
error, but other errors must be fixed. Look at other
indications to see if the batteries can recharge without
being put into use. Possible Cause-FRUs or other:
v Battery backplane (50%)
1108 Battery backplane cabling faulty or
possible battery backplane requires 1109 Battery or possibly battery backplane
replacing. requires replacing.
Explanation: Faulty cabling or a faulty backplane are Explanation: Battery or possibly battery backplane
preventing the system from full communication with requires replacing.
and control of the batteries.
User response: Complete the following steps:
User response: Check the cabling to the battery 1. Replace the drive bay battery.
backplane, making sure that all the connectors are
2. Check to see whether the node error is fixed. If not,
properly mated.
replace the battery backplane.
Four signal cables (EPOW, LPC, PWR_SENSE & LED)
3. To verify that the new battery backplane is working If both batteries have errors, battery charging might be
correctly, check that the node error is fixed. underway. (No FRU)
Possible Cause-FRUs or other: If both batteries have errors that do not resolve after a
v Drive bay battery (95%) sufficient time to charge, battery charging might be
impaired, such as by a faulty battery backplane FRU.
v Battery backplane (5%)
Communication errors are often correctable by
1110 The power management board detected reseating the battery or by allowing the temperature of
a voltage that is outside of the set the battery to cool without the need to replace the
thresholds. battery. (No FRU)
Explanation: The power management board detected
If a battery is missing or failed, the solution is to
a voltage that is outside of the set thresholds.
replace the battery FRU.
User response:
1. In the sequence that is shown in the log, replace Battery (50%)
any failing FRUs with new FRUs.
2. Check node status: Other:
Possible Cause, FRUs, or other: Attention: Do not reseat a battery unless the other
battery has enough charge, or data loss might occur.
2145-CG8 or 2145-CF8
v Power supply unit (50%) Possible Cause-FRUs or other:
v System board (50%)
Battery (95%)
Other: 2. Submit the lsmdisk task and ensure that all of the
flash drive managed disks that are located in this
Bad connection (5%) node have a status of Online.
3. Go to the repair verification MAP.
1120 A high speed SAS adapter is missing.
Possible Cause-FRUs or other:
Explanation: This node has detected that a high speed
1. High speed SAS adapter (90%)
SAS adapter that was previously installed is no longer
present. 2. System board (10%)
Otherwise, the high speed SAS adapter has failed and Explanation: A fault has been detected on a power
must be replaced. In the sequence shown, exchange the supply unit (PSU).
FRUs for new FRUs. User response: Replace the PSU.
Go to the repair verification MAP.
Attention: To avoid losing state and data from the
Possible Cause-FRUs or other: node, use the satask startservice command to put
1. High speed SAS adapter (90%) the node into service state so that it no longer processes
2. System board (10%) I/O. Then you can remove and replace the top power
supply unit (PSU 2). This precaution is due to a
limitation in the power-supply configuration. Once the
1121 A high speed SAS adapter has failed. service action is complete, run the satask stopservice
Explanation: A fault has been detected on a high command to let the node rejoin the system.
speed SAS adapter.
Possible Cause-FRUs or other:
User response: In the sequence shown, exchange the
FRUs for new FRUs. PSU (100%)
Go to the repair verification MAP.
Possible Cause-FRUs or other: 1125 Power Supply Unit fault type 1
1. High speed SAS adapter (90%) Explanation: The power supply unit (PSU) is not
2. System board (10%) supported.
User response: Replace the PSU with a supported
1122 A high speed SAS adapter error has version.
occurred.
Attention: To avoid losing state and data from the
Explanation: The high speed SAS adapter has node, use the satask startservice command to put
detected a PCI bus error and requires service before it the node into service state so that it no longer processes
can be restarted. The high speed SAS adapter failure I/O. Then you can remove and replace the top power
has caused all of the flash drives that were being supply unit (PSU 2). This precaution is due to a
accessed through this adapter to go Offline. limitation in the power-supply configuration. Once the
User response: If this is the first time that this error service action is complete, run the satask stopservice
has occurred on this node, complete the following command to let the node rejoin the system.
steps:
Possible Cause-FRUs or other:
1. Power off the node.
2. Reseat the high speed SAS adapter. PSU (100%)
3. Power on the node.
4. Submit the lsmdisk task and ensure that all of the 1126 Power Supply Unit fault type 2
flash drive managed disks that are located in this
node have a status of Online. Explanation: A fault exists on the power supply unit
(PSU).
If the sequence of actions above has not resolved the User response:
problem or the error occurs again on the same node,
complete the following steps: 1. Reseat the PSU in the enclosure.
Attention: To avoid losing state and data from the that the node error is fixed. After the node joins a
node, use the satask startservice command to put clustered system, use the lsnodebattery command to
the node into service state so that it no longer view information about the battery.
processes I/O. Then you can remove and replace
Possible Cause-FRUs or other:
the top power supply unit (PSU 2). This precaution
is due to a limitation in the power-supply v Battery (100%)
configuration. Once the service action is complete,
run the satask stopservice command to let the 1130 The node battery requires replacing.
node rejoin the system.
2. If the fault is not resolved, replace the PSU. Explanation: When a battery must be replaced, you
get this message. The proper response is to install new
batteries.
Possible Cause-FRUs or other:
1. No Part (30%) User response: Battery 1 is on the left (from the front),
and battery 2 is on the right. Remove the old battery by
2. PSU (70 %)
disengaging and pulling down the cam handle to lever
out the battery enough to pull the battery from the
1128 Power Supply Unit missing enclosure.
Explanation: The power supply unit (PSU) is not This service procedure is intended for a failed or offline
seated in the enclosure, or no PSU is installed. battery. To prevent losing data from a battery that is
online, run the svctask chnodebattery -remove
User response:
-battery battery_ID node_ID. Running the command
1. If no PSU is installed, install a PSU. verifies when it is safe to remove the battery.
2. If a PSU is installed, reseat the PSU in the
Install new batteries in battery slot 1 and in battery slot
enclosure.
2. Leave the node running as you add the batteries.
Attention: To avoid losing state and data from the Align each battery so that the guide rails in the
node, use the satask startservice command to put enclosure engage the guide rail slots on the battery.
the node into service state so that it no longer processes Push the battery firmly into the battery bay until it
I/O. Then you can remove and replace the top power stops. The cam on the front of the battery remains
supply unit (PSU 2). This precaution is due to a closed during this installation.
limitation in the power-supply configuration. Once the
To verify that the new battery works correctly, check
service action is complete, run the satask stopservice
that the node error is fixed. After the node joins a
command to let the node rejoin the system.
clustered system, use the lsnodebattery command to
view information about the battery.
Possible Cause-FRUs or other:
1. No Part (5%)
1131 Battery conditioning is required but not
2. PSU (95%) possible.
Reseat the power supply unit in the enclosure. Explanation: Battery conditioning is required but not
possible.
Possible Cause-FRUs or other: One possible cause of this error is that battery
reconditioning cannot occur in a clustered node if the
Power supply (100%) partner node is not online.
User response: This error can be corrected on its own.
1129 The node battery is missing. For example, if the partner node comes online, the
Explanation: Install new batteries to enable the node reconditioning begins.
to join a clustered system. Possible Cause-FRUs or other:
User response: Install a battery in battery slot 1 (on Other:
the left from the front) and in battery slot 2 (on the
right). Leave the node running as you add the batteries. Wait, or address other errors.
1141 The 2145 UPS-1U has reported that it 1146 The signal connection between a 2145
has a problem with the input AC power. and its 2145 UPS-1U is failing.
Explanation: The 2145 UPS-1U has reported that it has Explanation: The signal connection between a node
a problem with the input AC power. and its UPS is failing.
User response: User response:
1. Check the input AC power, whether it is missing or 1. In the sequence that is shown in the log, replace
out of specification. Correct if necessary. Otherwise, any failing FRUs with new FRUs.
exchange the FRU for a new FRU. 2. Check node status:
2. Check node status. If all nodes show a status of v If all nodes show a status of online, mark the
online, mark the error that you have just repaired error as fixed.
fixed. If any nodes do not show a status of
v If any nodes do not show a status of online, go
online, go to start MAP. If you return to this step,
to the start MAP.
contact your support center to resolve the problem
with the uninterruptible power supply. v If you return to this step, contact your support
center to resolve the problem with the node.
3. Go to repair verification MAP.
3. Go to the repair verification MAP.
Possible Cause-FRUs or other:
Possible Cause, FRUs, or other:
v 2145 UPS-1U input power cable (10%)
v 2145 UPS-1U assembly (10%) 2145-CF8 or 2145-CG8
v Power cable assembly (40%)
Other:
v 2145 UPS-1U assembly (30%)
v The input AC power is missing (40%)
v System board (30%)
v The input AC power is not in specification (40%)
2145-CF8 or 2145-CG8
1151 Data that the 2145 has received from the
2145 UPS-1U suggests the 2145 UPS-1U
N/A
power cable, the signal cable, or both,
are not connected correctly.
Explanation: Data that the 2145 has received from the
2145 UPS-1U suggests the 2145 UPS-1U power cable,
the signal cable, or both, are not connected correctly. 1. Determine the 2145 UPS that is reporting the error
from the error event data. Perform the following
User response:
steps on just this uninterruptible power supply.
1. Connect the cables correctly. See your product's
2. Check that the 2145 UPS is still reporting the error.
installation guide.
If the power overload warning LED is no longer on,
2. Check node status. If all nodes show a status of go to step 6.
online, mark the error that you have just repaired
3. Ensure that only 2145s are receiving power from the
fixed. If any nodes do not show a status of
uninterruptible power supply. Ensure that there are
online, go to start MAP. If you return to this step,
no switches or disk controllers that are connected to
contact your support center to resolve the problem
the 2145 UPS.
with the uninterruptible power supply.
4. Remove each connected 2145 input power in turn,
3. Go to repair verification MAP.
until the output overload is removed.
Possible Cause-FRUs or other: 5. Exchange the FRUs for new FRUs in the sequence
shown, on the overcurrent 2145.
v None
6. Check node status. If all nodes show a status of
online, mark the error that you have just repaired
Other:
fixed. If any nodes do not show a status of
v Configuration error online, go to start MAP. If you return to this step,
contact your support center to resolve the problem
1152 Incorrect type of uninterruptible power with the 2145 UPS.
supply detected. 7. Go to repair verification MAP.
Explanation: The cluster is reporting that a node is
Possible Cause-FRUs or other:
not operational because of critical node error 587. See
the details of node error 587 for more information. v Power cable assembly (50%)
v Power supply assembly (40%)
User response: See node error 587.
v 2145 UPS electronics assembly (10%)
1180 2145 UPS battery fault (reported by 2145 Explanation: 2145 UPS fault, with no specific FRU
UPS alarm bits). identified (reported by 2145 UPS alarm bits).
1181 2145 UPS-1U battery fault (reported by 1186 A problem has occurred in the 2145
2145 UPS-1U alarm bits). UPS-1U, with no specific FRU identified
(reported by 2145 UPS-1U alarm bits).
Explanation: 2145 UPS-1U battery fault (reported by
2145 UPS-1U alarm bits). Explanation: A problem has occurred in the 2145
UPS-1U, with no specific FRU identified (reported by
User response: 2145 UPS-1U alarm bits).
1. Replace the 2145 UPS-1U battery assembly.
User response:
2. Check node status. If all nodes show a status of
online, mark the error that you have just repaired 1. In the sequence shown, exchange the FRU for a
fixed. If any nodes do not show a status of new FRU.
online, go to start MAP. If you return to this step, 2. Check node status. If all nodes show a status of
contact your support center to resolve the problem online, mark the error that you have just repaired
with the uninterruptible power supply. fixed. If any nodes do not show a status of
3. Go to repair verification MAP. online, go to start MAP. If you return to this step,
for joining the cluster. The cluster attempts to add node. Therefore, incompatible hardware is not a
the node into the cluster but does not succeed. After potential root cause of this error.
15 minutes, the cluster makes a second attempt to 4. If the node was added to the cluster but failed
add the node into the cluster and again does not again before it has been online for 24 hours,
succeed. After another 15 minutes, the cluster investigate the root cause of the failure. If no events
makes a third attempt to add the node into the in the event log indicate the reason for the node
cluster and again does not succeed. After another 15 failure, collect dumps and contact IBM technical
minutes, the cluster logs error code 1194. The node support for assistance.
never came online during the attempts to add it to
5. When you have fixed the problem with the node,
the cluster.
you must use either the cluster console or the
2. The node has failed without saving all of its state command line interface to manually remove the
data. The node has restarted, possibly after a repair, node from the cluster and add the node into the
and shows node error 578 and is a candidate node cluster.
for joining the cluster. The cluster attempts to add
6. Mark the error as fixed and go to the verification
the node into the cluster and succeeds and the node
MAP.
becomes online. Within 24 hours the node fails
again without saving its state data. The node
restarts and shows node error 578 and is a Possible Cause-FRUs or other:
candidate node for joining the cluster. The cluster
again attempts to add the node into the cluster, None, although investigation might indicate a
succeeds, and the node becomes online; however, hardware failure.
the node again fails within 24 hours. The cluster
attempts a third time to add the node into the 1195 Node missing.
cluster, succeeds, and the node becomes online;
however, the node again fails within 24 hours. After Explanation: You can resolve this problem by
another 15 minutes, the cluster logs error code 1194. repairing the failure on the missing 3700.
User response:
A combination of these scenarios is also possible.
1. If it is not obvious which node in the cluster has
failed, check the status of the nodes and find the
Note: If the node is manually removed from the cluster,
3700 with a status of offline.
the count of automatic recovery attempts is reset to
zero. 2. Go to the Start MAP and perform the repair on the
failing node.
User response:
3. When the repair has been completed, this error is
1. If the node has been continuously online in the automatically marked as fixed.
cluster for more than 24 hours, mark the error as
4. Check node status. If all nodes show a status of
fixed and go to the Repair Verification MAP.
online, but the error in the log has not been
2. Determine the history of events for this node by marked as fixed, manually mark the error that you
locating events for this node name in the event log. have just repaired fixed. If any nodes do not
Note that the node ID will change, so match on the show a status of online, go to start MAP. If you
WWNN and node name. Also, check the service return to this step, contact your support center to
records. Specifically, note entries indicating one of resolve the problem with the 3700.
three events: 1) the node is missing from the cluster
5. Go to repair verification MAP.
(cluster error 1195 event 009052), 2) an attempt to
automatically recover the offline node is starting
Possible Cause-FRUs or other:
(event 980352), 3) the node has been added to the
cluster (event 980349). v None
3. If the node has not been added to the cluster since
the recovery process started, there is probably a 1198 Detected hardware is not a valid
hardware problem. The node's internal disk might configuration.
be failing in a manner that it is unable to modify its
software level to match the software level of the Explanation: A hardware change was made to this
cluster. If you have not yet determined the root node that is not supported by its software. Either a
cause of the problem, you can attempt to manually hardware component failed, or the node was
remove the node from the cluster and add the node incorrectly upgraded.
back into the cluster. Continuously monitor the User response: Complete the following steps:
status of the nodes in the cluster while the cluster is
1. If required, power the node off for servicing.
attempting to add the node. Note: If the node type
is not supported by the software version of the 2. If new hardware is correctly installed, but it is listed
cluster, the node will not appear as a candidate as an invalid configuration, then update the
Explanation: Boot drive is in the wrong slot. 3. Ensure that there are no obstructions to air flow for
the node.
User response: Complete the following steps: 4. Mark the error as fixed. If the error recurs, contact
1. Look at a boot drive view to determine which drive hardware support for further investigation.
is in the wrong slot, which node and slot it belongs
in, and which drive must be in this slot. Possible Cause-FRUs or other:
2. Swap the drive for the correct one but shut down v Flash drive (10%)
the node first if booted yes is shown for that drive
in boot drive view. Other:
3. If you want to use the drive in this node, v System environment or airflow blockage (90%)
synchronize the boot drives by running the
commands svctask chnodebootdrive -sync and/or
satask chbootdrive -sync. 1220 A remote Fibre Channel port has been
excluded.
4. The node error clears, or a new node error is
displayed for you to work on. Explanation: A remote Fibre Channel port has been
excluded.
Possible Cause-FRUs or other:
User response:
v None
1. View the event log. Note the MDisk ID associated
with the error code.
1215 A flash drive is failing. 2. From the MDisk, determine the failing disk
Explanation: The flash drive has detected faults that controller ID.
indicate that the drive is likely to fail soon. The drive 3. Refer to the service documentation for the disk
should be replaced. The cluster event log will identify a controller and the Fibre Channel network to resolve
drive ID for the flash drive that caused the error. the reported problem.
4. After the disk drive is repaired, start a cluster 5. Go to repair verification MAP.
discovery operation to recover the excluded Fibre
Channel port by rescanning the Fibre Channel Possible Cause-FRUs or other:
network. v Fibre Channel cable, switch to remote port, (30%)
5. To restore MDisk online status, include the v Switch or remote device SFP connector or adapter,
managed disk that you noted in step 1. (30%)
6. Check the status of the disk controller. If all disk v Fibre Channel cable, local port to switch, (30%)
controllers show a good status, mark the error
that you have just repaired, fixed. v Cluster SFP connector, (9%)
7. If all disk controllers do not show a good status, v Cluster Fibre Channel adapter, (1%)
contact your support center to resolve the problem
with the disk controller. Note: The first two FRUs are not cluster FRUs.
8. Go to repair verification MAP.
1260 SAS cable fault type 2.
Possible Cause-FRUs or other:
Explanation: SAS cable fault type 2.
v None
User response: Complete the following steps:
Other:
Note: After each action, check to see whether the
v Enclosure/controller fault (50%)
canister ports at both ends of the cable are excluded. If
v Fibre Channel network fabric (50%) the ports are excluded, then enable them by issuing the
following command:
1230 A login has been excluded. chenclosurecanister -excludesasport no -port X
Explanation: A port to port fabric connection, or login, 1. Reset this canister and the upstream canister.
between the cluster node and either a controller or The upstream canister is identified in sense data as
another cluster has had excessive errors. The login has enclosureid2, faultobjectlocation2...
therefore been excluded, and will not be used for I/O 2. Reseat cable between the two ports that are
operations. identified in the sense data.
User response: Determine the remote system, which 3. Replace cable between the two ports that are
might be either a controller or a cluster. Check the identified in the sense data.
event log for other 1230 errors. Ensure that all higher 4. Replace this canister.
priority errors are fixed.
5. Replace other canister (enclosureid2).
This error event is usually caused by a fabric problem.
If possible, use the fabric switch or other fabric Possible Cause-FRUs or other:
diagnostic tools to determine which link or port is v SAS cable
reporting the errors. If there are error events for links
v Canister
from this node to a number of different controllers or
clusters, then it is probably the node to switch link that
is causing the errors. Unless there are other contrary 1298 A node has encountered an error
indications, first replace the cable between the switch updating.
and the remote system.
Explanation: One or more nodes has failed the
1. From the fabric analysis, determine the FRU that is update.
most likely causing the error. If this FRU has
recently been replaced while resolving a 1230 error, User response: Check lsupdate for the node that
choose the next most likely FRU that has not been failed and continue troubleshooting with the error code
replaced recently. Exchange the FRU for a new FRU. it provides.
2. Mark the error as fixed. If the FRU replacement has
not fixed the problem, the error will be logged 1310 A managed disk is reporting excessive
again; however, depending on the severity of the errors.
problem, the error might not be logged again
immediately. Explanation: A managed disk is reporting excessive
errors.
3. Start a cluster discovery operation to recover the
login by re-scanning the Fibre Channel network. User response:
4. Check the status of the disk controller or remote 1. Repair the enclosure/controller fault.
cluster. If the status is not good, go to the Start 2. Check the managed disk status. If all managed
MAP. disks show a status of online, mark the error that
you have just repaired as fixed. If any managed new volume can be kept and mirrored, or the
disks show a status of excluded, include the original volume can be repaired and the data copied
excluded managed disks and then mark the error as back again.
fixed. 4. Check managed disk status. If all managed disks
3. Go to repair verification MAP. show a status of online, mark the error that you
have just repaired as fixed. If any managed disks
Possible Cause-FRUs or other: do not show a status of online, go to start MAP. If
v None you return to this step, contact your support center
to resolve the problem with the disk controller.
Other: 5. Go to repair verification MAP.
when there are managed disks or image mode disks 4. Check the managed disk status. If the managed disk
but no quorum disks. identified in step 1 shows a status of online, mark
the error that you have just repaired as fixed. If
To become a quorum disk:
the managed disk does not show a status of
v The MDisk must be accessible by all nodes in the online, go to start MAP. If you return to this step,
cluster. contact your support center to resolve the problem
v The MDisk must be managed; that is, it must be a with the disk controller.
member of a storage pool. 5. Go to repair verification MAP.
v The MDisk must have free extents.
v The MDisk must be associated with a controller that Possible Cause-FRUs or other:
is enabled for quorum support. If the controller has v None
multiple WWNNs, all of the controller components
must be enabled for quorum support. Other:
6. Reset the IB interface adapter; reset the canister; 2. Perform the disk controller problem determination
reboot the system. and repair procedures for the MDisk determined in
step 1.
Possible Cause-FRUs or other: 3. Perform problem determination and repair
procedures for the fibre channel switches connected
External (cable, HCA, switch, and so on) (85%) to the 2145 and any other Fibre Channel network
components.
Interface (10%) 4. If any problems are found and resolved in steps 2
and 3, mark this error as fixed.
Canister (5%)
5. If no switch or disk controller failures were found
in steps 2 and 3, take an event log dump. Call your
1360 A SAN transport error occurred. hardware support center.
Explanation: This error has been reported because the 6. Go to repair verification MAP.
2145 performed error recovery procedures in response
to SAN component associated transport errors. The Possible Cause-FRUs or other:
problem is probably caused by a failure of a component v None
of the SAN.
Other:
User response:
v Enclosure/controller fault
1. View the event log entry to determine the node that
logged the problem. Determine the 2145 node or v Fibre Channel switch
controller that the problem was logged against.
2. Perform Fibre Channel switch problem 1400 The 2145 cannot detect an Ethernet
determination and repair procedures for the connection.
switches connected to the 2145 node or controller.
Explanation: The 2145 cannot detect an Ethernet
3. Perform Fibre Channel cabling problem connection.
determination and repair procedures for the cables
connected to the 2145 node or controller. User response:
4. If any problems are found and resolved in step 2 1. Go to the Ethernet MAP.
and 3, mark this error as fixed. 2. Go to the repair verification MAP.
5. If no switch or cable failures were found in steps 2
and 3, take an event log dump. Call your hardware Possible Cause-FRUs or other:
support center.
6. Go to repair verification MAP. 2145-CF8, or 2145-CG8
v Ethernet cable (25%)
Possible Cause-FRUs or other: v System board (25%)
v None
Other:
Other: v Ethernet cable is disconnected or damaged (25%)
v Fibre Channel switch v Ethernet hub fault (25%)
v Fibre Channel cabling
1403 External port not operational.
1370 A managed disk error recovery
Explanation: If this error occurs when a port was
procedure (ERP) has occurred.
initially online and subsequently went offline, it
Explanation: This error was reported because a large indicates:
number of disk error recovery procedures have been v the server, HBA, CNA or switch has been turned off.
performed by the disk controller. The problem is
v there is a physical issue.
probably caused by a failure of some other component
on the SAN.
If this error occurs during an initial setup or during a
User response: setup change, it is most likely a configuration issue
1. View the event log entry and determine the rather than a physical issue.
managed disk that was being accessed when the User response:
problem was detected.
1. Reset the port via the CLI command Maintenance. If
the port is now online, the DMP is complete.
2. If the port is connected to a switch, check the v If the port has been intentionally disconnected,
switch to make sure the port is not disabled. Check use the management GUI recommended action
the switch vendor troubleshooting documentation for the service error code and acknowledge the
for other possibilities. If the port is now online, the intended change.
DMP is complete. v Resolve the 704 or 724 error.
3. Reseat the cable. This includes plugging in the cable v If this is an FCoE connection, use the information
and SFP if not already done. If the port is now the view gives about the Fibre Channel forwarder
online, the DMP is complete. (FCF) to troubleshoot the connection between the
4. Reseat the hot swap SFPs (optics modules). If the port and the FCF.
port is now online, the DMP is complete.
5. Try using a new cable. Possible Cause-FRUs or other cause:
6. Try using a new SFP. v None
7. Try using a new port on the switch.
1471 Interface card is unsupported.
Note: Continuing from here will affect other ports
Explanation: Interface adapter is unsupported.
connected on the adapter.
8. Reset the adapter. User response: Replace the wrong interface adapter
with the correct type.
9. Reset the canister.
Possible Cause-FRUs or other:
1450 Fewer Fibre Channel I/O ports Interface adapter (100%)
operational.
Explanation: One or more Fibre Channel I/O ports 1472 Boot drive is in an unsupported slot.
that have previously been active are now inactive. This
situation has continued for one minute. Explanation: Boot drive is in an unsupported slot.
A Fibre Channel I/O port might be established on User response: Complete the following steps:
either a Fibre Channel platform port or an Ethernet 1. Look at a boot drive view to determine which drive
platform port using FCoE. This error is expected if the is in an unsupported slot.
associated Fibre Channel or Ethernet port is not 2. Move the drive back to its correct node and slot,
operational. but shut down the node first if booted yes is shown
Data: for that drive in boot drive view.
3. The node error clears, or a new node error is
Three numeric values are listed: displayed for you to work on.
v The ID of the first unexpected inactive port. This ID
is a decimal number. Possible Cause-FRUs or other:
v The ports that are expected to be active, which is a v None
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1.
The bit is 1 if the port is expected to be active. 1473 The installed battery is at a hardware
revision level that is not supported by
v The ports that are actually active, which is a
the current code level.
hexadecimal number. Each bit position represents a
port, with the least significant bit representing port 1. Explanation: The installed battery is at a hardware
The bit is 1 if the port is active. revision level that is not supported by the current code
level.
User response:
1. If possible, use the management GUI to run the User response: To replace the battery with one that is
recommended actions for the associated service supported by the current code level, follow the service
error code. action for 1130 on page 208. To update the code level
to one that supports the currently installed battery,
2. Follow the procedure for mapping I/O ports to
perform a service mode code update. Always install the
platform ports to determine which platform port is
latest level of the system software to avoid problems
providing this I/O port.
with upgrades and component compatibility.
3. Check for any 704 (Fibre channel platform port
not operational) or 724 (Ethernet platform port Possible Cause-FRUs or other:
not operational) node errors reported for the v Battery (50%)
platform port.
4. Possibilities:
1474 Battery is nearing end of life. 1550 A cluster path has failed.
Explanation: When a battery nears the end of its life, Explanation: One of the V3700 Fibre Channel ports is
you must replace it if you intend to preserve the unable to communicate with all the other V3700s in the
capacity to failover power to batteries. cluster.
User response: Replace the battery by following this User response:
procedure as soon as you can. 1. Check for incorrect switch zoning.
If the node is in a clustered system, ensure that the 2. Repair the fault in the Fibre Channel network
battery is not being relied upon to provide data fabric.
protection before you remove it. Issue the 3. Check the status of the node ports that are not
chnodebattery -remove -battery battery_ID node_ID excluded via the system's local port mask. If the
command to establish the lack of reliance on the status of the node ports shows as active, mark the
battery. error that you have repaired as fixed. If any node
If the command returns with a The command has ports do not show a status of active, go to start
failed because the specified battery is MAP. If you return to this step contact your support
offline(BATTERY_OFFLINE) error, replace the battery center to resolve the problem with the V3700.
immediately. 4. Go to repair verification MAP.
If the command returns with a The command has
Possible Cause-FRUs or other:
failed because the specified battery is not redundant
(BATTERY_NOT_REDUNDANT) error, do not remove v None
the relied-on battery. Removing the battery
compromises data protection. Other:
Possible Cause-FRUs or other: controller reports a real medium error, the system
v None creates a matching virtual medium error on the
target extent.
Other: 2. The mirrored volume validate and repair process
has the option to create virtual medium errors on
Fibre Channel network fabric fault (100%) sectors that do not match on all volume copies.
Normally zero, or very few, differences are
expected; however, if the copies have been marked
1600 Mirrored disk repair halted because of as synchronized inappropriately, then a large
difference. number of virtual medium errors could be created.
Explanation: During the repair of a mirrored volume User response: Ensure that all higher priority errors
two copy disks were found to contain different data for are fixed before you attempt to resolve this error.
the same logical block address (LBA). The validate
option was used, so the repair process has halted. Determine whether the excessive number of virtual
medium errors occurred because of a mirrored disk
Read operations to the LBAs that differ might return validate and repair operation that created errors for
the data of either volume copy. Therefore it is differences, or whether the errors were created because
important not to use the volume unless you are sure of a copy operation. Follow the corresponding option
that the host applications will not read the LBAs that shown below.
differ or can manage the different data that potentially
1. If the virtual medium errors occurred because of a
can be returned.
mirrored disk validate and repair operation that
User response: Perform one of the following actions: created medium errors for differences, then also
v Continue the repair starting with the next LBA after ensure that the volume copies had been fully
the difference to see how many differences there are synchronized prior to starting the operation. If the
for the whole mirrored volume. This can help you copies had been synchronized, there should be only
decide which of the following actions to take. a few virtual medium errors created by the validate
and repair operation. In this case, it might be
v Choose a primary disk and run repair
possible to rewrite only the data that was not
resynchronizing differences.
consistent on the copies using the local data
v Run a repair and create medium errors for recovery process. If the copies had not been
differences. synchronized, it is likely that there are now a large
v Restore all or part of the volume from a backup. number of medium errors on all of the volume
v Decide which disk has correct data, then delete the copies. Even if the virtual medium errors are
copy that is different and re-create it allowing it to be expected to be only for blocks that have never been
synchronized. written, it is important to clear the virtual medium
errors to avoid inhibition of other operations. To
recover the data for all of these virtual medium
Then mark the error as fixed.
errors it is likely that the volume will have to be
recovered from a backup using a process that
Possible Cause-FRUs or other:
rewrites all sectors of the volume.
v None
2. If the virtual medium errors have been created by a
copy operation, it is best practice to correct any
1610 There are too many copied media errors medium errors on the source volume and to not
on a managed disk. propagate the medium errors to copies of the
volume. Fixing higher priority errors in the event
Explanation: The cluster maintains a virtual medium log would have corrected the medium error on the
error table for each MDisk. This table is a list of logical source volume. Once the medium errors have been
block addresses on the managed disk that contain data fixed, you must run the copy operation again to
that is not valid and cannot be read. The virtual clear the virtual medium errors from the target
medium error table has a fixed length. This error event volume. It might be necessary to repeat a sequence
indicates that the system has attempted to add an entry of copy operations if copies have been made of
to the table, but the attempt has failed because the table already copied medium errors.
is already full.
There are two circumstances that will cause an entry to An alternative that does not address the root cause is to
be added to the virtual medium error table: delete volumes on the target managed disk that have
1. FlashCopy, data migration and mirrored volume the virtual medium errors. This volume deletion
synchronization operations copy data from one reduces the number of virtual medium error entries in
managed disk extent to another. If the source extent the MDisk table. Migrating the volume to a different
contains either a virtual medium error or the RAID managed disk will also delete entries in the MDisk
table, but will create more entries on the MDisk table of 3. Ensure that the logical unit is mapped to all of the
the MDisk to which the volume is migrated. nodes.
4. Ensure that the logical unit is mapped to all of the
Possible Cause-FRUs or other: nodes using the same LUN.
v None 5. Run the console or CLI command to discover
MDisks and ensure that the command completes.
1620 A storage pool is offline. 6. Mark the error that you have just repaired as
fixed. When you mark the error as fixed, the
Explanation: A storage pool is offline. controller's MDisk availability is tested and the
User response: error will be logged again immediately if the error
persists for any MDisks. It is possible that the new
1. Repair the faults in the order shown.
error will report a different MDisk.
2. Start a cluster discovery operation by rescanning the
7. Go to repair verification MAP.
Fibre Channel network.
3. Check managed disk (MDisk) status. If all MDisks Possible Cause-FRUs or other:
show a status of online, mark the error that you
have just repaired as fixed. If any MDisks do not v None
show a status of online, go to start MAP. If you
return to this step, contact your support center to Other:
resolve the problem with the disk controller. v Fibre Channel network fabric fault (50%)
4. Go to repair verification MAP. v Enclosure/controller fault (50%)
If there are unfixed higher-priority errors that relate to 010043 A disk controller is accessible through only half,
the SAN or to disk controllers, those errors should be or less, of the previously configured controller ports.
fixed before resolving this error because they might v Although there might still be multiple ports that are
indicate the reason for the lack of redundancy. Error accessible on the disk controller, a hardware
codes that must be fixed first are: component of the controller might have failed or one
v 1210 Local FC port excluded of the SAN fabrics has failed such that the
operational system configuration has been reduced to
v 1230 Login has been excluded
a single point of failure.
Note: This error can be reported if the required action, v The error data indicates a port on the disk controller
to rescan the Fibre Channel network for new MDisks, that is still connected, and also lists controller ports
has not been performed after a deliberate that are expected but that are not connected.
reconfiguration of a disk controller or after SAN
rezoning.
you have just repaired as fixed. If any managed 3. Identify unused drives and remove them from the
disks do not show a status of online, go to the enclosures.
start MAP. If you return to this step, contact your 4. Identify arrays of drives that are no longer required.
support center to resolve the problem with the disk Remove the arrays and remove the drives from the
controller. enclosures if they are present.
5. Go to repair verification MAP. 5. Once there are fewer than 1056 drives in the
system, consider re-engineering system capacity by
Possible Cause-FRUs or other: migrating data from small arrays onto large arrays,
v None then removing the small arrays and the drives that
formed them.
Other: 6. Consider the need for an additional Storwize
system in your SAN solution.
Enclosure/controller fault (100%)
1689 Array MDisk has lost redundancy.
1670 The CMOS battery on the system board
Explanation: Array MDisk has lost redundancy. The
failed.
RAID 5 system is missing a data drive.
Explanation: The CMOS battery on the system board
User response: Replace the missing or failed drive.
failed.
Possible Cause-FRUs or other:
User response: Replace the node until the FRU is
available. Drives removed or failed (100%)
Possible Cause-FRUs or other:
1690 No spare protection exists for one or
CMOS battery (100%)
more array MDisks.
Explanation: The system spare pool cannot
1680 Drive fault type 1
immediately provide a spare of any suitability to one or
Explanation: Drive fault type 1 more arrays.
User response: Replace the drive. User response:
Possible Cause-FRUs or other: 1. Configure an array but no spares.
2. Configure many arrays and a single spare. Cause
Drive (95%)
that spare to be consumed or change its use.
Canister (3%)
For a distributed array, unused or candidate drives are
Midplane (2%)
converted into array members.
1. Decode/explain the number of rebuild areas
1684 Drive is missing. available and the threshold set.
Explanation: Drive is missing. 2. Check for unfixed higher priority errors.
User response: Install the missing drive. The drive is 3. Check for unused and candidate drives that are
typically a data drive that was previously part of the suitable for the distributed array. Run the
array. lsarraymembergoals command to determine drive
suitability by using tech_type, capacity, and rpm
Possible Cause-FRUs or other: information.
Drive (100%) v Offer to add the drives into the array. Allow up
to the number of missing array members to be
added.
1686 There are too many drives attached to
v Recheck after array members are added.
the system.
4. If no drives are available, explain that drives need
Explanation: The cluster only supports a fixed number to be added to restore the wanted number of
of drives. rebuild areas.
User response: v If the threshold is greater than the number of
1. Disconnect any excessive unmanaged enclosures rebuild areas available, and the threshold is
from the system. greater than 1, offer to reduce the threshold to the
number of drives that are available.
2. Unmanage any offline drives that are not present in
the system.
1691 A background scrub process has found 1700 Unrecovered remote copy relationship
an inconsistency between data and
Explanation: This error might be reported after the
parity on the array.
recovery action for a clustered system failure or a
Explanation: The array has at least one stride where complete I/O group failure. The error is reported
the data and parity do not match. RAID has found an because some remote copy relationships, whose control
inconsistency between the data stored on the drives data is stored by the I/O group, could not be
and the parity information. This could either mean that recovered.
the data has been corrupted, or that the parity
User response: To fix this error it is necessary to
information has been corrupted.
delete all of the relationships that might not be
User response: Follow the directed maintenance recovered, and then re-create the relationships.
procedure for inconsistent arrays. 1. Note the I/O group index against which the error is
logged.
1692 Array MDisk has taken a spare member 2. List all of the relationships that have either a master
that does not match array goals. or an auxiliary volume in this I/O group. Use the
volume view to determine which volumes in the
Explanation:
I/O group you noted have a relationship that is
1. A member of the array MDisk either has technology defined.
or capability that does not match exactly with the
3. Note the details of the relationships that are listed
established goals of the array.
so that they can be re-created.
2. The array is configured to want location matches,
If the affected I/O group has active-active
and the drive location does not match all the
relationships that are in a consistency group, run
location goals.
the command chrcrelationship -noconsistgrp
User response: The error will fix itself automatically rc_rel_name for each active-active relationship that
as soon as the rebuild or exchange is queued up. It was not recovered. Then, use the command
does not wait until the array is showing balanced = lsrcrelatioship in case volume labels are changed
exact (which indicates that all populated members and to see the value of the primary attributes.
have exact capability match and exact location match). 4. Delete all of the relationships that are listed in step
2, except any active-active relationship that has host
1695 Persistent unsupported disk controller applications that use the auxiliary volume via the
configuration. master volume unique ID. (that is, the primary
attribute value is auxiliary in the output from
Explanation: A disk controller configuration that lsrcrelationship).
might prevent failover for the cluster has persisted for For the active-active relationships that have the
more than four hours. The problem was originally primary attribute value of auxiliary, use the rmvdisk
logged through a 010032 event, service error code 1625. -keepaux CLI command (which also deletes the
User response: relationship). For example, rmvdisk -keepaux
master_volume_id/name.
1. Fix any higher priority error. In particular, follow
the service actions to fix the 1625 error indicated by
this error's root event. This error will be marked as Note: The error is automatically marked as fixed
fixed when the root event is marked as fixed. once the last relationship on the I/O group is
deleted. New relationships must not be created until
2. If the root event cannot be found, or is marked as the error is fixed.
fixed, perform an MDisk discovery and mark this
error as fixed. 5. Re-create all the relationships that you deleted by
using the details noted in step 3.
3. Go to repair verification MAP.
Note: For Metro Mirror and Global Mirror
Possible Cause-FRUs or other: relationships, you are able to delete a relationship
v None from either the master or auxiliary system; however,
you must re-create the relationship on the master
Other: system. Therefore, it might be necessary to go to
v Enclosure/controller fault another system to complete this service action.
Enclosure 20%
1740 Recovery encryption key not available.
Explanation: Recovery encryption key is not available. SAS port 20%
User response: Make the recovery encryption key
available. 1780 Encryption key changes are not
1. If the key is not available: committed.
v Install a USB drive with the encryption key. Explanation: Encryption key changes are not
v Ensure correct file is on the USB drive. committed.
2. If the key is not valid: User response:
v Get a USB drive with a valid key for this MTMS. 1. Install the USB sticks into each canister.
The key does not have a valid CRC.
Note: Use chencryption -usb -validate to
Possible Cause-FRUs or other: determine if there is a valid key.
2. Press Next to cancel the pending key change.
No FRU
gives information on the reason. v Consider reducing the value of the storage pool
warning threshold to give more time to allocate extra
The cluster maintains backup copies of the metadata
space.
and it might be possible to repair the thin-provisioned
volume using this data.
If the volume copy is not auto-expand enabled,
User response: The cluster is able to repair the perform one or more of the following actions. In this
inconsistency in some circumstances. Run the repair case the error will automatically be marked as fixed,
volume option to start the repair process. This repair and the volume copy will return online when space is
process, however, can take some time. In some available.
situations it might be more appropriate to delete the v Determine why the thin-provisioned volume copy
thin-provisioned volume and reconstruct a new one used space has grown at the rate that it has. There
from a backup or mirror copy. might be an application error.
If you run the repair procedure and it completes, this v Increase the real capacity of the volume copy.
error is automatically marked as fixed; otherwise, v Enable auto-expand for the thin-provisioned volume
another error event (error code 1860) is logged to copy.
indicate that the repair action has failed.
v Consider reducing the value of the thin-provisioned
Possible Cause-FRUs or other: volume copy warning threshold to give more time to
v None allocate more real space.
the Metro Mirror and Global Mirror requirements v Secondary 2145 cluster or SAN fabric configuration
described in the planning documentation are met. (25%)
Ensure that any changes to the applications using v Intercluster link problem (15%)
Metro Mirror or Global Mirror have been taken
v Intercluster link configuration (25%)
into account. Resolve any issues.
9. On the secondary cluster, examine the 2145
statistics using a SAN productivity monitoring tool 1925 Cached data cannot be destaged.
and confirm that all the Metro Mirror and Global Explanation: Problem diagnosis is required.
Mirror requirements described in the software
installation and configuration documentation are User response:
met. Resolve any issues. 1. Run the directed maintenance procedure to fix all
10. On the intercluster link, examine the performance errors of a higher priority. This will allow the
of each component using an appropriate SAN cached data to be destaged and the originating
productivity monitoring tool to ensure that they event to be marked fixed.
are operating as expected. Resolve any issues.
11. Mark the error as fixed and restart the Metro Possible Cause-FRUs or other:
Mirror or Global Mirror relationship. v None
1950 Unable to mirror medium error. 2010 A software update has failed.
Explanation: During the synchronization of a mirrored Explanation: Cluster configuration changes are
volume copy it was necessary to duplicate the record of restricted until the update is completed or rolled back.
a medium error onto the volume copy, creating a The cluster update process waits for user intervention
virtual medium error. Each managed disk has a table of when this error is logged.
virtual medium errors. The virtual medium error could
User response: The action required to recover from a
not be created because the table is full. The volume
stalled update depends on the current state of the
copy is in an inconsistent state and has been taken
cluster being updated. Call IBM technical support for
offline.
an action plan to resolve this problem.
User response: Three different approaches can be
Possible Cause-FRUs or other:
taken to resolving this problem: 1) the source volume
copy can be fixed so that it does not contain medium v None
errors, 2) the number of virtual medium errors on the
target managed disk can be reduced or 3) the target Other:
volume copy can be moved to a managed disk with
more free virtual medium error entries. 2145 software (100%)
The managed disk with a full medium error table can
be determined from the data of the root event. 2020 IP Remote Copy link unavailable.
Approach 1) - This is the preferred procedure because Explanation: IP Remote Copy link is unavailable.
it restores the source volume copy to a state where all
User response: Fix the remote IP link so that traffic
of the data can be read. Use the normal service
can flow correctly. Once the connection is made, the
procedures for fixing a medium error (rewrite block or
error will auto-correct.
volume from backup or regenerate the data using local
procedures).
2021 Partner cluster IP address unreachable.
Approach 2) - This method can be used if the majority
of the virtual medium errors on the target managed Explanation: Partner cluster IP address unreachable.
disk do not relate to the volume copy. Determine where
the virtual medium errors are using the event log User response:
events and re-write the block or volume from backup. 1. Verify the system IP address of the remote system
forming the partnership.
Approach 3) - Delete the offline volume copy and
create a new one either forcing the use of different 2. Check if remote cluster IP address is reachable from
MDisks in the storage pool or using a completely local cluster. The following can be done to verify
different storage pool. accessibility:
a. Use svctask to ping the remote cluster IP
Follow your selection option(s) and then mark the error address. If the ping works, there may be a block
as fixed. on the specific port traffic that needs to be
Possible Cause-FRUs or other: opened in the network. If the ping does not
work, there may be no route between the
v None
system. Check the IP gateway configuration on
the SAN Volume Controller nodes and the IP
2008 A software downgrade has failed. network configuration.
Explanation: Cluster configuration changes are b. Check the configuration of the routers and
restricted until the downgrade is completed. The cluster firewall to ensure that TCP/IP port 3620 used
downgrade process waits for user intervention when for IP partnership is not blocked.
this error is logged. c. Use the ssh command from another system to
attempt to establish a session with the
User response: The action required to recover from a
problematic remote cluster IP address to confirm
stalled downgrade depends on the current state of the
that the remote cluster is operational.
cluster being downgraded. Call IBM Support for an
action plan to resolve this problem.
2022 Cannot authenticate with partner cluster.
Possible Cause-FRUs or other:
v None Explanation: Cannot authenticate with partner cluster.
User response: Verify the CHAP secret set of
Other: partnership using mkippartnership or chpartnership
CLIs match remote system CHAP secret set using
2145 software (100%) chsystem CLI. If they don't match, use appropriate
commands to set the right CHAP secrets. simplest to remove these drives from the array and
service them individually. This option is not possible if
the array is currently syncing post recovery.
2023 Unexpected cluster ID for partner
cluster.
2040 A software update is required.
Explanation: Unexpected cluster ID for partner cluster.
Explanation: The software cannot determine the VPD
User response: After deleting all relationships and
for a FRU. Probably, a new FRU has been installed and
consistency group, remove the partnership.
the software does not recognize that FRU.
This is an unrecoverable error when one of the sites has
User response:
undergone a T3 recovery and lost all partnership
information. Contact IBM support. 1. If a FRU has been replaced, ensure that the correct
replacement part was used. The node VPD indicates
which part is not recognized.
2030 Software error.
2. Ensure that the cluster software is at the latest level.
Explanation: The software has restarted because of a 3. Save dump data with configuration dump and
problem in the cluster, on a disk system or on the Fibre logged data dump.
Channel fabric.
4. Contact your product support center to resolve the
User response: problem.
1. Collect the software dump file(s) generated at the 5. Mark the error that you have just repaired as
time the error was logged on the cluster. fixed.
2. Contact your product support center to investigate 6. Go to repair verification MAP.
and resolve the problem.
3. Ensure that the software is at the latest level on the Possible Cause-FRUs or other:
cluster and on the disk systems. v None
4. Use the available SAN monitoring tools to check for
any problems on the fabric. Other:
5. Mark the error that you have just repaired as
fixed. 2145 software (100%)
6. Go to repair verification Map.
2055 System reboot required.
Possible Cause-FRUs or other: Explanation: System reboot required.
v Your support center might indicate a FRU based on
their problem analysis (2%) User response: The software update is not complete.
Reboot the system.
Other: The system will not be available for IO or systems
v Software (48%) management during the system reset.
v Enclosure/controller software (25%)
v Fibre Channel switch or switch configuration (25%) 2060 Manual discharge of batteries required.
Explanation: Manual discharge of batteries required.
2035 Drive has disabled protection User response: Use chenclosureslot -battery -slot
information support. 1 -recondition on to cause battery calibration.
Explanation: An array has been interrupted in the
process of establishing data integrity protection 2070 A drive has been detected in an
information on or more of its members by initial writes enclosure that does not support that
or rebuild writes. drive.
In order to ensure the array is usable, the system has Explanation: A drive has been detected in an
turned off hardware data protection for the member enclosure that does not support that drive.
drive.
User response: Remove the drive. If the result is an
User response: If many or all the member drives in an invalid number of drives, replace the drive with a valid
array have logged this error, and sufficient storage drive.
exists in the pool to migrate the allocated extents, then
the simplest strategy is to delete the array and recreate Possible Cause-FRUs or other:
it once the drive service action has been accomplished.
Drive (100%)
If a small number of drives are affected then it is
User response: Run the fix procedure for this event, v Because this error indicates a problem with the
assisting you with the following tasks: number of sessions that are attempting external
access to the cluster, determine the reason that so
1. Run the Detect MDisks task, so that the system many SSH sessions have been opened.
determines the current performance category of
each Mdisk. When the detection task is complete, if v Run the Fix Procedure for this error on the panel at
performance has reverted, the event is automatically Management GUI Troubleshooting >
marked as fixed. Recommended Actions to view and manage the
open SSH sessions.
2. If the event is not automatically fixed, you can
change the tier of the MDisk to the recommended
tier shown in the event properties. The 2550 Encryption key on USB flash drive
recommended tier is logged in the event (Bytes 9-13 removed
of the sense data. A value of 10 hex indicates flash
Explanation: The USB flash drive in a particular node
tier, a value of 20 hex indicates enterprise tier).
or port has been removed. This USB flash drive
3. If you choose not to change the tier configuration, contained a valid encryption key for the system.
mark the event as fixed. Unauthorized removal can compromise data security.
User response: If your data has been compromised,
2258 System SSL certificate has expired. perform a rekey operation immediately.
Explanation: System SSL certificate has expired.
Connections to the GUI, service assistant, and CIMOM
are likely to generate security exceptions.
2600 The cluster was unable to send an Explanation: Cluster time cannot be synchronized
email. with the NTP network time server that is configured.
Explanation: The cluster has attempted to send an User response: There are three main causes to
email in response to an event, but there was no examine:
acknowledgement that it was successfully received by v The cluster NTP network time server configuration is
the SMTP mail server. It might have failed because the incorrect. Ensure that the configured IP address
cluster was unable to connect to the configured SMTP matches that of the NTP network time server.
server, the email might have been rejected by the
v The NTP network time server is not operational.
server, or a timeout might have occurred. The SMTP
Check the status of the NTP network time server.
server might not be running or might not be correctly
configured, or the cluster might not be correctly v The TCP/IP network is not configured correctly.
configured. This error is not logged by the test email Check the configuration of the routers, gateways and
function because it responds immediately with a result firewalls. Ensure that the cluster can access the NTP
code. network time server and that the NTP protocol is
permitted.
User response:
v Ensure that the SMTP email server is active. The error will automatically fix when the cluster is able
v Ensure that the SMTP server TCP/IP address and to synchronize its time with the NTP network time
port are correctly configured in the cluster email server.
configuration.
Possible Cause-FRUs or other:
v Send a test email and validate that the change has
corrected the issue. v None
v Mark the error that you have just repaired as fixed.
v Go to MAP 5700: Repair verification. 2702 Check configuration settings of the NTP
server on the CMM
Possible Cause-FRUs or other: Explanation: The node is configured to automatically
v None set the time using an NTP server within the CMM. It is
not possible to connect to the NTP server during
authentication. The NTP server configuration cannot be
2601 Error detected while sending an email.
changed within S-ITE. Within the CMM, there are
Explanation: An error has occured while the cluster changeable NTP settings. However, these settings
was attempting to send an email in response to an configure how the CMM gets the time and date - the
event. The cluster is unable to determine if the email internal CMM NTP server that is used by the S-ITE
has been sent and will attempt to resend it. The cannot be changed or configured. This event is only
problem might be with the SMTP server or with the raised when an attempt is made to use the server -
cluster email configuration. The problem might also be once every half hour.
caused by a failover of the configuration node. This
error is not logged by the test email function because it Note: The NTP configuration settings are re-read from
responds immediately with a result code. the CMM before each connection.
The reason for a connection error can be due to the
User response:
following:
v If there are higher-priority unfixed errors in the log,
v all suitable Ethernet ports are offline
fix those errors first.
v the CMM hardware is not operational
v Ensure that the SMTP email server is active.
v the CMM is active but the CMM NTP server is
v Ensure that the SMTP server TCP/IP address and
offline.
port are correctly configured in the cluster email
configuration.
The reason for an authentication issue can be due to 2. Ensure that the air vents at the front and back of
the following: the 2145 UPS-1U are not obstructed.
v the authentication values provided were invalid 3. Ensure that other devices in the same rack are not
v the NTP server rejected the authentication key overheating.
provided to the node by the CMM. 4. When you are satisfied that the cause of the
overheating has been resolved, mark the error
If the NTP port is an unsupported value, a port error fixed.
can display. Currently, only port 123 is supported. Only
the current configuration node attempts to resync with
3010 Internal uninterruptible power supply
the server.
software error detected.
User response:
Explanation: Some of the tests that are performed
1. Make sure that CMM is operational by logging in during node startup did not complete because some of
and confirming its time. the data reported by the uninterruptible power supply
2. Check that the IP address in the event log can be during node startup is inconsistent because of a
pinged from the node. software error in the uninterruptible power supply. The
3. If there is an error, try rebooting the CMM. node has determined that the uninterruptible power
supply is functioning sufficiently for the node to
continue operations. The operation of the cluster is not
3000 The 2145 UPS temperature is close to its affected by this error. This error is usually resolved by
upper limit. If the temperature power cycling the uninterruptible power supply.
continues to rise the 2145 UPS will
power off. User response:
1. Power cycle the uninterruptible power supply at a
Explanation: The temperature sensor in the 2145 UPS convenient time. The one or two nodes attached to
is reporting a temperature that is close to the the uninterruptible power supply should be
operational limit of the unit. If the temperature powered off before powering off the uninterruptible
continues to rise the 2145 UPS will power off for safety power supply. Once the nodes have powered off,
reasons. The sensor is probably reporting an excessively wait 5 minutes for the uninterruptible power supply
high temperature because the environment in which to go into standby mode (flashing green AC LED).
the 2145 UPS is operating is too hot. If this does not happen automatically then check the
User response: cabling to confirm that all nodes powered by this
uninterruptible power supply have been powered
1. Ensure that the room ambient temperature is within
off. Remove the power input cable from the
the permitted limits.
uninterruptible power supply and wait at least 2
2. Ensure that the air vents at the front and back of minutes for the uninterruptible power supply to
the 2145 UPS are not obstructed. clear its internal state. Reconnect the uninterruptible
3. Ensure that other devices in the same rack are not power supply power input cable. Press the
overheating. uninterruptible power supply ON button. Power on
4. When you are satisfied that the cause of the the nodes connected to this uninterruptible power
overheating has been resolved, mark the error supply.
fixed. 2. If the error is reported again after the nodes are
restarted replace the 2145 UPS electronics assembly.
3001 The 2145 UPS-1U temperature is close to
Possible Cause-FRUs or other:
its upper limit. If the temperature
continues to rise the 2145 UPS-1U will v 2145 UPS electronics assembly (5%)
power off.
Other:
Explanation: The temperature sensor in the 2145
UPS-1U is reporting a temperature that is close to the v Transient 2145 UPS error (95%)
operational limit of the unit. If the temperature
continues to rise the 2145 UPS-1U will power off for 3024 Technician port connection invalid
safety reasons. The sensor is probably reporting an
excessively high temperature because the environment Explanation: The code has detected more than one
in which the 2145 UPS-1U is operating is too hot. MAC address through the connection, or the DHCP has
given out more than one address. The code thus
User response: believes there is a switch attached.
1. Ensure that the room ambient temperature is within
User response:
the permitted limits.
1. Remove the cable from the technician port.
2. (Optional) Disable additional network adapters on that the sum of the capacities for all of the clusters
the laptop to which it is to connected. does not exceed the licensed capacity.
3. Ensure DHCP is enabled on the network adapter. v You can view the event data or the feature log to
4. If this was not possible, manually set the IP to ensure that the licensed capacity is sufficient for the
192.168.0.2 space that is actually being used. Contact your IBM
sales representative if you want to change the
5. Connect a standard Ethernet cable between the
capacity of the license.
network adapter and the technician port.
v This error will automatically be fixed when a valid
6. If this still does not work, reboot the node and
configuration is entered.
repeat the above steps.
7. This event will auto-fix once either no connection or Possible Cause-FRUs or other:
a valid connection has been detected.
v None
portion of the license allocated to this cluster. Set the v To reduce the FlashCopy capacity delete some
licensed FlashCopy capacity to zero if it is no longer FlashCopy mappings. The used FlashCopy size is the
being used. sum of all of the volumes that are the source volume
v View the event data or the feature log to ensure that of a FlashCopy mapping.
the licensed FlashCopy capacity is sufficient for the v To reduce Global and Metro Mirror capacity delete
space actually being used. Contact your IBM sales some Global Mirror or Metro Mirror relationships.
representative if you want to change the licensed The used Global and Metro Mirror size is the sum of
FlashCopy capacity. the capacities of all of the volumes that are in a
v The error will automatically be fixed when a valid Metro Mirror or Global Mirror relationship; both
configuration is entered. master and auxiliary volumes are counted.
v The error will automatically be fixed when the
Possible Cause-FRUs or other: licensed capacity is greater than the capacity that is
v None being used.
you have not registered on the cluster. Update the the Global Mirror and Metro Mirror relationships
cluster license configuration if you have a license. and consistency groups that use the partnership.
v Decide whether you want to continue to use the v Re-establish any selected partnerships.
Global Mirror or Metro Mirror features or not. v Delete all of the Global Mirror and Metro Mirror
v If you want to use either the Global Mirror or Metro relationships and consistency groups that you listed
Mirror feature contact your IBM sales representative, in either of the first two steps whose remote cluster
arrange a license and change the license settings for partnership has not been re-established.
the cluster to register the license. v Check that the error has been marked as fixed by the
v If you do not want to use both the Global Mirror and system. If it has not, return to the first step and
Metro Mirror features, you must delete all of the determine which Global Mirror or Metro Mirror
Global Mirror and Metro Mirror relationships. relationships or consistency groups are still causing
v The error will automatically fix when the situation is the issue.
resolved.
Possible Cause-FRUs or other:
Possible Cause-FRUs or other: v None
v None
3081 Unable to send email to any of the
3080 Global or Metro Mirror relationship or configured email servers.
consistency group with deleted Explanation: Either the system was not able to
partnership connect to any of the SMTP email servers, or the email
Explanation: A Global Mirror or Metro Mirror transmission has failed. A maximum of six email
relationship or consistency group exists with a cluster servers can be configured. Error event 2600 or 2601 is
whose partnership is deleted. raised when an individual email server is found to be
not working. This error indicates that all of the email
Beginning with SAN Volume Controller software servers were found to be not working.
version 4.3.1 this configuration is not supported and
should be resolved. This condition can occur as a result User response:
of an update to SAN Volume Controller software v Check the event log for all unresolved 2600 and 2601
version 4.3.1 or later. errors and fix those problems.
User response: The issue can be resolved either by v If this error has not already been automatically
deleting all of the Global Mirror or Metro Mirror marked fixed, mark this error as fixed.
relationships or consistency groups that exist with a v Perform the check email function to test that an
cluster whose partnership is deleted, or by recreating email server is operating properly.
all of the partnerships that they were using.
Possible Cause-FRUs or other:
The error will automatically fix when the situation is
resolved. v None
v List all of the Global Mirror and Metro Mirror
relationships and note those where the master cluster 3090 Drive firmware download is cancelled
name or the auxiliary cluster name is blank. For each by user or system, problem diagnosis
of these relationships, also note the cluster ID of the required.
remote cluster.
Explanation: The drive firmware download has been
v List all of the Global Mirror and Metro Mirror cancelled by the user or the system and problem
consistency groups and note those where the master diagnosis required.
cluster name or the auxiliary cluster name is blank.
For each of these consistency groups, also note the User response: If you cancelled the download using
cluster ID of the remote cluster. applydrivesoftware -cancel then this error is to be
v Determine how many unique remote cluster IDs expected.
there are among all of the Global Mirror and Metro If you changed the state of any drive while the
Mirror relationships and consistency groups that you download was ongoing, this error is to be expected,
have identified in the first two steps. For each of however you will have to rerun the
these remote clusters, decide if you want to applydrivesoftware to ensure all your drive firmware
re-establish the partnership with that cluster. Ensure has been updated.
that the total number of partnerships that you want
to have with remote clusters does not exceed the Otherwise:
cluster limit. In version 4.3.1 this limit is 1. If you 1. Check the drive states using lsdrive, in particular
re-establish a partnership, you will not have to delete look at drives which are status=degraded, offline or
use=failed.
2. Check node states using lsnode or lsnodecanister, Explanation: System SSL certificate expires within the
and confirm all nodes are online. next 30 days.
3. Use lsdependentvdisks -drive <drive_id> to check The system SSL certificate that is used to authenticate
for vdisks that are dependent on specific drives. connections to the GUI, service assistant, and the
4. If the drive is a member of a RAID0 array, consider CIMOM is about to expire.
whether to introduce additional redundancy to
User response: Complete the following steps to
protect the data on that drive.
resolve this problem.
5. If the drive is not a member of a RAID0 array, fix
1. If you are using a self-signed certificate, then
any errors in the event log that relate to the array.
generate a new self-signed certificate.
6. Consider using the -force option. With any drive
2. If you are using a certificate that is signed by a
software upgrade there is a risk that the drive
certificate authority, generate a new certificate
might become unusable. Only use the -force option
request and get this certificate signed by your
if you accept this risk.
certificate authority. The existing certificate can
7. Reissue the applydrivesoftware again. continue to be used until the expiry date to provide
time to get the new certificate request signed and
Note: The lsdriveupgradeprogress command can be installed.
used to check the progress of the applydrivesoftware
command as it updates each drive. Possible Cause-FRUs or other:
v N/A
3130 System SSL certificate expires within
the next 30 days.
The following list identifies some of the hardware that might cause failures:
v Power, fan, or cooling
v Application-specific integrated circuits
v Installed small form-factor pluggable (SFP) transceiver
v Fiber-optic cables
If either the maintenance analysis procedures or the error codes sent you here,
complete the following steps:
Procedure
1. If the customer changed the SAN configuration by changing the Fibre Channel
cable connections or switch zoning, verify that the changes were correct and, if
necessary, reverse those changes.
2. Verify that the power is turned on to all switches and storage controllers that
the SAN Volume Controller system uses, and that they are not reporting any
hardware failures. If problems are found, resolve those problems before you
proceed further.
3. Verify that the Fibre Channel cables that connect the systems to the switches
are securely connected.
4. If the customer is running a SAN management tool, you can use that tool to
view the SAN topology and isolate the failing component.
Procedure
1. Wait 5 minutes and try again. The clients might still need to wait for the
services to restart.
2. Confirm that the SSL/TLS implementation of the client (for example, the web
browser or CIM management tool) is up to date and supports the level of
security that is being enforced.
3. If necessary, revert to a weaker SSL/TLS security level in SAN Volume
Controller and see whether this action resolves the issue.
4. If the problem is a browser problem, check the exact error message reported by
the browser.
If the error message is cipher error, SSL error, TLS error, or handshake error,
then the error implies that there is a problem with the secure connection. In
this case, confirm that the browser is up to date. All of the supported browsers
(Internet Explorer, Firefox, Firefox ESR, and Chrome) support TLS 1.2 at the
latest version.
If there is only a blank screen, it is likely that either the web service needs to
restart, or there is a problem unrelated to the security level.
Drives cannot start using protection information for I/O requests on demand. They
must be validated as having a correct format and general support for the function
within the code. SAN Volume Controller is capable of validating the format and
general support when the drive object is first discovered by the system. The
requirement for system validation means that no drive that exists can use
protection information on an update from version 730 regardless of use in the
configuration. The system can reject a request to make a drive a candidate if the
media is not formatted correctly for use with protection information. The process
to begin using protection information on an existing drive is to use the system
interface (GUI/CLI) and involves unmanaging and rediscovering the drive to
allow the software to reacquire the drive characteristics.
The lsdrive view contains the protection_enabled field that shows whether a
drive is using protection information. Drives and arrays that exist on an update to
version 740 do not automatically pick up support for protection information. All
newly discovered drives at this code level support protection information. If the
system has spare capacity, then migration can proceed an MDisk at a time.
Otherwise, the migration to using protection information on drives must proceed
drive by drive.
When you install a new expansion enclosure, follow the management GUI Add
Enclosure wizard. Select Monitoring > System. From the Actions menu, select
Add Enclosures.
The following items can indicate that a single Fibre Channel or 10 G Ethernet link
failed:
v The Fibre Channel port status on the front panel of the node
v The Fibre Channel status light-emitting diodes (LEDs) at the rear of the node
v An error that indicates that a single port failed (703, 723).
Use only IBM supported 10 Gb SFP transceivers with the SAN Volume Controller
2145-DH8. Using any other SFP transceivers can lead to unexpected system
behavior. Copper DAC is not supported by these 10 Gb ports. The SFP transceiver
replacement in a 10 Gbps Ethernet adapter port is governed by the following rules:
v An existing 10 Gb SFP transceiver replaced with a new 10 Gb SFP transceiver:
The 10 Gbps Ethernet adapter port detects a new SFP transceiver and becomes
operational immediately.
v If the 10 Gbps Ethernet adapter port detects a new SFP transceiver and becomes
operational immediately, the port has an incorrect SFP transceiver since the last
reboot and is replaced with the correct 10 Gb SFP transceiver. This situation can
occur with an incompatible SFP transceiver (8 Gb SFP or 4 Gb SFP) that is
inserted in the 10 Gbps Ethernet adapter port.
The node will require a reboot for detecting the new SFP transceiver. The new
SFP transceiver will be operational only after reboot (no DMP is produced).
v The 10 Gbps Ethernet adapter port contains no SFP transceiver since the last
reboot and the correct 10 Gb SFP transceiver is installed:
System reboot is required for detecting the new SFP transceiver.
Procedure
Attempt each of these actions, in the following order, until the failure is fixed.
1. Replace the SFP transceiver for the failing port on the node.
Note: SAN Volume Controller nodes are supported by both longwave SFP
transceivers and shortwave SFP transceivers. You must replace an SFP
transceiver with the same type of SFP transceiver. If the SFP transceiver to
replace is a longwave SFP transceiver, for example, you must provide a suitable
replacement. Removing the wrong SFP transceiver might result in loss of data
access.
2. Replace the Fibre Channel adapter on the node.Replace the Fibre Channel
adapter or Fibre Channel over Ethernet adapter on the node.
For network problems, you can attempt any of the following actions:
v Test your connectivity between the host and SAN Volume Controller ports.
v Try to ping the SAN Volume Controller system from the host.
v Ask the Ethernet network administrator to check the firewall and router settings.
v Check that the subnet mask and gateway are correct for the SAN Volume
Controller host configuration.
Using the management GUI for SAN Volume Controller problems, you can attempt
any of the following actions:
v View the configured node port IP addresses.
v View the list of volumes that are mapped to a host to ensure that the volume
host mappings are correct.
v Verify that the volume is online.
For host problems, you can attempt any of the following actions:
v Verify that the host iSCSI qualified name (IQN) is correctly configured.
v Use operating system utilities (such as Windows device manager) to verify that
the device driver is installed, loaded, and operating correctly.
v If you configured the VLAN, check that its settings are correct. Ensure that Host
Ethernet port, SAN Volume Controller Ethernet ports IP address, and Switch
port are on the same VLAN ID. Ensure that on each VLAN, a different subnet is
used. Configuring the same subnet on different VLAN IDs can cause network
connectivity problems.
If error code 705 on node is displayed, this means the FC I/O port is inactive.
Fibre Channel over Ethernet uses Fibre Channel as a protocol and Ethernet as an
inter-connect.
Note: Concerning a Fibre Channel over Ethernet enabled port: either the fibre
channel forwarder (FCF) is not seen, or the Fibre Channel over Ethernet feature is
not configured on switch.
v Verify that the Fibre Channel over Ethernet feature is enabled on the FCF.
v Verify the remote port (switch port) properties on the FCF.
Run lsfabric, and verify the host is seen as a remote port in the output. If the
host is not seen, in order:
v Verify that SAN Volume Controller and host get an Fibre Channel ID (FCID) on
the FCF. If unable to verify, check the VLAN configuration.
What to do next
If the problem is not resolved, verify the state of the host adapter.
v Unload and load the device driver
v Use the operating system utilities (for example, Windows Device Manager) to
verify the device driver is installed, loaded, and operating correctly.
The following categories represent the types of service actions for storage systems:
v Controller code update
v Field replaceable unit (FRU) replacement
Ensure that you are familiar with the following guidelines for updating controller
code:
v Check to see if the SAN Volume Controller supports concurrent maintenance for
your storage system.
v Allow the storage system to coordinate the entire update process.
v If it is not possible to allow the storage system to coordinate the entire update
process, perform the following steps:
1. Reduce the storage system workload by 50%.
2. Use the configuration tools for the storage system to manually failover all
logical units (LUs) from the controller that you want to update.
3. Update the controller code.
4. Restart the controller.
5. Manually failback the LUs to their original controller.
6. Repeat for all controllers.
FRU replacement
Ensure that you are familiar with the following guidelines for replacing FRUs:
v If the component that you want to replace is directly in the host-side data path
(for example, cable, Fibre Channel port, or controller), disable the external data
paths to prepare for update. To disable external data paths, disconnect or disable
the appropriate ports on the fabric switch. The SAN Volume Controller ERPs
reroute access over the alternate path.
v If the component that you want to replace is in the internal data path (for
example, cache, or drive) and did not completely fail, ensure that the data is
backed up before you attempt to replace the component.
HyperSwap
When you resume HyperSwap replication, consider whether you want to continue
using the out-of-date consistent copy or revert to the up-to-date copy. To identify
whether the master or auxiliary volume has access, look at the primary field that is
shown by the lsrcrelationship or lsrcconsistgrp command. To continue using
the out-of-date copy, provide that value as the argument to the -primary parameter
of the startrcrelationship or startrcconsistgrp command. To revert to the
up-to-date copy, specify the opposite value as the argument to the -primary
parameter. For example, if master is shown in the primary field of lsrcconsistgrp
for an active-active consistency group in the Idling state, to revert to the up-to-date
copy, use startrcconsistgrp -primary aux.
Note: Inappropriate use of these procedures can allow host systems to make
independent modifications to both the primary and secondary copies of data. The
user is responsible for ensuring that no host systems are continuing to use the
primary copy of the data before you enable access to the secondary copy.
Stretched System
Use the satask overridequorum command to enable access to the storage at the
secondary site. This feature is only available if the system was configured by
assigning sites to nodes and storage controllers, and changing the system topology
to stretched.
Important: If the user ran disaster recovery on one site and then powered up the
remaining, failed site (which contained the configuration node at the time of the
disaster), then the cluster would assert itself as designed. This procedure would
start a second, identical cluster in parallel, which can cause data corruption. The
user must follow these steps:
Example
1. Remove the connectivity of the nodes from the site that is experiencing the
outage
2. Power up or recover those nodes
3. Run satask leavecluster-force or svctask rmnode command for all the nodes
in the cluster
4. Bring the nodes into candidate state, and then
5. Connect them to the site on which the site disaster recovery feature was run.
Other configurations
CAUTION:
If the system encounters a state where:
v No nodes are active, and
v One or more nodes have node errors that require a node rescue
STOP and contact IBM Remote Technical Support. Initiating this T3 recover
system procedure while in this specific state can result in loss of the XML
configuration backup files.
Attention:
v Run service actions only when directed by the fix procedures. If used
inappropriately, service actions can cause loss of access to data or even data loss.
Before you attempt to recover a system, investigate the cause of the failure and
attempt to resolve those issues by using other fix procedures. Read and
understand all of the instructions before you complete any action.
v The recovery procedure can take several hours if the system uses large-capacity
devices as quorum devices.
Do not attempt the recover system procedure unless the following conditions are
met:
v All of the conditions have been met in When to run the recover system
procedure on page 254.
v All hardware errors are fixed. See Fix hardware errors on page 254
v All nodes have candidate status. Otherwise, see step 1.
v All nodes must be at the same level of code that the system had before the
failure. If any nodes were modified or replaced, use the service assistant to
verify the levels of code, and where necessary, to reinstall the level of code so
that it matches the level that is running on the other nodes in the system. For
more information, see Removing system information for nodes with error code
550 or error code 578 using the service assistant on page 256.
The system recovery procedure is one of several tasks that must be completed. The
following list is an overview of the tasks and the order in which they must be
completed:
1. Preparing for system recovery
a. Review the information regarding when to run the recover system
procedure.
Note: Run the procedure on one system in a fabric at a time. Do not run the
procedure on different nodes in the same system. This restriction also applies to
remote systems.
3. Completing actions to get your environment operational.
v Recovering from offline volumes by using the CLI.
v Checking your system, for example, to ensure that all mapped volumes can
access the host.
Attention: If you experience failures at any time while running the recover
system procedure, call the IBM remote technical support. Do not attempt to do
further recovery actions, because these actions might prevent support from
restoring the system to an operational status.
Certain conditions must be met before you run the recovery procedure. Use the
following items to help you determine when to run the recovery procedure:
1. All enclosures and external storage systems are powered up and can
communicate with each other.
2. Check that all nodes in the system are shown in the service assistant tool or
using the service command: sainfo lsservicenodes. Investigate any missing
nodes.
3. Check that no node in the system is active and that the management IP is not
accessible. If any node has active status, it is not necessary to recover the
system.
4. Resolve all hardware errors in nodes so that only node errors 578 or 550 are
present. If this is not the case, go to Fix hardware errors.
5. Ensure all backend storage that is administered by the system is present before
you run the recover system procedure.
6. If any nodes have been replaced, ensure that the WWNN of the replacement
node matches that of the replaced node, and that no prior system data remains
on this node.
Note: If any of the buttons on the front panel have been pressed after these
two error codes are reported, the report for the node returns to the 578 node
error. The change in the report happens after approximately 60 seconds. Also,
if the node was rebooted or if hardware service actions were taken, the node
might show no cluster name on the Cluster: display.
If any nodes show Node Error: 550, record the data from the second line of
the display. If the last character on the second line of the display is >, use the
right button to scroll the display to the right.
- In addition to the Node Error: 550, the second line of the display can show
a list of node front panel IDs (seven digits) that are separated by spaces.
The list can also show the WWPN/LUN ID (16 hexadecimal digits
followed by a forward slash and a decimal number).
- If the error data contains any front panel IDs, ensure that the node referred
to by that front panel ID is showing Node Error 578:. If it is not reporting
node error 578, ensure that the two nodes can communicate with each
other. Verify the SAN connectivity and restart one of the two nodes by
pressing the front panel power button twice.
- If the error data contains a WWPN/LUN ID, verify the SAN connectivity
between this node and that WWPN. Check the storage system to ensure
that the LUN referred to is online. After verifying, restart the node by
pressing the front panel power button twice.
Note: If after resolving all these scenarios, half or greater than half of the
nodes are reporting Node Error: 578, it is appropriate to run the recovery
procedure.
For any nodes that are reporting a node error 550, ensure that all the missing
hardware that is identified by these errors is powered on and connected
without faults.
If you have not been able to restart the system, and if any node other than
the current node is reporting node error 550 or 578, you must remove system
data from those nodes. This action acknowledges the data loss and puts the
nodes into the required candidate state.
To remove clustered system information from a node with an error 550 or 578,
follow this front panel procedure :
Procedure
1. Press and release the up or down button until the Actions menu option is
displayed.
2. Press and release the select.
3. Press and release the up or down button until Remove Cluster? option is
displayed.
4. Press and release the select.
5. The node displays Confirm Remove?.
6. Press and release the select.
7. The node displays Cluster:.
Results
When all nodes show Cluster: on the top line and blank on the second line, the
nodes are in candidate status. The 550 or 578 error is removed. You can now run
the recovery procedure.
Before performing this task, ensure that you have read the introductory
information in the overall recover system procedure.
To remove system information from a node with an error 550 or 578, follow this
procedure using the service assistant:
Procedure
1. Point your browser to the service IP address of one of the nodes, for example,
https://node_service_ip_address/service/.
If you do not know the IP address or if it has not been configured, configure
the service address in one of the following ways:
v On SAN Volume Controller models 2145-CG8 and 2145-CF8 nodes, use the
front panel menu to configure a service address on the node.
v On SAN Volume Controller 2145-DH8 nodes, use the technician port to
connect to the service assistant and configure a service address on the node.
Results
When all nodes display a status of candidate and all error conditions are None,
you can run the recovery procedure.
Attention: This service action has serious implications if not completed properly.
If at any time an error is encountered not covered by this procedure, stop and call
IBM Support.
The volumes are online. Use the final checks to make the environment
operational; see What to check after running the system recovery on page 261.
v T3 incomplete
One or more of the volumes is offline because there was fast write data in the
cache. Further actions are required to bring the volumes online; see Recovering
from offline volumes using the CLI on page 260 for details (specifically, see the
task concerning recovery from offline VDisks using the command-line interface
(CLI)).
v T3 failed
Start the recovery procedure from any node in the system; the node must not have
participated in any other system. To receive optimal results in maintaining the I/O
group ordering, run the recovery from a node that was in I/O group 0.
Note: Each individual stage of the recovery procedure might take significant time
to complete, dependant upon the specific configuration.
Note: Changes made after the time of this configuration backup might not be
restored.
7. After verifying the time stamp is correct, press and hold the UP ARROW and
click Select.
The node displays Restoring. After a short delay, the second line displays a
sequence of progress messages indicating the actions taking place; then the
software on the node restarts.
The node displays Cluster on the top line and a management IP address on the
second line. After a few moments, the node displays T3 Completing.
Note: Any system errors logged at this time might temporarily overwrite the
display; ignore the message: Cluster Error: 3025. After a short delay, the
second line displays a sequence of progress messages indicating the actions
taking place.
When each node is added to the system, the display shows Cluster: on the top
line, and the cluster (system) name on the second line.
Attention: After the last node is added to the system, there is a short delay to
allow the system to stabilize. Do not attempt to use the system. The recovery is
still in progress. Once recovery is complete, the node displays T3 Succeeded on
the top line.
8. Click Select to return the node to normal display.
Note: Ensure that the web browser is not blocking pop-up windows. If it does,
progress windows cannot open.
Before you begin this procedure, read the recover system procedure introductory
information; see Recover system procedure on page 253.
Attention: This service action has serious implications if not completed properly.
If at any time an error is encountered not covered by this procedure, stop and call
the support center.
Run the recovery from any nodes in the system; the nodes must not have
participated in any other system.
Note: Each individual stage of the recovery procedure can take significant time to
complete, depending on the specific configuration.
Procedure
1. Point your browser to the service IP address of one of the nodes.
If you do not know the IP address or if it has not been configured, configure
the service address in one of the following ways:
v On SAN Volume Controller models 2145-CG8 and 2145-CF8 nodes, use the
front panel menu to configure a service address on the node.
v On SAN Volume Controller 2145-DH8 nodes, use the technician port to
connect to the service assistant and configure a service address on the node.
2. Log on to the service assistant.
3. Select Recover System from the navigation.
4. Follow the online instructions to complete the recovery procedure.
a. Verify the date and time of the last quorum time. The time stamp must be
less than 30 minutes before the failure. The time stamp format is
YYYYMMDD hh:mm, where YYYY is the year, MM is the month, DD is the
day, hh is the hour, and mm is the minute.
Attention: If the time stamp is not less than 30 minutes before the failure,
call the support center.
b. Verify the date and time of the last backup date. The time stamp must be
less than 24 hours before the failure. The time stamp format is YYYYMMDD
hh:mm, where YYYY is the year, MM is the month, DD is the day, hh is the
hour, and mm is the minute.
Results
The volumes are back online. Use the final checks to get your environment
operational again.
v T3 recovery completed with errors
T3 recovery completed with errors: One or more of the volumes are offline
because there was fast write data in the cache. To bring the volumes online, see
Recovering from offline volumes using the CLI for details.
v T3 failed
Verify that the environment is operational by completing the checks that are
provided in What to check after running the system recovery on page 261.
If any errors are logged in the error log after the system recovery procedure
completes, use the fix procedures to resolve these errors, especially the errors that
are related to offline arrays.
If you have run the recovery procedure but there are offline volumes, you can
complete the following steps to bring the volumes back online. Any volumes that
are offline and are not thin-provisioned (or compressed) volumes are offline
because of the loss of write-cache data during the event that led all node canisters
to lose their cluster state. Any data lost from the write-cache cannot be recovered.
These volumes might need additional recovery steps after the volume is brought
back online.
Note: If you encounter errors in the error log after running the recovery procedure
that are related to offline arrays, use the fix procedures to resolve the offline array
errors before fixing the offline volume errors.
Complete the following steps to recover an offline volume after the recovery
procedure has completed:
1. Delete all IBM FlashCopy function mappings and Metro Mirror or Global
Mirror relationships that use the offline volumes.
2. Run the recovervdisk, recovervdiskbyiogrp or recovervdiskbysystem
command. (This will only bring the volume back online so that you can
attempt to deal with the data loss.)
3. Refer to What to check after running the system recovery for what to do
with volumes that have been corrupted by the loss of data from the
write-cache.
4. Recreate all FlashCopy mappings and Metro Mirror or Global Mirror
relationships that use the volumes.
The recovery procedure recreates the old system from the quorum data. However,
some things cannot be restored, such as cached data or system data managing
in-flight I/O. This latter loss of state affects RAID arrays managing internal
storage. The detailed map about where data is out of synchronization has been
lost, meaning that all parity information must be restored, and mirrored pairs must
be brought back into synchronization. Normally this results in either old or stale
data being used, so only writes in flight are affected. However, if the array had lost
redundancy (such as syncing, or degraded or critical RAID status) prior to the
error requiring system recovery, then the situation is more severe. Under this
situation you need to check the internal storage:
v Parity arrays will likely be syncing to restore parity; they do not have
redundancy when this operation proceeds.
v Because there is no redundancy in this process, bad blocks might have been
created where data is not accessible.
v Parity arrays could be marked as corrupt. This indicates that the extent of lost
data is wider than in-flight I/O, and in order to bring the array online, the data
loss must be acknowledged.
v RAID-6 arrays that were actually degraded prior the system recovery might
require a full restore from backup. For this reason, it is important to have at
least a capacity match spare available.
Note: Any data that was in the SAN Volume Controller write cache at the time
of the failure is lost.
v Run the application consistency checks.
FlashCopy mappings are not restored for VVols. The implications are as follows.
v The mappings that describe the VM's snapshot relationships are lost. However,
the Virtual Volumes that are associated with these snapshots still exist, and the
snapshots might still appear on the vSphere Web Client. This outcome might
have implications on your VMware back-up solution.
Configuration data for the system provides information about your system and the
objects that are defined in it. The backup and restore functions of the svcconfig
command can back up and restore only your configuration data for the SAN
Volume Controller system. You must regularly back up your application data by
using the appropriate backup methods.
You can maintain your configuration data for the system by completing the
following tasks:
v Backing up the configuration data
v Restoring the configuration data
v Deleting unwanted backup configuration data files
Before you back up your configuration data, the following prerequisites must be
met:
v No independent operations that change the configuration for the system can be
running while the backup command is running.
v No object name can begin with an underscore character (_).
Note:
v The default object names for controllers, I/O groups, and managed disks
(MDisks) do not restore correctly if the ID of the object is different from what is
recorded in the current configuration data file.
v All other objects with default names are renamed during the restore process. The
new names appear in the format name_r where name is the name of the object in
your system.
Before you restore your configuration data, the following prerequisites must be
met:
v You have the Security Administrator role associated with your user name and
password.
v You have a copy of your backup configuration files on a server that is accessible
to the system.
v You have a backup copy of your application data that is ready to load on your
system after the restore configuration operation is complete.
v You know the current license settings for your system.
Note: You can add new hardware, but you must not remove any hardware
because the removal can cause the restore process to fail.
v No zoning changes were made on the Fibre Channel fabric which would prevent
communication between the SAN Volume Controller and any storage controllers
which are present in the configuration.
v You have at least 3 USB flash drives if encryption was enabled on the system
when its configuration was backed up. The USB flash drives are used for
generation of new keys as part of the restore process or for manually restoring
encryption if the system has less than 3 USB ports.
The SAN Volume Controller analyzes the backup configuration data file and the
system to verify that the required disk controller system nodes are available.
Before you begin, hardware recovery must be complete. The following hardware
must be operational: hosts, SAN Volume Controller nodes, internal flash drives and
expansion enclosures (if applicable), the Ethernet network, the SAN fabric, and any
external storage systems (if applicable).
Before you back up your configuration data, the following prerequisites must be
met:
v No independent operations that change the configuration can be running while
the backup command is running.
v No object name can begin with an underscore character (_).
Note: The system automatically creates a backup of the configuration data each
day at 1 AM. This backup is known as a cron backup and is written to
/dumps/svc.config.cron.xml_serial# on the configuration node.
Use the these instructions to generate a manual backup at any time. If a severe
failure occurs, both the configuration of the system and application data might be
lost. The backup of the configuration data can be used to restore the system
configuration to the exact state it was in before the failure. In some cases, it might
be possible to automatically recover the application data. This backup can be
attempted with the Recover System Procedure, also known as a Tier 3 (T3)
procedure. To restore the system configuration without attempting to recover the
application data, use the Restoring the System Configuration procedure, also
known as a Tier 4 (T4) recovery. Both of these procedures require a recent backup
of the configuration data.
Procedure
1. Use your preferred backup method to back up all of the application data that
you stored on your volumes.
2. Issue the following CLI command to back up your configuration:
svcconfig backup
The svcconfig backup CLI command creates three files that provide
information about the backup process and the configuration. These files are
created in the /dumps directory of the configuration node canister.
Table 79 describes the three files that are created by the backup process:
Table 79. Files created by the backup process
File name Description
svc.config.backup.xml_<serial#> Contains your configuration data.
svc.config.backup.sh_<serial#> Contains the names of the commands that
were issued to create the backup of the
system.
svc.config.backup.log_<serial#> Contains details about the backup, including
any reported errors or warnings.
If the process fails, resolve the errors, and run the command again.
4. Keep backup copies of the files outside the system to protect them against a
system hardware failure. Copy the backup files off the system to a secure
location; use either the management GUI or SmartCloud Provisioning
command line. For example:
pscp -unsafe superuser@cluster_ip:/dumps/svc.config.backup.*
/offclusterstorage/
Tip: To maintain controlled access to your configuration data, copy the backup
files to a location that is password-protected.
If encryption was enabled on the system when its configuration was backed up,
then at least 3 USB flash drives need to be present in the node USB ports for the
configuration restore to work. The USB flash drives do not need to contain any
keys. They are for generation of new keys as part of the restore process.
You must regularly back up your configuration data and your application data to
avoid data loss. If a system is lost after a severe failure occurs, both configuration
for the system and application data is lost. You must restore the system to the
exact state it was in before the failure, and then recover the application data.
During the restore process, the nodes and the storage enclosure will be restored to
the system, and then the MDisks and the array will be re-created and configured.
If there are multiple storage enclosures involved, the arrays and MDisks will be
restored on the proper enclosures based on the enclosure IDs.
Important:
v There are two phases during the restore process: prepare and execute. You must
not change the fabric or system between these two phases.
If you do not understand the instructions to run the CLI commands, see the
command-line interface reference information.
Procedure
1. Verify that all nodes are available as candidate nodes before you run this
recovery procedure. You must remove errors 550 or 578 to put the node in
candidate state.
2. Create a system. If possible, use the node that was originally in I/O group 0.
v For SAN Volume Controller 2145-DH8 systems, use the technician port.
v For all other earlier models, use the front panel.
3. In a supported browser, enter the IP address that you used to initialize the
system and the default superuser password (passw0rd).
4. Issue the following CLI command to ensure that only the configuration node
is online:
lsnode
The following output is an example of what is displayed:
id name status IO_group_id IO_group_name config_node
1 nodel online 0 io_grp0 yes
5. Using the command-line interface, issue the following command to log on to
the system:
plink -i ssh_private_key_file superuser@cluster_ip
Where ssh_private_key_file is the name of the SSH private key file for the
superuser and cluster_ip is the IP address or DNS name of the system for
which you want to restore the configuration.
Note: Because the RSA host key changed, a warning message might display
when you connect to the system using SSH.
6. Identify the configuration backup file from which you want to restore.
The file can be either a local copy of the configuration backup XML file that
you saved when you backed-up the configuration or an up-to-date file on one
of the nodes.
Configuration data is automatically backed up daily at 01:00 system time on
the configuration node.
Download and check the configuration backup files on all nodes that were
previously in the system to identify the one containing the most recent
complete backup
a. From the management GUI, click Settings > Support.
b. Click Show full log listing.
c. For each node (canister) in the system, complete the following steps:
1) Select the node to operate on from the selection box at the top of the
table.
This CLI command creates a log file in the /tmp directory of the configuration
node. The name of the log file is svc.config.restore.prepare.log.
Note: Any nodes that you did not add manually to the system will be added
automatically as part of the restore process.
This CLI command creates a log file in the /tmp directory of the configuration
node. The name of the log file is svc.config.restore.execute.log.
15. Issue the following command to copy the log file to another server that is
accessible to the system:
pscp superuser@cluster_ip:/tmp/svc.config.restore.execute.log
full_path_for_where_to_copy_log_files
16. Open the log file from the server where the copy is now stored.
17. Check the log file to ensure that no errors or warnings occurred.
Note: You might receive a warning that states that a licensed feature is not
enabled. This message means that after the recovery process, the current
license settings do not match the previous license settings. The recovery
process continues normally and you can enter the correct license settings in
the management GUI later.
When you log in to the CLI again over SSH, you see this output:
IBM_2145:your_cluster_name:superuser>
What to do next
You can remove any unwanted configuration backup and restore files from the
/tmp directory on your configuration by issuing the following CLI command:
svcconfig clear -all
Procedure
1. Issue the following command to log on to the system:
plink -i ssh_private_key_file superuser@cluster_ip
where ssh_private_key_file is the name of the SSH private key file for the
superuser and cluster_ip is the IP address or DNS name of the clustered system
from which you want to delete the configuration.
2. Issue the following CLI command to erase all of the files that are stored in the
/tmp directory:
svcconfig clear -all
Similarly, if you have replaced the service controller, use the node rescue procedure
to ensure that the service controller has the correct software.
Attention: If you recently replaced both the service controller and the disk drive
as part of the same repair operation, node rescue fails.
Node rescue works by booting the operating system from the service controller
and running a program that copies all the SAN Volume Controller software from
any other node that can be found on the Fibre Channel fabric.
Attention: When running node rescue operations, run only one node rescue
operation on the same SAN, at any one time. Wait for one node rescue operation
to complete before starting another.
Procedure
1. Ensure that the Fibre Channel cables are connected.
2. Ensure that at least one other node is connected to the Fibre Channel fabric.
3. Ensure that the SAN zoning allows a connection between at least one port of
this node and one port of another node. It is better if multiple ports can
connect. This is particularly important if the zoning is by worldwide port name
(WWPN) and you are using a new service controller. In this case, you might
need to use SAN monitoring tools to determine the WWPNs of the node. If you
need to change the zoning, remember to set it back when the service procedure
is complete.
4. Turn off the node.
5. Press and hold the left and right buttons on the front panel.
6. Press the power button.
7. Continue to hold the left and right buttons until the node-rescue-request
symbol is displayed on the front panel (Figure 65).
Results
The node rescue request symbol displays on the front panel display until the node
starts to boot from the service controller. If the node rescue request symbol
displays for more than two minutes, go to the hardware boot MAP to resolve the
problem. When the node rescue starts, the service display shows the progress or
failure of the node rescue operation.
Note: If the recovered node was part of a clustered system, the node is now
offline. Delete the offline node from the system and then add the node back into
the system. If node recovery was used to recover a node that failed during a
The volume virtualization that is provided extends the time when a medium error
is returned to a host. Because of this difference to non-virtualized systems, the
SAN Volume Controller uses the term bad blocks rather than medium errors.
The SAN Volume Controller allocates volumes from the extents that are on the
managed disks (MDisks). The MDisk can be a volume on an external storage
controller or a RAID array that is created from internal drives. In either case,
depending on the RAID level used, there is normally protection against a read
error on a single drive. However, it is still possible to get a medium error on a
read request if multiple drives have errors or if the drives are rebuilding or are
offline due to other issues.
The SAN Volume Controller provides migration facilities to move a volume from
one underlying set of physical storage to another or to replicate a volume that uses
FlashCopy or Metro Mirror or Global Mirror. In all these cases, the migrated
volume or the replicated volume returns a medium error to the host when the
logical block address on the original volume is read. The system maintains tables
of bad blocks to record where the logical block addresses that cannot be read are.
These tables are associated with the MDisks that are providing storage for the
volumes.
Important: The dumpmdiskbadblocks only outputs the virtual medium errors that
have been created, and not a list of the actual medium errors on MDisks or drives.
It is possible that the tables that are used to record bad block locations can fill up.
The table can fill either on an MDisk or on the system as a whole. If a table does
fill up, the migration or replication that was creating the bad block fails because it
was not possible to create an exact image of the source volume.
The system creates alerts in the event log for the following situations:
v When it detects medium errors and creates a bad block
v When the bad block tables fill up
The recommended actions for these alerts guide you in correcting the situation.
Clear bad blocks by deallocating the volume disk extent, by deleting the volume or
by issuing write I/O to the block. It is good practice to correct bad blocks as soon
as they are detected. This action prevents the bad block from being propagated
when the volume is replicated or migrated. It is possible, however, for the bad
block to be on part of the volume that is not used by the application. For example,
it can be in part of a database that has not been initialized. These bad blocks are
corrected when the application writes data to these areas. Before the correction
happens, the bad block records continue to use up the available bad block space.
SAN Volume Controller nodes must be configured in pairs so you can perform
concurrent maintenance.
When you service one node, the other node keeps the storage area network (SAN)
operational. With concurrent maintenance, you can remove, replace, and test all
field replaceable units (FRUs) on one node while the SAN and host systems are
powered on and doing productive work.
Note: Unless you have a particular reason, do not remove the power from both
nodes unless instructed to do so. When you need to remove power, see MAP
5350: Powering off a node on page 302.
Procedure
v To isolate the FRUs in the failing node, complete the actions and answer the
questions given in these maintenance analysis procedures (MAPs).
v When instructed to exchange two or more FRUs in sequence:
1. Exchange the first FRU in the list for a new one.
2. Verify that the problem is solved.
3. If the problem remains:
a. Reinstall the original FRU.
b. Exchange the next FRU in the list for a new one.
4. Repeat steps 2 and 3 until either the problem is solved, or all the related
FRUs have been exchanged.
5. Complete the next action indicated by the MAP.
6. If you are using one or more MAPs because of a system error code, mark the
error as fixed in the event log after the repair, but before you verify the
repair.
Note: Start all problem determination procedures and repair procedures with
MAP 5000: Start.
Note: The service assistant interface should be used if there is no front panel
display, for example on the SAN Volume Controller 2145-DH8.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures.
SAN Volume Controller nodes are configured in pairs. While you service one node,
you can access all the storage managed by the pair from the other node. With
concurrent maintenance, you can remove, replace, and test all FRUs on one SAN
Volume Controller while the SAN and host systems are powered on and doing
productive work.
Notes:
v Unless you have a particular reason, do not remove the power from both nodes
unless instructed to do so.
v If an action in these procedures involves removing or replacing a part, use the
applicable procedure.
v If the problem persists after you complete the actions in this procedure, return to
step 1 of the MAP to try again to fix the problem.
Procedure
1. Were you sent here from a fix procedure?
NO Go to step 2
YES Go to step 6 on page 277
2. (from step 1)
Access the management GUI. See Accessing the management GUI on page
60
3. (from step 2)
Does the management GUI start?
NO Go to step 6 on page 277.
YES Go to step 4.
4. (from step 3)
Is the Welcome window displayed?
NO Go to step 6 on page 277.
YES Go to step 5.
5. (from step 4)
Log in to the management GUI. Use the user ID and password that is
provided by the user.
Go to the Events page.
Start the fix procedure for the recommended action.
Did the fix procedures find an error that is to be fixed?
NO Go to step 6 on page 277.
1
svc00923
2145-CF8
2145-CG8
Figure 67 on page 278 shows the operator-information panel for the SAN
Volume Controller 2145-DH8.
1 2 3 4
svc00824
5 6 7
Note: If the node has more than 4 Ethernet ports, activity for ports 5 and above is not
indicated by the Ethernet activity LEDs on the operator-information panel.
NO Go to step 9.
YES Go to MAP 5800: Light path on page 322.
9. (from step 8 on page 277)
Is the hardware boot display that you see in Figure 68 displayed on the
node? (SAN Volume Controller 2145-DH8 does not have a front panel
display. For this 2145-DH8 model, are the node status LED, node fault LED,
and battery status LED that you see in Figure 69 on page 279 all off?)
NO Go to step 11.
YES Go to step 10.
10. (from step 9)
Has the hardware boot display that you see in Figure 68 displayed for more
than 3 minutes? For 2145-DH8, have the node status LED, node fault LED,
and battery status LED that you see in Figure 69 on page 279 all been off
for more than 3 minutes?
NO Go to step 11.
YES For 2145-DH8, go to step 23 on page 282. Otherwise:
a. Go to MAP 5900: Hardware boot on page 341.
b. Go to MAP 5700: Repair verification on page 321.
11. (from step 9 )
3 4 5
- -
1 2 3 4 5 6 7 8 1+ 2+
1 2 3 4
aaaa aaaa aaaa aaaaaa aaaaaa aaaaaa aaaaaa aaaaaa aaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a a a a a a a a a a a
aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa aaa aaaa
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
aa
aaaa
aa
aaaa aa
aa aaaaaa aaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a
aaaa a
aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa
a aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa SAN Volume Controller
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa a a a a a a a a a a a a a a a a a a a a a a a a a a
aaaa
aaaa
aaaa
aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa a a a a a a a a a a a a a a a a a a a a a a a a a a
aaa
a aa
aaa
a aa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa
a a
aaaa
a a
aaaa
a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
6 a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
a a a a a a a a a a a a a a a a a a a a a a a 6
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
12 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
11
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a 10 7
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
svc00800
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
8
- -
+ 9
NO Go to step 12.
YES Complete these steps:
a. If the node has a front panel display, note the failure code and go
to Boot code reference on page 159 to follow the repair actions.
b. Access the service assistant interface via the Technician port for
node access on page 75 and follow the service recommendation
presented.
c. Go to MAP 5700: Repair verification on page 321.
12. (from step 11 on page 278)
Is Booting displayed on the top line of the front-panel display of the node?
NO Go to step 14.
YES Go to step 13.
13. (from step 12)
A progress bar and a boot code are displayed. If the progress bar does not
advance for more than 3 minutes, the progress is stalled.
Has the progress bar stalled?
NO Go to step 14.
YES
a. Note the failure code and go to Boot code reference on page 159
to complete the repair actions.
b. Go to MAP 5700: Repair verification on page 321.
14. (from step 12 and step 13)
Note: Do not validate the WWNN until you read the following
information to ensure that you choose the correct value. If you choose
an incorrect value, you might find that the SAN zoning for the node is
also not correct and more than one node is using the same WWNN.
Therefore, it is important to establish the correct WWNN before you
continue.
a. Determine which WWNN that you want to use.
v If the service controller was replaced, the correct value is
probably the WWNN that is stored on disk (the disk WWNN).
v If the disk was replaced, perhaps as part of a frame replacement
procedure, but was not reinitialized, the correct value is
probably the WWNN that is stored on the service controller (the
panel WWNN).
b. Select the stored WWNN that you want this node to use:
v To use the WWNN that is stored on the disk:
1) From the Validate WWNN? panel, press and release the
select button. The Disk WWNN: panel is displayed and
shows the last five digits of the WWNN that is stored on the
disk.
2) From the Disk WWNN: panel, press and release the down
button. The Use Disk WWNN? panel is displayed.
3) Press and release the select button.
v To use the WWNN that is stored on the service controller:
Results
If you suspect that the problem is a software problem, see Updating the system
documentation for details about how to update your entire SAN Volume Controller
environment.
If the problem is still not fixed, collect diagnostic information and contact IBM
Remote Technical Support.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
Procedure
1. Are you here because the node is not powered on?
NO Go to step 10 on page 287.
YES Go to step 2.
2. (from step 1)
Is the power LED on the operator-information panel continuously
illuminated? Figure 70 on page 284 shows the location of the power LED 1
on the operator-information panel.
ifs00064
5 6 7
1 Power button and power LED (green)
NO Go to step 3.
YES The node is powered on correctly. Reassess the symptoms and return
to MAP 5000: Start or go to MAP 5700: Repair verification to verify
the correct operation.
3. (from step 2 on page 283)
Is the power LED on the operator-information panel flashing approximately
four times per second?
NO Go to step 4.
YES The node is turned off and is not ready to be turned on. Wait until the
power LED flashes at a rate of once per second, then go to step 5.
If this behavior persists for more than 3 minutes, complete the
following procedure:
a. Remove all input power from the SAN Volume Controller node by
removing the power supply from the back of the node. See
Removing a SAN Volume Controller 2145-DH8 power supply
when you are removing the power cords from the node.
b. Wait 1 minute and then verify that all power LEDs on the node are
extinguished.
c. Reinsert the power supply.
d. Wait for the flashing rate of the power LED to slow down to one
flash per second. Go to step 5.
e. If the power LED keeps flashing at a rate of four flashes per
second for a second time, replace the parts in the following
sequence:
v System board
Verify the repair by continuing with MAP 5700: Repair verification.
4. (from step 3)
Is the Power LED on the operator-information panel flashing once per
second?
YES The node is in standby mode. Input power is present. Go to step 5.
NO Go to step 6 on page 285.
5. (from step 3 and step 4)
Press Power on the operator-information panel of the node.
5
svc00574
4 5 4 1
Figure 71. Power LED indicator on the rear panel of the SAN Volume Controller 2145-DH8
NO Go to step 7.
YES The operator-information panel is failing.
Verify that the operator-information panel cable is seated on the
system board.
If the node still fails to power on, replace parts in the following
sequence:
a. Operator-information panel assembly
b. System board
7. (from step 6)
Are the ac LED indicators on the rear of the power supply assemblies
illuminated? Figure 72 on page 286 shows the location of the ac LED 1, the
dc LED 2, and the power-supply error LED 3 on the rear of the power
supply assembly that is on the rear panel of the SAN Volume Controller
2145-DH8.
AC 2 DC LED (green)
DC 3 Power-supply
error LED (yellow)
AC
DC
svc00794
Figure 72. AC, dc, and power-supply error LED indicators on the rear panel of the SAN Volume Controller 2145-DH8
NO Verify that the input power cable or cables are securely connected at
both ends and show no sign of damage; replace damaged cables. If
the node still fails to power on, replace the specified parts that are
based on the SAN Volume Controller model type.
Replace the SAN Volume Controller 2145-DH8 parts in the following
sequence:
a. Power supply 750 W
YES Go to step 8.
8. (from step 7 on page 285)
Is the power-supply error LED on the rear of the SAN Volume Controller
2145-DH8 power supply illuminated? Figure 72 shows the location of the
power-supply error LED 3.
YES Replace the power supply unit.
NO Go to step 9
9. (from step 8)
Are the dc LED indicators on the rear of the power supply assemblies
illuminated?
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
You might have been sent here for one of the following reasons:
v A problem occurred during the installation of a SAN Volume Controller node
v The power switch failed to turn the node on
v The power switch failed to turn the node off
v Another MAP sent you here
Procedure
1. Are you here because the node is not powered on?
NO Go to step 11 on page 292.
YES Go to step 2.
2. (from step 1)
Is the power LED on the operator-information panel continuously
illuminated? Figure 73 on page 289 shows the location of the power LED 1
on the operator-information panel.
svc00715_new
2145-CF8
2145-CG8
Figure 73. Power LED on the SAN Volume Controller models 2145-CG8 or2145-CF8
operator-information panel
NO Go to step 3.
YES The node is powered on correctly. Reassess the symptoms and return
to MAP 5000: Start on page 275 or go to MAP 5700: Repair
verification on page 321 to verify the correct operation.
3. (from step 2 on page 288)
Is the power LED on the operator-information panel flashing approximately
four times per second?
NO Go to step 4.
YES The node is turned off and is not ready to be turned on. Wait until the
power LED flashes at a rate of approximately once per second, then
go to step 5.
If this behavior persists for more than three minutes, perform the
following procedure:
a. Remove all input power from the SAN Volume Controller node by
removing the power retention brackets and the power cords from
the back of the node. See Removing the cable-retention brackets
to see how to remove the cable-rentention brackets when removing
the power cords from the node.
b. Wait one minute and then verify that all power LEDs on the node
are extinguished.
c. Reinsert the power cords and power retention brackets.
d. Wait for the flashing rate of the power LED to slow down to one
flash per second. Go to step 5.
e. If the power LED keeps flashing at a rate of four flashes per
second for a second time, replace the parts in the following
sequence:
v System board
Verify the repair by continuing with MAP 5700: Repair verification
on page 321.
4. (from step 3)
Is the Power LED on the operator-information panel flashing approximately
once per second?
YES The node is in standby mode. Input power is present. Go to step 5.
NO Go to step 6 on page 290.
5. (from step 3 and step 4)
Press the power-on button on the operator-information panel of the node.
5
svc00574
4 5 4 1
Figure 74. Power LED indicator on the rear panel of the SAN Volume Controller 2145-CG8 or
2145-CF8
NO Go to step 7.
YES The operator-information panel is failing.
Verify that the operator-information panel cable is seated on the
system board.
If you are working on a SAN Volume Controller 2145-CG8 or
2145-CF8, and the node still fails to power on, replace parts in the
following sequence:
a. Operator-information panel assembly
b. System board
7. (from step 6)
Locate the 2145 UPS-1U (2145 UPS-1U) that is connected to this node.
Does the 2145 UPS-1U that is powering this node have its power on and is
its load segment 2 indicator a solid green?
1
2
svc00571
3
Figure 75. Power LED indicator and ac and dc indicators on the rear panel of the SAN
Volume Controller 2145-CG8 or 2145-CF8
NO Verify that the input power cable or cables are securely connected at
both ends and show no sign of damage; otherwise, if the cable or
cables are faulty or damaged, replace them. If the node still fails to
power on, replace the specified parts based on the SAN Volume
Controller model type.
Replace the SAN Volume Controller 2145-CG8 or 2145-CF8 parts in
the following sequence:
a. Power supply 675W
YES Go to step 9 for SAN Volume Controller 2145-CG8 or 2145-CF8
models.
Go to step 10 for all other models.
9. (from step 8)
Is the power supply error LED on the rear of the SAN Volume Controller
2145-CG8 or 2145-CF8 power supply assemblies illuminated? Figure 74 on
page 290 shows the location of the power LED 1 on the 2145-CF8 or the
2145-CG8.
YES Replace the power supply unit.
NO Go to step 10
10. (from step 8 or step 9)
Are the dc LED indicators on the rear of the power supply assemblies
illuminated?
NO Replace the SAN Volume Controller 2145-CG8 or 2145-CF8 parts in
the following sequence:
a. Power supply 675W
b. System board
YES Verify that the operator-information panel cable is correctly seated at
both ends. If the node still fails to power on, replace parts in the
following sequence:
a. Operator-information panel
b. Cable, signal, front panel
c. System board (if the node is a SAN Volume Controller 2145-CG8or
SAN Volume Controller 2145-CF8)
Attention: Be sure that you are turning off the correct 2145 UPS-1U.
If necessary, trace the cables back to the 2145 UPS-1U assembly.
Turning off the wrong 2145 UPS-1U might cause customer data loss.
Go to step 13.
YES Go to step 13.
13. (from step 12)
If necessary, turn on the 2145 UPS-1U that is connected to this node and then
press the power button to turn the node on.
Did the node turn on and boot correctly?
NO Go to MAP 5000: Start on page 275 to resolve the problem.
YES Go to step 14.
14. (from step 13)
The node has probably suffered a software failure. Dump data might have
been captured that will help resolve the problem. Call your support center for
assistance.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
Tip: If the 2145 UPS-1U does not seem to work, ensure that the power cable is
connected properly or reseat the power cable.
Figure 76 shows an illustration of the front of the panel for the 2145 UPS-1U.
7
LOAD 2 LOAD 1 + -
1yyzvm
1 2 3 4 5 6
Table 81 identifies which status and error LEDs that display on the 2145 UPS-1U
front-panel assembly relate to the specified error conditions. It also lists the
uninterruptible power supply alert-buzzer behavior.
Table 81. 2145 UPS-1U error indicators
[5] [6]
[1] Load2 [2] Load1 [3] Alarm [4] Battery Overload Power-on Buzzer Error condition
Green (see Green (see Note 3 No errors; the 2145
Note 1) ) UPS-1U was
configured by the SAN
Volume Controller
Green Amber (see Green No errors; the 2145
Note 2) UPS-1U is not yet
configured by the SAN
Volume Controller
Procedure
1. Is the power-on indicator for the 2145 UPS-1U that is connected to the failing
SAN Volume Controller off?
NO Go to step 3.
YES Go to step 2.
2. (from step 1)
Are other 2145 UPS-1U units showing the power-on indicator as off?
NO The 2145 UPS-1U might be in standby mode. This can be because the
on or off button on this 2145 UPS-1U was pressed, input power has
been missing for more than five minutes, or because the SAN Volume
Controller shut it down following a reported loss of input power. Press
and hold the on or off button until the 2145 UPS-1U power-on indicator
is illuminated (approximately five seconds). On some versions of the
2145 UPS-1U, you need a pointed device, such as a screwdriver, to
press the on or off button.
Go to step 3.
YES Either main power is missing from the installation or a redundant
AC-power switch has failed. If the 2145 UPS-1U units are connected to
a redundant AC-power switch, go to MAP 5320: Redundant AC
power on page 299. Otherwise, complete these steps:
a. Restore main power to installation.
b. Verify the repair by continuing with MAP 5250: 2145 UPS-1U
repair verification on page 298.
3. (from step 1 and step 2)
Are the power-on and load segment 2 indicators for the 2145 UPS-1U
illuminated solid green, with service, on-battery, and overload indicators off?
NO Go to step 4.
YES The 2145 UPS-1U is no longer showing a fault. Verify the repair by
continuing with MAP 5250: 2145 UPS-1U repair verification on page
298.
4. (from step 3)
Is the 2145 UPS-1U on-battery indicator illuminated yellow (solid or
flashing), with service and overload indicators off?
NO Go to step 5 on page 296.
YES The input power supply to this 2145 UPS-1U is not working or is not
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
You may have been sent here because you have performed a repair and want to
confirm that no other problems exist on the machine.
Procedure
1. Are the power-on and load segment 2 indicators for the repaired 2145
UPS-1U illuminated solid green, with service, on-battery, and overload
indicators off?
NO Continue with MAP 5000: Start on page 275.
YES Go to step 2.
2. (from step 1)
Is the SAN Volume Controller node powered by this 2145 UPS-1U powered
on?
NO Press power-on on the SAN Volume Controller node that is connected
to this 2145 UPS-1U and is powered off. Go to step 3.
YES Go to step 3.
3. (from step 2)
Is the node that is connected to this 2145 UPS-1U still not powered on or
showing error codes in the front panel display?
NO Go to step 4.
YES Continue with MAP 5000: Start on page 275.
4. (from step 3)
Does the SAN Volume Controller node that is connected to this 2145 UPS-1U
show Charging on the front panel display?
NO Go to step 5.
YES Wait for the Charging display to finish (this might take up to two
hours). Go to step 5.
5. (from step 4)
Press and hold the test/alarm reset button on the repaired 2145 UPS-1U for
three seconds to initiate a self-test. During the test, individual indicators
illuminate as various parts of the 2145 UPS-1U are checked.
Does the 2145 UPS-1U service, on-battery, or overload indicator stay on?
NO 2145 UPS-1U repair verification has completed successfully. Continue
with MAP 5700: Repair verification on page 321.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
You might have been sent here for one of the following reasons:
v A problem occurred during the installation of a SAN Volume Controller.
v MAP 5150: 2145 UPS-1U on page 292 sent you here.
Perform the following steps to solve problems that have occurred in any
redundant AC-power switch:
Procedure
1. One or two 2145 UPS-1Us might be connected to the redundant AC-power
switch. Is the power-on indicator on any connected 2145 UPS-1U on?
NO Go to step 3.
YES The redundant AC-power switch is powered. Go to step 2.
2. (from step 1)
Measure the voltage at the redundant AC-power switch output socket
connected to the 2145 UPS-1U that is not showing power-on.
CAUTION:
Ensure that you do not remove the power cable of any powered
uninterruptible power supply units
Is there power at the output socket?
NO One redundant AC-power switch output is working while the other is
not. Replace the redundant AC-power switch.
CAUTION:
You might need to power-off an operational node to replace the
redundant AC-power switch assembly. If this is the case, consult with
the customer to determine a suitable time to perform the
replacement. See MAP 5350: Powering off a node on page 302.
After you replace the redundant AC-power switch, continue with
MAP 5340: Redundant ac power verification on page 300.
YES The redundant AC-power switch is working. There is a problem with
the 2145 UPS-1U power cord or the 2145 UPS-1U . Return to the
procedure that called this MAP and continue from where you were
within that procedure. It will help you analyze the problem with the
2145 UPS-1U power cord or the 2145 UPS-1U.
3. (from step 1)
None of the used redundant AC-power switch outputs appears to have power.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
You might have been sent here because you have replaced a redundant AC-power
switch or corrected the cabling of a redundant AC-power switch. You can also use
this MAP if you think a redundant AC-power switch might not be working
correctly, because it is connected to nodes that have lost power when only one ac
power circuit lost power.
In this MAP, you will be asked to confirm that power is available at the redundant
AC-power switch output sockets 1 and 2. If the redundant AC-power switch is
connected to nodes that are not powered on, use a voltage meter to confirm that
power is available.
If the redundant AC-power switch is powering nodes that are powered on (so the
nodes are operational), take some precautions before continuing with these tests.
Although you do not have to power off the nodes to conduct the test, the nodes
will power off if the redundant AC-power switch is not functioning correctly.
For each of the powered-on nodes connected to this redundant AC-power switch,
perform the following steps:
1. Use the management GUI or the command-line interface (CLI) to confirm that
the other node in the same I/O group as this node is online.
2. Use the management GUI or the CLI to confirm that all virtual disks connected
to this I/O group are online.
If any of these tests fail, correct any failures before continuing with this MAP. If
you are performing the verification using powered-on nodes, understand that
power is no longer available if the following is true:
v The on-battery indicator on the 2145 UPS-1U that connects the redundant
AC-power switch to the node lights for more than five seconds.
v The SAN Volume Controller node display shows Power Failure.
When the instructions say remove power, you can switch the power off if the
sitepower distribution unit has outputs that are individually switched; otherwise,
remove the specified redundant AC-power switch power cable from the site power
distribution unit's outlet.
Procedure
1. Are the two site power distribution units providing power to this redundant
AC-power switch connected to different power circuits?
NO Correct the problem and then return to this MAP.
YES Go to step 2.
2. (from step 1)
Are both of the site power distribution units providing power to this redundant
AC-power switch powered?
NO Correct the problem and then return to the start of this MAP.
YES Go to step 3.
3. (from step 2)
Are the two cables that are connecting the site power distribution units to the
redundant AC-power switch connected?
NO Correct the problem and then return to the start of this MAP.
YES Go to step 4.
4. (from step 3)
Is there power at the redundant AC-power switch output socket 2?
NO Go to step 8 on page 302.
YES Go to step 5.
5. (from step 4)
Is there power at the redundant AC-power switch output socket 1?
NO Go to step 8 on page 302.
YES Go to step 6.
6. (from step 5)
Remove power from the Main power cable to the redundant AC-power switch.
Is there power at the redundant AC-power switch output socket 1?
NO Go to step 8 on page 302.
YES Go to step 7 on page 302.
Results
If the solution is set up correctly, powering off a single node does not disrupt
normally the operation of a SAN Volume Controller system. Normal operation
within a system has nodes in pairs called I/O groups. An I/O group continues to
handle I/O to the disks it manages with only a single node powered on. However,
performance degrades and resilience to error is reduced.
Be careful when powering off a SAN Volume Controller node to impact the system
no more than necessary. If you do not follow the procedures outlined here, your
application hosts might lose access to their data or they might lose data in the
worst case.
You can use the following preferred methods to power off a node that is a member
of a system and not offline:
1. Use the Power off option in the management GUI or in the service assistant
interface.
2. Use the CLI command stopsystem node name.
Only if a node is offline or not a member of a system must you power it off using
the power button.
To provide the least disruption when powering off a node, all of the following
conditions should apply:
v The other node in the I/O group is powered on and active in the system.
v The other node in the I/O group has SAN Fibre Channel connections to all hosts
and disk controllers managed by the I/O group.
v All volumes handled by this I/O group are online.
In some circumstances, the reason you power off the node might make these
conditions impossible. For instance, if you replace a failed Fibre Channel adapter,
volumes do not show an online status. Use your judgment to decide that it is safe
to proceed when a condition is not met. Always check with the system
administrator before proceeding with a power off that you know disrupts I/O
access, as the system administrator might prefer to wait for a more suitable time or
suspend host applications.
To ensure a smooth restart, a node must save data structures that it cannot recreate
to its local, internal disk drive. The amount of data the node saves to local disk can
be high, so this operation might take several minutes. Do not attempt to interrupt
the controlled power off.
Attention: The following actions do not allow the node to save data to its local
disk. Therefore, do not power off a node using the following methods:
v Removing the power cable between the node and the uninterruptible power
supply.
Normally the uninterruptible power supply provides sufficient power to allow
the write to local disk in the event of a power failure, but obviously it is unable
to provide power in this case.
v Holding down the power button on the node.
When you press and release the power button, the node indicates this to the
software so the node can write its data to local disk before the node powers off.
When you hold down the power button, the hardware interprets this as an
emergency power off indication and shuts down immediately. The hardware
does not save the data to a local disk before powering down. The emergency
power off occurs approximately four seconds after you press and hold down the
power button.
v Pressing the reset button on the light path diagnostics panel.
Important: If you power off the node and might not power it back on the same
day, follow these steps to prevent the batteries from being discharged too much
while the node is connected to power but not powered on:
1. Pull both batteries out of the node. Keep them out until you're ready to power
on the node.
2. Push the batteries in just before you press the power button to power on the
node.
If you disconnect the power from the node and might not reconnect power to it
again within the next 24 hours, follow these steps to prevent the batteries from
being discharged too much while the node is not connected to power:
1. After both power cords are disconnected from the node, pull both batteries out
of the node. This step completely turns off the battery backplane.
2. Push the batteries back in again.
To use the management GUI to power off a system, complete the following steps:
1. Launch the management GUI for the system that you are servicing.
Optionally, you can sign on to the IBM System Storage Productivity Center as
an administrator to launch the management GUI for the system that you are
servicing.
2. Select Monitoring > System.
If the nodes to power off are shown as Offline, the nodes are not participating
in the system. In such circumstances, use the power button on the offline nodes
to power off the nodes.
If the nodes to power off are shown as Online, powering off the nodes can
result in their dependent volumes also going offline:
a. Select the node and click Show Dependent Volumes.
b. Make sure the status of each volume in the I/O group is Online. You might
need to view more than one page.
If any volumes are Degraded, only one node in the I/O is processing I/O
requests for that volume. If that node is powered off, it impacts all the hosts
that are submitting I/O requests to the degraded volume.
If any volumes are degraded and you believe that this might be because the
partner node in the I/O group has been powered off recently, wait until a
refresh of the screen shows all volumes online. All the volumes should be
online within 30 minutes of the partner node being powered off.
Note: After waiting 30 minutes, if you have a degraded volume and all of
the associated nodes and MDisks are online, contact support for assistance.
Ensure that all volumes that are used by hosts are online before continuing.
c. If possible, check that all hosts that access volumes managed by this I/O
group are able to fail over to use paths that are provided by the other node
in the group.
Perform this check using the multipathing device driver software of the host
system. Commands to use differ, depending on the multipathing device
driver being used.
If you use the System Storage Multipath Subsystem Device Driver (SDD),
the command to query paths is datapath query device.
It can take some time for the multipathing device drivers to rediscover
paths after a node is powered on. If you are unable to check on the host
that all paths to both nodes in the I/O group are available, do not power off
a node within 30 minutes of the partner node being powered on or you
might lose access to the volume.
d. If you decide that it is okay to continue with powering off the nodes, select
the node to power off and click Shut Down System.
e. Click OK. If the node that you select is the last remaining node that
provides access to a volume, for example a node that contains flash drives
with unmirrored volumes, the Shutting Down a Node-Force panel is
displayed with a list of volumes that will go offline if the node is shut
down.
f. Check that no host applications access the volumes that are going offline.
Continue with the shut down only if the loss of access to these volumes is
acceptable. To continue with shutting down the node, click Force Shutdown.
During the shut down, the node saves its data structures to its local disk and
destages all write data held in cache to the SAN disks. Such processing can take
several minutes.
Procedure
1. Issue the lsnode CLI command to display a list of nodes in the system and
their properties. Find the node to shut down and write down the name of its
I/O group. Confirm that the other node in the I/O group is online.
lsnode -delim :
id:name:UPS_serial_number:WWNN:status:IO_group_id: IO_group_name:config_node:
UPS_unique_id
1:group1node1:10L3ASH:500507680100002C:online:0:io_grp0:yes:202378101C0D18D8
2:group1node2:10L3ANF:5005076801000009:online:0:io_grp0:no:202378101C0D1796
3:group2node1:10L3ASH:5005076801000001:online:1:io_grp1:no:202378101C0D18D8
4:group2node2:10L3ANF:50050768010000F4:online:1:io_grp1:no:202378101C0D1796
If the node to power off is shown as Offline, the node is not participating in
the system and is not processing I/O requests. In such circumstances, use the
power button on the node to power off the node.
If the node to power off is shown as Online, but the other node in the I/O
group is not online, powering off the node impacts all hosts that are submitting
I/O requests to the volumes that are managed by the I/O group. Ensure that
the other node in the I/O group is online before you continue.
2. Issue the lsdependentvdisks CLI command to list the volumes that are
dependent on the status of a specified node.
lsdependentvdisks group1node1
vdisk_id vdisk_name
0 vdisk0
1 vdisk1
If the node goes offline or is removed from the system, the dependent volumes
also go offline. Before taking a node offline or removing it from the system, you
can use the command to ensure that you do not lose access to any volumes.
3. If you decide that it is okay to continue powering off the node, issue the
stopsystem node <name> CLI command to power off the node. Use the node
parameter to avoid powering off the whole system:
stopsystem node group1node1
Are you sure that you want to continue with the shut down? yes
Note: To shut down the node even though there are dependent volumes, add
the -force parameter to the stopsystem command. The force parameter forces
continuation of the command even though any node-dependent volumes will
be taken offline. Use the force parameter with caution; access to data on
node-dependent volumes will be lost.
During the shut down, the node saves its data structures to its local disk and
destages all write data held in the cache to the SAN disks, which can take
several minutes.
At the end of this process, the node powers off.
With this method, you cannot check the system status from the front panel, so you
cannot tell if the power off is liable to cause excessive disruption to the system.
Instead, use the management GUI or the CLI commands, described in the previous
topics, to power off an active node.
If you must use this method, notice in Figure 77 that each model type has a power
control button 1 on the front.
Figure 77. Power control button on the SAN Volume Controller models
When you determine it is safe to do so, press and immediately release the power
button. On models other than the 2145-DH8, the front panel display changes to
display Powering Off and displays a progress bar.
Note: The 2145-DH8 is a new design that does not use a front panel display.
The 2145-CG8 or the 2145-CF8 requires that you remove a power button cover
before you can press the power button.
If you press the power button for too long, the node immediately powers down
and cannot write all data to its local disk. An extended service procedure is
required to restart the node, which involves deleting the node from the system
before adding it back.
The following graphic shows how Powering Off is displayed on the front panel of
all nodes but the 2145-DH8:
The node saves its data structures to disk while powering off. The power off
process can take up to five minutes.
When a node is powered off by using the power button (or because of a power
failure), the partner node in its I/O group immediately stops using its cache for
new write data and destages any write data already in its cache to the SAN
attached disks.
The time taken by this destage depends on the speed and utilization of the disk
controllers. The time to complete is usually less than 15 minutes, but it might be
longer. The destaging cannot complete if there is data waiting to be written to a
disk that is offline.
A node that powers off and restarts while its partner node continues to process
I/O might not be able to become an active member of the I/O group immediately.
The node must wait until the partner node completes its destage of the cache.
If the partner node powers off during this period, access to the SAN storage that is
managed by this I/O group is lost. If one of the nodes in the I/O group is unable
to service any I/O, for example because the partner node in the I/O group is still
flushing its write cache, volumes that are managed by that I/O group have a
status of Degraded.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
This MAP applies to all SAN Volume Controller models. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working with, look for the label that identifies the model type on
the front of the node.
Procedure
1. Is the power LED on the operator-information panel illuminated and
showing a solid green?
NO Continue with the power MAP. See MAP 5050: Power 2145-CG8 and
2145-CF8 on page 288.
svc00561
Figure 78. SAN Volume Controller service controller error light
NO Start the front panel tests by pressing and holding the select button for
five seconds. Go to step 3.
Attention: Do not start this test until the node is powered on for at
least two minutes. You might receive unexpected results.
YES The SAN Volume Controller service controller has failed.
v Replace the service controller.
v Verify the repair by continuing with MAP 5700: Repair verification
on page 321.
3. (from step 2)
The front-panel check light illuminates and the display test of all display bits
turns on for 3 seconds and then turns off for 3 seconds, then a vertical line
travels from left to right, followed by a horizontal line travelling from top to
bottom. The test completes with the switch test display of a single rectangle in
the center of the display.
Did the front-panel lights and display operate as described?
NO SAN Volume Controller front panel has failed its display test.
v Replace the service controller.
v Verify the repair by continuing with MAP 5700: Repair verification
on page 321.
YES Go to step 4.
4. (from step 3)
Figure 79 on page 309 provides four examples of what the front-panel display
shows before you press any button and then when you press the up button, the
left and right buttons, and the select button. To perform the front panel switch
test, press any button in any sequence or any combination. The display
indicates which buttons you pressed.
Check each switch in turn. Did the service panel switches and display operate
as described in Figure 79?
NO The SAN Volume Controller front panel has failed its switch test.
v Replace the service controller.
v Verify the repair by continuing with MAP 5700: Repair verification
on page 321.
YES Press and hold the select button for five seconds to exit the test. Go to
step 5.
5. Is the front-panel display now showing Cluster:?
NO Continue with MAP 5000: Start on page 275.
YES Keep pressing and releasing the down button until Node is displayed in
line 1 of the menu screen. Go to step 6.
6. (from step 5)
Is this MAP being used as part of the installation of a new node?
NO Front-panel tests have completed with no fault found. Verify the repair
by continuing with MAP 5700: Repair verification on page 321.
YES Go to step 7.
7. (from step 6)
Is the node number that is displayed in line 2 of the menu screen the same
as the node number that is printed on the front panel of the node?
NO Node number stored in front-panel electronics is not the same as that
printed on the front panel.
v Replace the service controller.
v Verify the repair by continuing with MAP 5700: Repair verification
on page 321.
YES Front-panel tests have completed with no fault found. Verify the repair
by continuing with MAP 5700: Repair verification on page 321.
Note: The service assistant GUI should be used if there is no front panel display,
for example on the SAN Volume Controller 2145-DH8.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
If you encounter problems with the 10 Gbps Ethernet feature on the SAN Volume
Controller 2145-CG8 or SAN Volume Controller 2145-DH8, see MAP 5550: 10G
Ethernet and Fibre Channel over Ethernet personality enabled adapter port on
page 313.
You might have been sent here for one of the following reasons:
v A problem occurred during the installation of a SAN Volume Controller system
and the Ethernet checks failed
v Another MAP sent you here
v The customer needs immediate access to the system by using an alternate
configuration node. See Defining an alternate configuration node on page 313
Procedure
1. Is the front panel of any node in the system displaying Node Error with
error code 805?
YES Go to step 6 on page 311.
NO Go to step 2.
2. Is the system reporting error 1400 either on the front panel or in the event
log?
YES Go to step 4.
NO Go to step 3.
3. Are you experiencing Ethernet performance issues?
YES Go to step 9 on page 312.
NO Go to step 10 on page 312.
4. (from step 2) On all nodes perform the following actions:
a. Press the down button until the top line of the display shows Ethernet.
b. Press right until the top line displays Ethernet port 1.
c. If the second line of the display shows link offline, record this port as
one that requires fixing.
d. If the system is configured with two Ethernet cables per node, press the
right button until the top line of the display shows Ethernet port 2 and
repeat the previous step.
e. Go to step 5.
5. (from step 4) Are any Ethernet ports that have cables attached to them
reporting link offline?
YES Go to step 6 on page 311.
NO Go to step 10 on page 312.
1 2 3 4
svc00718
Figure 80. Port 2 Ethernet link LED on the SAN Volume Controller rear panel
1 4
2 5
3 6
svc00861
1 2 3
Figure 81. Ethernet ports on the rear of the SAN Volume Controller 2145-DH8
If all Ethernet connections to the configuration node have failed, the system is
unable to report failure conditions, and the management GUI is unable to access
the system to perform administrative or service tasks. If this is the case and the
customer needs immediate access to the system, you can make the system use an
alternate configuration node by using the service assistant GUI. The service
assistant is accessed via the technician port..
Note: If the system has no front panel display such as on SAN Volume Controller
2145-DH8, use the service assistant GUI. The service assistant is accessed via the
technician port.
If only one node is displaying Node Error 805 on the front panel, perform the
following steps:
Procedure
1. Press and release the power button on the node that is displaying Node Error
805.
2. When Powering off is displayed on the front panel display, press the power
button again.
3. Restarting is displayed.
Results
The system will select a new configuration node. The management GUI is able to
access the system again.
MAP 5550: 10G Ethernet and Fibre Channel over Ethernet personality
enabled adapter port
MAP 5550: 10G Ethernet helps you solve problems that occur on a SAN Volume
Controller 2145-CG8 or SAN Volume Controller 2145-DH8 with 10G Ethernet
capability, and Fibre Channel over Ethernet personality enabled.
Note: The service assistant GUI might be used if there is no front panel display,
for example on the SAN Volume Controller 2145-DH8.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
This MAP applies to the SAN Volume Controller 2145-CG8 and SAN Volume
Controller 2145-DH8 models with the 10G Ethernet feature installed. Be sure that
you know which model you are using before you start this procedure. To
determine which model you are working with, look for the label that identifies the
model type on the front of the node. Check that the 10G Ethernet adapter is
installed and that an optical cable is attached to each port. Figure 26 on page 28
shows the rear panel of the 2145-CG8 with the 10G Ethernet ports.
If you experience a problem with error code 703 or 723, go to Fibre Channel and
10G Ethernet link failures on page 247.
Procedure
1. Is node error 720 or 721 displayed on the front panel of the affected node or
is service error code 1072 shown in the event log?
YES Go to step 11 on page 315.
NO Go to step 2.
2. (from step 1) Perform the following actions from the front panel of the
affected node:
a. Press and release the up or down button until Ethernet is shown.
b. Press and release the left or right button until Ethernet port 3 is shown.
Was Ethernet port 3 found?
No Go to step 11 on page 315
Yes Go to step 3
3. (from step 2) Perform the following actions from the front panel of the
affected node:
a. Press and release the up or down button until Ethernet is shown.
b. Press and release the up or down button until Ethernet port 3 is shown.
c. Record if the second line of the display shows Link offline, Link online,
or Not configured.
d. Press and release the up or down button until Ethernet port 4 is shown.
e. Record if the second line of the display shows Link offline, Link online,
or Not configured.
f. Go to step 4.
4. (from step 3) What was the state of the 10G Ethernet ports that were seen in
step 3?
Both ports show Link online
The 10G link is working now. Verify the repair by continuing with
MAP 5700: Repair verification on page 321.
One or more ports show Link offline
Go to step 5 on page 315.
One or more ports show Not configured
For information about the port configuration, see the CLI command
cfgportip description in the SAN Volume Controller Information
Center for iSCSI.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
This MAP applies to all SAN Volume Controller models. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working with, look for the label that identifies the model type on
the front of the node.
Note: Use the service assistant GUI if there is no front panel display, for example
on the SAN Volume Controller 2145-DH8 where the service assistant GUI can be
accessed via the Technician port.
Complete the following steps to solve problems that are caused by the Fibre
Channel ports:
Procedure
1. Are you trying to resolve a Fibre Channel port speed problem?
NO Go to step 2.
YES Go to step 11 on page 320.
2. (from step 1) Display the Fibre Channel port 1 status on the front panel
display or the service assistant GUI. For more information, see Chapter 6,
Using the front panel of the SAN Volume Controller, on page 91.
Is the front panel display or the service assistant GUI on the SAN Volume
Controller showing Fibre Channel port-1 active?
NO A Fibre Channel port is not working correctly. Check the port status
on the second line of the front panel display or the service assistant
GUI.
Note: SAN Volume Controller nodes are supported by both longwave SFP
transceivers and shortwave SFP transceivers. You must replace an SFP
transceiver with the same type of SFP transceiver. If the SFP transceiver to
replace is a longwave SFP transceiver, for example, you must provide a
suitable replacement. Removing the wrong SFP transceiver might result in
loss of data access. See the Removing and replacing the Fibre Channel
SFP transceiver on a SAN Volume Controller node documentation to find
out how to replace an SFP transceiver.
b. Replace the Fibre Channel adapter assembly as shown in Table 82 on page
318.
c. Verify the repair by continuing with MAP 5700: Repair verification on
page 321.
10. (from steps 2 on page 316, 3 on page 317, 4 on page 317, and 5 on page 317)
The noted port on the SAN Volume Controller displays a status of not
installed. If you replaced the Fibre Channel adapter, make sure that it is
installed correctly. If you replaced any other system board components, make
sure that the Fibre Channel adapter was not disturbed.
Is the Fibre Channel adapter failure explained by the previous checks?
NO
a. Replace the Fibre Channel adapter assembly as shown in Table 82
on page 318.
b. If the problem is not fixed, replace the Fibre Channel connection
hardware in the order that is shown in Table 83.
Table 83. SAN Volume Controller Fibre Channel adapter connection hardware
Node Adapter connection hardware
SAN Volume Controller 2145-CG8 port 1, 2, 1. PCI Express FC with Riser card
3, or 4 (slot 1) assembly 1
2. System board
SAN Volume Controller 2145-CG8 port 5, 6, 1. PCI Express FC with Riser card
7, or 8 (slot 2) assembly 2
2. System board
SAN Volume Controller 2145-DH8 port 1-8 1. PCI Express Riser card assembly 1
2. System board
SAN Volume Controller 2145-DH8 port 9-12 1. PCI Express Riser card assembly 2
2. System board
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
You might have been sent here because you performed a repair and want to
confirm that no other problems exists on the machine.
Procedure
1. Are the Power LEDs on all the nodes on? For more information about this
LED, see Power LED on page 23.
NO Go to MAP 5000: Start on page 275.
YES Go to step 2.
2. (from step 1)
Are all the nodes displaying Cluster: or is the node status LED on?
NO Go to MAP 5000: Start on page 275.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
When an error occurs, LEDs are lit along the front of the operator-information
panel, the light path diagnostics panel, then on the failed component. By viewing
the LEDs in a particular order, you can often identify the source of the error.
LEDs that are lit to indicate an error, remain lit when the server is turned off, if the
node is connected to an operating power supply.
Ensure that the node is turned on, and then resolve any hardware errors that are
indicated by the Error LED and light path LEDs:
Procedure
1. Is the System error LED 7, shown in Figure 82, on the SAN Volume
Controller 2145-DH8 operator-information panel on or flashing?
1 2 3 4
ifs00064
5 6 7
Light path
diagnostics LEDs
Release latch
Are one or more LEDs on the light path diagnostics panel on or flashing?
Checkpoint
Code
Remind
Reset
Figure 84. SAN Volume Controller 2145-DH8 light path diagnostics panel
System Error
LED Locator LED Power LED
Enclosure management
heartbeat LED
Imm2 heartbeat
LED
Standby power
LED
Battery
error LED
DIMM 19-24
error LED DIMM 1-6
(under the latches) error LED
(under the latches)
Microprocessor 2 Microprocessor 1
error LED error LED
3. Continue with MAP 5700: Repair verification on page 321 to verify the
correct operation.
Ensure that the node is turned on, and then complete the following steps to
resolve any hardware errors that are indicated by the Error LED and light path
LEDs:
Procedure
1. Is the Error LED, shown in Figure 86 on page 330, on the SAN Volume
Controller 2145-CG8 operator-information panel on or flashing?
1 2
svc00721
2
REMIND
OVERSPEC LOG LINK PS PCI SP
RESET
Light Path Diagnostics
Figure 87. SAN Volume Controller 2145-CG8 or 2145-CF8 light path diagnostics panel
23
22
21
20 6
19
18
7
17
svc00713
16
15 14 13 12 11 10 9 8
Figure 88. SAN Volume Controller 2145-CG8 system board LEDs diagnostics panel
1 Battery LED.
2 IMM heartbeat LED.
3 Enclosure management heartbeat LED.
4 DIMM 10-18 error LEDs.
5 Microprocessor 1 error LED.
6 DIMM 1-9 error LEDs.
7 Fan one error LED.
8 Fan two error LED.
9 Fan three error LED.
10 Fan four error LED.
If a Flash drive is deliberately removed from a slot, the system error LED and the DASD
diagnostics panel LED lights. The error is maintained even if the Flash drive is replaced in a
different slot. If a Flash drive is removed or moved, clear the error by completing this
procedure:
1. Power off the node by using MAP 5350.
2. Remove both power cables.
3. Replacing both power cables.
4. Restart the node.
Resolve any node or system errors that relate to Flash drives or the system disk drive.
If an error is still shown, power off the node and reseat all the drives.
If the error remains, replace the following components in the order listed:
1. The system disk drive
2. The disk backplane
RAID This LED is not used on the SAN Volume Controller 2145-CG8.
BRD An error occurred on the system board. Complete the following actions to resolve the problem:
1. Check the LEDs on the system board to identify the component that caused the error. The
BRD LED can be lit because of any of the following reasons:
v Battery.
v Missing PCI riser-card assembly. There must be a riser card in PCI slot 2 even if another
adapter is not present.
v Failed voltage regulator.
2. Replace any failed or missing replacement components, such as the battery or PCI
riser-card assembly.
3. If a voltage regulator fails, replace the system board.
3. Continue with MAP 5700: Repair verification on page 321 to verify the
correct operation.
Ensure that the node is turned on, and then complete the following steps to
resolve any hardware errors that are indicated by the Error LED and light path
LEDs:
Procedure
1. Is the Error LED, shown in Figure 89, on the SAN Volume Controller
2145-CF8 operator-information panel on or flashing?
1 2 3 4 5
svc_bb1gs008
2 1
4 3
10 9 8 7 6
REMIND
OVERSPEC LOG LINK PS PCI SP
RESET
Light Path Diagnostics
Figure 90. SAN Volume Controller 2145-CG8 or 2145-CF8 light path diagnostics panel
1 2 3 4
24
5
23
6
22
7
21
20
19 8
18
9
17
16 14 13 12 11 10
15
Figure 91. SAN Volume Controller 2145-CF8 system board LEDs diagnostics panel
If an Flash drive is deliberately removed from a slot, the system error LED and the DASD
diagnostics panel LED lights. The error is maintained even if the Flash drive is replaced in a
different slot. If a Flash drive is removed or moved, clear the error by completing this
procedure:
1. Power off the node by using MAP 5350.
2. Remove both power cables.
3. Replacing both power cables.
4. Restart the node.
Resolve any node or system errors that relate to Flash drives or the system disk drive.
If an error is still shown, power off the node and reseat all the drives.
If the error remains, replace the following components in the order listed:
1. The system disk drive
2. The disk backplane
RAID This LED is not used on the SAN Volume Controller 2145-CF8.
BRD An error occurred on the system board. Complete the following actions to resolve the problem:
1. Check the LEDs on the system board to identify the component that caused the error. The
BRD LED can be lit because of any of the following reasons:
v Battery
v Missing PCI riser-card assembly. There must be a riser card in PCI slot 2 even if another
adapter is not present.
v Failed voltage regulator
2. Replace any failed or missing replacement components, such as the battery or PCI
riser-card assembly.
3. If a voltage regulator fails, replace the system board.
3. Continue with MAP 5700: Repair verification on page 321 to verify the
correct operation.
Note: Use the service assistant GUI if there is no front panel display, for example
on the SAN Volume Controller 2145-DH8.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
This MAP applies to all SAN Volume Controller models. However some models do
not have a front panel display; the service assistant GUI should be used if the node
does not have a front panel display. Be sure that you know which model you are
You might have been sent here for one of the following reasons:
v The hardware boot display, shown in Figure 92, is displayed continuously.
v The boot progress is hung and an error is displayed on the front panel
v Another MAP sent you here
v The node status LED, node fault LED and battery status LED have remained off
Perform the following steps to allow the node to start its boot sequence:
Procedure
1. Is the Error LED on the operator-information panel illuminated or flashing?
NO Go to step 2.
YES Go to MAP 5800: Light path on page 322 to resolve the problem.
2. (From step 1)
If you have just installed the SAN Volume Controller node or have just
replaced a field replaceable unit (FRU) inside the node, perform the
following steps:
a. Identify and label all the cables that are attached to the node so that they
can be replaced in the same port. Remove the node from the rack and place
it on a flat, static-protective surface. See the Removing the node from a rack
information to find out how to perform the procedure.
b. Remove the top cover. See the Removing the top cover information to
find out how to perform the procedure.
c. If you have just replaced a FRU, ensure that the FRU is correctly placed and
that all connections to the FRU are secure.
d. Ensure that all memory modules are correctly installed and that the latches
are fully closed. See the Replacing the memory modules (DIMM) information
to find out how to perform the procedure.
2
svc00572
Figure 94. Keyboard and monitor ports on the SAN Volume Controller 2145-CF8
1
svc00723
Figure 95. Keyboard and monitor ports on the SAN Volume Controller 2145-CG8
3 4 5
- -
1 2 3 4 5 6 7 8 1+ 2+
1 2 3 4
aaaa aaaaaa aaaaaa aaaaaa aaaaaa aaaaaa aaaaaa aaaaaa aaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aa a aaaa aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaa a a a a a a a a a a a
aa
aaaa
aaaa
aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
a aa a a aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa SAN Volume Controller
aaaa aaaaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa aaaa aaaa aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
aaaa aaaa a a
aaaaaa
a a
aaaaaa
a a
aaaaaa
a a a a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a
a aa a a a a a a a a a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a a a a a a a a a a a 6
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
12 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
11
a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 10 7
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
a a a a a a a a a a a a a a a a a a a a a a a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
svc00800
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
8
- -
+ 9
Figure 96. Keyboard and monitor ports on the SAN Volume Controller 2145-DH8, front
1 4
2 5
3 6
svc00859
1 2 3 4 13 12 11 10 9 8 7 6
7 USB port 6
8 USB port 5
9 USB port 4
10 USB port 3
11 Serial port
12 Video port
Figure 97. Keyboard and monitor ports on the SAN Volume Controller 2145-DH8, rear
Note: With the FRUs removed, the boot will hang with a different boot failure
code.
NO Go to step 6 to replace the FRUs, one-at-a-time, until the failing FRU is
isolated.
YES Go to step 7 on page 346
6. (From step 5)
Remove all hardware except the hardware that is necessary to power up.
Continue to add in the FRUs one at a time and power on each time until the
original failure is introduced.
Does the boot operation still hang?
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
This map applies to models with internal flash drives. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working on, look for the label that identifies the model type on the
front of the node.
Use this MAP to determine which detailed MAP to use for replacing an offline
SSD.
Attention: If the drive use property is member and the drive must be replaced,
contact IBM support before taking any actions.
Procedure
Are you using an SSD in a RAID 0 array and using volume mirroring to provide
redundancy?
Yes Go to MAP 6001: Replace offline SSD in a RAID 0 array.
No Go to MAP 6002: Replace offline SSD in RAID 1 array or RAID 10 array
on page 349.
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
This map applies to models with internal flash drives. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working on, look for the label that identifies the model type on the
front of the node.
Attention:
1. Back up your SAN Volume Controller configuration before you begin these
steps.
2. If the drive use property is member and the drive must be replaced, contact IBM
support before taking any actions.
Perform the following steps only if a drive in a RAID 0 (striped) array has failed:
Procedure
1. Record the properties of all volume copies, MDisks, and storage pools that are
dependent on the failed drive.
a. Identify the drive ID and the error sequence number with status equals
offline and use equals failed using the lsdrive CLI command.
b. Review the offline reason using the lsevent <seq_no> CLI command.
c. Obtain detailed information about the offline drive or drives using the
lsdrive <drive_id> CLI command.
d. Record the mdisk_id, mdisk_name, node_id, node_name, and slot_id for each
offline drive.
e. Obtain the storage pools of the failed drives using the lsmdisk <mdisk_id>
CLI command for each MDisk that was identified in the substep 1c.
Note: If a listed volume has a mirrored, online, and in-sync copy, you can
recover the copied volume data from the copy. All the data on the unmirrored
volumes will be lost and will need to be restored from backup.
2. Delete the storage pool using the rmmdiskgrp -force <mdiskgrp id> CLI
command.
All MDisks and volume copies in the storage pool are also deleted. If any of
the volume copies were the last in-sync copy of a volume, all the copies that
are not in sync are also deleted, even if they are not in the storage pool.
3. Using the drive ID that you recorded in substep 1e, set the use property of the
drive to unused using the chdrive command.
chdrive -use unused <id of offline drive>
The drive is removed from the drive listing.
4. Follow the physical instructions to replace or remove a drive. See the
Replacing a SAN Volume Controller 2145-CG8 flash drive documentation or
the Removing a SAN Volume Controller 2145-CG8 flash drive
documentation to find out how to perform the procedures.
5. A new drive object is created with the use attribute set to unused. This action
might take several minutes.
Obtain the ID of the new drive using the lsdrive CLI command.
6. Change the use property for the new drive to candidate.
chdrive -use candidate <drive id of new drive>
7. Create a new storage pool with the same properties as the deleted storage
pool. Use the properties that you recorded in substep 1l.
mkmdiskgrp -name <mdiskgrp name as before> -ext <extent size as before>
8. Create again all MDisks that were previously in the storage pool using the
information from steps 1j and 1k.
v For internal RAID 0 MDisks, use this command:
If you are not familiar with these maintenance analysis procedures (MAPs), first
read Chapter 11, Using the maintenance analysis procedures, on page 275.
This map applies to models with internal flash drives. Be sure that you know
which model you are using before you start this procedure. To determine which
model you are working on, look for the label that identifies the model type on the
front of the node.
Attention:
1. Back up your SAN Volume Controller configuration before you begin these
steps.
2. If the drive use property is member and the drive must be replaced, contact IBM
support before taking any actions.
Procedure
1. Make sure the drive property use is not member.
Use the lsdrive CLI command to determine the use.
2. Record the drive property values of the node ID and the slot ID for use in step
4. These values identify which physical drive to remove.
3. Record the error sequence number for use in step 11 on page 350.
4. Use the drive ID that you recorded in step 2 to set the use attribute property
of the drive to unused with the chdrive command.
chdrive -use failed <id of offline drive>
chdrive -use unused <id of offline drive>
The drive is removed from the drive listing.
Some of the attributes and host parameters that might affect iSCSI performance:
v Transmission Control Protocol (TCP) Delayed ACK
v Ethernet jumbo frame
v Network bottleneck or oversubscription
v iSCSI session login balance
v Priority flow control (PFC) setting and bandwidth allocation for iSCSI on the
network
Procedure
1. Disable the TCP delayed acknowledgment feature.
To disable this feature, refer to OS/platform documentation.
v VMWare: http://kb.vmware.com/selfservice/microsites/microsite.do
v Windows: http://support.microsoft.com/kb/823764
The primary signature of this issue: read performance is significantly lower
than write performance. Transmission Control Protocol (TCP) delayed
acknowledgment is a technique that is used by some implementations of the
TCP in an effort to improve network performance. However, in this scenario
where the number of outstanding I/O is 1, the technique can significantly
reduce I/O performance.
In essence, several ACK responses can be combined together into a single
response, reducing protocol overhead. As described in RFC 1122, a host can
delay sending an ACK response by up to 500 ms. Additionally, with a stream of
full-sized incoming segments, ACK responses must be sent for every second
segment.
Important: The host must be rebooted for these settings to take effect. A few
platforms (for example, standard Linux distributions) do not provide a way to
disable this feature. However, the issue was resolved with the version 7.1
release, and no host configuration changes are required to manage
TcpDelayedAck behavior.
2. Enable jumbo frame for iSCSI.
Jumbo frames are Ethernet frames with a size in excess of 1500 bytes. The
maximum transmission unit (MTU) parameter is used to measure the size of
jumbo frames.
The SAN Volume Controller supports 9000-bytes MTU. Refer to the CLI
command cfgportip to enable jumbo frame. This command is disruptive as the
link flips and the I/O operation through that port pauses.
The network must support jumbo frames end-to-end for this to be effective;
verify this by sending a ping packet to be delivered without fragmentation. For
example:
v Windows:
Node 1
Port 1: 192.168.1.11
Port 2: 192.168.2.21
Port 3: 192.168.3.31
Node 2:
Port 1: 192.168.1.12
Port 2: 192.168.2.22
Port 3: 192.168.3.33
v Avoid situations where 50 hosts are logged in to port 1 and only five hosts
are logged in to port 2.
v Use proper subnetting to achieve a balance between the number of sessions
and redundancy.
5. Troubleshoot problems with PFC settings.
You do not need to enable PFC on the SAN Volume Controller system. SAN
Volume Controller reads the data center bridging exchange (DCBx) packet and
enables PFC for iSCSI automatically if it is enabled on the switch. In the
lsportip command output, the fields lossless_iscsi and lossless_iscsi6
show [on/off] depending on whether PFC is enabled or not for iSCSI on the
system.
If the fields lossless_iscsi and lossless_iscsi6 are showing off, it might be
due to one of the following reasons:
a. VLAN is not set for that IP. Verify the following checks:
v For IP address type IPv4, check the vlan field in the lsportip output. It
should not be blank.
Accessibility features
These are the major accessibility features for the SAN Volume Controller:
v You can use screen-reader software and a digital speech synthesizer to hear what
is displayed on the screen. HTML documents have been tested using JAWS
version 15.0.
v This product uses standard Windows navigation keys.
v Interfaces are commonly used by screen readers.
v Keys are discernible by touch, but do not activate just by touching them.
v Industry-standard devices, ports, and connectors.
v You can attach alternative input and output devices.
The SAN Volume Controller online documentation and its related publications are
accessibility-enabled. The accessibility features of the online documentation are
described in Viewing information in the information center .
Keyboard navigation
You can use keys or key combinations to perform operations and initiate menu
actions that can also be done through mouse actions. You can navigate the SAN
Volume Controller online documentation from the keyboard by using the shortcut
keys for your browser or screen-reader software. See your browser or screen-reader
software Help for a list of shortcut keys that it supports.
See the IBM Human Ability and Accessibility Center for more information about
the commitment that IBM has to accessibility.
The Statement of Limited Warranty is shipped (in hardcopy form) with your
product. It can also be ordered from IBM (see Table 2 on page xii for the part
number).
IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may
be used instead. However, it is the user's responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant you
any license to these patents. You can send license inquiries, in writing, to:
IBM may use or distribute any of the information you provide in any way it
believes appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact:
The licensed program described in this document and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,
IBM International Program License Agreement or any equivalent agreement
between us.
All IBM prices shown are IBM's suggested retail prices, are current and are subject
to change without notice. Dealer prices may vary.
This information is for planning purposes only. The information herein is subject to
change before the products described become available.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
If you are viewing this information softcopy, the photographs and color
illustrations may not appear.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of
International Business Machines Corp., registered in many jurisdictions worldwide.
Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the web at Copyright and
trademark information at www.ibm.com/legal/copytrade.shtml.
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered
trademarks or trademarks of Adobe Systems Incorporated in the United States,
and/or other countries.
Linux and the Linux logo is a registered trademark of Linus Torvalds in the United
States, other countries, or both.
Other product and service names might be trademarks of IBM or other companies.
Homologation statement
This product may not be certified in your country for connection by any means
whatsoever to interfaces of public telecommunications networks. Further
certification may be required by law prior to making any such connection. Contact
an IBM representative or reseller for any questions.
This equipment has been tested and found to comply with the limits for a Class A
digital device, pursuant to Part 15 of the FCC Rules. These limits are designed to
provide reasonable protection against harmful interference when the equipment is
operated in a commercial environment. This equipment generates, uses, and can
radiate radio frequency energy and, if not installed and used in accordance with
the instruction manual, might cause harmful interference to radio communications.
Operation of this equipment in a residential area is likely to cause harmful
interference, in which case the user will be required to correct the interference at
his own expense.
Notices 361
Properly shielded and grounded cables and connectors must be used in order to
meet FCC emission limits. IBM is not responsible for any radio or television
interference caused by using other than recommended cables and connectors, or by
unauthorized changes or modifications to this equipment. Unauthorized changes
or modifications could void the user's authority to operate the equipment.
This device complies with Part 15 of the FCC Rules. Operation is subject to the
following two conditions: (1) this device might not cause harmful interference, and
(2) this device must accept any interference received, including interference that
might cause undesired operation.
Responsible Manufacturer:
Warnung: Dieses ist eine Einrichtung der Klasse A. Diese Einrichtung kann im
Wohnbereich Funk-Strungen verursachen; in diesem Fall kann vom Betreiber
verlangt werden, angemessene Mabnahmen zu ergreifen und dafr
aufzukommen.
Dieses Gert ist berechtigt, in bereinstimmung mit dem Deutschen EMVG das
EG-Konformittszeichen - CE - zu fhren.
Generelle Informationen:
Das Gert erfllt die Schutzanforderungen nach EN 55024 und EN 55022 Klasse
A.
Notices 363
Taiwan Class A compliance statement
jjieta2
rusemi
Notices 365
366 SAN Volume Controller: Troubleshooting Guide
Index
Numerics accessibility (continued)
repeat rate
C
10 Gbps Ethernet up and down buttons 355 Call Home 133, 136
link failures 313 repeat rate of up and down Canadian electronic emission notice 362
MAP 5550 313 buttons 115 charging 92
10 Gbps Ethernet card accessing circuit breakers
activity LED 24 cluster (system) CLI 68 2145 UPS-1U 51
10G Ethernet 247, 313 management GUI 61 requirements
2145 UPS-1U publications 355 SAN Volume Controller
alarm 50 service assistant 67 2145-CF8 40
circuit breakers 51 service CLI 69 SAN Volume Controller
connecting 48 action menu options 2145-CG8 38
connectors 51 front panel display 102 CLI
controls and indicators on the front sequence 102 service commands 68
panel 49 action options system commands 68
description of parts 51 node when to use 68
dip switches 51 create cluster 107 CLI commands
environment 53 actions lssystem
heat output of node 39 reset service IP address 70 displaying clustered system
Load segment 1 indicator 50 reset superuser password 70 properties 82
Load segment 2 indicator 50 active status 98 cluster (system) CLI
MAP adding accessing 68
5150: 2145 UPS-1U 292 nodes 63 clustered system
5250: repair verification 298 address restore 261
nodes MAC 101 T3 recovery 261
heat output 39 Address Resolution Protocol (ARP) 12 clustered systems
on or off button 50 addressing Call Home email 133, 136
on-battery indicator 50 configuration node 12 deleting nodes 61
operation 48 error codes 161
overload indicator 50 IP address
ports not used 51 configuration node 12
power-on indicator 50 B IP failover 12
service indicator 50 back-panel assembly IPv4 address 98
test and alarm-reset button 51 SAN Volume Controller 2145-CF8 IPv6 address 99
unused ports 51 connectors 30 metadata, saving 93
2145-DH8 indicators 29 options 98
additional space requirements 36 SAN Volume Controller 2145-CG8 overview 11
air temperature without redundant ac connectors 27 properties 82
power 36 indicators 26 recovery codes 161
dimensions and weight 36 SAN Volume Controller 2145-DH8 removing nodes 61
heat output of node 37 connectors 24 restore 254
humidity without redundant ac indicators 24 T3 recovery 254
power 36 backing up codes
input-voltage requirements 35 system configuration files 264 node error
nodes backup configuration files critical 160
heat output 37 deleting noncritical 160
power requirements for each using the CLI 269 node rescue 160
node 35 restoring 266 commands
product characteristics 35 bad blocks 273 create cluster 72
requirements 35 battery install software 72
specifications 35 Charging, front panel display 92 query status 74
weight and dimensions 36 power 93 reset service assistant password 71
battery fault LED 17 satask.txt 70
battery status LED 17 snap 71
A boot
codes, understanding 159
svcconfig backup 264
svcconfig restore 266
about this document failed 91 comments, sending xv
sending comments xv progress indicator 91 configuration
ac and dc LEDs 33, 34 boot drive node failover 12
AC and DC LEDs 33 SAN Volume Controller configuration node 12
ac power switch, cabling 44 2145-DH8 157 connecting
accessibility 355 buttons, navigation 19 2145 UPS-1U 48
Index 369
LEDs (continued) MAP (continued) message classification 162
power 23, 33 6000: Replace offline SSD 346 migrate 245
power-supply error 33, 34 6001 Replace offline SSD in a RAID 0 migrate drives 245
rear-panel indicators 24, 26, 29 array 347
SAN Volume Controller 2145-CF8 29 6002: Replace offline SSD in a RAID 1
SAN Volume Controller 2145-CG8 26
SAN Volume Controller
array or RAID 10 array 349
power off node 302
N
navigation
2145-DH8 24 MAPs (maintenance analysis procedures)
accessibility 355
system information 23 10 Gbps Ethernet 313
buttons 19
system-error 22, 33 2145 UPS-1U 292
create cluster 107
light path MAP 322 2145 UPS-1U repair verification 298
Language? 116
link failures Ethernet 310
recover cluster 115
Fibre Channel 247 Fibre Channel 316
New Zealand electronic emission
link problems front panel 307
statement 362
iSCSI 248 hardware boot 341
node
Load segment 1 indicator 50 light path 322
create cluster 107
Load segment 2 indicator 50 power
options
locator LED 24 SAN Volume Controller
create cluster? 107
log files 2145-CF8 288
gateway 111
viewing 132 SAN Volume Controller
IPv4 address 107
2145-CG8 288
IPv4 confirm create? 109
SAN Volume Controller
IPv4 gateway 109
M 2145-DH8 283
power off 302
IPv4 subnet mask 108
MAC address 101 IPv6 address 110
redundant ac power 300
maintenance analysis procedures (MAPs) IPv6 Confirm Create? 111
redundant AC power 299
10 Gbps Ethernet 313 IPv6 prefix 110
repair verification 321
2145 UPS-1U 292 Remove Cluster? 115
SSD failure 346, 347, 349
Ethernet 310 status 100
start 275
Fibre Channel 316 subnet mask 108
using 275
front panel 307 rescue request 93
media access control (MAC) address 101
hardware boot 341 software failure 283, 288
medium errors 273
light path 322 node canisters
menu options
overview 275 configuration 11
clustered system
power node fault LED 16
IPv4 address 98
SAN Volume Controller node rescue
IPv4 gateway 99
2145-CG8 288 codes 160
IPv4 subnet 99
SAN Volume Controller node status LED 16, 18
clustered systems
2145-DH8 283 nodes
IPv6 address 99
repair verification 321 adding 63
clusters
SSD failure 346, 347, 349 cache data, saving 93
IPv6 address 99
start 275 configuration 11
options 98
management GUI addressing 12
reset password 115
accessing 61 failover 12
status 98
shut down node 302 deleting 61
Ethernet
management GUI interface downloading
MAC address 101
when to use 60 vital product data 81
port 101
managing failover 12
speed 101
event log 131 hard disk drive failure 93
Fibre Channel port-1 through
MAP identification label 20
port-4 101
5000: Start 275 options
front panel display 96
5040: Power SAN Volume Controller main 100
IPv4 gateway 99
2145-DH8 283 removing 61
IPv6 gateway 99
5050: Power SAN Volume Controller rescue
IPv6 prefix 99
2145-CG8 and 2145-CF8 288 completing 270
Language? 116
5150: 2145 UPS-1U 292 viewing
node
5250: 2145 UPS-1U repair general details 81
options 100
verification 298 noncritical
status 100
5320: Redundant AC power 299 node errors 160
SAN Volume Controller
5340: Redundant ac power not used
active 98
verification 300 2145 UPS-1U ports 51
degraded 98
5400: Front panel 307 location LED 33
inactive 98
5500: Ethernet 310 notifications
sequence 96
5550: 10 Gbps Ethernet 313 Call Home information 136
system
5600: Fibre Channel 316 inventory information 136
gateway 99
5700: Repair verification 321 sending 133
IPv6 prefix 99
5800: Light path 322 number range 162
status 100
5900: Hardware boot 341
Index 371
SAN Volume Controller (continued) SAN Volume Controller 2145-CF8 sending
field replaceable units (continued) (continued) comments xv
disk drive cables 53 rear-panel indicators 29 serial number 19
disk power cable 53 requirements 40 service
disk signal cable 53 service ports 31 actions, uninterruptible power
Ethernet cable 53 specifications 40 supply 48
fan assembly 53 temperature with redundant ac service address
fan power cable 53 power 41 navigation 112
Fibre Channel adapter unused ports 31 options 112
assembly 53 weight and dimensions 41 Service Address
Fibre Channel cable 53 SAN Volume Controller 2145-CG8 option 100
Fibre Channel HBA 53 additional space requirements 39 service assistant
frame assembly 53 air temperature without redundant ac accessing 67
front panel 53 power 38 interface 67
memory module 53 circuit breaker requirements 38 when to use 67
microprocessor 53 connectors 27 service CLI
operator-information panel 53 controls and indicators on the front accessing 69
power backplane 53 panel 17 when to use 68
power cable assembly 53 dimensions and weight 39 service commands
power supply assembly 53 heat output of node 39 CLI 68
riser card, PCI 53 humidity with redundant ac create cluster 72
riser card, PCI Express 53 power 38 install software 72
service controller 53 humidity without redundant ac reset service assistant password 71
service controller cable 53 power 38 reset service IP address 70
system board 53 indicators and controls on the front reset superuser password 70
thermal grease 53 panel 17 snap 71
voltage regulator module 53 input-voltage requirements 37 service controller
front-panel display 91 light path MAP 329 replacing
hardware 1 MAP 5800: Light path 329 validate WWNN 95
hardware components 15 nodes Service DHCPv4
menu options heat output 39 option 113
Language? 116 operator-information panel 21 Service DHCPv6
node 100 ports 27 option 113
node 15 power requirements for each service ports
overview 1 node 37 SAN Volume Controller 2145-CF8 31
power control 117 product characteristics 37 SAN Volume Controller 2145-CG8 28
power-on self-test 130 rear-panel indicators 26 SAN Volume Controller
preparing environment 35 requirements 37 2145-DH8 26
properties 81 service ports 28 Set FC Speed
software specifications 37 option 115
overview 1 temperature with redundant ac shortcut keys
SAN Volume Controller 2145-CF8 power 38 keyboard 355
additional space requirements 42 unused ports 29 shutting down
air temperature without redundant ac weight and dimensions 39 front panel display 94
power 40 SAN Volume Controller 2145-CG8 node snap command 71
circuit breaker requirements 40 features 11 SNMP traps 133
connectors 30 SAN Volume Controller 2145-DH8 software
controls and indicators on the front boot drive 157 failure, MAP 5050 283, 288
panel 18 connectors 24 overview 1
dimensions and weight 41 controls and indicators on the front version
heat output of node 42 panel 15 display 100
humidity with redundant ac indicators and controls on the front space requirements
power 41 panel 15 2145-DH8 36
humidity without redundant ac light path MAP 323 SAN Volume Controller 2145-CF8 42
power 40 MAP 5800: Light path 323 SAN Volume Controller 2145-CG8 39
indicators and controls on the front operator-information panel 20 specifications
panel 18 ports 24 redundant AC-power switch 43
input-voltage requirements 40 rear-panel indicators 24 speed
light path MAP 336 service ports 26 Fibre Channel port 101
MAP 5800: Light path 336 unused ports 26 Start MAP 275
nodes SAN Volume Controller library starting
heat output 42 related publications xii clustered system recovery 257
operator-information panel 22 satask.txt system recovery 259
ports 30 commands 70 T3 recovery 257
power requirements for each Security level 245 Statement of Limited Warranty, Where to
node 40 self-test, power-on 130 find the 357
product characteristics 40
T
T3 recovery V
removing validating
550 errors 256 volume copies 77
578 errors 256 viewing
restore event log 132
clustered system 253 vital product data (VPD)
starting 257 displaying 81
what to check 261 overview 81
when to run 254 understanding the fields for the
Taiwan node 83
contact information 364 understanding the fields for the
electronic emission notice 364 system 88
TCP 351 viewing
technical assistance xv nodes 81
technician port volume copies
using 75 validating 77
test and alarm-reset button 51 volumes
trademarks 361 recovering from offline
troubleshooting using CLI 79, 260
event notification email 133, 136 VPD (vital product data)
SAN failures 244 displaying 81
using error logs 92 overview 81
using the front panel 91 understanding the fields for the
node 83
understanding the fields for the
U system 88
understanding
clustered-system recovery codes 161
error codes 137, 161 W
event log 131 websites xiv
Index 373
374 SAN Volume Controller: Troubleshooting Guide
IBM
Printed in USA
GC27-2284-09