Servicing IBM Systems X Servers II - Study Guide
Servicing IBM Systems X Servers II - Study Guide
Servicing IBM Systems X Servers II - Study Guide
Study Guide
XW0001 Release 3.13 May, 2009
smpdr3.13-xw0001.pdf
May 2009
International Business Machines Corporation, 2009 All rights reserved. IBM System x Service and Support Education IBM Systems, Department EYGA. Building 203, Post Office Box 12195, Research Triangle Park, North Carolina 27709-2195 IBM reserves the right to change specifications or other product information without notice. This publication could include technical inaccuracies or typographical errors. References herein to IBM products and services do not imply that IBM intends to make them available in other countries. IBM provides this publication as is, without warranty of any kind either expressed or impliedincluding the implied warranties of merchantability or fitness for a particular purpose. Some jurisdictions do not allow disclaimer of expressed or implied warranties. Therefore, this disclaimer may not apply to you. Data on competitive products is obtained from publicly obtained information and is subject to change without notice. Please contact the manufacturer for the most recent information. The following terms are trademarks or registered trademarks of IBM Corporation in the United States, other countries or both: Active Memory, Active PCI, AT, BladeCenter, the e-business logo, EasyServ, Enterprise XArchitecture, EtherJet, HelpCenter, HelpWare, IBM RXE-100 Remote Expansion Enclosure, IBM XA-32, IBM XA-64, IntelliStation, LANClient Control Manager, Memory ProteXion, NetBAY3, Netfinity, Netfinity Manager, Predictive Failure Analysis, RXE Expansion Port, SecureWay, ServeRAID, ServerProven, ServicePac, SMART Reaction, SMP Expansion Module, SMP Expansion Port, UM Services, Universal Manageability, Update Connector, Wake on LAN, XceL4 Server Accelerator Cache, XpandOnDemand scalability. IBM Corporation Subsidiaries: Lotus, Lotus Notes, Domino, and SmartSuite are trademarks of Lotus Development Corporation. Tivoli and Planet Tivoli are trademarks of Tivoli Systems, Inc. LLC, Adobe, and PostScript are trademarks of Adobe Systems, Inc. Intel Celeron, LANDesk, MMX, Pentium II, Pentium III, Pentium 4, SpeedStep, and Xeon are trademarks or registered trademarks of Intel Corporation. Linux is a trademark of Linus Torvalds. Microsoft Windows and Windows NT are trademarks or registered trademarks of Microsoft Corporation. Other company, product, and service names may be trademarks or service marks of others. For more information, visit:www.ibm.com/legal/copytrade/phtml
smpdr3.13-xw0001.pdf
May 2009
Preface
This publication is primarily intended for use by students enrolled in the course Servicing System x Servers Part II xw0001. This document represents a training technique developed for and used by IBM and is not for sale. Portions of this document, such as foils, charts, and quizzes, may be copied and distributed if required to conduct a class properly. The instructor should exercise good judgment on handouts of this type. The complete document may not be copied for or sold to non-IBM personnel. Please write your name and address below to personalize your copy. Issued to: Address: ____________________________________________________ ____________________________________________________ ____________________________________________________ ____________________________________________________ Current release date: Current release level: Test numbers for this guide are: xw0001r313 The information contained within this publication is current as of the date of the latest revision and is subject to change at any time without notice. Please forward all comments and suggestions regarding the course material, format, and content to your local IBM System x Service and Support Education country coordinator or contact. May 2009 3.13
smpdr3.13-xw0001.pdf
May 2009
Table of Contents
Preface Table of Contents Introduction to the Study Guide Topic 1 Topic 2 Topic 3 Topic 4 Objectives and Agenda 3 4 4 5
Topic 5 Working With Scalable Systems Topic 6 Dynamic System Analysis Topic 7 Problem Solving Topic 8 Support References
smpdr3.13-xw0001.pdf
May 2009
Welcome!
smpdr3.13-xw0001.pdf
May 2009
Before we begin, we need to establish some basics for the course. You need to understand what the course objectives are so you can be sure you are taking the right class. You need to understand what we expect of you by way of previous knowledge. We also need to explain the course agenda so you will know what is about to happen.
smpdr3.13-xw0001.pdf
May 2009
This course concentrates on problem determination and the tools that can be utilized to trouble shoot IBM System x Servers. Before you start the practical exercises, however, we will discuss some of the key technologies used in IBM System x servers. The lab exercises revolve around best practice in dealing with IBM System x server problems. This combination remote lab and paper exercises will enable you to become familiar with the high end of the System x server range and how to perform service on them. You will also see some of the fault tolerant and redundant features of the servers and practice working with servers that have suffered a component failure but which are still running the NOS.
smpdr3.13-xw0001.pdf
May 2009
Your instructor is
-Your instructor will now introduce herself/himself
You are
-Your instructor will ask you to introduce yourself
Tell the class what you do (not what your job title is) Tell the class how you got into this role Tell the class anything else you wish to share
We will be together for some time. It will be useful for us all to get to know each other.
smpdr3.13-xw0001.pdf
May 2009
To make the most of this class, you should have completed the following education prior to attending this course:
-Strongly recommended (mandatory in some locations)
A+ Certification Server+ Certification
-Required
Servicing IBM xSeries Servers Part I (XW2001 R300)
In some locations, if you work with IBM System x server products, you are required to be A+ and Server+ certified. Even if this is not mandatory where you are, we strongly recommend that you are A+ Certified and Server+ certified. Servicing IBM System x Servers Part I is REQUIRED prior to attending this class. XW2001R300 is a self-paced, CD-ROM course. If you have not completed this training prior to attending, you will not get the most from this class. As there is a test at the end of this class, this may impact your ability to pass the end-of-class mastery test.
smpdr3.13-xw0001.pdf
May 2009
This chart identifies the position of this course in the IBM System x server service curriculum. This course is a mandatory module towards warranty approval for high-performance System x servers. Service update CD-ROMs are issued periodically to inform service technicians about new products that are announced.
smpdr3.13-xw0001.pdf
10
May 2009
This course provides practical experience through hands-on exercises. This is a list of the key exit skills that you should be able to perform after completing this course.
smpdr3.13-xw0001.pdf
11
May 2009
XW0001 - Servicing IBM System x Servers Part II Lesson Topics in This Course
Lesson topics
Topic 1: Objectives and agenda Topic 2: High-performance System x Server Family Overview Topic 3: RAID Adapters and Enclosures Topic 4: High-performance Technologies Review Topic 5: Working With Scalable Systems Topic 6: Dynamic System Analysis Topic 7: Problem Solving Topic 8: Support references
Lab exercises
Details are on the next page
Test
Here are the lesson topics in this guide.
smpdr3.13-xw0001.pdf
12
May 2009
XW0001 - Servicing IBM System x Servers Part II Lesson Topics in This Course
Lab exercises
Lab1- Locations, Removals, Flash Update and Diagnostics Lab 2 Remote Desktop Connection & Bios Setup Lab 3 Utilizing RCM &Virtual Console software Lab 4 Preboot DSA / Diagnostics Lab 5 Updating with IBM UpdateXpress Service Packs Lab 6 ServeRAID Mgr & Spanned Arrays Lab 7 MegaRAID Storage Manager Lab 8 Utilizing BMC and DSA to gather the Facts Lab 9a - Scale System x460 Lab 9b - Scale a multi-node x3950 M2
smpdr3.13-xw0001.pdf
13
May 2009
We have now outlined the contents and scope of this course. The next topic is a review of the product and how the components fit together.
smpdr3.13-xw0001.pdf
14
May 2009
We will provide an overview of the IBM System x and xSeries high performance products.
smpdr3.13-xw0001.pdf
15
May 2009
This topic describes high-performance System x server family offerings and some common options.
smpdr3.13-xw0001.pdf
16
May 2009
XW0001 - Servicing IBM System x Servers Part II Non Scalable IBM System x 3755 Overview
2- 4 processors, using AMD quad core Opteron processors with HyperTransport link Eight DIMM slots per processor card Dual Broadcom 5708c Gigabit Ethernet Optional redundant power and cooling Standard DVD drive 4 - 3.5 in. HS SAS HDD Bays
4 PCI Express, 2 PCI-X and 1 HTX I/O slots IPMI 2.0 BMC w/optional RSA II Slimline refresh SAS Chipset supporting RAID 0, 1 or 10 Optional RAID 5 upgrade RoHS Compliant Server Enablement Suite Support for Windows, Linux, VMWare and Netware
The x3755 is a low cost high end AMD Dual Core Opteron based Server. The system supports up to 4 Opteron revision F processors. Each processor supports up to 8 DIMM slots and using 4GB memory DIMMS the system supports up to 128GB of memory. The IO slot mixture is 4 PCI-E, 2 PCI-X slots and 1 HTX slot. The x3755 is a RoHS compliant system. Tip: Processors must be installed in order 1 through 4. Tip: The processor complex uses a passthru card. Older systems may have had only one processor installed. Current models ship with two processors standard. If there is no Processor/Memory Card in processor 2 slot there is no path to the ServerWorks HT2100 B PCI-E Bridge unless the passthru card is fitted. This passthru card must be present in processor slot 2 if no processor is installed. If processor 2 is present then no passthru card is used and one is NOT shipped as well, however, if processors 1, 2 and 3 are installed then a passthru card must be installed in processor socket 4. The Baseboard Management Controller (BMC-H8) is a system environmental monitor and controller. It will perform low level system monitoring and LED control functions using multiple I2C bus connections to communicate out-of-band with other onboard devices. The optional RSAII Slimline Refresh systems management adapter adds advance service processor alert notification and remote connectivity. .
smpdr3.13-xw0001.pdf
17
May 2009
XW0001 - Servicing IBM System x Servers Part II System x3800, x3850 and x3950 Product Overview 3U or7U tower or rack Up to two 1300W (x3850 and x3950), three 770W (x3800) XA-64e Enterprise Xredundant hot-swap power Architecture chipset supplies 1-way to 4-way, Intel Xeon MP X3850 and x3950 (up to 32-way on x3950) Active Memory Intel Xeon MP (support for ChipKill, Memory EM64T processors) ProteXion and Memory PC2-3200 DDR2 SDRAM Mirroring 2-way interleaving Remote Supervisor Adapter II x3800 slim line (optional on x3800 Disk support and x3850, standard on DVD-ROM standard x3950) Up to twelve 3.5 Serial Attached SCSI (SAS) hot Broadcom 5704 dual port swap disks(x3800) Up to six ethernet 2.5 SAS hot-swap disks Active PCI-X 2.0 Slots, 64(x3850 and x3950) bit/266MHz ServeRAID 8I (optional) 3-year, next business day RAID 0/1/5 warranty
The IBM Enterprise X-Architecture range of servers offers models 3U and 7U rack model server for high-volume network transaction processing. These high-performance, symmetric multiprocessing (SMP) servers are ideally suited for networking environments that require superior microprocessor performance, input/output (I/O) flexibility, and high manageability. EM64T is a 64-bit extension technology enhancement to the Intel IA-32 architecture. It is compatible with legacy IA-32 software while enabling new software to access larger memory address space. EM64T introduces a new operating mode which includes two sub-modes: (1) Sub-mode, referred to as compatibility mode, enables a 64-bit operating system to run most existing legacy 32-bit software unmodified. (2) Sub-mode, referred to as 64-bit mode, enables a 64-bit operating system to run applications written specifically to access 64-bit address space. The System x3800 and x3850 offers the Remote Supervisor Adapter II slim line (RSA II) as an option. This new adapter significantly enhances the tools available to the service technician for detecting and correcting problems with the server. It supports a web interface to the error logs, dramatically simplifying the troubleshooting task without disturbing the workings of the host server. The system error logs can be viewed and manipulated from a ThinkPad, connected to the Service Processor through the LAN (using a web browser) if the RSAII is connected to an ethernet network.
smpdr3.13-xw0001.pdf
18
May 2009
XA64e
4th
generation
chipset Four processor sockets Intel Xeon Dual- and Quad-core and six core processors Up to four memory cards Up to 8 DIMMs per memory card PC2-5300 DDR II Disk support DVD-ROM standard Integrated LSI 1078 SAS RAID controller (supports RAID 0 & 1)
Up to four 2.5 SAS hot-
MR10
ProteXion & Memory Mirroring Dual embedded Broadcom 5709 ethernet Remote Supervisor Adapter II standard Chassis scalability supported One or three year, next business day warranty
The IBM System x3850 M2 and 3950 M2 is a high-performance, four-socket, non-scalable server featuring fourth-generation Enterprise X-Architecture. The x3950 M2 server contains advanced technology that combines scalable SMP power, PCI-E expansion, fourth-generation Enterprise XArchitecture (EXA), high availability, scalability, and substantial internal data storage capacity. This slide summarizes some of the features of the IBM System x3850 and x3950 M2. The x3850 M2 supports scaling with the installation of the IBM ScaleXpander Option kit. It will become a x3950 M2 when scaled. A multi-node configuration interconnects multiple servers. Each multi-node configuration can have one or more scalable partitions. Each scalable partition supports an independent operating system installation. The scalable partition uses a single, contiguous memory space and provides access to all associated adapters and hard disk drives. PCI slot numbering starts with the primary node and continues with the secondary nodes, in numeric order of the logical node ID. The scalability discussion is continued later in this course.
smpdr3.13-xw0001.pdf
19
May 2009
XW0001 - Servicing IBM System x Servers Part II Enterprise X-Architecture Overall Design
Scalable systems use the IBM Enterprise XArchitecture (EXA) and IBM XA-64e fourthgeneration chipset
Note. Third generation EXA chipsets use system memory for L4 cache
Scalable systems need a sophisticated chipset to enable processors and memory to be shared across multiple chassis under a single OS. This diagram shows the overall schematic of the EXA chipset. The processor and memory bus can be extended with the use of scalability cables, effectively joining the processors, memory and I/O into a single hardware set.
smpdr3.13-xw0001.pdf
20
May 2009
XW0001 - Servicing IBM System x Servers Part II Architecture x3800, x3850, x3950, x3950E
Not present on x3800 or x3850
The x3850, x3950 and x3950E use the third generation of the IBM XA-64e chipset. The architecture consists of the following components: One to four Xeon MP processors One Memory and I/O Controller (MIOC) Two PCI Bridges Each memory port out of the memory controller has a peak throughput of 5.33 GBps. DIMMs are installed in matched pairs (two-way interleaving) to ensure that the memory port is fully utilized. Peak throughput for each PC2-3200 DDR2 DIMM is 2.67 GBps. (The DIMMs are run at 333 MHz to remain in sync with the throughput of the front-side bus.) In addition, there are four memory ports; spreading installed DIMMs across all four memory ports can improve performance, because the four independent memory ports (memory cards) provide simultaneous/concurrent access to memory. With four memory cards installed (and DIMMs in each card), peak memory bandwidth is 21.33 GBps. The memory controller routes all traffic from the four memory ports, two CPU ports and the two PCI bridge ports. The memory controller also has embedded DRAM, which in the x366. x3800 and x3860 holds a snoop filter lookup table. This filter ensures that snoop requests for cache lines go to the appropriate CPU bus and not both of them, thereby improving performance. One PCI bridge supplies four of the six 64-bit 266 MHz PCI-X slots on four independent PCI-X buses. The other PCI bridge supplies the other two PCI-X slots (also 64-bit, 266 MHz), plus all the onboard PCI devices. This illustration details the interconnect and board components. CPLD = Complex Programmable Logic Device. BMC = Baseboard Management Controller smpdr3.13-xw0001.pdf 21 May 2009
The x3850 M2 and x3950 M2 uses the fourth generation of the IBM XA-64e chipset. The architecture consists of the following components: One to four Xeon dual-core or quad-core processors 4 Memory and I/O Controller (MIOC) Eight high speed memory buffers Two PCI Express bridges One South bridge PCI bridge 1 supplies four of the seven PCI Express x8 slots on four independent PCI Express buses. PCI bridge 2 supplies the other three PCI Express x8 slots plus the onboard SAS devices, including the optional ServeRAID-MR10k. A separate South bridge supplies all the other onboard PCI devices, such as the USB ports, onboard Ethernet and the standard RSA II. As this is a multi-board system (processor board, I/O board and RSA II adapter, hardware replacements require careful thought to ensure a working system when a board is replaced. Code is located on all major boards in the system and this code must be matched for release levels to ensure proper operation The components represented by the black boxes require BIOS/Firmware update after parts replacement. CPU card there is BIOS, BMC, and FPGA code. (Field Programmable Gate Arrays) FPGA is very similar to CPLD in previous systems TPM (Trusted Platform Module) I/O card there is SAS, Ethernet, FPGA, and DSA (Diagnostics) code. RSAII Adapter Broadcom 5709 Ethernet controller ServeRAID- MR 10K SAS.SATA Controller (If present )
smpdr3.13-xw0001.pdf
22
May 2009
XW0001 - Servicing IBM System x Servers Part II Processor Architecture Single to Dual Core
Up to this point that sometimes we refer to these x86 platforms as either 2-socket, 4 socket, 8-socket, or 16-socket configurations. Historically we have been used to referring to these systems as n-way systems. Due to the current trends in microprocessors the term n-way or n-CPU could become misleading if not used in the proper context. The dual-core processors in the x3950 are the first Intel processor to offer multiple cores. Dual-core processors are a concept similar to a two-way system except that the two cores are integrated into one silicon die. This brings the benefits of two-way SMP with less power consumption and faster data throughput between the two cores. To keep power consumption down, the resulting core frequency is lower, but the additional processing capacity means an overall gain in performance. In addition to the two cores, the dual-core processor has separate L1 instruction and data caches for each core, as well as separate execution units (integer, floating point, and so on), registers, issue ports, and pipelines for each core. A dual-core processor achieves more parallelism than Hyper-Threading Technology, because these resources are not shared between the two cores. Estimates are that there is a 1.2 to 1.5 times improvement when comparing the dual-core Xeon MP with current single-core Xeon MP. With double the number of cores for the same number of sockets, it is even more important that the memory subsystem is able to meet the demand for data throughput. The 21 GB/sec peak throughput of the X3 Architecture of the x3950 with four memory cards is well-suited to dual-core processors. For additional information refer to IBM Red Book Virtualization on the IBM System x3950 Server Publication # SG 24-790-00
smpdr3.13-xw0001.pdf
23
May 2009
XW0001 - Servicing IBM System x Servers Part II Processor Architecture Dual Core to Quad Core
The dual-core processors are a concept similar to a two-way SMP system except that the two processors, or cores, are integrated into one silicon die. This brings the benefits of two-way SMP with less power consumption and faster data throughput between the two cores. To keep power consumption down, the resulting core frequency is lower, but the additional processing capacity means an overall gain in performance. The quad-core processors add two more cores onto the same die. Hyper-Threading Technology is not supported. Each core has separate L1 instruction and data caches, as well as separate execution units (integer, floating point, and so on), registers, issue ports, and pipelines for each core. A multi-core processor achieves more parallelism than Hyper-Threading Technology, because these resources are not shared between the two cores. With double and quadruple the number of cores for the same number of sockets, it is even more important that the memory subsystem is able to meet the demand for data throughput. The 34.1 GBps peak throughput of the x3850 M2 and x3950 M2 eX4 Architecture with four memory cards is well-suited to dual-core and quad-core processors. 1066 MHz front-side bus The Xeon MP uses two 266 MHz clocks, out of phase with each other by 90, and using both edges of each clock to transmit data. A quad-pumped 266 MHz bus therefore results in a 1066 MHz front-side bus. The bus is eight bytes wide, which means it has an effective burst throughput of 8.53 GBps. This can have a substantial impact, especially on TCP/IP-based LAN traffic.
smpdr3.13-xw0001.pdf
24
May 2009
This group of servers offers a high degree of redundancy, fault tolerance and manageability hardware. The BMC and RSA II are discussed in detail later.
smpdr3.13-xw0001.pdf
25
May 2009
XW0001 - Servicing IBM System x Servers Part II Server Security Software (SSS) Trusted Platform Module (TPM)
Trusted Platform Module (TPM) Management is a new feature offered by Microsoft Windows. This feature will be available after the release of Windows Server 2008 Network Operating System. The feature set includes the TPM Management console, and an API called TPM Base Services (TBS). This architecture provides an infrastructure that allows Windows-based applications to use and share the TPM. TPM has the ability to create cryptographic keys and encrypt them so that they can be decrypted only by the TPM. This process, often called "wrapping" or "binding" a key, can help protect the key from disclosure. The TPM can also seal and unseal data generated outside of the TPM. With this sealed key and software like Microsoft Windows BitLocker Drive Encryption, you can lock data until specific hardware or software conditions are met. With a TPM, private portions of key pairs are kept separated from the memory controlled by the operating system.
smpdr3.13-xw0001.pdf
26
May 2009
smpdr3.13-xw0001.pdf
27
May 2009
Here, we will discus IBM RAID adapters and enclosures commonly associated with System x servers.
smpdr3.13-xw0001.pdf
28
May 2009
This topic described the IBM Raid levels and the ServeRAID adapter family and storage enclosures.
smpdr3.13-xw0001.pdf
29
May 2009
Array
A group of physical disks
Logical Drive
Who has control?
An array is a grouping of physical disks. A logical drive is a term given to part or all of an array. An array can contain multiple logical drives. Logical drives are recognized by the OS as physical disks.
smpdr3.13-xw0001.pdf
30
May 2009
RAID-0 stripes (or spreads) data across multiple disks drives without parity protection in order to maximize DASD performance. Performance is improved with larger files because read/writes are overlapped across all disks. An additional benefit of RAID-0 is "drive spanning". With data spread across multiple drives in the array, the logical drive size is the sum of the individual drive capacities. RAID-0 is the only level of RAID that does not provide any type of fault tolerance. In other words, the failure of one drive will cause the entire disk subsystem to fail.
smpdr3.13-xw0001.pdf
31
May 2009
Disk 1
Data
Disk 2
Disk Duplexing
. . . . . . . .
Mirrored Data
. . . . . . . .
RAID-1 is either disk mirroring or disk duplexing. Disk mirroring involves duplicating the data from one disk onto a second using a single controller. Disk duplexing is the same as mirroring in all respects, except that the disks are attached to separate controllers. The server can now tolerate the loss of one disk controller or one disk, without the loss of the disk subsystem's availability or the customer's data. Since each disk is attached to a separated controller, performance and throughput may be further improved. NetWare splits seeks, reads half from data drive and half from mirrored drive
smpdr3.13-xw0001.pdf
32
May 2009
Disk 1
Data Stripe
Disk 2
.......... .......... .......... .......... .......... ..........
Data 2 Mirror 1 Data 5 Mirror 4
Disk 3
.......... .......... .......... .......... .......... ..........
Data 3 Mirror 2 Data 6 Mirror 5
Data 1
Mirrored Stripe Data Stripe Mirrored Stripe
. . . .
. . . .
. . . .
RAID 1e offers an enhanced version of RAID-1 that combines mirroring with data striping. The first stripe is for data and the second is for mirrored data offset by one drive. This allows for improved performance and increased flexibility in configuring mirroring for greater than two drives.
smpdr3.13-xw0001.pdf
33
May 2009
XW0001 - Servicing IBM System x Servers Part II RAID 5 (Data Stripping with Parity)
Stripes data and parity information, sectors at a time, across all disks
Parity information is also striped across all disks Requires a minimum of three disks If any one disk fails, the data can still be accessed
Disk 1
Stripe 1 Stripe 2 Stripe 3
Block 1 Block 4 Block 7
Disk 2
.......... .......... .......... .......... .......... ..........
Block 2 Block 5
Checksum of blocks 7-9
Disk 3
.......... .......... .......... .......... .......... ..........
Block 3
Checksum of blocks 4-6
Disk 4
.......... .......... .......... .......... .......... ..........
Checksum of blocks 1-3
Block 6 Block 9
. . . .
Stripe x
. . . .
. . . .
Block 8
. . . .
. . . .
Block n-2
Block n-1
Block n
Data and checksum information are evenly spread across drives, spreads both the data and data parity information across the disks one block at a time to ensure maximum read performance when accessing large files and to improve array performance in a transaction processing environment. This removes the bottleneck of storing all of the parity data on one drive. High transaction rate (good for random transactions) Drives operate independently (don't need to be in sync) Better server performance than RAID 2, 3 and 4 Low reliability cost: Capacity of 1 drive per array RAID-5 The equivalent of one drive per array is used for the parity data, regardless of the size of array. Once again, the capacity left for data storage is always N - 1.
smpdr3.13-xw0001.pdf
34
May 2009
Stripes data, sectors at a time, across all disks with an additional stripe for parity information and hotspare space
Requires a minimum of four disks If any one disk fails, the data and parity information will be redistributed on the remaining drives (Logical Drive Migration) Capacity of n - 2 (n = number of disks)
Stripe 1 Stripe 2 Stripe 3
Parity Data 4 Data 7 Data 10 HSP
. . . .
Stripe x
RAID 5E is firmware-specific. You can think of RAID 5E as RAID 5 with a built in spare drive. Reading from, and writing to, four disk drives is more efficient than three disk drives and therefore improves performance. Additionally, the spare drive is actually part of the RAID 5E array. With such a configuration, you can not share the spare drive with other arrays. If you want a spare drive for any other array, you must have another spare drive for those arrays. Like RAID 5, RAID 5E stripes data and parity across all of the drives in the array. When an array is assigned RAID 5E, the capacity of the logical drive is reduced by the capacity of two physical drives in the array (that is, one for parity and one for the spare). RAID 5E is a good choice to use, because it offers both data protection and increased throughput, in addition to the built-in spare drive. RAID 5E gives you better utilization of the array's physical capacity than RAID 1, but RAID 1 offers better performance. RAID 5E was superseded by RAID 5EE where the HSP is left room for in every stripe. (e.g. most prefer the RAID 5EE implementation) smpdr3.13-xw0001.pdf 35 May 2009
XW0001 - Servicing IBM System x Servers Part II RAID 6 Block striping with double distributed parity
RAID 6 reserves the equivalent of two disks in the array for parity information and stores two separately calculated checksums on different disks
Can survive the loss of two disks before data loss occurs Block striping with double distributed parity Two separate parity checksums to survive two disk failures
Stripe 1 Stripe 2 Stripe 3
A0 A1 P2 PD
B0 P1 PC B3 A3
P0 PB C2 C3 B1
PA D1 D2 P3 A2
. . . .
Stripe x
B2
RAID 6 is a newly emerging RAID level that has been designed to address modern data storage needs. As RAID arrays increase in size and complexity, the ability to survive more than one disk failure becomes more important to avoid catastrophic data loss. RAID 6 is: Block striping with double distributed parity Two separate parity checksums to survive two disk failures RAID 6 reserves the equivalent of two disks in the array for parity information and stores two separately calculated checksums on different disks in order to survive the loss of two disks before data loss occurs.
smpdr3.13-xw0001.pdf
36
May 2009
ServeRAID 4 family
Ultra160 SCSI with one, two or four channels
-RAID levels 0, 1, 1e, 5, 5e, 00, 10, 1e0, 50 -Support for up to 56 disks
ServeRAID 5i, 6i
Zero channel RAID adapter (works with onboard SCSI controller)
-Uses full ServeRAID software stack -Has BIOS, firmware, device drivers, and utilities -RAID levels 0, 1, 1e, 5, 00, 10, 1e0, and 50
ServeRAID 6m
Ultra320 SCSI with two channels
-RAID levels 0, 1, 1e, 5ee, 00, 10, 1e0 and 50
The NOS device drivers are model specific. The ServeRAID 4 adapter family shares the characteristics listed here. It comes in several different flavors (4L/4Lx, 4m, and 4H) The ServeRAID 5i and 6i adapters have no internal or external SCSI connectors. They use the server's onboard SCSI controller but enhance the basic features to provide support for additional RAID levels. The ServeRAID 6m is a dual-channel Ultra320 SCSI controller.
smpdr3.13-xw0001.pdf
37
May 2009
XW0001 - Servicing IBM System x Servers Part II SATA and SAS ServeRAID Adapters
ServeRAID 7t
1.5 Gbps per port serial ATA (SATA) controller
-RAID levels 0, 1, 5, 10 -Up to four SATA disks on four separate ports
ServeRAID 7k
- The option is shipped as a special memory DIMM with a battery attached (for batterybackup purposes) - Memory is 256 MB, 133 MHz (PC2100) DDR1 memory - RAID levels 0, 1, 5, 10
The ServeRAID 7t is designed for smaller servers that require RAID support with SATA disks. A maximum of four disks can be connected to the ServeRAID 7t. It is unlikely that you will see a ServeRAID 7t in a high-end server as the controller does not support the SCSI or SAS backplanes that are common in high-end models. However, a customer may choose to add such an adapter to a system that can support non-hot-swap disks. The battery backup of the 7k adapter provides up to 33 hr backup.
smpdr3.13-xw0001.pdf
38
May 2009
XW0001 - Servicing IBM System x Servers Part II SATA and SAS ServeRAID Adapters
ServeRAID 8i
3.0 Gbps per port serial attached SCSI (SAS) controller
-RAID levels 0, 1, 5, 5ee, 6, 10, 1e0, 50, 60 -Up to eight SAS ports
ServeRAID 8k
- This option is shipped as a special memory DIMM with a battery attached via wires (for battery-backup purposes) - The DIMM is installed in a special DIMM socket in supported servers - Battery is connected to the DIMM by wires and is typically mounted on the server chassis
The ServeRAID 8i and 8k was introduced to support the third generation Enterprise X-Architecture servers as they are built around SAS disk subsystems. The ServeRAID 8k option is shipped as a special memory DIMM with a battery attached via wires (for battery-backup purposes) The DIMM is installed in a special DIMM socket in supported servers Five DRAM chips on the DIMM "Adaptec ATB-200" on battery side Write-back cache memory is 256 MB, 533 MHz DDR2 unbuffered memory Battery is connected to the DIMM by wires and is typically mounted on the server chassis
smpdr3.13-xw0001.pdf
39
May 2009
XW0001 - Servicing IBM System x Servers Part II ServeRAID 10 (MR10i, MR10k, MR10M)
LSI 1078 RAID Adapter (MR10i/is, MR10k, MR10E) Eight-port SAS RAID adapter, Two SAS connectors , 3 Gb/s throughput per port (full duplex) RAID levels 0, 1, 5, 6,10 and 50,60 w/Greater than 2TB array support X8 PCI Express host interface Battery-backed 256MB DDRII 667 MHz SDRAM DIMM module The 10is offers encryption/security Protects data in cache up to 72 hours during power loss or MegaRAID controller failure Allows system administrators to replace a failed adapter, while maintaining the data protected on the DIMM module for up to 72 hours. iTBBU support 122 device support RoHS and WEEE compliant
smpdr3.13-xw0001.pdf
40
May 2009
3 Gbps Serial Attached SCSI (SAS) host interface technology Easy to deploy and manage with the DS3000 Storage Manager Combination of 12 SAS or SATA 3.5" drives per enclosure Scalable to 3.6 TB of storage capacity with 300 GB hot-swappable SAS disks or 12.0 TB with 1.0 TB hot-swappable SATA disks in the first enclosure Expandable by attaching up to three EXP3000s, a total of 14.4 TB of storage capacity with 300 GB SAS or up to 48.0 TB with 1.0 TB SATA Telco model supports -48V dc power supplies NEBS and ETSI compliance for AC and DC models
smpdr3.13-xw0001.pdf
41
May 2009
-DS3300
iSCSI host-side connection
-DS3400
Fibre Channel host-side connection
The DS3000 family of storage servers provide flexible connection for external, managed storage. SAS, iSCSI and FC models are available. All disks can be SAS or SATA. The host requires the appropriate host bus adapter for the chosen model (SAS adapter for DS3200, iSCSI adapter (ethernet) for DS3300 and FC HBA for DS3400).
smpdr3.13-xw0001.pdf
42
May 2009
This picture shows the rear view of the D3200 chassis with dual power supplies and ESMs..
smpdr3.13-xw0001.pdf
43
May 2009
iSCSI Ports
Components covered
smpdr3.13-xw0001.pdf
44
May 2009
This picture ends the series showing the rear view of the DS3400.
smpdr3.13-xw0001.pdf
45
May 2009
This topic dealt with overviews of IBM Raid levels and the currently offered ServeRAID adapters. During this topic also discussed what storage solutions IBM System x offers
smpdr3.13-xw0001.pdf
46
May 2009
The prerequisite to this course introduced the design principles of IBM System x and xSeries servers and how to service them. This topic looks more closely at what these design principles mean in practice when servicing an System x and xSeries server.
smpdr3.13-xw0001.pdf
47
May 2009
All System x servers support some of the more advanced technologies that IBM has designed and developed. This topic discusses these technologies and describes the implications of working with them in the field.
smpdr3.13-xw0001.pdf
48
May 2009
Processor Technologies
The industry standard server (Intel processor-based) takes many forms. There are a number of processor types in common use today. This section discusses some of the features of the Intel processor family and reviews some of the service implications when working with processor problems.
smpdr3.13-xw0001.pdf
49
May 2009
Intel processors
Dual processor capable
-Xeon DP
AMD Processors
-AMD Opteron family of processors
The Intel processor family has several offerings in common use today. High-performance System x servers use all of the processors in the chart above. It should be noted that, although the servers discussed in this course are multi-processor capable, not all servers you see in the field will actually have multiple processors installed. In many cases, the base server ships with one processor, with spare slots or sockets for additional processors as the customers needs grow. The x3950 M2 provides an uncomplicated, cost-effective and highly flexible solution. With the ability to scale up to a maximum of 96cores using Intel six-core processors, while maintaining balanced performance between processors, memory and I/O, thex3950 M2 can easily accommodate business expansion and the resulting need for additional application space. Unique flexibility of the configurations allows the system to populate a minimum of two CPUs per chassis for additional access to memory and I/O that addresses an organizations specific application requirements. This flexibility allows for the creation of a12-core, 32-DIMM server utilizing only two processor sockets for processor licensing-constrained applications, and can be scaled to a 48-core, 128DIMM server utilizing only eight processors. For servers equipped with AMD processors, IBM uses the Opteron multi-core parts.
smpdr3.13-xw0001.pdf
50
May 2009
XW0001 - Servicing IBM System x Servers Part II Processor and VRM Failures
Help is available if a processor or VRM fails as the Service Processor will log the event. When the SP detects a failed processor or VRM, it handles the error and attempts to make the server functional. The SP will deal with this situation by attempting to re-boot the server to any surviving processors.
smpdr3.13-xw0001.pdf
51
May 2009
XW0001 - Servicing IBM System x Servers Part II Replacing a Failed Processor or VRM
Service implications
-System restarts if there are good processors/VRMs remaining -Processor slot may need to be manually re-enabled upon repair - Following replacement of the failed component, run Setup (F1) to check processor slot status -The PDSG will advise the correct part for VRMs (slot or system board)
A VRM failure may give the appearance that a processor has failed. The system event log should capture the specific details of the fail and enable you to identify if it was the VRM or the processor that had the error. If it is the VRM, it could be in one of several places in the server. Some system x and xSeries models have VRMs built into the system board, some have VRM slots and some have both. Your knowledge of the System x and xSeries model will help you to identify the exact location. Light Path Diagnostics will usually indicate the failing part and, if no light is visible or if you need to verify the failure, check the HMM/PDG for additional information. Once the failed part has been identified and replaced, it is important that you test the system to make sure that the associated processor is functioning normally. Upon replacement of the component, the initialization of the processor slot may or may not be automatically detected by BIOS. It may be necessary to manually enable the processor slot before the system is restored to full functionality. Check the HMM/PDG for the correct procedure.
smpdr3.13-xw0001.pdf
52
May 2009
Memory Technologies
This section looks at IBMs memory protection technologies and describes how these memory technologies change the behavior of a server and how you service it when memory faults occur.
smpdr3.13-xw0001.pdf
53
May 2009
XW0001 - Servicing IBM System x Servers Part II Error Checking and Correcting (ECC) Memory
Additional bits on a memory DIMM store checksum data to verify memory contents (72 bits vs. 64)
-During each write, a new checksum is calculated and stored in the additional bits on the DIMM -During a read, the checksum is compared with the data bits and verifies data as valid and/or corrects single bit errors
Non-servers traditionally use 64 Bit (non-parity) memory, but the absolute minimum memory quality requirement is ECC (72 bits). This memory type is standard across the System x and xSeries server range. Due to the nature of memory configurations in modern servers, a single bit error is still the most common type of error. ECC offers the ability to detect and correct any single bit error and works well in most situations for most general purpose server requirements. From a service perspective, if ECC is correcting a persistent error, the DIMM ultimately needs to be replaced. Unless the server has encountered a second, uncorrectable error, it is likely that the server will still be running. You may need to schedule a suitable time to replace the failing DIMM.
smpdr3.13-xw0001.pdf
54
May 2009
ChipKill memory provides a higher level of error checking and correcting capabilities
- Uses standard ECC DIMMs - Corrects up to 4-bit memory errors
IBM patented technology performs on-the-fly correction
-Improves reliability 600 times over standard ECC memory
-Especially important for business-critical applications where large amounts of memory are installed
1 in 5 servers with more than 1 GB of memory may have multi-bit errors each year Large Database Servers can take many extra hours to recover from a system failure (for example, time to re-initialize the database)
Where standard ECC protection is not enough, many high-end IBM System x and xSeries servers now offer ChipKill support. This technology extends the basic ECC capabilities to be able to support the loss of an entire DRAM device on a DIMM the equivalent of 4 bits of bad data. Very large databases can take several hours to resynchronize, rebuild or restart following a shutdown so a customer should/will factor this into service availability planning when deciding on a suitable memory technology for their server. As with ECC, a memory system that has invoked a ChipKill event is likely to be running. You will not be able to simply take out the bad DIMM and replace it without scheduling a suitable time with the customer.
smpdr3.13-xw0001.pdf
55
May 2009
Hot-spare memory reserves a bank of memory to cover the user memory in the server. The extra/hot-spare memory is idle until it is needed. The Service Processor monitors memory performance and tracks errors. Before the ECC threshold is reached, the failing memory is copied to the hot-spare DIMMs during the refresh cycle, and the questionable memory is switched off. In order for this to work, the failure must be correctable by ECC or ChipKill correction algorithms and the memory swapped by the controller before a fatal error halts/crashes the NOS. Traditionally, for hot-spare memory to work, all memory in all banks must be identical.
smpdr3.13-xw0001.pdf
56
May 2009
Memory ProteXion is the term given to the memory system of a number of System x and xSeries servers that are based around the IBM Enterprise X-Architecture chipsets. The memory configuration provides for spare bits on each DIMM. If a bit of memory goes bad, it will be moved by the memory controller to a new location on the DIMM. (Routinely ECC correction has taken 8 extra bits, out of 72 to provide ECC protection, but recent innovations at IBM have found a way to do that with only 6. Leaving two spare bits per 72 pin memory DIMM) Memory can also be mirrored. In a mirrored configuration, half of the memory is reserved for the copy so the total maximum possible memory is reduced by half. As with all memory failures, the bad DIMM must ultimately be replaced to avoid further failures stopping the NOS. But ONLY in a mirrored memory configuration, will you be allowed to hot replace a failed memory DIMM. If you are unable to hot replace a failed DIMM, you are likely to need to schedule downtime on high performance System x and xSeries servers as they are built to survive even serious memory faults.
smpdr3.13-xw0001.pdf
57
May 2009
smpdr3.13-xw0001.pdf
58
May 2009
XW0001 - Servicing IBM System x Servers Part II When a DIMM Fails in a Server
Service implications
- If memory is mirrored, the server will be running and it may be possible to remove the failed DIMM without stopping the NOS - If memory is not mirrored, the system may have restarted itself if there was good memory remaining - If so, it will be necessary to shut down the server to make repairs -Memory slot or bank may need to be manually re-enabled - Following replacement of the failed component, run Setup (F1) to check memory slot status
When the SP detects a failed DIMM, it handles the error and attempts to make the server functional. If memory is mirrored, the hardware will have switched off the port containing the bad DIMM. In this case, you may be able to remove the failed DIMM without stopping the NOS. The procedures for removing a failed DIMM in a mirrored configuration are contained in the HMM or PDG. In a system without mirrored memory, you will need to shut down the server to replace a failed DIMM. Upon replacement of the component, the initialization of the memory slot may or may not be automatically detected by BIOS. It may be necessary to manually enable the DIMM slot or a bank of DIMMs before the system is restored to full functionality.
smpdr3.13-xw0001.pdf
59
May 2009
Active PCI, PCI-X and PCI-Express were developed to add the ability to hot add, remove and replace adapters and controllers to a system without the need to shut down the NOS. This section describes the technology and how to work with it.
smpdr3.13-xw0001.pdf
60
May 2009
-In a redundant adapter configuration, failed adapters can be removed and replaced without shutting down the OS
PCI-
While not exclusive to high-end servers, Active PCI technology is common to all high-end System x and xSeries servers. Active PCI (and Active PCI-X) enables the option to potentially add, remove and replace adapters while the NOS is running. Device drivers are needed to support both the technology and any adapters that will make use of the technology. Where two adapters are coupled together in a redundant configuration, for example two network adapters, a failure can be fixed without stopping the NOS. Active PCI requirements Hardware Interlock Switch 2 LEDs per Active PCI slot
Power Attention
Software Device Driver
Adapter manufacturer
System Driver
Machine manufacturer
System Service
XW0001 - Servicing IBM System x Servers Part II Servicing a Server with Active Slots
If you are called to a server which has Active slots enabled and working, you will need to consult with the customer before attempting to replace a failed adapter. Any customer who adopts this technology will be reluctant to let you stop the NOS to replace the failed adapter and you may be required to replace the adapter hot. Procedures vary from NOS to NOS AND from adapter to adapter. However, in general, the NOS is informed that an adapter is about to be removed and Active PCI/PCI-X switch card is used to remove power to a slot prior to removal. When you have completed the repair and fitted the replacement adapter, the NOS may need to be told that the repair is complete.
smpdr3.13-xw0001.pdf
62
May 2009
Service Processors
Here, we look at the system management hardware (Service Processors) you will find in highperformance System x and xSeries servers.
smpdr3.13-xw0001.pdf
63
May 2009
Service Processors (SP) are often divided into two groups. Basic Service Processor (BMC) - Runs on the 5v continuously-on power, and is used to power on/off the server - monitors I2C bus for sensor activity, and stores logs / information about events - Responds to issues, and errors (light path diagnostics, fans, reboots) - provide limited information access to the machine while powered off (if machine is plugged in) Advanced Service Processor (RSA2) Runs on the 5v continuously-on power and Monitors/collects information from BMC Can be programmed to page out support personnel when a problem occurs Powerful web interface for easy remote management Remote video, remote control, push down code features .
smpdr3.13-xw0001.pdf
64
May 2009
XW0001 - Servicing IBM System x Servers Part II Base Management Controller (BMC)
Independent microcontroller used to perform low level system monitoring and control functions. BMC Functions:
-Initial system check out at AC on -BMC event log maintenance -System power state tracking -System initialization -System software state tracking -System event state monitoring -System fan speed control
smpdr3.13-xw0001.pdf
65
May 2009
XW0001 - Servicing IBM System x Servers Part II Remote Supervisor Adapter (RSA)
Remote Supervisor Adapters (RSA) are full featured management adapters with a host of features to provide both in-band and out-of-band management capabilities, including full remote control Through the RSA and RSA II, you can interrogate and manage logs, control and monitor the power state of the host server, apply flash updates to host and any attached I/O expansion enclosures and take full remote control of the host console while the NOS is running. RSAs support the following: Web-based management: embedded in the adapter, a small web server provides the capability to connect through the dedicated LAN port and access a user friendly interface, based on HTML code, to perform configuration and monitoring of the server. Remote graphic console redirection: When connecting through the dedicated LAN port, the card will make it possible to grab video data and perform a complete console redirection with text, graphics, keyboard and mouse support. DNS/DHCP support: In addition to static IP configuration, the RSA supports DHCP and DNS. Putting the card in a network where a DHCP is installed will generate its automatic configuration; avoiding the need to run configuration routines through the management software. NT blue screen capture: The most recent OS failure screen can be captured, avoiding the annoying step of restarting the server to reproduce the error. Attach event log to e-mail alerts: The event log can be sent out as an attachment of an e-mail to administrators to notify them of any problem that affected the server. DB-9 connector (RSA only): The card has a standard DB-9 connector, making cabling easier. Externally visible LEDs: Power and error LEDs are on the rear bezel, removing the need to lift the covers in order to check the status of the card.
smpdr3.13-xw0001.pdf
66
May 2009
XW0001 - Servicing IBM System x Servers Part II RSA II Adapter Features / Layout
1. Status LEDs (Heartbeat & Power - heartbeat blinking, power solid during normal operation) 2. Pinhole Reset (Service Processor Software Reset) 3. Mini-USB Connector (Host OS Comm. / Remote Disk,Mouse,Keyboard) 4. External Power Supply 5. RJ45 Ethernet Connector (Web Interface) 6. DB15 VGA Video Connector (Host Video) 7. Video Compression Memory 8. Non-Serviceable Clock Battery 9. Video Compression Chip 10. Remote Floppy,Mouse,Keyboard Chip 11. ATI Radeon 7000VE (a.k.a RV-100) (Video) 12. PCI Connector (System Video) 13. Ethernet PHY 14. Flash Memory (Service Processor) 15. PowerPC CPU (Service Processor) 16. Video Memory (System Video) 17. CPU Memory (Service Processor) 18. Real-time Clock
The RSA2 adapter replaced the RSA adapter starting in 2003 and currently comes in several slightly different flavors. The above photos shows some of the complex features of the full RSA II adapter (e.g. mounted on its own video card). The RSA2 SlimLine adapter mounts on an existing video adapter in many of the newer System x servers. There is also a RSA2 SlimLine Refresh 1, and a RSA2-EXA adapter. The essential differences of these renditions can be found on the following website. http://www.redbooks.ibm.com/abstracts/tips0146.html This RSA-2 adapter is a complex, half size adapter which needs to be flashed for the supported server that it is installed in. Depending on the level of code installed in the RSA II, the adapter can be reset with either a 5-5-10 second (5 seconds pushed, 5 seconds not pushed, 10 seconds pushed) or a straight 10 second pushed reset using a paper clip. A reset of the adapter will set it back to factory defaults, cause the adapter to reboot, and try for two (2) minutes to obtain a DHCP address before resorting to a 192.168.70.125 if/when it can not find a DHCP server.
smpdr3.13-xw0001.pdf
67
May 2009
In this picture you can see an example of the interface that will be presented to the user when connecting an RSA II through a Web browser.
smpdr3.13-xw0001.pdf
68
May 2009
Through the RSA, you can interrogate and manage event logs to assist in problem isolation and repair. Note: you can access the RSA II event logs even if the host is in standby power mode.
smpdr3.13-xw0001.pdf
69
May 2009
1. 2.
The IBM Remote Supervisor Adapter II has three different update package options: a Windows update package, a Linux update package, and a Zip file package. (e.g. sample web link is The Windows and Linux update packages can be installed from one of these NOSs. (e.g. provided that the NOS driver for the RSA2 is installed) The Zip file package is used to update the RSA2 adapter from the Web Interface. The package consists of a readme, a change history and the Zip file containing the following PKT files. PAETBRUS.PKT is traditionally the name of the Boot ROM file PAETMNUS.PKT is traditionally the name of the Main Application file If access to the server is possible, these components can be updated with the use of flash images. Images can be downloaded from the IBM support Web site, which cam be used to make the necessary diskettes. If access to the server is not possible or if the RSA2 is under management through a Web browser, updates can also be applied via the web browser connection. In this case, the update images are different but can still be downloaded from the IBM support web site.
smpdr3.13-xw0001.pdf
70
May 2009
-Software requirements for remote POST screens, remote Setup and remote Diagnostics:
Terminal program or IBM Director or a WEB browser Supported Java engine
-During Boot, the RSA2 adapter can be loaded with a Disk or CD image/file and the server can boot from this image file.
Console redirection can be very useful for diagnosing problems where access to the server console is required. Using a variety of connection methods and software interfaces, the RSA gives full remote control capabilities. Depending on the level of access to the hardware, you can perform almost any task that you could perform while actually standing at the server itself. If the RSA ethernet port is connected to the customer LAN, you can even take control of the server from another location in theory, anywhere in the world provided you know the IP address of the adapter and have the necessary security permissions to access the interface. This facility is very powerful and must be used with extreme care. Also, accessing a server console in this way should only be undertaken with the permission of the customer. One other very important feature of the RSA2s Remote Disk feature is that a file ( diskette or CD image file) can be accessed by the server via the RSA2 adapter. The image is first loaded on the RSA2 adapter. Then when the server is rebooted, its boot sequence can be altered (e.g. press F12) to boot from it . ( The server will now boot from the remote file, as if it was really an attached diskette driver, or CDROM drive.) This can be used to flash the various server hardware features remotely.
smpdr3.13-xw0001.pdf
71
May 2009
Feature / Function
Monitoring Automatic Server Restart Capture Windows Blue Screens Environmental Monitors Interface with Light-Path Optional Power Source PFA on system components POST, Loader, O/S Timeouts Alerting Alert to pager SMTP Email SNMP Traps SNMP via PPP Management/configuration ANSI-based Management Director-based Management Telnet-based Management Web-based Management Remote BIOS Update Remote Control Remote POST / Diagnostics View Status Logs View Vital Product Data Connectivity 10/100 Ethernet DHCP support DNS support PPP Shared serial support
BMC
Yes No Yes Yes No Yes Yes Yes No No No Yes (via SoL) Yes Yes No No No No Yes Yes Yes (shared) No No No No
RSAII
Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes* No Yes Yes Yes Yes Yes** Yes Yes Yes Yes Yes Yes Yes Yes Yes
Here is an at a glance comparison of the monitoring and alerting capabilities of the Service Processors found in System x servers. *Only SNMPv1 traps supported. **Direct flashing of BIOS/Diags firmware is not supported (can be done using the remote disk feature instead).
smpdr3.13-xw0001.pdf
72
May 2009
This topic discussed the technologies incorporated into high-performance System x servers.
smpdr3.13-xw0001.pdf
73
May 2009
This topic discusses the service implications of working with scalable systems.
smpdr3.13-xw0001.pdf
74
May 2009
When servicing multi-node systems, it is important to understand the relationship between nodes in the partition and how the partition is wired together.
smpdr3.13-xw0001.pdf
75
May 2009
Scalable
-A system that is able to join with another computing resource to act as a single, larger server
Node
-A single computing resource (a server)
Capable of operating alone or joined (scaled)
Complex
-Two or more nodes
Joined together physically
Partition
A complex that is running a single instance of an OS
The term scalable is used to describe a device that has the ability to operate in a joined fashion, along with another computing device, to appear as a single, large server. A node is the smallest unit of a scaled system. A node can operate standalone, as well as in a complex. A complex is a collection of nodes, physically and joined together to form a large computing resource. A partition is a complex that is running a single instance of an OS across all processors and memory in the complex.
smpdr3.13-xw0001.pdf
76
May 2009
XW0001 - Servicing IBM System x Servers Part II x3950 M2 Scalability Schematics 8-way and 16-way
Upper SMP
Port 1 Port 2
module
Port 3
BMC RSA
x3950/460/MXE
Port 1 Port 2 Port 3 BMC RSA
module
Port 3
BMC
x3950/460/MXE
BMC RSA
x3950/460/MXE
Port 1 Port 2 Port 3 BMC RSA
x3950/460/MXE
Port 1 Port 2 Port 3 BMC RSA
x3950/460/MXE
Port 1 Port 2 Port 3 BMC RSA
Here are the cabling schematics for 8-way and 16-way operation across all the supported scalable systems. When scaling the x3950, x3950E, 460 or MXE 460 to an 8-way partition or all the above systems to a 16-way partition, the RSAs play a key part in creating and maintaining the partition as they hold the partition data and maintain communications between all nodes in the partition. The data flows across the scalability cables. Each node contains a scalability controller (part of the XA chipset) that is effectively a high speed switch. Each node above is directly connected to each other node so much of the switching technology embedded in the controller is not used. Note that ethernet hubs are used in all but the most simple of partitions as there are many devices that need to connect to a common management LAN in order for scaling to work, while still providing real time access to management processors and functions.
smpdr3.13-xw0001.pdf
77
May 2009
x460/MXE
Port 1 BMC RSA
x460/MXE
Port 1 Port 2 Port 3 BMC RSA
x460/MXE
Port 1 Port 2 Port 3 BMC RSA
x460/MXE
Port 1 Port 2 Port 3 BMC RSA
x460/MXE
Port 1 Port 2 Port 3 BMC RSA
x460/MXE
Port 1 Port 2 Port 3 BMC RSA
x460/MXE
Port 1 Port 2 Port 3 BMC RSA
x460/MXE
Port 1 Port 2 Port 3 BMC RSA
Here is the cabling schematic for a 32-way xSeries x3950, x3950E, 460/MXE 460 partition. As you can see from the scalability cabling in this schematic, each node is directly connected to three other nodes in the partition. This time, each node acts as a router to the nodes that are not directly connected, fully exploiting the switching capabilities of the scalability controllers in the nodes. Without the ability to maintain routing tables, it would not be possible to scale eight nodes together.
smpdr3.13-xw0001.pdf
78
May 2009
All scalability cables must be fitted BIOS and firmware levels must match across all nodes Previous partition information must be deleted
-Stale partition descriptor data may cause nodes to fail to merge
Here are the basic rules that will allow multiple nodes to merge into a partition. Before a partition can merge, however, parameters must be set to identify all nodes in the partition.
smpdr3.13-xw0001.pdf
79
May 2009
Static partitions are those that require a reboot to change the configuration. This is a simplified model that fits well with existing OSes that rely on hardware to mask the fact that it is running on processors and memory from several physical nodes.
smpdr3.13-xw0001.pdf
80
May 2009
To create a complex:
-Flash BIOS, BMC and RSA of all nodes to same levels -Gather IP addresses (static or dynamic) for all RSAs
SP networks must either have static IP addresses or have DHCP leases to maintain consistent IP addressing
-The IP addresses that are assigned to the RSAs must not change once nodes are scaled and running -This is true for static IP addresses and DHCP leases
-Partition tables still exist and are stored on each local RSA Partition tables still exist and are stored on each local BMC
Before attempting to create a complex, ensure that BIOS, BMC and RSA firmware match across all nodes and that RSA clocks match. By doing this, if a failure occurs the information written to the event logs will correlate. The configuration of a complex is performed in one of two places, depending on the node type. For older systems, the configuration is created and stored via the <F1> Setup program. On newer systems, all configuration tasks are performed though the RSA II Web interface.
smpdr3.13-xw0001.pdf
81
May 2009
XW0001 - Servicing IBM System x Servers Part II Scalable Partitioning Using <F1> Setup
smpdr3.13-xw0001.pdf
82
May 2009
XW0001 - Servicing IBM System x Servers Part II Scalable Partitioning Using the RSA II Web Interface Sub Menus
Status View current and new scalable partitions data in the graphical user interface provided by RSA-2 Scalable Partitioning Web interface. This menu is automatically displayed after each task below (create, control and delete) has completed. Create Partition Task Create new scalable partitions with RSA-2 Scalable Partitioning Web interface. Control Partition Task Control new and current scalable partitions with RSA-2 Scalable Partitioning Web interface. Controls are: 1. Moving new partition to current partition - new partition is a staging area for current partitions. New partitions can be created while current partitions are running. 2. Starting current partitions 3. Stopping current partition Delete Partition Task Selections are: 1. Delete Partition Settings on all ASM's members of New Scalable Partition. 2. Delete Partition Settings on all ASM's members of Current Scalable Partition. 3. Delete Partition Settings only for this (local) ASM Member of Current Scalable Partition.
smpdr3.13-xw0001.pdf
83
May 2009
Partitions are supported by L4 cache to speed communications across the processor busses
-On earlier Scalable systems (the xSeries 440, 445, and 455), L4 cache is separate from main memory -On the x3950, x3950E, xSeries 460 and MXE 460, the scalability chip has an integrated L4 Scalability Memory Cache (SMC) which utilizes main memory
When BIOS reports available memory per node to the O/S, it must first subtract the scalability cache size (256MB)
On first and second generation scalable systems, the L4 cache was physically separate from main memory. All main memory is available to the OS. On third generation scalable systems, the cache controller utilizes host memory for the cache. The customer will notice a difference between reported memory (that which is available to the OS) and physically installed memory.
smpdr3.13-xw0001.pdf
84
May 2009
XW0001 - Servicing IBM System x Servers Part II Scaled System Management Considerations
Here are some things to remember when working with partitions and scaled systems.
smpdr3.13-xw0001.pdf
85
May 2009
XW0001 - Servicing IBM System x Servers Part II Scalability Port Test from Diagnostics
Scalability Ports can be tested using System Diagnostics <F2>, under the Basic menu option from each chassis. The new Diagnostic Test Scalability Port Test is an Interactive Test which requires the user to follow the text on the screen.
smpdr3.13-xw0001.pdf
86
May 2009
The x3950 M2 can be scaled to create complex partition that is running a single instance of an OS
The term scalable is used to describe a device that has the ability to operate in a joined fashion, along with another computing device, to appear as a single, large server. A node is the smallest unit of a scaled system. A node can operate standalone, as well as in a complex. A complex is a collection of nodes, physically and joined together to form a large computing resource. A partition is a complex that is running a single instance of an OS across all processors and memory in the complex.
smpdr3.13-xw0001.pdf
87
May 2009
Scalability configurations supported are 2,3,4 nodes Port cabling same as x3950
-New Cables (deep plug w/iPass connectors)
Only USB keyboard and Mouse are supported to boot stand alone
-Hit remind button to initiate standalone boot as USB devices are not initialized at start of merge process
Configuration can have one or more scalable partitions. Each scalable partition supports an independent operating system installation. The scalable partition uses a single, contiguous memory space and provides access to all associated adapters and hard disk drives. PCI slot numbering starts with the primary node and continues with the secondary nodes, in numeric order of the logical node IDs. Before you create scalable partitions, read the following information: Make sure that all nodes in the multi-node configuration contain the following software and hardware: The current level of BIOS code, SAS BIOS code, service processor firmware, BMC firmware, and FPGA firmware. Note: To check for the latest firmware levels and to download firmware updates, go to http://www.ibm.com/systems/support/. Microprocessors that are the same cache size and type, and the same clock speed. Make sure that each node contains the following hardware: A minimum of one microprocessor and one memory card with one pair of DIMMs Note: The nodes can vary in the number of microprocessors and the amount of memory each contains, above the minimum. A ScaleXpander key on the microprocessor board to enable multi-node operation Make sure that the primary node contains a minimum of 4 GB of memory The Scalability installation Option documentation is available
smpdr3.13-xw0001.pdf
88
May 2009
XW0001 - Servicing IBM System x Servers Part II Chassis Scalability requires ScaleXpander Option Kit
ScaleXpander Option Kit Scalability icon lights up when active The x3850 M2 can be upgraded to a x3950 M2 with the ScaleXpander Option Kit
Closer look
Notes: The IBM ScaleXpander Option Kit can be used to upgrade the x3850M2 for scalability. The IBM ScaleXpander Option Kit can be used interconnect the SMP Expansion Ports of two or more servers to form multi-node configurations. With the ScaleXpander Option Kit, the non-scalable x3850 M2 transforms into a scalable, x3950 M2. This scaleable configuration supports up to 16-sockets and 92 processor cores.
smpdr3.13-xw0001.pdf
89
May 2009
The ScaleXpander Option Kit is installed in a slot near the front of the systemboard During POST, the BMC reads VPD on the chip to verify the system can scale Each chassis to be scaled requires the kit to be installed
ScaleXpander Option Key
Notes: In order to merge chassis, the ScaleXpander Option Kit needs to be installed in a slot near the front of the systemboard. During POST, the BMC will read VPD on the chip to verify the system can scale. Each chassis to be scaled requires the kit to be installed.
smpdr3.13-xw0001.pdf
90
May 2009
XW0001 - Servicing IBM System x Servers Part II Processor Board Scalability Connectors
This slide is the Processor board Connections The Scalability key required to enable scalability the key plugs into the processor board at J14 connector. Three connectors on the rear of the system are used to connect the physical system together. A management network consisting of the RSA and BMC from each of the system to be scaled is required.
smpdr3.13-xw0001.pdf
91
May 2009
XW0001 - Servicing IBM System x Servers Part II Rear view scalability connections and cable
Scalability Cable
Notes: This slide shows the scalability cable and SMP connectors on the rear of the x3950 M2.
smpdr3.13-xw0001.pdf
92
May 2009
The cabling information is for multi-node configurations that consist of two or (when supported) three servers, for up to 12-socket operation. A node is a server that is interconnected with other servers or nodes through the SMP Expansion Ports to share system resources. Two-node configuration A twonode configuration requires two 3.0 m (9.8-foot) ScaleXpander cables. (for two node Configuration) Attach Scalability cables to from port 1 to port 1, and port 2 to port 2
smpdr3.13-xw0001.pdf
93
May 2009
XW0001 - Servicing IBM System x Servers Part II Rear view scalability cables connected
Notes: This slide shows the deep-plug scalability cables installed into the SMP ports on the rear of the x3950 M2. Note the location of the scalability release levers.
smpdr3.13-xw0001.pdf
94
May 2009
XW0001 - Servicing IBM System x Servers Part II Two Node Scalability Cable Layout
Two-node configuration A two-node configuration requires two 3.0 m (9.8-foot) ScaleXpander cables. To cable a two-node configuration for up to eight-socket operation, complete the following steps: Label each end of each ScaleXpander cable according to where it will be connected to each server. Connect the ScaleXpander cables to node 1: a. Connect one end of a ScaleXpander cable to port 1 on node 1; then, route the cable through the node 1 wire-form clips on the cable-management arm. b. Connect one end of a ScaleXpander cable to port 2 on node 1; then, route the cable through the node 1 wire-form clips on the cable-management arm. Connect the ScaleXpander cables to node 2: a. Locate the ScaleXpander cable that is connected to port 1 on node 1; then, connect the opposite end of the cable to port 1 of node 2. Next, route the cable through the node 2 wire-form clips on the cable-management arm. b. Locate the ScaleXpander cable that is connected to port 2 on node 1; then, connect the opposite end of the cable to port 2 of node 2. Next, route the cable through the node 2 wire-form clip on the cable-management arm.
Three-node or four node configuration A three-node configuration requires three 3.0 m (9.8-foot) ScaleXpander cables. To cable a three-node configuration for up to 12-socket operation, For detailed instructions and cable layout refer to the IBM System x3850 M2 and System x3950 M2 Type 7141Problem Determination and Service Guide
smpdr3.13-xw0001.pdf
95
May 2009
XW0001 - Servicing IBM System x Servers Part II x3950 and x3950 M2 Scalability Comparison
x3950 -RSA managed partitioning -Complex descriptor and partition descriptor -Partitioning done across ethernet -No topology awareness
Manual system discovery required to setup RSA IP addresses No cable status or debug reporting
Unlike previous scalable systems, the IBM System x3950 M2 BMC manages the scalable partitioning (rather than the RSA).
smpdr3.13-xw0001.pdf
96
May 2009
External components can create and control partitions through nine available scalability commands
The new architecture uses an RSA connected to one of the nodes to act as a web based scalable complex management console from which partitions can be created and controlled. Cable topology and scalable port status will also be available from this complex management console. Partition creation and control may be performed from an RSA or IPMI client; partition management will be handled within each BMC. The new architecture will perform automatic node topology discovery using the FPGA and BMC, so that every node will be able to communicate with every other node using the scalable management bus. The previous architecture required the user to set up in advance the Ethernet IP addresses of all the RSAs before partitions could be created, and further required partition creation be performed from the boot node of each partition. The new architecture has removed all of these cumbersome requirements, making it possible to connect systems out of the box and go directly to partition creation. Partition creation is now streamlined to a single RSA web page where partition configuration data can be distributed to target member BMCs and stored in NVRAM. A test for pre-existing partitions is performed and their status is checked to ensure that the partition is powered off prior to reconfiguration. Partition IDs are utilized by the FPGA to enable uniform behavior by all nodes in a partition during power and reset operations. Partition-wide platform options such as mirroring are also distributed so that each BIOS can have consistent settings in advance of partition merging during the system boot phase.
smpdr3.13-xw0001.pdf
97
May 2009
smpdr3.13-xw0001.pdf
98
May 2009
Unlike previous scalable systems, the role of the RSA has changed
-Reads scalable complex information from the BMC -Displays scalable complex topology to user including:
Incorrect cabling displayed and noted Port problems displayed and noted Non-scaled systems displayed and noted Provides partition and system
In multi-node integration, the partition configuration is written once through RSAII to the BMC then the FPGA interface. The FPGA interface allows for routing partition configuration to each partition members BMC and FPGA interface (Virtual ICMB) (Intelligent Chassis Management Bus). This complex configuration will be stored in each local nodes BMC NVRAM. The partition configurations are contained in the complex configuration. During complex/partition configuration, the BMC will only use one buffer for all data, no longer holding two buffer (active/candidate) like previous scalable systems. The data structure of the complex descriptor will be stored in each local nodes BMC NVRAM. The data structure will have a version check to ensure consistency. This data structure of the complex descriptor will be shared between all the user applications creating and controlling static partitioning Note: RSA is still required for partition definition.
smpdr3.13-xw0001.pdf
99
May 2009
XW0001 - Servicing IBM System x Servers Part II RSA II Interface (Create Partition)
To create a scalable partition, complete the following steps: 1. Connect the ScaleXpander cables. 2. Connect all nodes to an ac power source and make sure that they are not running an operating system. Note: If the nodes are part of an existing partition, all nodes must be in Standby mode, which means that the nodes are part of the partition but operate independently. Click Force under Standalone Boot on the Scalable Complex Management page to enable the Standby mode. 3. Connect and log in to the Remote Supervisor Adapter II Web interface 4: In the navigation pane, click Manage Partition(s) under Scalable Partitioning. Use the Scalable Complex Management page to create, delete, control, and view scalable partitions.. Select the primary node; then, automatically or manually create a scalable partition Click Auto under Partition Configure to automatically create a single partition that uses all nodes in the multi-node configuration Click Create under Partition Configure to manually assign nodes to the partition See the Remote Supervisor Adapter II SlimLine and Remote Supervisor Adapter II Users Guide for more information; then, continue with the procedure to create a scalable partition.
smpdr3.13-xw0001.pdf
100
May 2009
XW0001 - Servicing IBM System x Servers Part II Scalable Complex Management page
To create a scalable partition, complete the following steps: 1. Connect the ScaleXpander cables. 2. Connect all nodes to an ac power source and make sure that they are not running an operating system. Note: If the nodes are part of an existing partition, all nodes must be in Standby mode, which means that the nodes are part of the partition but operate independently Click Force under Standalone Boot on the Scalable Complex Management page to enable the Standby mode. 3. Connect and log in to the Remote Supervisor Adapter II Web interface. See the Remote Supervisor Adapter II SlimLine and Remote Supervisor Adapter II Users Guide for more information; then, continue with the procedure to create a scalable partition. 4. In the navigation pane, click Manage Partition's under Scalable Partitioning. Use the Scalable Complex Management page to create, delete, control, and view scalable partitions. A page similar to the one in the following illustration is displayed.
smpdr3.13-xw0001.pdf
101
May 2009
XW0001 - Servicing IBM System x Servers Part II RSA II Interface ( Partition Started )
Select the primary node; then, automatically or manually create a scalable partition: 1. Click Auto under Partition Configure to automatically create a single partition that uses all nodes in the multi-node configuration. 2. Click Create under Partition Configure to manually assign nodes to the partition. Note: Click Redraw to reorder the sequence in which the nodes appear in the diagram on the page. You can, for example, reorder the diagram to reflect the order in which the nodes are installed in a rack. The nodes are reordered according to the ScaleXpander cabling, with the node that you select in the top position.
smpdr3.13-xw0001.pdf
102
May 2009
Click Partition ID to define operation of the partition and view information about the partition. A page similar to the one in the following illustration is displayed. The following non selectable fields display information about the partition: 1. The Partition Count field displays the number of nodes in the partition. 2. The Partition Validity field displays the following status: Valid (which indicates the configuration is correct). 3. The Partition field displays one of the following statuses: Stopped: The partition is inactive, and the nodes can be reassigned to a partition. Started: The partition is active. Resetting: The configuration is resetting. Unknown: The partition contains unidentified port or chassis IDs a) In the Partition merge timeout minutes field, select the number of minutes POST waits for the scalable nodes to merge resources. The default value is 6 minutes. b) Allow at least 8 seconds for each GB of memory in the scalable partition. c) In the On merge failure, attempt partial merge? field, select whether POST should attempt a partial merge if one error is detected during full merge. Yes is the default value. d) In the Memory Mirroring? field, select whether memory mirroring is enabled in all nodes in the partition. Yes is the default value. e) Click Save.
smpdr3.13-xw0001.pdf
103
May 2009
Notes: In order to merge chassis, all secondary nodes must contain same core count as the primary node. They can have different speeds, but not core count. The screen shows that chassis number 2 processors do not match the primary and the error message appears.
smpdr3.13-xw0001.pdf
104
May 2009
Notes: In addition, in order to merge chassis, all chassis must have at least 4 GB of memory installed. The screen shows the error message if this condition is not met.
smpdr3.13-xw0001.pdf
105
May 2009
Notes: Any chassis can boot into standalone mode. You can boot into standalone several different ways. First, since you cannot press ESC key to bypass merge as USB support is not available at merge time, you can press the Blue Remind button. Or you can reconfigure the partition information via RSA II interface to force standalone.
smpdr3.13-xw0001.pdf
106
May 2009
Notes: Another way to boot into standalone status is to wait till the chassis merge, then press the ESC key to force a reboot to standalone mode.
smpdr3.13-xw0001.pdf
107
May 2009
Notes: This is a sample Scalable Complex Management screen showing how you would modify the settings to boot into standalone.
smpdr3.13-xw0001.pdf
108
May 2009
The following non selectable fields display information about the partition: 1. The Partition Count field displays the number of nodes in the partition. 2. The Partition Validity field displays the following status: Valid (which indicates the configuration is correct). 3. The Partition field displays one of the following statuses: Stopped: The partition is inactive, and the nodes can be reassigned to a partition. Started: The partition is active. Resetting: The configuration is resetting. Unknown: The partition contains unidentified port or chassis IDs In the Partition merge timeout minutes field, select the number of minutes POST waits for the scalable nodes to merge resources. The default value is 6 minutes. Allow at least 8 seconds for each GB of memory in the scalable partition. In the On merge failure, attempt partial merge? field, select whether POST should attempt a partial merge if one error is detected during full merge. Yes is the default value. c. In the Memory Mirroring? field, select whether memory mirroring is enabled in all nodes in the partition. Yes is the default value. Click Save.
smpdr3.13-xw0001.pdf
109
May 2009
Notes: One of the changes in the System x3950 M2 BIOS screens is that you can now see all the processors in a multi-node complex.
smpdr3.13-xw0001.pdf
110
May 2009
smpdr3.13-xw0001.pdf
111
May 2009
This topic discusses Dynamic System Analysis (DSA) and how it can be used to provide service on high-performance System x servers.
smpdr3.13-xw0001.pdf
112
May 2009
This topic discusses the significant aspects of DSA and what you need to know in order to use it to solve problems.
smpdr3.13-xw0001.pdf
113
May 2009
XW0001 - Servicing IBM System x Servers Part II Dynamic System Analysis (DSA) Overview
The information is collected into a compressed XML file. The file can be sent to IBM Support to assist in finding and resolving problems. In addition, DSA provides a local viewer and can display the contents of the XML file in a Web browser.
smpdr3.13-xw0001.pdf
114
May 2009
Dynamic System Analysis (DSA) is a collection of probes that hunt the system for information. It has the capability to plug itself into drivers and firmware to pull logs, then, interprets the information into a useable format. IPMI and RSA drivers must be installed prior to using DSA. If there is no RSA present DSA is able to pull information from the BMC as long as the IPMI mapping layer and driver are installed.
smpdr3.13-xw0001.pdf
115
May 2009
Preboot DSA
-A blend of the diagnostic routines behind the F2 option and the DSA data gathering capabilities
There are several editions of IBM DSA The portable edition runs on a supported system without altering any system files or system settings. No files are installed on the system under investigation. The installable edition installs directly on the system. This edition can be run directly from the console of the system under investigation. DSA is supported on Windows and Linux operating systems. The readme file lists the specific information regarding NOS support and installation instructions for the different NOSes. Running DSA with the default options will create an XML file that can be sent to IBM support. The XML file is stored locally on the system under investigation. Command line switches are used to run DSA in a way that will create the necessary HTML files to read the results locally. Preboot Diagnostics (DSA) is installed on a internal USB key in some of IBM High performance Servers . Preboot DSA is activated by pressing F2 at the BIOS prompt screen. Same procedure we used when entering Diagnostics on the older systems DSA versions are available for download from the IBM support Web site. Note: Linux Portable and Installable versions are for Linux / VMware. VMware ESX 3.0 users should run the Red Hat 3, 32-bit version of DSA.
smpdr3.13-xw0001.pdf
116
May 2009
XW0001 - Servicing IBM System x Servers Part II Portable and Installable DSA Prerequisites
DSA will run without any additional software but may not include all of the available logs without the installation of device drivers
To read a BMC SEL, the system must have the following device drivers installed and running:
-IPMI Device Driver -IPMI Mapping Layer -Note. The installation sequence of these drivers is critical. They MUST be installed in the order shown above
To read the RSA event log, the RSA driver must be installed
smpdr3.13-xw0001.pdf
117
May 2009
DSA has the ability to compare a report for a system against known firmware and driver levels that are available from IBM
This feature compares DSA outputs for firmware and device drivers with those found on the UpdateXpress CD-ROM set To run the comparison tool, the relevant UpdateXpress CD-ROM must be in the system CD ROM drive
DSA has the ability to compare code levels against a set of code levels on an UpdateXpress CD-ROM. This can be useful if code mismatches are suspected to be the cause of problems. DSA can also compare two DSA reports to track changes for two points in time. The difference checker will highlight any significant changes to the system environment.
smpdr3.13-xw0001.pdf
118
May 2009
Preboot DSA is integrated into the System x3850 M2 and x3950 M2. It is accessed via the F2 key sequence when the IBM splash screen loads.
Preboot DSA can be accessed if the system reaches state 4 completion of POST.
smpdr3.13-xw0001.pdf
119
May 2009
XW0001 - Servicing IBM System x Servers Part II Preboot DSA - Capabilities System Data Collection Providers
- System Overview Mfr, version, prod name, serial no, uuid, critical details - Network Settings Hostname, physical network port info, global settings - Hardware Inventory Processor, memory, disk info, monitor info, system card info, devices scsi, usb, optical, other - PCI Information - Devices, bridges, slots - Firmware/VPD - Network, SP, BIOS, other vpd - SP Configurations Settings general, TCP/IP, SNMP, dial-out, dial-in - LSI Controller Controller info, physical & logical drive info - System Management Data, logs, Light Path LED settings - BIST results RSA, IPMI - Event logs ASM, IPMI - Merged devices - Memory diagnostics log - DSA Error log
Diagnostic Tests
- Memory Test runs in standalone mode - BMC I2C Test - Check Point Panel Test - Optical Test Read Error Test Self Test Verify Media Installed - RSA Restart Test - TPM Test - Ethernet Test Control Registers EEPROM Internal Memory Interrupt LEDs MAC Loopback PHY Loopback MII Registers - Stress Tests CPU Stress Test Memory Stress Test - HDD Test
smpdr3.13-xw0001.pdf
120
May 2009
XW0001 - Servicing IBM System x Servers Part II Initiating a Preboot DSA Session
smpdr3.13-xw0001.pdf
121
May 2009
Notes:
By default, you will be taken the Memory Test Main menu screen. Test that can be executed are: Quick Memory test Full Memory test Change Options To exit Memory test and enter DSA from here, you would select Quit to DSA.
smpdr3.13-xw0001.pdf
122
May 2009
By default, you will be taken the diagnostic menu screen. To run DSA from here, select Quit to DSA.
smpdr3.13-xw0001.pdf
123
May 2009
XW0001 - Servicing IBM System x Servers Part II Preboot DSA Command Line
Preboot DSA offers a command menu where you have the opportunity to make a selection. GUI - take you the a graphical environment CMD - offers various command as an option COPY - copy DSA results to a removable media EXIT - exits the program HELP - is also available
Preboot DSA Command menu . 1. COLLECT collects system information 2. VIEW displays the collected data on a local console in text viewer 3. ENUMTESTS list available test 4. EXECTEST menu used to select a test to execute 5. GETEXTENDEDEDRESULTS retrieves and displays diagnostic results 6. TRANSFER send s the collected data to IBM support 7. QUIT exits the Preboot DSA
The copy command will be used most by the customers and the field community to capture all the logs to a USB key and then have those logs emailed to IBM support for analysis In the lab session of this course you will be running this command to capture the logs and then analyze the data
smpdr3.13-xw0001.pdf
124
May 2009
XW0001 - Servicing IBM System x Servers Part II Preboot DSA Graphical Interface
smpdr3.13-xw0001.pdf
125
May 2009
Select Diagnostics from the main menu to load the diagnostic tests page
From this page, you can select and run a variety of diagnostic tests on system hardware.
smpdr3.13-xw0001.pdf
126
May 2009
Preboot DSA provides the following data in System Information System configuration Installed applications and hot fixes Device drivers and system services Network interfaces and settings Hardware inventory including PCI information Vital Product Data and BIOS and firmware information Drive health information LSI, RAID controller configuration Event logs for ServeRAID controller and service processors
smpdr3.13-xw0001.pdf
127
May 2009
XW0001 - Servicing IBM System x Servers Part II Scaled System Information Gathering
The Primary nodes Preboot Diagnostic ( DSA) gathers and displays the systems that are in the scaled partition.
In a scaled system configuration Preboot Diagnostic (DSA) the primary node will gather system information for all the scaled systems in the partition.
smpdr3.13-xw0001.pdf
128
May 2009
XW0001 - Servicing IBM System x Servers Part II Two Node Graphical Diagnostics
In a Scaled configuration the Primary nodes Preboot Diagnostic tests the systems that are scaled.
The Preboot Diagnostic on the primary node will test the scaled systems. Pay close attention to the Ethernet test in the screen shot above.
smpdr3.13-xw0001.pdf
129
May 2009
XW0001 - Servicing IBM System x Servers Part II DSA Automated Report Submission
Preboot DSA can automatically transmit the DSA data to IBM support for analysis. Here is a list of requirements that MUST be met in order for this process to be successful.
smpdr3.13-xw0001.pdf
130
May 2009
Almost all System x and xSeries servers support some of the more advanced technologies that IBM has designed and developed. This topic discusses these service processor technologies and describes the implications of working with them in the field.
smpdr3.13-xw0001.pdf
131
May 2009
This topic discusses how to solve problems on the System x3859, x3950 M2.
smpdr3.13-xw0001.pdf
132
May 2009
This topic deals with information gathering and analysis. Without information, you can not understand what is wrong and you can not apply solutions.
smpdr3.13-xw0001.pdf
133
May 2009
XW0001 - Servicing IBM System x Servers Part II Service and Support Tools
The list of tools available for this system is quite large The most important aspect of using the tools is to recognize which tools you should be placing all our trust into. Also, you need to understand when to use them and how to use them. The following pages in this topic will explain all those interactions.
smpdr3.13-xw0001.pdf
134
May 2009
XW0001 - Servicing IBM System x Servers Part II The Six System States
The six system states are used as the basis for problem analysis and repair
-Each state offers new information gathering and analysis tools -Each state builds on the last state for tool availability
System State 1. There is no AC power 2. There is AC power but no DC output Data Gathering Visual BMC RSA Light path Checkpoint codes F1 and F2 (possibly) Beep codes Adapter BIOS msgs (Adaptec, LSI, etc.) ServeRAID Manager MegaRAID Storage Manager F2 Preboot Diagnostics (DSA) NOS boot messages Blue screen Safe mode DSA NOS event logs Data Analysis PDSG/HMM SvcCon, SMBridge RSA event log PDSG RETAIN tips IBM support Web site F2 Preboot Diagnostics (DSA) PDSG RETAIN tips F2 Preboot Diagnostics (DSA) NOS vendor messages
4. There is AC and DC power, the system completes POST but the NOS fails to start loading 5. There is AC and DC power, the system completes POST but the NOS fails to complete loading 6. There is AC and DC power, the system completes POST and the NOS completes loading but stops during operation
DSA
All IBM System x servers start in a uniform manner. All have a common set of interfaces to advise where in the power-up sequence the server has reached. This chart shows the possible information gathering tools on the left and the possible information analysis tools on the right. All servers are supported by documentation, which forms part of the tool set for both information gathering and information analysis. For example, a Problem Determination and Service Guide (PDSG), contains lists of errors that may occur (information gathering) during POST but also contain probable causes of the error (information analysis). It is also important to realize with the above chart the each state builds on to the previous state. Example in system state two we have most importantly the RSA,but we also have BMC, Light Path and from state one, the PDSG and visual symptoms. So each state builds on the previous and you have those previous states data gathering tools and resources to rely upon. It is important to stress that not all information sources are available in all system states. This page summarizes what tools are available and when.
smpdr3.13-xw0001.pdf
135
May 2009
XW0001 - Servicing IBM System x Servers Part II Service and Support Tools
-SSR
Preboot DSA and or RSA logs and diagnostic results
Here is a summary of what you can expect to see when engaged on a service call with this system. Note. Available data sources will depend on the system state.
smpdr3.13-xw0001.pdf
136
May 2009
The RSA adapter is alive from system state 1 to system state 6 and is available to log into without any interruption to the customer or OS environment. As you will see in the following pages, DSA in all versions from Preboot to installable will capture the RSA logs and data into its logs to report findings. The RSA in this system is similar to all previous systems. Logon and information capture is the same as before.
smpdr3.13-xw0001.pdf
137
May 2009
For DSA installable and portable, the customer must install the drivers prior to running DSA for RSA data.
smpdr3.13-xw0001.pdf
138
May 2009
Preboot DSA can automatically transmit the DSA data to IBM support for analysis. Here is a list of requirements that MUST be met in order for this process to be successful.
smpdr3.13-xw0001.pdf
139
May 2009
Although a BMC gathers an even log, as the system has an RSA II as standard, the RSA event log is the preferred log. However, the system information light will be illuminated if the BMC log reaches 75% full. Following any service activity, use either SVCCon or SMBridge to clear the BMC log in readiness for any future problems and log reporting.
smpdr3.13-xw0001.pdf
140
May 2009
XW0001 - Servicing IBM System x Servers Part II CP Codes on Light Path Card
The client will now see the CP (checkpoint) codes from the Light Path Diagnostic panel
-CP codes are not documented in the PDSG
Explain to the client that this is a service only display used only by support personnel
-The BMC recordsCP codes and the RSA displays them in the log
This only occurs if the system is connected to AC for a minimum of two minutes before the power on button is pressed (to give the BMC/RSA2 time to boot/communicate)
When a system is connected for the first time to AC, the BMC will take up to two minutes to initialize internally, until this is complete the BMC cannot communicate to the RSA and the RSA will not be able to capture any power on failures and/or CP codes.
smpdr3.13-xw0001.pdf
141
May 2009
As with any new product announcement it is extremely important to search/query RETAIN for any tips that match the symptoms displayed
-In some cases, not all features are available from the initial product release but are added to the system after product GA (General Availability) date.
Published capabilities are contained in the announcement letters Review RETAIN for those features that are not enabled yet
New products, as they are released, may not have all of their possible features available on GA date. The announcement letter for the product will list all of the features that are supported at GA, as well as a prediction of when new features will be forth coming. The RETAIN tip database will contain up to date information on the status of new features in the product.
smpdr3.13-xw0001.pdf
142
May 2009
This topic has identified the support tools available on the System x3859, x3950 M2.
smpdr3.13-xw0001.pdf
143
May 2009
This topic discusses where to go for help once this course is finished.
smpdr3.13-xw0001.pdf
144
May 2009
Support information can take many forms. Here, we will discuss the key information sources for these systems and how to access them.
smpdr3.13-xw0001.pdf
145
May 2009
The system documentation, which ships with every new system may also prove useful for verifying the basic setup of the server or I//O expansion drawer. As many of the components of modern servers are customer replaceable units (CRUs) as well as FRUs, some setup instructions are contained in the system manuals. If you are called to a newly installed server, you will want to verify that the customer has, in fact, correctly installed everything. The Problem Determination and Service Guide (PDSG) (formerly known as the Hardware Maintenance Manual (HMM) is the primary reference document for the systems covered in this course. All PDSG/HMMs are now available electronically in Adobe Acrobat Portable Document Format (PDF). The PDSG contains all the disassembly and reassembly steps, beep codes and error descriptions to assist you in isolating a failed FRU or FRUs. You will need Adobe Acrobat Reader version 4 or higher to view the contents properly as this is the minimum supported revision of the reader.
smpdr3.13-xw0001.pdf
146
May 2009
XW0001 - Servicing IBM System x Servers Part II Server Support Web Site
IBM has launched a new central support site for all products. The address is listed above. It should be noted that web addresses change from time to time. In future, this web address may change but IBM normally links older web addresses to the new address for several months at least after the old site closes. If you bookmark this site in your browser, be sure to maintain your bookmarks as site addresses change. The navigation bar on the left provides the main topics available on the web site.
smpdr3.13-xw0001.pdf
147
May 2009
XW0001 - Servicing IBM System x Servers Part II Software and Device Drivers
Software and Device Drivers IBM System x provides easy/quick access to the wide range of firmware updates as well as the software/device drivers for supported operating systems for each System x server, BladeCenter or Storage Enclosure. If you are an authorized servicer, there is also a dealer support site, with a nice collection of some of the more popular links for each product. (e.g. https://www304.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=SERVOPTN&brandind=5000008#x460 )
smpdr3.13-xw0001.pdf
148
May 2009
Whilst IBM extensively tests third party hardware and software and, in many cases, approves them for use with System x servers, not all devices or combinations of devices are tested/supported. If you are working with a server which contains third party devices, you can check for compatibility here. You may find assistance which is not contained in the primary documentation here which can help you to isolate a fault.
smpdr3.13-xw0001.pdf
149
May 2009
This site is the central repository for a collection of information and photographs of many IBM System x, BladeCenter, eServer, and xSeries Servers intended for support personnel. (Note: This site and the subsequent one is NOT for the full list of IBM products and was often put together from the documents that the education group provided updated training materials on.)
smpdr3.13-xw0001.pdf
150
May 2009
XW0001 - Servicing IBM System x Servers Part II IBM Server - Bios Simulators
Many times, the support people do not have immediate physical access to the machine that someone is asking for help with. These pages contain one of the ship level BIOS files with a simulator that shows how many of the System x, BladeCenter, eServer, xSeries machines can be configured. The simulator shows screens similar to the ones that the customer would use to configure their server after pressing F1 during the system boot. (e.g. The Bios level may be different between the simulator version and the one installed on the customers machine.) We have also included an Options simulator for the BladeCenter management module, and numerous adapters (Note: as of this writing, several servers are still missing from the entire support matrix.)
smpdr3.13-xw0001.pdf
151
May 2009
This site contains links to COG, xRef and other helpful configuration tools
http://www.ibm.com/systems/x/hardware/configtools.html
This Web site contains links, descriptions of several Configuration tools. (Note: While these pages are intended for pre-sale support, they are often useful for Business Partner, and in a Service/Post sale environment. The COG contains general information about IBM products and supported options for currently shipping equipment (updated each month) The xRef documents provide a brief technical overview of each of the servers in the System x/BladeCenter , Intellistations, and withdrawn systems. (e.g. past servers are removed from the originals and made available in the withdrawn systems xRef) Other Configuration tools deal with BladeCenter Interoperability, Rack Configuration, and Power / Equipment sizings.
smpdr3.13-xw0001.pdf
152
May 2009
This topic has discussed several helpful support Sites for configuring, maintaining, and troubleshooting IBM Servers.
smpdr3.13-xw0001.pdf
153
May 2009
This course is now complete. Thank you for attending. System x and BladeCenter Service and Support Education hopes you have enjoyed it and found it both interesting and valuable to your job. If you have any comments or suggestions regarding this education, please let your instructor know and s/he will pass them on to the education development teams. We ALWAYS act on comments and suggestions as we constantly seek to improve our education offerings.
smpdr3.13-xw0001.pdf
154
May 2009