Share 2011 - SFM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

Sysplex Failure Management:

The Good-The Bad-The Ugly – Almost!


Mark Wilson
RSM Partners

2nd March 2011


Session Number: 8685
Agenda

•  Introduction
•  Language
•  Overview
•  Sysplex Failure Manager (SFM)
•  Automatic Restart Manager (ARM)
•  Summary
•  Questions

© RSM Education LLP 2011


Introduction

•  I am a mainframe technician with some


knowledge of zOS & Sysplex
•  I have been doing this for almost 30 years
•  When creating this presentation; found it difficult
just to talk about SFM; so the content is a little
broader than the abstract!
•  Happy to take questions as we go

© RSM Education LLP 2011


Language!

•  And I don t mean bad language!


•  Two countries separated by a common language!
•  When is a ZEE not a ZEE?
•  When it s a ZED
•  What is PARMLIB(e)?
•  When its PARMLIB

© RSM Education LLP 2011


What's this?

•  Zeebra?
•  No it s a Zebra!
•  Hopefully this will help you understand me 
© RSM Education LLP 2011
Acknowledgements
•  This material is extracted from a formal education class:

•  Parallel Sysplex: Operations, Troubleshooting &


Recovery

•  www.rsm.co.uk/view_course.php?code=MPOR

•  There are more slides than we can cover in this session


some are hidden from the presentation

•  There is a PDF of the actual course material available for


download; with all of the slides and comprehensive notes

MPOR - 05 - 6 © RSM Education LLP 2011


Parallel Sysplex: Operations, Troubleshooting & Recovery

Runtime Problem
Determination

© RSM Education LLP 2011


Objectives
On completing this segment of the course, you will be able to:

•  Identify the different types of sysplex-related error conditions


•  Deal with the connectivity problems in the sysplex
•  Respond correctly to ‘status update missing’ conditions
•  Manage the Sysplex Failure Manager environment
•  Respond appropriately to sysplex timer related problems
•  Handle Coupling Facility environment errors
•  Recognise and respond to structure-related errors for the major
application systems
•  Operate successfully in the Automatic Restart Manager environment

MPOR - 05 - 8 © RSM Education LLP 2011


Parallel Sysplex: Operations, Troubleshooting & Recovery

Overview

© RSM Education LLP 2011


It’s the sysplex that counts...

A parallel sysplex:
•  may consist of up to 32 systems
CICS IMS •  and can accept new systems up to
workload workload that limit dynamically
CTCs •  but can provide a ‘single image’ for
the workloads
The
DB2
Coupling •  can recover failing work units
workload
Facility automatically, anywhere in the
sysplex
TSO Batch •  can provide continuous availability
The for application workloads
workload Coupling workload
Facility

z/OS operations & functionality

So how do we keep things running?

MPOR - 05 - 10 © RSM Education LLP 2011


...not the individual systems
W W W If a processor fails in an
MP it will be removed, the
CPU CPU CPU CPU work unit may be
recoverable
Expanded Central
Storage Storage
But if certain
components fail in an
CPU CPU CPU CPU MP or z/OS dies, all
W W W W work units are lost

A workload, made up of several


dispatchable elements
If an image fails in a
sysplex, IT will be
removed! Affected work W W W
units may be recoverable z/OS z/OS z/OS z/OS
The
A sysplex can be Coupling
The
configured to provide Coupling
Facility W W
continuous availability, Facility z/OS z/OS z/OS z/OS
regardless of component W W
failure

MPOR - 05 - 11 © RSM Education LLP 2011


Murphy’s Law – No Redundancy!!

System A System C
Connection 1 Connection 5
CTCs
Connection 2 Connection 6

CP CP CP ….
Coupling Facility
Control Code

Dump space

IXC_SIG01
ISTMNPS
OPERLOG
System B Non-control
System ‘n’
Control storage
(expanded) (Central)
Connection 3 Connection ‘n’

Connection 4 Connection ‘n’


Policy 1

CFRM Sysplex
If it CAN go wrong … Couple Couple … it WILL!
Data Set Data Set
MPOR - 05 - 12 © RSM Education LLP 2011
Redundancy is good for you….But!

System A System C
Connection 1 Connection 5
CTCs
Connection 2 Connection 6

The

CP CP CP ….The
Coupling
Coupling
Facility
Facility
Coupling Facility
Control Code

Dump space

IXC_SIG01
ISTMNPS
OPERLOG
System B Non-control
System ‘n’
Control storage
(expanded) (Central)
Connection 3 Connection ‘n’

Connection 4 Policy
Policy
1 1 Connection ‘n’
Sysplex
CFRM
CFRM Sysplex
Couple
If you’ve got backup … Couple
Couple Couple
Dataset … it doesn’t matter!
Data
Dataset
Set
Data Set
MPOR - 05 - 13 © RSM Education LLP 2011
Redundancy is good for you….But!

•  Its expensive

•  So it’s a Risk/Security vs Cost Debate

•  So the hardware sales guys like this!

MPOR - 05 - 14 © RSM Education LLP 2011


Example configuration
BP01 BP02 CF02 CF02
Full redundancy
•  All systems fully connected by two sets
XCF XCF XCF of SCTCs
Apps XCF Apps
•  XCF also using list structures
•  In both the two Coupling Facilities
•  Which are connected to each system by
BOX1 BOX4 at least two CFCs
•  Two sysplex timers cross-linked
•  All Couple Data Sets have alternatives

CTCs SYS1 SYS1CFRM SYS1SFM SYS1ARm SYS1WLM SYS1LOGR


CDS02 CDS02 CDS02 CDS02 CDS02 CDS02
SW
SYS1 SYS1CFRM SYS1SFM SYS1ARM SYS1WLM SYS1LOGR
CDS01 CDS01 CDS01 CDS01 CDS01 CDS01

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 15 © RSM Education LLP 2011


Failure events & recovery options
Failing component Have backup No backup
or alternate or alternate
XCF path (via CTC) lose capacity isolate system 1

XCF path (via List Structures)


Coupling Facility rebuild structures isolate system(s) 1
Coupling Facility Channel failure lose capacity isolate system(s) 1
Structure failure (CF ok) lose capacity rebuild structures 5

MVS system (“status update missing”) n/a isolate system 2

Sysplex Timer carry on wait state 3

Couple Data Set duplexed pair wait state 4

Coupling Facility environment rebuild structures appl dependent 5


Application (non-signalling) structure loss

Application (batch job or STC) n/a invoke ARM 6

MPOR - 05 - 16 © RSM Education LLP 2011


Failure events & recovery options
1.  Isolating a system due to a physical connectivity problem
–  This can be automated by the Sysplex Failure Manager using information
provided in an SFM policy
2.  Isolating a system when a system fails
–  This can also be automated by SFM in conjunction with some COUPLEnn
parameters
  Both of the above situations can be managed automatically by the Sysplex Failure
Manager component of XCF
3.  Dropping a system into a wait state due to ETR failure
–  Not much of a recovery option, you might think. And you’d be right
4.  Dealing with Couple Data Set loss
–  One of the great ‘it depends’ in the recovery environment
5.  Rebuilding a structure
–  This is handled by a combination of SFM action and activities initiated by the
affected connections themselves
6.  Restarting failed applications
–  This is handled via the Automatic Restart Manager
MPOR - 05 - 17 © RSM Education LLP 2011
CTC signalling path reconfiguration - 1
(more than one CTC path available) BP01 D XCF,PO
IXC3551 17.10.40 DISPLAY XCF
BP02 PATHOUT TO SYSNAME: BP03
BP01
DEVICE (LOCAL/REMOTE): 8038/7018 8030/7010

XCF XCF XCF IOS102I DEVICE 8038 BOXED, PERMANENT ERROR


Apps XCF Apps
IXC467I RESTARTING PATHOUT DEVICE 8038
USED TO COMMUNICATE WITH SYSTEM BP03
BOX1 RSN: I/O/ERROR WHILE WORKING

IXC4671 STOPPING PATHOUT DEVICE 8038


USED TO COMMUNICATE WITH SYSTEM BP03
RSN: HALT I/O FAILED
DIAG073:08220003 0000000C 00000001 00000000

CTCs IXC307I STOP PATHOUT REQUEST FOR DEVICE 8038 COMPLETED


SUCCESSFULLY: HALT I/O FAILED
SW
SW BP01 D XCF,PO
IXC3551 17.10.30 DISPLAY XCF
PATHOUT TO SYSNAME: BP03
DEVICE (LOCAL/REMOTE): ????/7018 8030/7010

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 18 © RSM Education LLP 2011


CTC signalling path reconfiguration - 2
(more than one CTC path available) BP01 D XCF,PO
IXC3551 17.10.30 DISPLAY XCF
BP01 BP02 PATHOUT TO SYSNAME: BP03
DEVICE (LOCAL/REMOTE): 8030/7010
XCF XCF XCF
Apps XCF Apps BP01 VARY 8038,ONLINE,UNCOND
IEE3021 8038 ONLINE

IXC306I START PATHOUT REQUEST FOR DEVICE 8038 COMPLETED


BOX1 SUCCESSFULLY: DEVICE CAME ONLINE

IXC466I OUTBOUND SIGNAL CONNECTIVITY ESTABLISHED WITH SYSTEM


BP03 VIA DEVICE 8038 WHICH IS CONNECTED TO DEVICE 7018

BP01 D XCF,PO
CTCs IXC3551 11.13.329 DISPLAY XCF
PATHOUT TO SYSNAME: BP03
SW DEVICE (LOCAL/REMOTE): 8038/7018 8030/7010
SW

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 19 © RSM Education LLP 2011


Losing the last or only CTC signalling path
(only one CTC path available BP01 D XCF,PO
and no structure paths) IXC355I 17.10.40 DISPLAY XCF
PATHOUT TO SYSNAME: BP03
BP01 BP02 DEVICE (LOCAL/REMOTE): 8030/7010

IXC467I STOPPING PATHOUT DEVICE 8030


XCF XCF XCF
Apps XCF Apps USED TO COMMUNICATE WITH SYSTEM BP03
RSN: RETRY LIMIT EXCEEDED

IXC307I STOP PATHOUT REQUEST FOR DEVICE 8030 COMPLETED


BOX1 SUCCESSFULLY: RETRY LIMIT EXCEEDED

IXC409D SIGNAL PATHS BETWEEN BP03 AND BP01 ARE LOST. REPLY
RETRY OR SYSNAME=SYSNAME OF THE SYSTEM TO BE REMOVED

CTCs Decision time!


SW •  CTCs are point to point connections
SW •  If two systems can’t communicate directly, one of them must be
removed from the sysplex

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 20 © RSM Education LLP 2011


Losing the last or only CTC signalling path - 2
(only one CTC path available Come on, make your mind up!
and no structure paths) IXC409D SIGNAL PATHS BETWEEN BP03 AND BP01 ARE LOST. REPLY
RETRY OR SYSNAME=SYSNAME OF THE SYSTEM TO BE REMOVED
BP01 BP02

•  The system continues processing and awaits your reply


XCF XCF XCF
Apps XCF Apps •  XCF attempts to restart the signalling path anyway, if successful the
message is removed

BOX1 Reply “retry”


•  Gives you time to SETXCF START another path if you’ve got one

Reply “sysname=BP0n”
IXC417D CONFIRM REQUEST TO REMOVE BP0n FROM THE SYSPLEX.
CTCs REPLY SYSNAME=BP0n TO REMOVE BP0n OR C TO CANCEL
IXC458I SIGNAL PATHOUT DEVICE 8030 STOPPED: RETRY LIMIT EXCEEDED
SW IXC220W XCF IS UNABLE TO CONTINUE: WAIT STATE CODE: 0A2
SW REASON CODE: 08, LOSS OF CONNECTIVITY DETECTED

0A2-08 is non-restartable. SYSTEM RESET should be performed


BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 21 © RSM Education LLP 2011


Structure signalling path ‘reconfiguration’
(more than one list structure, Coupling 1)  CF Channel failure
Facility, and CFC to each CF)
IXL518I PATH chpid IS NOW NOT OPERATIONAL TO CUID: CF cuid
COUPLING FACILITY 009672.IBM.00.000020040104
BP01 BP02 PARTITION: 1 CPCID: 00

(Probably accompanied by an IOSnnnx message)


XCF XCF XCF
Apps XCF Apps
2) Coupling Facility failure
IXL518I (as above)
IXC518I BP01 NOT USING COUPLING FACILITY (description) NAMED
BOX1 CF01 REASON: CONNECTIVITY LOST

or maybe
IXC519I COUPLING FACILITY DAMAGE RECOGNIZED FOR COUPLING
FACILITY (description) NAMED CF01

3) Structure failure
CF02
IXC467I REBUILDING PATH STRUCTURE IXC_STR1. RSN: STRUCTURE
CF01 IXC_STR2 FAILURE
(see “Sysplex Operations” topic, OPS00310, for remainder of messages)
IXC_STR1
In all cases, signalling continues using alternate facilities

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 22 © RSM Education LLP 2011


Losing the only CFC to a signalling structure
(only one list structure, Coupling Facility, CF Channel failure
CFC to each CF and no CTC paths) IXL518I PATH chpid IS NOW NOT OPERATIONAL TO CUID: CF cuid
COUPLING FACILITY 009672.IBM.00.000020040104
BP01 BP02 PARTITION: 1 CPCID: 00

(Probably accompanied by an IOSnnnx message)


XCF XCF XCF
Apps XCF Apps
Who’s affected?
•  If the CFC definitions are shared via EMIF, all LPARs on the affected
processor
BOX1
•  If dedicated, just the affected system

IXC519I STOPPING PATH STRUCTURE IXC_STR1


RSN: LOST CONNECTIVITY TO STRUCTURE
IXC409D SIGNAL PATHS BETWEEN nnnn AND BP01 ARE LOST. REPLY
RETRY OR SYSNAME=SYSNAME OF THE SYSTEM TO BE REMOVED
CF01
IXC_STR1 •  IXC409D will be issued on BP01 (and BP02 if CFC shared) once for
each system to which connectivity has been lost
•  Same options and results as before

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 23 © RSM Education LLP 2011


Losing the only CF (using a structure for signalling)
(only one list structure, Coupling Facility,
CFC to each CF and no CTC paths) Coupling Facility failure
BP01 BP02 IXL518I (as before)
IXC510I nnnn NOT USING COUPLING FACILITY (description) NAMED
CF01 REASON: CONNECTIVITY LOST
XCF XCF XCF
Apps XCF Apps
Or maybe

BOX1 Who’s affected?


•  All systems!

IXC467I STOPPING PATH STRUCTURE IXC_STR1


RSN: LOST CONNECTIVITY TO FACILITY
IXC409D SIGNAL PATHS BETWEEN nnnn AND nnnn ARE LOST. REPLY
CF01 RETRY OR SYSNAME=SYSNAME OF THE SYSTEM TO BE REMOVED

IXC_STR1
•  IXC409D will be issued on all systems
•  Only one system allowed to remain active, all others must be
removed via the 0A2 wait state

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 24 © RSM Education LLP 2011


Losing the only signalling structure
(only one list structure, Coupling Facility,
CFC to each CF and no CTC paths)
BP01 BP02 Structure failure
IXC467I REBUILDING PATH STRUCTURE IXC_STR1. RSN: STRUCTURE
FAILURE
XCF XCF XCF
Apps XCF Apps
(see Sysplex Operation segment, OPS00310, for remainder of messages)
Who’s affected?
BOX1
•  All systems, but only temporarily until the structure is rebuilt
•  Of course, if you see this

IXC467I STOPPING PATH STRUCTURE IXC_STR1


RSN: REBUILD FAILED, UNABLE TO USE ORIGINAL

CF01
•  Then it will be as if you’ve lost the Coupling Facility!
IXC_STR1

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 25 © RSM Education LLP 2011


‘Status update missing’ conditions
COUPLExx last
hasn’t
next
INTERVAL(25) “check in” IXC402D “check in”
check in issued
BP01 BP02 OPNOTIFY(28) time time

XCF XCF XCF INTERVAL


Apps XCF Apps Replied
“INTERVAL=SSSSS”
OPNOTIFY
BOX1 “Status update missing”
•  If an XCF image fails to update the couple datasets within the INTERVAL
time, the other XCFs raise a status update missing condition
•  After the (OPNOTIFY-INTERVAL) time, the following message isssued:
IXC402D BP01 LAST OPERATIVE AT hh:mm:ss. REPLY DOWN AFTER
SYSTEM RESET OR INTERVAL=SSSSS TO SET A REPROMPT TIME
Sysplex
Couple Implications?
Data Set •  BP01 could theoretically be working fine, apart from XCF, but is probably
in a disabled condition
•  Could be restartable condition, or may need re-IPL!
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 26 © RSM Education LLP 2011


Monitor detected ‘Stop’ status

IXC335I 17.04.41 DISPLAY XCF 479


SYSPLEX RSMPLX
SYSTEM TYPE SERIAL LPAR STATUS TIME SYSTEM STATUS
RSMA 2086 722D 03 06/13/2010 17:04:40 ACTIVE TM=SIMETR
RSMB 2086 722D 04 06/13/2010 17:04:15 MONITOR-DETECTED STOP

MPOR - 05 - 27 © RSM Education LLP 2011


Removing the system and replying “down”
COUPLExx last
hasn’t
A02-20
INTERVAL(25) “check in” IXC402D Wait
check in issued
BP01 BP02 OPNOTIFY(28) time State

XCF XCF XCF INTERVAL


Apps XCF Apps Reply “DOWN”

OPNOTIFY
BOX1 Removing the system
IXC402D BP01 LAST OPERATIVE AT hh:mm:ss. REPLY DOWN AFTER
SYSTEM RESET OR INTERVAL=SSSSS TO SET A REPROMPT TIME

If BP01 is dead, reply DOWN, but only AFTER


•  SYSTEM RESET-NORMAL
•  LOAD-NORMAL (to re-IPL z/OS or IPL SAD)
Sysplex •  SYSTEM RESET-CLEAR or LOAD-CLEAR
Couple •  SYSIM or POR
Dataset •  Loss of power to BP01 box
•  LPAR deactivation or LPAR reset
“DOWN” removes BP01 from sysplex and loads 0A2-20 wait

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 28 © RSM Education LLP 2011


D XCF after system has been removed

IXC335I 17.17.02 DISPLAY XCF 559


SYSPLEX RSMPLX
SYSTEM TYPE SERIAL LPAR STATUS TIME SYSTEM STATUS
RSMA 2086 722D 03 06/13/2010 17:17:01 ACTIVE TM=SIMETR

MPOR - 05 - 29 © RSM Education LLP 2011


SPINTIME & INTERVAL
excessive spin freed, or ABEND,
spin loop loop detected,
entered TERM or ACR taken
SPIN taken
EXSPATnn
SPINTIME(10) Spintime Spintime

BP01 BP01
checks in checks in
here here

must IXC402D
check in issued
COUPLExx
INTERVAL(25)
OPNOTIFY(28) INTERVAL

OPNOTIFY

SPINTIME & INTERVAL


•  SPINTIME is an ‘internal’ value, it represents problems ‘inside’ the system
•  INTERVAL is an ‘external’ value, it represents a problem at the sysplex level
•  SPINTIME should be less than INTERVAL

MPOR - 05 - 30 © RSM Education LLP 2011


System Isolation techniques
Connectivity failures
•  If BP01 loses its last or only connection to one or more systems
BP01 BP02
in the sysplex it will be isolated
•  BP01 is a working system, it’s just lost communications with the sysplex
XCF XCF XCF
Apps XCF Apps •  So XCF on BP01 will load 0A2 wait state for itself

BOX1 Status update missing


•  If BP01 goes into a disabled condition, it will miss its update interval
and will be isolated
•  But in this case BP01 is dead and can’t post its own wait state
•  System isolation:
•  is performed via the Coupling Facility from another system in
CTCs CF01 the sysplex
IXC_STR1 •  is done via the channel subsystem on the target system (BP01)
Sysplex
Couple
Dataset

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 31 © RSM Education LLP 2011


Parallel Sysplex: Operations, Troubleshooting & Recovery

Sysplex Failure Manager

© RSM Education LLP 2011


SFM & ARM – Optional XCF Features
Sysplex Failure Manager deals with XCF level sysplex failures Active:PRODDAY
PRODDAY
Automatic Restart Manager restarts failed jobs PRODEVE
TESTDAY

Related components Primary Alternate Spare


COUPLE COUPLE COUPLE
•  Implemented via XCF Data Set Data Set Data Set
•  Policies for dealing with failures in the sysplex
•  Different policies for different workloads(overnight, etc.) SFM and ARM
Couple Data Sets and policies
•  Policies can be switched with SETXCF START,POLICY

CTCs

BOX1 BOX2 BOX3


CICS
BP01 BP02 BP03 BP04 BP05 BP06 BP07 BP08

LOGGER XCF LOGGER XCF LOGGER XCF LOGGER XCF LOGGER XCF LOGGER XCF LOGGER XCF LOGGER XCF

WLM WLM WLM WLM WLM WLM WLM WLM

MPOR - 05 - 33 © RSM Education LLP 2011


The Sysplex Failure Manager (SFM)
COUPLExx SFM Policy
INTERVAL(25) CONNFAIL(YES)
BP01 BP02 SYS1. SYSTEM NAME(BP01) SFM
OPNOTIFY(28)
CLEANUP(60) PARMLIB WEIGHT(500) Couple
XCF
Apps XCF
XCF
Apps
XCF PROMPT Data Set
SYSTEM NAME(BP02)
etc
BOX1
Sysplex Failure Manager
•  SFM policy in SFM Couple Data Set can be used to automate
system isolation events caused by:
•  lost connectivity
•  status update missing conditions
CTCs CF01
•  can also be used for automatic LPAR reconfiguration
IXC_STR1 •  Works in connection with COUPLEnn parameters
Sysplex
Couple •  Policies can be switched via SETXCF START,POLICY
Data Set •  Active policy can be deactivated via SETXCF STOP,POLICY

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 34 © RSM Education LLP 2011


SFM policy options
SF Policy
BP01 BP02 SFM 1
Couple CONNFAIL(YES) should SFM use system weights to
automate connectivity failure processing?
XCF XCF XCF
Data Set
Apps XCF Apps
SYSTEM NAME(*) this defines default values for all systems
2 WEIGHT(1) all systems have equal values for
BOX1 connectivity failure processing
PROMPT ICX402D should be issued for status
update missing conditions
3
SYSTEM NAME(BP01) if this system loses connectivity to another
WEIGHT(500) it is much more important than the other!

CTCs CF01
SYSTEM NAME(BP02) if this system misses its status update
IXC_STR1 don’t issue IXC402D, but instead:
Sysplex 4
Couple ISOLATETIME(nnnnn) isolate automatically after “nnnnn” secs
Data Set DEACTTIME(nnnnn) or deactivate its LPAR after “nnnnn” secs
RESETTIME(nnnnn) or system reset its LPAR after “nnnnn” secs

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 35 © RSM Education LLP 2011


Starting & stopping the SFM policy
SETXCF START,POLICY,POLNAME=SFMPOL1,TYPE=SFM

IXC616I SFM POLICY SFMPOL1 INDICATES CONNFAIL(YES) FOR SYSPLEX RSMPLX


IXC602I SFM POLICY SFMPOL1 INDICATES ISOLATETIME(0) 485
SSUMLIMIT(25) FOR SYSTEM RSMA FROM THE DEFAULT POLICY ENTRY.
IXC609I SFM POLICY SFMPOL1 INDICATES FOR SYSTEM RSMA A SYSTEM WEIGHT OF
5 SPECIFIED BY POLICY DEFAULT
IXC614I SFM POLICY SFMPOL1 INDICATES MEMSTALLTIME(NO) FOR SYSTEM RSMA AS
SPECIFIED BY SYSTEM DEFAULT
IXC601I SFM POLICY SFMPOL1 HAS BEEN STARTED BY SYSTEM RSMA

TYPE: SFM
POLNAME: SFMPOL1
STARTED: 06/13/2010 17:06:55
LAST UPDATED: 06/13/2010 10:36:34

SETXCF STOP,POLICY,TYPE=SFM
IXC607I SFM POLICY HAS BEEN STOPPED BY SYSTEM RSMA

TYPE: SFM
POLICY NOT STARTED

MPOR - 05 - 36 © RSM Education LLP 2011


SFM processing for connectivity failures
(only one CTC path available
SFM Policy
and no structure paths)
CONNFAIL(YES)
SFM SYSTEM NAME(*)
Weight=500 Weight=1 Couple WEIGHT(1)
BP01 BP02 Data Set
SYSTEM NAME(BP01)
XCF
Apps XCF XCF
Apps XCF WEIGHT(500)
SFM processing for connectivity failure
BOX1 •  CTC path from BP03 to BP01 lost, no alternate available
•  One of the systems must be removed
•  SFM is active for connectivity failures (CONNFAIL(YES))
•  WEIGHTS are used, and BP03 removed
CTCs (IXC458I SIGNAL PAHIN DEVICE 7030 STOPPED: reason)
IXC101I SYSPLEX PARTITIONING IN PROGRESS FOR BP03
SW IXC105I SYSPLEX PARTITIONING HAS COMPLETED FOR BP03
SW PRIMARY REASON: SYSTEM REMOVED BY SYSPLEX FAILURE MANAGER
BECAUSE OF A SIGNALLING CONNECTIVITY FAILURE IN THE
SYSPLEX – REASON FLAGS: flags

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF


Apps XCF XCF
Apps XCF XCF
Apps XCF XCF
Apps XCF XCF
Apps XCF
Apps

Weight=1 Weight=1 Weight=1 Weight=1 Weight=1 Weight=1

MPOR - 05 - 37 © RSM Education LLP 2011


SFM processing for connectivity failures

•  We have a CTC only signaling configuration, with only one pair of


connections between each system
•  The path from BP03 to BP01 fails
•  No alternate path is available, so one of the two systems must be
removed
•  We have an active SFM policy which includes the CONNFAIL(YES)
setting, so SFM takes over
•  SFM checks the weights of BP01 and BP03, so it looks like BP03 is
the loser here
•  SFM removes BP03, and issues the messages to indicate what has
happened

MPOR - 05 - 38 © RSM Education LLP 2011


Displaying SFM parameters

SFM Active

INTERVAL OPNOTIFY MAXMSG CLEANUP RETRY CLASSLEN


85 88 2000 15 10 956

SSUM ACTION SSUM INTERVAL SSUM LIMIT WEIGHT MEMSTALLTIME


ISOLATE 0 25 5 NO

MPOR - 05 - 39 © RSM Education LLP 2011


CF signalling, connectivity failures & SFM’s weights
(signalling via single CF structure, lose
SFM Policy
only CFC to only CF) CONNFAIL(YES)
Weight=500 Weight=1 SFM SYSTEM NAME(*)
WEIGHT(1)
BP01 BP02 Couple
Data Set SYSTEM NAME(BP01)
XCF
Apps XCF 0A2- XCF
XCF
Apps WEIGHT(500)
110

BOX1 CF signalling, connectivity failure & weights


•  If a system loses its connection to all other systems
•  Keep all the other systems up?
•  Or keep this system up?
•  Compare weight of this system to the sum of other systems weights

CF01
IXC_STR1 Be careful what values you use!

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF0A2-XCF XCF
0A2- XCF
Apps XCF
0A2- XCF
Apps XCF0A2-XCF
Apps XCF0A2-XCF
Apps XCF0A2-XCF
Apps
Apps
110 110 110 110 110 110

Weight=1 Weight=1 Weight=1 Weight=1 Weight=1 Weight=1

MPOR - 05 - 40 © RSM Education LLP 2011


SFM processing for status update missing
COUPLExx SFM Policy
INTERVAL(25) CONNFAIL(YES)
OPNOTIFY(28) SFM SYSTEM NAME(*)
BP01 BP02 WEIGHT(1)
CLEANUP(60) Couple
Data Set PROMPT
XCF XCF XCF SYSTEM NAME(BP01)
Apps XCF Apps
WEIGHT(500)
SFM processing for status update missing
BOX1 •  by default, the IXC402D is still issued after the OPNOTIFY time
IXC402D BP01 LAST OPERATIVE AT hh:mm:ss. REPLY DOWN AFTER
SYSTEM RESET OR INTERVAL=SSSSS TO SET A REPROMPT TIME

(system reset and reply down)


IXC101I SYSPLEX PARTITIONING IN PROGRESS FOR BP01

•  the Group Exits of any associated XCF applications on the other systems
Sysplex are notified in case any application recovery needed
Couple •  when the CLEANUP interval expires or the group exits finish
Data Set
IXC105I SYSPLEX PARTITIONING HAS COMPLETED FOR BP01
PRIMARY REASON: SYSTEM STATUS UPDATE MISSING

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 41 © RSM Education LLP 2011


SFM processing for status update missing

•  In this example, BP01 enters a status update missing condition


•  By default, our active SFM policy issues the IXC402D message. This
is done after the COUPLExx OPNOTIFY period expires
•  BP01 can’t be restarted, so SYSTEM RESET is performed and the
operators reply DOWN
•  XCF starts the partitioning process and issues IXC101I
•  XCF then notifies the group exits of all the members on BP02
through BP08 of those XCF groups that also had members on BP01
–  The idea here is that applications might need to ‘clean up’ before BP01 is removed
from the sysplex
•  When all the group exits have responded, or the COUPLExx
CLEANUP interval expires, whichever comes first, BP01 will be
placed into the 0A2 wait state and IXC105I issued
MPOR - 05 - 42 © RSM Education LLP 2011
Sysplex partitioning
SFM in action
IXC101I SYSPLEX PARTITIONING IN PROGRESS FOR RSMB REQUESTED BY
XCFAS. REASON: SFM STARTED DUE TO STATUS UPDATE MISSING
*22 IXC102A XCF IS WAITING FOR SYSTEM RSMB DEACTIVATION. REPLY DOWN WHEN
MVS ON RSMB HAS BEEN SYSTEM RESET

RESPONSE=RSMA
IXC335I 17.14.23 DISPLAY XCF 498
SYSPLEX RSMPLX
SYSTEM TYPE SERIAL LPAR STATUS TIME SYSTEM STATUS
RSMA 2086 722D 03 06/13/2010 17:14:22 ACTIVE TM=SIMETR
RSMB 2086 722D 04 06/13/2010 17:11:57 BEING REMOVED - RSMA

MPOR - 05 - 43 © RSM Education LLP 2011


SFM processing for status update missing - 2
SFM Policy
COUPLExx
CONNFAIL(YES)
INTERVAL(25)
BP01 BP02 SFM SYSTEM NAME(*)
OPNOTIFY(28)
Couple WEIGHT(1)
CLEANUP(60)
XCF XCF XCF Data Set PROMPT
Apps XCF Apps
SYSTEM NAME(BP01)
WEIGHT(500)
If ISOLATETIME is coded: ISOLATETIME(30)
BOX1
•  the default PROMPT value is ignored and IXC402D is not issued
•  XCF waits for the ISOLATETIME interval, then starts isolating BP01

IXC101I SYSPLEX PARTITIONING IN PROGRESS FOR BP01

•  the Group Exits of any associated XCF applications on the other systems
are notified in case any application recovery needed
Sysplex
Couple •  when the CLEANUP interval expires or the group exits finish
Data Set IXC105I SYSPLEX PARTITIONING HAS COMPLETED FOR BP01
PRIMARY REASON: SYSTEM REMOVED BY SYSPLEX FAILURE MANAGER
BECAUSE ITS STATUS UPDATE WAS MISSING -

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 44 © RSM Education LLP 2011


SFM – system isolation
SFM Policy
COUPLExx
CONNFAIL(YES)
INTERVAL(25)
BP01 BP02 SFM SYSTEM NAME(*)
OPNOTIFY(28)
Couple WEIGHT(1)
CLEANUP(60)
XCF XCF XCF Data Set PROMPT
Apps XCF Apps
SYSTEM NAME(BP01)
WEIGHT(500)
ISOLATETIME(30)
BOX1
ISOLATETIME
•  If the system has missed a status update and is not signalling, things proceed as
just described
•  If the system has missed a status update but is still signalling other systems,
it is not isolated immediately
Sysplex
Couple IXC427A SYSTEM BP01 HAS NOT UPDATED STATUS SINCE hh:mm:ss BUT IS SENDING
Data Set XCF SIGNALS. XCF SYSPLEX FAILURE MANAGEMENT WILL REMOVE BP01 IF
NO SIGNALS ARE RECEIVED WITHIN A SECOND INTERVAL
IXC426D SYSTEM BP01 IS SENDING SIGNALS BUT NOT UPDATING STATUS.
REPLY SYSNAME BP01 TO REMOVE THE SYSTEM [OR R TO RETRY]
BOX2
BP03 BP04 •  Processing continues meanwhile
•  If signalling stops, BP01 will be isolated as before
XCF XCF XCF
Apps XCF Apps •  reply BP01 will isolate BP01
•  do nothing (or reply R), BP01 remains in sysplex for a further interval

MPOR - 05 - 45 © RSM Education LLP 2011


Time interval relationships with SFM
IXC105I issued,
Hasn’t checked in “partitioning
Checks in “status update missing” complete”

ISOLATETIME

DEACTTIME

INTERVAL RESETTIME CLEANUP

INTERVAL

SFM Policy
OPNOTIFY CLEANUP CONNFAIL(YES)
IXC402D issued, SYSTEM NAME(*)
IXC105I issued,
system reset,
“partitioning WEIGHT(1)
reply “down” PROMPT
COUPLExx complete”
INTERVAL(25) SYSTEM NAME(BP01)
OPNOTIFY(28) WEIGHT(500)
CLEANUP(60) SYSTEM NAME(BP02)
If all XCF applications clean up before ISOLATETIME(nnnnn
the CLEANUP time expires the system DEACTTIME(nnnnn)
is partitioned at that point RESETTIME(nnnnn)

MPOR - 05 - 46 © RSM Education LLP 2011


The SFM environment
SFM Policy
COUPLExx
CONNFAIL(YES)
INTERVAL(25)
SFM SYSTEM NAME(*)
OPNOTIFY(28)
Couple WEIGHT(1)
CLEANUP(60)
Data Set PROMPT
SYSTEM NAME(BP01)
WEIGHT(500)
DISPLAY XCF,POLICY,TYPE=SFM
ISOLATETIME(30)
IXC364I 20:22:04 DISPLAY XCF
TYPE: SFM
POLNAME: SFMPOL99
STARTED: 05/25/97 18:03:22
LAST UPDATED: 05/25/97 18:03:22

DISPLAY XCF,COUPLE
IXC357I 20:28:14 DISPLAY XCF
SYSTEM BP01 DATA
INTERVAL OPNOTIFY MAXMSG CLEANUP RETRY CLASSLEN
Can be changed via SETXCF
25 28 500 60 10 956

SSUM ACTION SSUM INTERVAL WEIGHT Can only be changed by


ISOLATE 500 changing SFM policy
25

SETXCF COUPLE,INTERVAL=nn [,OPNOTIFY=nn, CLEANUP=nn]


IXC309I SETXCF COUPLE,INTERVAL REQUEST WAS ACCEPTED
SETXCF STOP,POLICY,TYPE=SFM
IXC607I SFM POLICY HAS BEEN STOPPED BY SYSTEM BP01
SETXCF START,POLICY,TYPE=SFM,POLNAME=SFMPOL77
IXC601I SFM POLICY HAS BEEN STARTED BY SYSTEM BP01

MPOR - 05 - 47 © RSM Education LLP 2011


Enabling SFM, switching SFM data sets
Turn SFM on dynamically
COUPLExx
SETXCF COUPLE,TYPE=SFM,PCOUPLE= COUPLE SYSPLEX(&SYSPLEX)
SETXCF COUPLE,TYPE=SFM,ACOUPLE= PCOUPLE etc
SETXCF START,POLICY,TYPE=SFM,POLNAME= INTERVAL(25)
•  the above commands have sysplex scope if all systems have access OPNOTIFY(28)
to data sets CLEANUP(60)
•  all systems must have access for SFM to become active etc.
•  COUPLExx should be updated to reflect SFM data sets DATA TYPE(SFM)
PCOUPLE(SYS1.SFM.CDS01)
SFM status retained across IPLs ACOUPLE(SYS1.SFM.CDS02)
•  if you shut down a system or the sysplex, last used data sets and
policy activated on re-IPL

Switching SFM data sets


SETXCF COUPLE,PSWITCH,TYPE=SFM
SETXCF COUPLE,ACOUPLE(SYS1.SFM.CDS03),TYPE=SFM SYS1.SFM SYS1.SFM
CDS01 CDS02
Re-IPL one system (e.g. BP04) after SFM CDS switch Status Information
•  no problems rejoining sysplex, even though COUPLEnn specifies
‘wrong’ SFM Couple Data sets Sysplex name: BPPLEX01
•  SFM on BP04 is told by other SFMs that CDS02 and CDS03 SFM status: active
currently in use instead Couple member: COUPLExx
Maxsystem: 8
Active policy: SFMPOL0

MPOR - 05 - 48 © RSM Education LLP 2011


Other SFM considerations
COUPLExx
COUPLE SYSPLEX(&SYSPLEX)
Re-IPL sysplex after SFM data sets switched
PCOUPLE(SYS1.CDS01)
•  SFM switched to CDS02/03 from CDS01/02 ACOUPLE(SYS1.CDS02)
•  Shutdown whole sysplex, re-IPL first system INTERVAL(25)
OPNOTIFY(28)
IXC2871 THE COUPLE DATASETS SPECIFIED IN COUPLEnn ARE etc.
INCONSISTENT WITH THOSE LAST USED FOR SFM DATA TYPE(SFM)
IXC2881 COUPLE DATASETS SPECIFIED IN COUPLEnn FOR SFM ARE PCOUPLE(SYS1.SFM.CDS01)
PRIMARY: SYS1.SFM.CDS01 ON VOLSER volser
ALTERNATE: SYS1.SFM.CDS02 ON VOLSER volser ACOUPLE(SYS1.SFM.CDS02)
IXC2881 COUPLE DATASETS LAST USED FOR SFM ARE
PRIMARY: SYS1.SFM.CDS02 ON VOLSER volser
ALTERNATE: SYS1.SFM.CDS03 ON VOLSER volser
IXC289D REPLY U TO USE THE DATA SETS LAST USED FOR SFM SYS1.
OR C TO USE THE DATA SETS SPECIFIED IN COUPLEnn SYS1.
CDS01 CDS02
•  Also, there are a bunch of SFM confirmation messages issued Status Information
on each system at IPL
Sysplex name: BPPLEX01
Varying a system offline with SFM active Couple member: COUPLExx
Maxsystem: 8
•  If SFM is active... SFM status: active
•  ...and ISOLATION is specified for the target system SFM data sets: CDS02/CDS03
•  V XCF,sysname,OFFLINE will result in automatic isolation for that
system, i.e. no IXC 102A (reply “down”) message is issued
SYS1.SFM SYS1.SFM
CDS02 CDS03

MPOR - 05 - 49 © RSM Education LLP 2011


Clocks
Time for an explanation!
(Sequence from initial power on)
1)  Sysplex timer set initially, from the 9037 console or
External Time Source
Sysplex Timer Sysplex Timer
External 2)  Support Element Battery Operated Clock set
Time initially from HMC
Source
3)  If 9037 attached, SE BOC updated from 9037
9672 Support Element 4)  CPC physical TOD clock set initially from SE BOC
Battery Operated Clock
5)  PR/SM maintains a logical TOD clock, set from
CPC TOD when LPAR activated, for each LPAR
CPC – Physical TOD Clock
When z/OS IPLs, CLOCKnn checked
LPAR1 – Logical LPAR2 – Logical ETRMODE=NO?
TOD Clock TOD Clock
•  Use LTOD or issue “SET CLOCK”
•  LTOD not synchronised with 9037
TESTA ETRMODE=YES?
BP01
(not in
(in sysplex) •  LTOD synchronised with 9037
sysplex)
•  System now in ETR synchronisation mode, 9037
SYS1.PARMLIB(CLOCKnn) SYS1.PARMLIB(CLOCKnn)
ETRMODE YES will maintain synchronisation from here on
ETRMODE=NO
ETRDELTA=n/a ETRDELTA= 10 1)  SE BOC clock reset to TOD at 23:00 daily
ETRZONE=no ETRZONE= YES
(n/a) 2)  HMC clock reset to BOC at 23:15 daily
TIMEZONE (local time) TIMEZONE

MPOR - 05 - 50 © RSM Education LLP 2011


ETR / TOD synchronisation
1) OSCILLATOR signal
ensures all clocks run at same ‘speed’

CPC TOD LP TOD SE BOC


Sysplex Timer 2) Data signal
actual time, zone offset, status, every few usec
Data signal stored by CPC

3) OTE signal TOD clock is just


every 1.048576 secs, acts as reference time a microsecond counter
1101000111010010 1 01010101001 1…
Bit 32 Bit 51
1101000111010010 1 01010101001 1… every every
1.048576 sec usec

4) Compare OTE with TOD NE?


ETR SYNCH CHK irpt
SYS1.PARMLIB(CLOCKnn) BP01 in LPAR1
The synchronisation ETRMODE YES
process ETRDELTA= 10 •  If OTE/TOD delta <ETRDELTA, reset TOD
ETRZONE= YES using current data signal value
TIMEZONE (n/a) •  Forward? Just reset
•  Backward? Spin all CPs for
appropriate time

MPOR - 05 - 51 © RSM Education LLP 2011


ETRDELTA
so shouldn’t ‘drift'
1) OSCILLATOR signal
ensures all clocks run at same ‘speed’

very frequent CPC TOD LP TOD SE BOC


Sysplex Timer 2) Data signal
actual time, zone offset, status, every few usec
Data signal stored by CPC

3) OTE signal TOD clock is just


every 1.048576 secs, acts as reference time a microsecond counter
1101000111010010 1 01010101001 1…
Bit 32 Bit 51
1101000111010010 1 01010101001 1… every every
1.048576 sec usec
How could a discrepancy occur?
•  Hardware malfunction in CPC 4) Compare OTE with TOD NE?
ETR SYNCH CHK irpt
•  Resetting the 9037
•  No new OTE from the 9037 SYS1.PARMLIB(CLOCKnn) BP01 in LPAR1
ETRMODE YES
ETRDELTA= 10 •  If OTE/TOD delta <ETRDELTA, reset TOD
ETRDELTA is not the maximum ETRZONE= YES using current data signal value small correction
discrepancy allowed in the sysplex, TIMEZONE (n/a) •  Forward? Just reset
it’s the maximum amount of spin •  Backward? Spin all CPs for
time when adjusting the TOD! appropriate time

MPOR - 05 - 52 © RSM Education LLP 2011


Sysplex timer connectivity problems
If the sysplex timer is lost
TOD Clock
•  On the affected systems
BP01 BP02
IEA015A THE SYSTEM HAS LOST ALL CONNECTIONS TO THE SYSPLEAX MTIMER
(plus lots of other text – see notes for actual text)
XCF XCF XCF
Apps XCF Apps (reply RETRY or ABORT)

•  Can fix the connection?


BOX1
•  Do so, reply RETRY and should be ok
•  If connection not fixed, message repeats
•  Can’t fix the connection?
•  Reply ABORT, will get

IXC462W XCF IS UNABLE TO ACCESS THE ETR AND HAS PLACED THIS SYSTEM
Sysplex Timer INTO NON-RESTARTABLE WAIT STATE CODE: 0A2 REASON CODE: 114

TOD Clock TOD Clock


BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 53 © RSM Education LLP 2011


Losing the sysplex timer
If the last or sysplex timer itself fails
•  On all systems
BP01 BP02
IEA015A THE SYSTEM HAS LOST ALL CONNECTIONS TO THE SYSPLEX TIMER
(plus lots of other text – see notes for actual text)
XCF XCF XCF
Apps XCF Apps (reply RETRY or ABORT)

BOX1 •  Same as before, it can’t fix the problem, reply ABORT, will get
TOD Clock
IXC462W XCF IS UNABLE TO ACCESS THE ETR AND HAS PLACED THIS SYSTEM
INTO NON-RESTARTABLE WAIT STATE CODE: 0A2 REASON CODE: 114

•  Can’t re-IPL the sysplex until sysplex timer problem resolved

You could bring up, for example, a 4-way sysplex (BOX3), using
Sysplex Timer ‘SIMETRID’

TOD Clock TOD Clock


BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 54 © RSM Education LLP 2011


Couple Data Set problems
Policy 1
Sysplex CFRM
Couple Lose this, all systems Couple Lose this, all systems also
Data Set Data Set
load 0A2 wait state load 0A2 wait state

Losing access to (or just losing) a Couple Data Set without an alternate -
If either of the above, you (or all systems) are placed into non-restartable 0A2
•  Check any messages involving Couple Data Sets very carefully
•  Other Couple Data Sets (below) involve loss of facility, rather than loss of systems
Use alternates!

Policy 1 Lose this, and each WLM Policy 1


Lose this, and -
WLM stays in Goal mode, but ARM
Couple Couple you’re ARMless
Data Set runs independently Data Set

Policy 1 Policy 1
SFM Lose this, and the systems LOGR Lose this, and you lose
Couple continue without an Couple
access to System Logger
Data Set Data Set
active SFM policy services
MPOR - 05 - 55 © RSM Education LLP 2011
Changing COUPLE parameters

RO RSMB,SETXCF COUPLE,INTERVAL=20
RO RSMB,SETXCF COUPLE,OPNOTIFY=23

MPOR - 05 - 56 © RSM Education LLP 2011


Failures in the Coupling Facility environment
Coupling Facility failure
•  Without alternate CF
BP01 BP02 •  some users, e.g. JES, can survive this, some, e.g. IMS, can’t
•  With alternate CF
XCF XCF XCF
Apps XCF Apps •  structures can be rebuilt into alternate CF
Coupling Facility Channel failure
BOX1 •  Without alternate CFC
•  this is like losing the CF, so same as above
Structure failure
•  ‘losing’ a structure could be due to:
•  above conditions
CF01 •  structure failure
Structure X •  a need for structure ‘reconfiguration’, e.g. a new CFRM policy
required to increase the maximum size of a structure
Different applications respond differently to these conditions

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 57 © RSM Education LLP 2011


Coupling Facility & CFC error indicators
Coupling Facility failure
BP01 BP02 IXC519E COUPLING FACILITY DAMAGE RECOGNIZED FOR
COUPLING FACILITY 009672.IBM.02.000020040104
XCF XCF XCF PARTITION: 1 CPCID: 00 NAMED: CF01
Apps XCF Apps
Coupling Facility Channel failure
IXL158I PATH nn IS NOW NOT OPERATRIONAL TO CUID nnnn
BOX1 COUPLING FACILITY 009672.IBM.02.000020040104
PARTITION: 1 CPCID: 00
IXC518I SYSTEM nnnn NOT USING
COUPLING FACILITY 009672.IBM.00.000020040104
PARTITION: 1 CPCID: 00 NAMED CF01
REASON: CONNECTIVITY LOST
REASON FLAG: 13300001
CF01
Structure X •  If second CF available, will see ‘structure build’ messages
•  If not, application error messages likely

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 58 © RSM Education LLP 2011


Structure rebuild - overview
Only permitted if “ALLOWREBLD=YES”
Application A System B Application C specified on original connect request
(connected) (XCF initiated) (failed)
Same process whether application initiated (as here)
IXLREBLD, REQUEST=START or system initiated
(can also use SETXCF START, REBUILD)
EVENT EXIT notification Action taken CFCC
REBUILD quiesce
Stop the rebuild or disconnect, or Structure A
stop activity against structure and (original – failed)
issue IXLEERSP
REBUILD connect
IXLCONNREBUILD Structure A
(new version)
IXLCONN REBUILD complete
None required but options are:
let new structure ‘fill up’ as normal, or
restore data from local buffers or The
move data from old to new structure
IXLREBLDCOMPLETE Coupling
REBUILD cleanup Facility
Remove references to old structure CFR
Couple Data Set

REBUILD complete
issue IXLEERSP
CFRM Policy ?
Old structure deleted, ENF 35 issued None - normal operations resume
MPOR - 05 - 59 © RSM Education LLP 2011
Structure rebuild – why?

CF1
Appl z/OSA z/OSB Appl
instance A Applications
original instance B
Applications
XCF new? XCF
DATA XES XES DATA

Structure
CTCs rebuild
controlled by
XES CFRM & SFM
The
XES
Coupling
Facility
XCF XCF
Appl Appl CFRM CDS
instance CApplications Applications instance D
z/OSC z/OSD CFRM Policy
DATA CF2 DATA
new?
CFRM CDS

SFM Policy

MPOR - 05 - 60 © RSM Education LLP 2011


Structure rebuild controls
A B
CF01 CFRM Policy
System A
Conn – Str01
CP CP CP …. CP
STRUCTURE NAME(STR01)
C
Coupling Facility REBUILDPERCENT(50)
Conn – Str04 Control Code
Dump space
Str 01 Str 01 SFM Policy
Str 02 Str 02 CONFAIL(YES)
System B structures
SYSTEM NAME(*) WEIGHT(1)
structures
Non-control Control storage
Conn – Str03 (Central)
(expanded) SYSTEM NAME(SystemA)
Conn – Str06 WEIGHT(10)

The
Rebuild calculation
System C
Conn – Str01
CP CP CP …. Coupling
Facility •  System A loses connection to CF01
Coupling Facility •  A = weight of systems that have lost
Conn – Str06
Control Code connectivity
Dump space
•  B = weight of all systems with
Str 03 Str 03
connections to CF01
System D Str 04 Str 04
structures structures •  The calculation is
Conn – Str04 Control storage
Non-control is A / B * 100 ge C?
(expanded) (Central)
Conn – Str07 CF02 •  If yes – rebuild structure

MPOR - 05 - 61 © RSM Education LLP 2011


Structure rebuild – applications support
Rebuild ‘rebuildpercent’
Application Structure allowed? supported? Comments
IRLM IMS lock structure Yes Yes
IMS OSAM cache structure Yes Yes
IMS VSAM cache structure Yes Yes
IRLM DB2 lock structure Yes Yes
DB2 GBP cache structure Yes Yes
DB2 SCA list structure Yes Yes
SMSVSAM lock structure Yes Yes
SMSVSAM VSAM cache structures Yes Yes
JES2/3 CHKPT list structure No No Checkpoint reconfiguring dialog

RACF cache structures Yes Yes


System Logger Logstream list structures Yes No Rebuilt is any connectivity loss

GRS STAR lock structure Yes Yes


XCF signalling list structure Yes No Rebuilt is any connectivity loss

VTAM generic resources structure Yes No Rebuilt is any connectivity loss

MPOR - 05 - 62 © RSM Education LLP 2011


Parallel Sysplex: Operations, Troubleshooting & Recovery

Automatic Restart Manager

© RSM Education LLP 2011


Automatic Restart Manager

•  ARM provides the ability to restart work subsystem address spaces


like VTAM, CICS, DB2, etc, whether they’re running as batch or
started tasks
•  The Automatic Restart Manager:
–  restarts failed batch jobs or started tasks after a system or job
failure
–  supports job inter-dependencies on the restarts
•  Although it will support batch jobs, what we’re really talking about here
is the ability to restart subsystem products rather than your general
batch workload:
–  If the application fails, it will be restarted on the same system
–  if a system fails, its applications will be started on a different
system
MPOR - 05 - 64 © RSM Education LLP 2011
Automatic Restart Manager

•  ARM is controlled through an ARM policy in an ARM Couple Data Set,


but there is an additional step involved here

•  Programs wishing to use ARM services must also register with ARM
via the IXCARM service macro

•  This means that programs have to be coded to use ARM

•  The newer releases of the IBM products like CICS, IMS etc do this

•  If you set up the ARM environment for them, these products will be
automatically restarted in the event of the failures described above
MPOR - 05 - 65 © RSM Education LLP 2011
Automatic Restart Manager

BP01 BP02 Sysplex


Couple
Automation Restart Management Data Set
XCF XCF XCF
Apps XCF Apps
•  A set of XCF services controlled by a police in an ARM
Couple Data Set
BOX1 •  Provides automatic job restart in the event of job abends
and system failure
•  Jobs have to issue IXCARM service requests to register with
ARM to use the services
•  Most of the IBM products register if ARM active
CF02 CTCs CF01
Structure X IXC_STR1

BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08

XCF XCF XCF XCF XCF XCF XCF XCF XCF


Apps XCF Apps Apps XCF Apps Apps XCF Apps

MPOR - 05 - 66 © RSM Education LLP 2011


The ARM policy

Primary Alternate (Spare)

ARM policy options ARM ARM ARM


Couple Couple Couple
Data Set Data Set Data Set
RESTART_ORDER
(LEVEL(2)
ELEMENT_NAME(sys??pay*,abcjob)
Batch Job or Started Task
LEVEL(3)
ELEMENT_NAME(xyztask) IXCARM REQUEST=REGISTER,
Element=name
RESTART_GROUP( name of group)
TARGET_SYSTEM( SystemA,SystemB)
FREE_CSA( below,above)
RESTART_PACING( delay in secs between restarts in grp)
ELEMENT( element name)
RESTART_ATTEMPTS( max #, over what period)
RESTART_TIMEOUT( interval between restart and REGISTER)
READY_TIMEOUT( interval between REGISTER and READY)
TERMTYPE( ALLTERM or ELEMTERM)
RESTART_METHOD( “if this type of error”, “restart this way”)
ELEMENT
(restart parameters for next element)

MPOR - 05 - 67 © RSM Education LLP 2011


The ARM defaults
Primary Alternate ARM defaults

RESTART_ORDER
(LEVEL(1)
ARM ARM
DB2, IMS, VTAM always restarted first)
Couple Couple LEVEL(2)
Data Set Data Set ELEMENT_TYPE(SYSLVL2)

RESTART_GROUP(DEFAULT)
TARGET_SYSTEM(*)
FREE_CSA(0,0)
Beware the defaults RESTART_PACING(0)
ELEMENT(*)
RESTART_ATTEMPTS(3,300)
RESTART_TIMEOUT(300)
•  Although you can activate ARM and use the defaults, you READY_TIMEOUT(300)
should not do so TERMTYPE(ALLTERM)
•  The defaults are effectively random and won;t necessarily RESTART_METHOD(BOTH,PERSIST)
work for individual applications like CICS, DB2, etc.
•  In any policy you create, to nullify the defaults, include
RESTART_GROUP(DEFAULT)
ELEMENT(*)
RESTART_ATTEMPTS(0,300)
•  Code explicit group/element statements for the work you
actually want covered
MPOR - 05 - 68 © RSM Education LLP 2011
Manipulating the ARM environment
Activating the defaults (bad idea) Primary Alternate
DISPLAY XCF,POLICY,TYPE=ARM
IXC364I 00.25.03 DISPLAY XCF SYS1.ARM SYS1.ARM
TYPE: ARM
POLICY NOT STARTED
CDS01 CDS02

SETXCF START,POLICY,TYPE=ARM
IXC805I ARM POLICY HAS BEEN STARTED BY SYSTEM BP01
POLICY DEFAULTS ARE NOW IN EFFECT

DISPLAY XCF,POLICY,TYPE=ARM Primary Alternate


IXC364I 00.27.22 DISPLAY XCF Defaults Defaults
TYPE: ARM
POLNAME: POLICY DEFAULTS ARE IN EFFECT SYS1.ARM SYS1.ARM
STARTED: 05/30/09 00.26.12 CDS01 CDS02
LAST UPDATED: -- --
Activating an installation defined policy
SETXCF START,POLICY,TYPE=AARM,POLNAME=ARMPOL01
IXC8051 ARM POLICY HAS BEEN STARTED BY SYSTEM BP01
POLICY NAMED ARMPOL01 IS NOW IN EFFECT
DISPLAY XCF,POLICY,TYPE=ARM Primary Alternate
IXC364I 00.35.25 DISPLAY XCF ARMPOL01 ARMPOL01
TYPE: ARM
POLNAME: ARMPOL01 SYS1.ARM SYS1.ARM
STARTED: 05/30/09 00.26.12 CDS01 CDS02
LAST UPDATED: 05/30/09 00.34.53
MPOR - 05 - 69 © RSM Education LLP 2011
ARM element states

Starts JOB/STC D XCF,ARMSTATUS,DETAIL


IXC3921 . . .
IXCARM, REQUEST= (none)
------- ELEMENT STATE SUMMARY -------

(counts of elements in the different states)


REGISTER Starting
RESTART GROUP : nnnnn
ELEMENT NAME : nnnnn
READY Available
(details of the individual elements)

Failed
Restarted
Restarting by ARM
Fails
Recovering REGISTER
Be careful,
‘D XCF,ARMSTATUS,DETAIL’ WAITPRED

can be a big display! Available READY

MPOR - 05 - 70 © RSM Education LLP 2011


“D XCF ARMSTATUS”
D XCF,ARMSTATUS

IXC392I 00.52.12 DISPLAY XCF BEPEJOB1


(none) IXCARM
NO ARM ELEMENTS ARE DEFINED
Starting REGISTER
$HASP373 BEPEJOB1 STARTED – INIT A – CLASS F – SYS BP01
IEF493I BEPEJOB1 STARTED – TIME=00.53.14
(job registers with ARM) Available READY
D XCF,ARMSTATUS,DETAIL Available
Failed
IXC392I 00.54.32 DISPLAY XCF
ARM RESTARTS ARE ENABLED Restarted
----------------ELEMENT STATE SUMMARY-------- -TOTAL- -MAX- Restarting by ARM
STARTING AVAILABLE FAILED RESTARTING RECOVERING Recovering REGISTER
0 1 0 0 0 1 20
RESTART GROUP:DEFAULT
RESTART GROUP:DEFAULT PACING : 0 FREE CSA: 0 0 WAITPRED
ELEMENT NAME
ELEMENT NAME :BEPEJOB1
:BEPEJOB1 JOBNAME :BEPEJOB1 STATE :AVAILABLE
CURR SYS :BP01 JOBTYPE :JOB ASID :002D
INIT SYS :BP01 JESGROUP:BPPLEX01 TERMTYPE:ALLTERM Available RADY
EVENTEXIT:GOSSIP99 ELEMENTYPE:*NONE* LEVEL : 2
TOTAL RESTARTS : 0 INITIAL START:05/30/09 00.53.14
RESTART THRESH :0 OFF 3 FIRST RESTART:*NONE*
RESTART TIMEOUT: 300 LAST RESTART :*NONE*

MPOR - 05 - 71 © RSM Education LLP 2011


ARM restart, same system
(Batch job starts)
$HASP373 BEPEJOB1 STARTED – INIT A – CLASS F – SYS BP01 BEPEJOB1
(none)
IEF493I BEPEJOB1 STARTED – TIME=00.53.14 IXCARM
(job registers with ARM) REGISTER
Starting

C BEPEJOB1,ARMRESTART
IEE301I BEPEJOB1 CANCEL COMMAND ACCEPTED Available READY
IXC812I JOBNAME BEPEJOB1, ELEMENT BEPEJOB1 FAILED
THE ELEMENT WAS RESTARTED WITH PERSISTENT JCL
Failed
$HASP373 BEPEJOB1 STARTED – INIT A – CLASS F – SYS BP01
IEF493I BEPEJOB1 STARTED – TIME=00.56.54
(job re-registers with ARM) Restarted
Restarting by ARM
D XCF,ARMSTATUS,DETAIL Recovering REGISTER
IXC392I 00.58.32 DISPLAY XCF
ARM RESTARTS ARE ENABLED
WAITPRED
--------------- ELEMENT STATE SUMMARY --------- TOTAL- -MAX-
STARTING AVAILABLE FAILED RESTARTING RECOVERING
0 1 0 0 0 1 20 Available RADY
RESTART GROUP:DEFAULT PACING : 0 FREE CSA: 0 0
ELEMENT NAME:BEPEJOB1 JOBNAME :BEPEJOB1 STATE :AVAILABLE
CURR SYS :BP01 JOBTYPE :JOB ASID :007F
INIT SYS :BP01 JESGROUP:BPPLEX01 TERMTYPE:ALLTERM
EVENTEXIT:GOSSIP99 ELEMENTYPE:*NONE* LEVEL : 2
TOTAL RESTARTS : 1 INITIAL START:05/30/97 00.53.14
INITIAL START:05/30/09 00.53.14
RESTART THRESH : 0 0F 3 RESTART:05/30/97 00.56.54
FIRST RESTART:05/30/09
RESTART TIMEOUT: 300 LAST RESTART :05/30/09
:05/30/97 00.56.54

MPOR - 05 - 72 © RSM Education LLP 2011


ARM restart, cross system
Batch job still running after restart, on BP01 BEPEJOB1
$HASP373 BEPEJOB1 STARTED – INIT A – CLASS F – SYS BP01 (none) IXCARM
IEF493I BEPEJOB1 STARTED – TIME=00.53.14
Starting REGISTER
(job registers with ARM)

Available READY
BP01 fails, is fenced out of sysplex, on BP02
IXC812I JOBNAME BEPEJOB1, ELEMENT BEPEJOB1 FAILED DUE TO
THE FAILURE OF SYSTEM BP01 Failed
THE ELEMENT WAS RESTARTED WITH PERSISTENT JCL
Restarted
$HASP373 BEPEJOB1 STARTED – INIT A – CLASS F – SYS BP02 Restarting
IEF493I BEPEJOB1 STARTED – TIME=01.05.23 by ARM

(job re-registers with ARM) Recovering REGISTER

D XCF,ARMSTATUS,DETAIL WAITPRED
IXC392I 01.08.32 DISPLAY XCF
ARM RESTARTS ARE ENABLED Available RADY
--------------- ELEMENT STATE SUMMARY --------- TOTAL- -MAX-
STARTING AVAILABLE FAILED RESTARTING RECOVERING
Restarted
0 1 0 0 0 1 20 Failed by ARM
RESTART GROUP:DEFAULT PACING : 0 FREE CSA: 0 0
ELEMENT NAME:BEPEJOB1 JOBNAME :BEPEJOB1 STATE :AVAILABLE Recovering REGISTER
CURR SYS
CURR SYS :BP02
:BP02 JOBTYPE :JOB ASID :015E
INIT SYS :BP01 JESGROUP:BPPLEX01 TERMTYPE:ALLTERM WAITPRED
EVENTEXIT:GOSSIP99 ELEMENTYPE:*NONE* LEVEL : 2
TOTAL RESTARTS : 2 INITIAL START:05/30/97 00.53.14
INITIAL START:05/30/09 00.53.14
RESTART THRESH : 0 0F 3 FIRST
FIRST RESTART:05/30/97
RESTART:05/30/09 00.56.54
00.56.54 Available RADY
RESTART TIMEOUT: 300 LAST RESTART
LAST RESTART :05/30/97
:05/30/09 01.05.23
01.05.23

MPOR - 05 - 73 © RSM Education LLP 2011


ARM considerations
CTCs
Devices

BOX1 BOX2 BOX3

BOX1 BOX2 BOX3 BOX4 BOX5 BOX6 BOX7 BOX8

XCF XCF XCF


Apps XCF XCF XCF XCF
Apps XCF XCF XCF XCF
Apps XCF XCF XCF XCF
Apps XCF
Apps Apps Apps Apps

Application “X” – CICS/DB2 Application “G” – CICS/DL1

ARM considerations
•  Don’t run with the defaults, create an installation policy
•  Include a default RESTART_GROUP definition with RESTART_Attempts(0) to exclude restarts for non-explicit elements
•  To use cross-system restart, pre-define all subsystems to all systems
•  If you lose BP01 above, can you support all your CICS/DB2 transactions on the remaining regions, or should you restart
the lost regions across the other images?
•  Subsystems like IRLM may need to be cross-system restarted to recover lost resources, but are then no longer required
A proper ARM environment requires a lot of planning!

MPOR - 05 - 74 © RSM Education LLP 2011


Summary
Sysplex Failure Management
•  keeps the sysplex up and
running System A System C
•  is concerned with the ‘mechanics’
Connection 1 Connection 5
of the sysplex CTCs
Connection 2 Connection 6

The

CP CP ….
The Coupling
CP Coupling Facility
Facility
Coupling Facility
Control Code
Dump space
IXC_SIG01
ISTMNPS
OPERLOG Automatic Restart Management
System B Non-control Control storage •  Restarts failed applications
(Central)
Connection 3 (expanded) (batch and started tasks)
•  Supports job interdependencies
Connection 4 Policy 1 Policy 1 Policy 1 Policy 1
Policy 1 Policy 1 Policy 1 Policy 1
CFRM CFRM CFRM CFRM
ARM
Couple SFM
Couple CFRM
Couple Sysplex
Couple
Couple
Dataset Couple
Dataset Couple
Dataset Couple
Dataset
Data Set Data Set Data Set Data Set

MPOR - 05 - 75 © RSM Education LLP 2011


Questions??

© RSM Education LLP 2011


And finally…

Now you can...

...get some coffee!!

© RSM Education LLP 2011

You might also like