Share 2011 - SFM
Share 2011 - SFM
Share 2011 - SFM
• Introduction
• Language
• Overview
• Sysplex Failure Manager (SFM)
• Automatic Restart Manager (ARM)
• Summary
• Questions
• Zeebra?
• No it s a Zebra!
• Hopefully this will help you understand me
© RSM Education LLP 2011
Acknowledgements
• This material is extracted from a formal education class:
• www.rsm.co.uk/view_course.php?code=MPOR
Runtime Problem
Determination
Overview
A parallel sysplex:
• may consist of up to 32 systems
CICS IMS • and can accept new systems up to
workload workload that limit dynamically
CTCs • but can provide a ‘single image’ for
the workloads
The
DB2
Coupling • can recover failing work units
workload
Facility automatically, anywhere in the
sysplex
TSO Batch • can provide continuous availability
The for application workloads
workload Coupling workload
Facility
System A System C
Connection 1 Connection 5
CTCs
Connection 2 Connection 6
CP CP CP ….
Coupling Facility
Control Code
Dump space
IXC_SIG01
ISTMNPS
OPERLOG
System B Non-control
System ‘n’
Control storage
(expanded) (Central)
Connection 3 Connection ‘n’
CFRM Sysplex
If it CAN go wrong … Couple Couple … it WILL!
Data Set Data Set
MPOR - 05 - 12 © RSM Education LLP 2011
Redundancy is good for you….But!
System A System C
Connection 1 Connection 5
CTCs
Connection 2 Connection 6
The
CP CP CP ….The
Coupling
Coupling
Facility
Facility
Coupling Facility
Control Code
Dump space
IXC_SIG01
ISTMNPS
OPERLOG
System B Non-control
System ‘n’
Control storage
(expanded) (Central)
Connection 3 Connection ‘n’
Connection 4 Policy
Policy
1 1 Connection ‘n’
Sysplex
CFRM
CFRM Sysplex
Couple
If you’ve got backup … Couple
Couple Couple
Dataset … it doesn’t matter!
Data
Dataset
Set
Data Set
MPOR - 05 - 13 © RSM Education LLP 2011
Redundancy is good for you….But!
• Its expensive
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
BP01 D XCF,PO
CTCs IXC3551 11.13.329 DISPLAY XCF
PATHOUT TO SYSNAME: BP03
SW DEVICE (LOCAL/REMOTE): 8038/7018 8030/7010
SW
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
IXC409D SIGNAL PATHS BETWEEN BP03 AND BP01 ARE LOST. REPLY
RETRY OR SYSNAME=SYSNAME OF THE SYSTEM TO BE REMOVED
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
Reply “sysname=BP0n”
IXC417D CONFIRM REQUEST TO REMOVE BP0n FROM THE SYSPLEX.
CTCs REPLY SYSNAME=BP0n TO REMOVE BP0n OR C TO CANCEL
IXC458I SIGNAL PATHOUT DEVICE 8030 STOPPED: RETRY LIMIT EXCEEDED
SW IXC220W XCF IS UNABLE TO CONTINUE: WAIT STATE CODE: 0A2
SW REASON CODE: 08, LOSS OF CONNECTIVITY DETECTED
or maybe
IXC519I COUPLING FACILITY DAMAGE RECOGNIZED FOR COUPLING
FACILITY (description) NAMED CF01
3) Structure failure
CF02
IXC467I REBUILDING PATH STRUCTURE IXC_STR1. RSN: STRUCTURE
CF01 IXC_STR2 FAILURE
(see “Sysplex Operations” topic, OPS00310, for remainder of messages)
IXC_STR1
In all cases, signalling continues using alternate facilities
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
IXC_STR1
• IXC409D will be issued on all systems
• Only one system allowed to remain active, all others must be
removed via the 0A2 wait state
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
CF01
• Then it will be as if you’ve lost the Coupling Facility!
IXC_STR1
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
OPNOTIFY
BOX1 Removing the system
IXC402D BP01 LAST OPERATIVE AT hh:mm:ss. REPLY DOWN AFTER
SYSTEM RESET OR INTERVAL=SSSSS TO SET A REPROMPT TIME
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
BP01 BP01
checks in checks in
here here
must IXC402D
check in issued
COUPLExx
INTERVAL(25)
OPNOTIFY(28) INTERVAL
OPNOTIFY
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
CTCs
LOGGER XCF LOGGER XCF LOGGER XCF LOGGER XCF LOGGER XCF LOGGER XCF LOGGER XCF LOGGER XCF
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
CTCs CF01
SYSTEM NAME(BP02) if this system misses its status update
IXC_STR1 don’t issue IXC402D, but instead:
Sysplex 4
Couple ISOLATETIME(nnnnn) isolate automatically after “nnnnn” secs
Data Set DEACTTIME(nnnnn) or deactivate its LPAR after “nnnnn” secs
RESETTIME(nnnnn) or system reset its LPAR after “nnnnn” secs
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
TYPE: SFM
POLNAME: SFMPOL1
STARTED: 06/13/2010 17:06:55
LAST UPDATED: 06/13/2010 10:36:34
SETXCF STOP,POLICY,TYPE=SFM
IXC607I SFM POLICY HAS BEEN STOPPED BY SYSTEM RSMA
TYPE: SFM
POLICY NOT STARTED
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
SFM Active
CF01
IXC_STR1 Be careful what values you use!
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
XCF0A2-XCF XCF
0A2- XCF
Apps XCF
0A2- XCF
Apps XCF0A2-XCF
Apps XCF0A2-XCF
Apps XCF0A2-XCF
Apps
Apps
110 110 110 110 110 110
• the Group Exits of any associated XCF applications on the other systems
Sysplex are notified in case any application recovery needed
Couple • when the CLEANUP interval expires or the group exits finish
Data Set
IXC105I SYSPLEX PARTITIONING HAS COMPLETED FOR BP01
PRIMARY REASON: SYSTEM STATUS UPDATE MISSING
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
RESPONSE=RSMA
IXC335I 17.14.23 DISPLAY XCF 498
SYSPLEX RSMPLX
SYSTEM TYPE SERIAL LPAR STATUS TIME SYSTEM STATUS
RSMA 2086 722D 03 06/13/2010 17:14:22 ACTIVE TM=SIMETR
RSMB 2086 722D 04 06/13/2010 17:11:57 BEING REMOVED - RSMA
• the Group Exits of any associated XCF applications on the other systems
are notified in case any application recovery needed
Sysplex
Couple • when the CLEANUP interval expires or the group exits finish
Data Set IXC105I SYSPLEX PARTITIONING HAS COMPLETED FOR BP01
PRIMARY REASON: SYSTEM REMOVED BY SYSPLEX FAILURE MANAGER
BECAUSE ITS STATUS UPDATE WAS MISSING -
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
ISOLATETIME
DEACTTIME
INTERVAL
SFM Policy
OPNOTIFY CLEANUP CONNFAIL(YES)
IXC402D issued, SYSTEM NAME(*)
IXC105I issued,
system reset,
“partitioning WEIGHT(1)
reply “down” PROMPT
COUPLExx complete”
INTERVAL(25) SYSTEM NAME(BP01)
OPNOTIFY(28) WEIGHT(500)
CLEANUP(60) SYSTEM NAME(BP02)
If all XCF applications clean up before ISOLATETIME(nnnnn
the CLEANUP time expires the system DEACTTIME(nnnnn)
is partitioned at that point RESETTIME(nnnnn)
DISPLAY XCF,COUPLE
IXC357I 20:28:14 DISPLAY XCF
SYSTEM BP01 DATA
INTERVAL OPNOTIFY MAXMSG CLEANUP RETRY CLASSLEN
Can be changed via SETXCF
25 28 500 60 10 956
IXC462W XCF IS UNABLE TO ACCESS THE ETR AND HAS PLACED THIS SYSTEM
Sysplex Timer INTO NON-RESTARTABLE WAIT STATE CODE: 0A2 REASON CODE: 114
BOX1 • Same as before, it can’t fix the problem, reply ABORT, will get
TOD Clock
IXC462W XCF IS UNABLE TO ACCESS THE ETR AND HAS PLACED THIS SYSTEM
INTO NON-RESTARTABLE WAIT STATE CODE: 0A2 REASON CODE: 114
You could bring up, for example, a 4-way sysplex (BOX3), using
Sysplex Timer ‘SIMETRID’
Losing access to (or just losing) a Couple Data Set without an alternate -
If either of the above, you (or all systems) are placed into non-restartable 0A2
• Check any messages involving Couple Data Sets very carefully
• Other Couple Data Sets (below) involve loss of facility, rather than loss of systems
Use alternates!
Policy 1 Policy 1
SFM Lose this, and the systems LOGR Lose this, and you lose
Couple continue without an Couple
access to System Logger
Data Set Data Set
active SFM policy services
MPOR - 05 - 55 © RSM Education LLP 2011
Changing COUPLE parameters
RO RSMB,SETXCF COUPLE,INTERVAL=20
RO RSMB,SETXCF COUPLE,OPNOTIFY=23
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
REBUILD complete
issue IXLEERSP
CFRM Policy ?
Old structure deleted, ENF 35 issued None - normal operations resume
MPOR - 05 - 59 © RSM Education LLP 2011
Structure rebuild – why?
CF1
Appl z/OSA z/OSB Appl
instance A Applications
original instance B
Applications
XCF new? XCF
DATA XES XES DATA
Structure
CTCs rebuild
controlled by
XES CFRM & SFM
The
XES
Coupling
Facility
XCF XCF
Appl Appl CFRM CDS
instance CApplications Applications instance D
z/OSC z/OSD CFRM Policy
DATA CF2 DATA
new?
CFRM CDS
SFM Policy
The
Rebuild calculation
System C
Conn – Str01
CP CP CP …. Coupling
Facility • System A loses connection to CF01
Coupling Facility • A = weight of systems that have lost
Conn – Str06
Control Code connectivity
Dump space
• B = weight of all systems with
Str 03 Str 03
connections to CF01
System D Str 04 Str 04
structures structures • The calculation is
Conn – Str04 Control storage
Non-control is A / B * 100 ge C?
(expanded) (Central)
Conn – Str07 CF02 • If yes – rebuild structure
• Programs wishing to use ARM services must also register with ARM
via the IXCARM service macro
• The newer releases of the IBM products like CICS, IMS etc do this
• If you set up the ARM environment for them, these products will be
automatically restarted in the event of the failures described above
MPOR - 05 - 65 © RSM Education LLP 2011
Automatic Restart Manager
BOX2 BOX3
BP03 BP04 BP05 BP06 BP07 BP08
RESTART_ORDER
(LEVEL(1)
ARM ARM
DB2, IMS, VTAM always restarted first)
Couple Couple LEVEL(2)
Data Set Data Set ELEMENT_TYPE(SYSLVL2)
RESTART_GROUP(DEFAULT)
TARGET_SYSTEM(*)
FREE_CSA(0,0)
Beware the defaults RESTART_PACING(0)
ELEMENT(*)
RESTART_ATTEMPTS(3,300)
RESTART_TIMEOUT(300)
• Although you can activate ARM and use the defaults, you READY_TIMEOUT(300)
should not do so TERMTYPE(ALLTERM)
• The defaults are effectively random and won;t necessarily RESTART_METHOD(BOTH,PERSIST)
work for individual applications like CICS, DB2, etc.
• In any policy you create, to nullify the defaults, include
RESTART_GROUP(DEFAULT)
ELEMENT(*)
RESTART_ATTEMPTS(0,300)
• Code explicit group/element statements for the work you
actually want covered
MPOR - 05 - 68 © RSM Education LLP 2011
Manipulating the ARM environment
Activating the defaults (bad idea) Primary Alternate
DISPLAY XCF,POLICY,TYPE=ARM
IXC364I 00.25.03 DISPLAY XCF SYS1.ARM SYS1.ARM
TYPE: ARM
POLICY NOT STARTED
CDS01 CDS02
SETXCF START,POLICY,TYPE=ARM
IXC805I ARM POLICY HAS BEEN STARTED BY SYSTEM BP01
POLICY DEFAULTS ARE NOW IN EFFECT
Failed
Restarted
Restarting by ARM
Fails
Recovering REGISTER
Be careful,
‘D XCF,ARMSTATUS,DETAIL’ WAITPRED
C BEPEJOB1,ARMRESTART
IEE301I BEPEJOB1 CANCEL COMMAND ACCEPTED Available READY
IXC812I JOBNAME BEPEJOB1, ELEMENT BEPEJOB1 FAILED
THE ELEMENT WAS RESTARTED WITH PERSISTENT JCL
Failed
$HASP373 BEPEJOB1 STARTED – INIT A – CLASS F – SYS BP01
IEF493I BEPEJOB1 STARTED – TIME=00.56.54
(job re-registers with ARM) Restarted
Restarting by ARM
D XCF,ARMSTATUS,DETAIL Recovering REGISTER
IXC392I 00.58.32 DISPLAY XCF
ARM RESTARTS ARE ENABLED
WAITPRED
--------------- ELEMENT STATE SUMMARY --------- TOTAL- -MAX-
STARTING AVAILABLE FAILED RESTARTING RECOVERING
0 1 0 0 0 1 20 Available RADY
RESTART GROUP:DEFAULT PACING : 0 FREE CSA: 0 0
ELEMENT NAME:BEPEJOB1 JOBNAME :BEPEJOB1 STATE :AVAILABLE
CURR SYS :BP01 JOBTYPE :JOB ASID :007F
INIT SYS :BP01 JESGROUP:BPPLEX01 TERMTYPE:ALLTERM
EVENTEXIT:GOSSIP99 ELEMENTYPE:*NONE* LEVEL : 2
TOTAL RESTARTS : 1 INITIAL START:05/30/97 00.53.14
INITIAL START:05/30/09 00.53.14
RESTART THRESH : 0 0F 3 RESTART:05/30/97 00.56.54
FIRST RESTART:05/30/09
RESTART TIMEOUT: 300 LAST RESTART :05/30/09
:05/30/97 00.56.54
Available READY
BP01 fails, is fenced out of sysplex, on BP02
IXC812I JOBNAME BEPEJOB1, ELEMENT BEPEJOB1 FAILED DUE TO
THE FAILURE OF SYSTEM BP01 Failed
THE ELEMENT WAS RESTARTED WITH PERSISTENT JCL
Restarted
$HASP373 BEPEJOB1 STARTED – INIT A – CLASS F – SYS BP02 Restarting
IEF493I BEPEJOB1 STARTED – TIME=01.05.23 by ARM
D XCF,ARMSTATUS,DETAIL WAITPRED
IXC392I 01.08.32 DISPLAY XCF
ARM RESTARTS ARE ENABLED Available RADY
--------------- ELEMENT STATE SUMMARY --------- TOTAL- -MAX-
STARTING AVAILABLE FAILED RESTARTING RECOVERING
Restarted
0 1 0 0 0 1 20 Failed by ARM
RESTART GROUP:DEFAULT PACING : 0 FREE CSA: 0 0
ELEMENT NAME:BEPEJOB1 JOBNAME :BEPEJOB1 STATE :AVAILABLE Recovering REGISTER
CURR SYS
CURR SYS :BP02
:BP02 JOBTYPE :JOB ASID :015E
INIT SYS :BP01 JESGROUP:BPPLEX01 TERMTYPE:ALLTERM WAITPRED
EVENTEXIT:GOSSIP99 ELEMENTYPE:*NONE* LEVEL : 2
TOTAL RESTARTS : 2 INITIAL START:05/30/97 00.53.14
INITIAL START:05/30/09 00.53.14
RESTART THRESH : 0 0F 3 FIRST
FIRST RESTART:05/30/97
RESTART:05/30/09 00.56.54
00.56.54 Available RADY
RESTART TIMEOUT: 300 LAST RESTART
LAST RESTART :05/30/97
:05/30/09 01.05.23
01.05.23
ARM considerations
• Don’t run with the defaults, create an installation policy
• Include a default RESTART_GROUP definition with RESTART_Attempts(0) to exclude restarts for non-explicit elements
• To use cross-system restart, pre-define all subsystems to all systems
• If you lose BP01 above, can you support all your CICS/DB2 transactions on the remaining regions, or should you restart
the lost regions across the other images?
• Subsystems like IRLM may need to be cross-system restarted to recover lost resources, but are then no longer required
A proper ARM environment requires a lot of planning!
The
CP CP ….
The Coupling
CP Coupling Facility
Facility
Coupling Facility
Control Code
Dump space
IXC_SIG01
ISTMNPS
OPERLOG Automatic Restart Management
System B Non-control Control storage • Restarts failed applications
(Central)
Connection 3 (expanded) (batch and started tasks)
• Supports job interdependencies
Connection 4 Policy 1 Policy 1 Policy 1 Policy 1
Policy 1 Policy 1 Policy 1 Policy 1
CFRM CFRM CFRM CFRM
ARM
Couple SFM
Couple CFRM
Couple Sysplex
Couple
Couple
Dataset Couple
Dataset Couple
Dataset Couple
Dataset
Data Set Data Set Data Set Data Set