Everything You Ever Wanted To Know About CHM
Everything You Ever Wanted To Know About CHM
Everything You Ever Wanted To Know About CHM
• Installation
– Of the Tool
– Of the GUI
• CHM in Action
• Administration
• Possible outcomes:
– Oracle Support finds the answer in one of the logs
– Oracle Support needs more node specific information to answer the question
• For the latter: This why you need Cluster Health Monitor (CHM) for example
Why should you use CHM?
Because you want to prevent another incident
• Based on the previous scenario:
– It is determined that the reboot was caused by
an abnormally high CPU load in conjunction with extreme IO waits.
– Your manager asks you:
What caused the high CPU load? What can we do to prevent this in future?
• For the latter: CHM provides a historical view on collected data for analyzes
– >crfgui -d "00:05:00" -m 192.168.2.8
– Cluster Health Analyzer V1.10 Look for Loggerd via node 192.168.2.8
...reading 300 sec from the past
Connected to Loggerd on rac1
Note: Node rac1 is now up
Cluster 'MyCluster', 2 nodes. Ext time=2010-08-18 23:22:30
Where can I get CHM?
Free Download
• Direct download link:
– http://www.oracle.com/technetwork/database/clustering/downloads/ipd-download-
homepage-087212.html
Installation
How to Install CHM?
Use the documentation
• Overview of Cluster Heath Monitor (CHM)
(http://www.oracle.com/technetwork/database/enterprise-edition/ipd-overview-130032.pdf)
osysmond osysmond
ologgerd ologgerd
oproxyd oproxyd
How to install the GUI?
Use the documentation + tips & tricks
• The GUI needs to be installed separately.
• It is recommended to install the GUI on a separate (client) machine
– The GUI can be installed on (one) node(s) of the cluster, if it has to
• If your client is a Windows client, download the Windows version of the tool
• Unzip and install the GUI using:
Administration
Administration part 1
The main administration tool for CHM: oclumon
----------------------------------------
Node: rac1 Clock: '08-19-10 03.53.53 UTC' SerialNo:63193
----------------------------------------
SYSTEM:
#cpus: 2 cpu: 4.5 cpuq: 1 physmemfree: 13896 mcache: 959952 swapfree: 1900208 ior: 0 iow: 297 ios: 17 netr: 57.9 netw: 43.56 procs: 187 rtprocs: 11 #fds: 2658 #sysfdlimit:
6815744 #disks: 7 #nics: 4 nicErrors: 0
TOP CONSUMERS:
topcpu: 'osysmond(13446) 0.66' topprivmem: 'ologgerd(13532) 102260' topshm: 'ologgerd(13532) 46680' topfd: 'crsd.bin(10754) 102' topthread: 'crsd.bin(10754) 58'
PROCESSES:
name: 'osysmond' pid: 13446 #procfdlimit: 1024 cpuusage: 0.66 memusage: 78912 shm: 41196 #fd: 22 #threads: 9 priority: 139
name: 'orarootagent.bi' pid: 10890 #procfdlimit: 65536 cpuusage: 0.66 memusage: 6420 shm: 10032 #fd: 7 #threads: 34 priority: 19
name: 'ologgerd' pid: 13532 #procfdlimit: 1024 cpuusage: 0.0 memusage: 102260 shm: 46680 #fd: 19 #threads: 9 priority: 139
…
DEVICES:
sdf ior: 0.0 iow: 0.0 ios: 0 qlen: 0 wait: 0 type: SYS
sdf1 ior: 0.0 iow: 0.0 ios: 0 qlen: - wait: - type: SYS
sde ior: 0.0 iow: 0.0 ios: 0 qlen: 0 wait: 0 type: SYS
sde1 ior: 0.0 iow: 0.0 ios: 0 qlen: - wait: - type: SYS
sdd ior: 0.0 iow: 0.0 ios: 0 qlen: 0 wait: 0 type: SYS
…
NICS:
lo netrr: 21.3 netwr: 21.3 neteff: 42.7 nicerrors: 0 pktsin: 7 pktsout: 7 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 7 innonunicast: 0 type:
PUBLIC
eth0 netrr: 25.65 netwr: 15.94 neteff: 41.60 nicerrors: 0 pktsin: 13 pktsout: 13 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 13 innonunicast:
0 type: PRIVATE latency: <1
eth1 netrr: 10.27 netwr: 6.58 neteff: 16.85 nicerrors: 0 pktsin: 30 pktsout: 22 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 30 innonunicast: 0
type: PRIVATE latency: <1
eth2 netrr: 0.12 netwr: 0.0 neteff: 0.12 nicerrors: 0 pktsin: 0 pktsout: 0 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 0 innonunicast: 0
type: PUBLIC latency: <1
PROTOCOL ERRORS:
IPHdrErr: 0 IPAddrErr: 0 IPUnkProto: 0 IPReasFail: 0 IPFragFail: 0 TCPFailedConn: 50 TCPEstRst: 13 TCPRetraSeg: 69 UDPUnkPort: 41 UDPRcvErr: 0
End of data
Administration part 4
Time is crucial – “the clock”
> oclumon dumpnodeview -n rac1 -s "2010-08-19 02.00.01" -e "2010-08-19 02.00.03"
----------------------------------------
Node: rac1 Clock: '08-19-10 02.00.01 UTC' SerialNo:58695
----------------------------------------
SYSTEM:
#cpus: 2 cpu: 4.20 cpuq: 4 physmemfree: 17728 mcache: 953248 swapfree: 1900208 ior: 0 iow: 103
ios: 7 netr: 46.36 netw: 39.29 procs: 187 rtprocs: 11 #fds: 2658 #sysfdlimit: 6815744
#disks: 7 #nics: 4 nicErrors: 0
TOP CONSUMERS:
topcpu: 'osysmond(13446) 1.31' topprivmem: 'ologgerd(13532) 102260' topshm: 'ologgerd(13532)
46680' topfd: 'crsd.bin(10754) 102' topthread: 'crsd.bin(10754) 58'
End of data
Alternative:
oclumon dumpnodeview -allnodes -s "2010-08-19 02.00.01" -e "2010-08-19 02.00.03“
• The sampling rate of the tool depends on the currently active processes
and the devices on the system. Up to a total of 1000 active processes and
disks with ideal system, the sampling interval is approximately 1 second.
• The refresh rate of the GUI is 1 second per default, but a higher refresh
rate can be specified using the –r parameter followed by the time in secs.
– Example: crfgui -r 5 -m 192.168.2.8
<Insert Picture Here>
Frequently Asked
Questions (FAQ)
The most common FAQs…
…Are answered in the tool readme
• Direct download link:
– http://www.oracle.com/technetwork/database/clustering/downloads/ipd-download-
homepage-087212.html
• NO
• CVU is a separate tool with
a completely different purpose.
• YES
• YES
More Information
Future Development of CHM
What you will find in Oracle Grid Infrastructure 11.2.0.2
• http://www.oracle.com/goto/rac
– Download link: Cluster Health Monitor - Download
• http://www.oracle.com/goto/clusterware
– Technical White Paper
Oracle Clusterware 11g Release 2 Technical Overview
• For OS Watcher
– My Oracle Support doc ID 301137.1 - OS Watcher User Guide
OTN Migration
A migration with some impact
• Note that Oracle Technology Network (also known as OTN) was migrated
– URLs containing http://otn.oracle.com/ are moved
– Individual items (e.g. papers) are migrated to a new Content Management System
– Direct links using the old URL to those items may therefore not work anymore
• Some links to main pages should be redirected to some new pages – e.g.:
– http://otn.oracle.com/rac (might go away over time)
http://www.oracle.com/technetwork/database/clustering/overview/index.html