RAC Frequently Asked Questions
RAC Frequently Asked Questions
RAC Frequently Asked Questions
General RAC
RAC Assistance
High Availability
• Why am I seeing the following warnings in my listener.log for my RAC 10g environment?
WARNING: Subscription for node down event still pending
• Will FAN work with SQLPlus?
• Do I need to install the ONS on all my mid-tier serves in order to enable JDBC Fast
Connection Failover (FCF)?
• Will FAN/FCF work with the default database service?
• Can I use the 10.2 JDBC driver with 10.1 database for FCF?
• What clients provide integration with FAN through FCF?
• Can I use TAF and FAN/FCF?
• How does the datasource properties initialLimit, minLimit, and maxLimit affect Fast
Connection Failover processing with JDBC?
• Will FAN/OCI work with Instant Client?
• What type of callbacks are supported with OCI when using FAN/FCF?
• Does FCF for OCI react to FAN HA UP events?
• Can I use FAN/OCI with Pro*C?
• Do I have to link my OCI application with a thread library? Why?
Scalability
• I am seeing the wait events 'ges remote message', 'gcs remote message', and/or 'gcs for
action'. What should I do about these?
• What are the changes in memory requirements from moving from single instance to
RAC?
• Will adding a new instance to my Oracle RAC database (new node to the cluster) allow
me to scale the workload?
• What do I do if I see GC CR BLOCK LOST in my top 5 Timed Events in my AWR
Report?
• How do I change my Veritas SF RAC installation to use UDP instead of LLT?
• A customer is currently using RAC in a 2 node environment. How should one review the
ability to scale out to 4, 6, 8 or even more nodes? What should the requirements of a
scale out test?
• What is the Load Balancing Advisory?
• How do I enable the load balancing advisory?
• What are my options for setting the Load Balancing Advisory GOAL on a Service?
• How can I validate the scalability of my shared storage? (Tightly related to RAC /
Application scalability)
• How many nodes are supported in a RAC Database?
• How do I measure the bandwidth utilization of my NIC or my interconnect?
• Does Database blocksize or tablespace blocksize affect how the data is passed across
the interconnect?
• What is Runtime Connection Load Balancing?
Manageability
• I found in 10.2 that the EM "Convert to Cluster Database" wizard would always fall over
on the last step where it runs emca and needs to log into the new cluster database as
dbsnmp to create the cluster database targets etc. I changed the password for the
dbsnmp account to be dbsnmp (same as username) and it worked OK. Is this a known
issue?
• What storage option should I use for RAC 10g on Linux? ASM / OCFS / Raw Devices /
Block Devices / Ext3 ?
• How do I stop the GSD?
• What is the purpose of the gsd service in Oracle 9i RAC?
• How should I deal with space management? Do I need to set free lists and free list
groups?
• I was installing RAC and my Oracle files did not get copied to the remote node(s). What
went wrong?
• If I am using Vendor Clusterware such as Veritas, IBM, Sun or HP, do I still need Oracle
Clusterware to run Oracle RAC 10g?
• Srvctl cannot start instance, I get the following error PRKP-1001 CRS-0215, however
sqlplus can start it on both nodes? What is the problem?
• When I look at ALL_SERVICES view in my database I see services I did not create, what
are they for?
• Does RAC work with NTP (Network Time Protocol)?
• I have 2 clusters named "crs" (the default), how do I get Grid Control to recognize them
as targets?
• If using plsql native code, the plsql_native_library_dir needs to be defined. In RAC
environement, must the directory be in the shared storage?
• How do I determine whether or not an OneOff patch is "rolling upgradeable"?
• What is the Cluster Verification Utiltiy (cluvfy)?
• What versions of the database can I use the cluster verification utility (cluvfy) with?
• What are the implications of using srvctl disable for an instance in my RAC cluster? I
want to have it available to start if I need it but at this time to not want to run this extra
instance for this database.
Platform Specific
• Is HACMP needed for RAC on AIX 5.2 using GPFS file system?
• Do I need HACMP/GPFS to store my OCR/Voting file on a shared device.
• Is VIO supported with RAC on IBM AIX?
• Can I run Oracle RAC 10g on my IBM Mainframe Sysplex environment (z/OS)?
• Can I use Oracle Clusterware for failover of the SAP Enqueue and VIP services when
running SAP in a RAC environment?
• Are Oracle Applications certified with RAC?
Diagnosibility
• How do I gather all relevant Oracle and OS log/trace files in a RAC cluster to provide to
Support?
• What are the cdmp directories in the background_dump_dest used for?
Oracle Clusterware
Answers
I have changed my spfile with alter system set <parameter_name>
=.... scope=spfile. The spfile is on ASM storage and the database
will not start.
How to recover:
In $ORACLE_HOME/dbs
. oraenv <instance_name>
startup nomount
/
shutdown immediate
quit
Now edit the newly created pfile to change the parameter to something sensible.
Then:
$ oifcfg getif
eth0 138.2.236.0 global public
eth2 138.2.238.0 global cluster_interconnect
Can we designate the place of archive logs on both ASM disk and
regular file system, when we use SE RAC?
Yes, - customers may want to create a standby database for their SE RAC database so placing
the archive logs additionally outside ASM is OK.
Why does netca always creates the listener which listens to public
ip and not VIP only?
This is for backward compatibility with existing clients: consider pre-10g to 10g server upgrade. If
we made upgraded listener to only listen on VIP, then clients that didn't upgrade will not be able
to reach this listener anymore.
Look for:
* Indexes with right-growing characteristics
--> Use reverse key indexes
--> Eliminate indexes which are not needed
Datafiles will need to be moved to either a clustered file system (CFS) or raw devices so that all
nodes can access it. Also, the MAXINSTANCES parameter in the control file must be greater
than or equal to number of instances you will start in the cluster.
For more detailed information, please see Migrating from single-instance to RAC in the Oracle
Documentation
With Oracle Database 10g Release 2, $ORACLE_HOME/bin/rconfig tool can be used to convert
Single instance database to RAC. This tool takes in a xml input file and convert the Single
Instance database whose information is provided in the xml. You can run this tool in "verify only"
mode prior to performing actual conversion. This is documented in the RAC admin book and a
sample xml can be found $ORACLE_HOME/assistants/rconfig/sampleXMLs/ConvertToRAC.xml.
This tool only supports databases using a clustered file system or ASM. You cannot use it with
raw devices. Grid Control 10g Release 2 provides a easy to use wizard to perform this function.
Note: Please be aware that you may hit bug 4456047 (shutdown immediate hangs) as you
convert the database. The bug is updated with workaround and the w/a should is release noted
as well.
What are the dependencies between OCFS and ASM in Oracle
Database 10g ?
In an Oracle RAC 10g environment, there is no dependency between Automatic Storage
Management (ASM) and Oracle Cluster File System (OCFS).
OCFS is not required if you are using Automatic Storage Management (ASM) for database files.
You can use OCFS on Windows( Version 2 on Linux ) for files that ASM does not handle -
binaries (shared oracle home), trace files, etc. Alternatively, you could place these files on local
file systems even though it's not as convenient given the multiple locations.
If you do not want to use ASM for your database files, you can still use OCFS for database files in
Oracle Database 10g.
Please refer to ASM and OCFS Positioning
With Oracle Database 10g, RAC is an option of EE and available as part of SE. Oracle provides
Oracle Clusterware on its own CD included in the database CD pack.
Please check the certification matrix (Note 184875.1) or with the appropriate platform vendor for
more information.
How can a NAS storage vendor certify their storage solution for
RAC ?
As of January 2007 the OSCP has been discontinued!!
Please refer to this link on OTN for details on RAC Technologies Matrix (storage being part of it).
They should obtain an OCE test kit and complete the required RAC tests. They can submit the
request for an OCE kit to ocesup_ie@oracle.com.
The list of certified NAS vendors/solutions is posted on OTN under the OSCP program
For example on Solaris, your 9i RAC will be using Sun Cluster. You can install Oracle
Clusterware and RAC 10g in the same cluster that is running Sun Cluster and 9i RAC.
Should the SCSI-3 reservation bit be set for our Oracle Clusterware
only installation?
If you are using only Oracle Clusterware(no Veritas CM), then you don't need to have SCSI-3
PGR enabled, since Oracle Clusterware does not require it for IO fencing. If the reservation is set,
then you'll get the inconsistent results. So ask your storage vendor to disable the reservation.
Veritas RAC requires that the storage array support SCSI-3 PGR, since this is how Veritas
handles IO fencing. This SCSI-3 PGR is set at the array level; for example EMC hypervolume
level.
$HOME
$HOME/.rhosts
$HOME/.shosts
$HOME/.ssh
$HOME/.ssh.authorized-keys
$HOME/.ssh/authorized-keys2 #Openssh specific for ssh2 protocol.
SSH (from OUI) will also fail if you have not connected to each machine in your cluster as per the
note in the installation guide:
The first time you use SSH to connect to a node from a particular system, you may see a
message similar to the following:
The authenticity of host 'node1 (140.87.152.153)' can't be established. RSA key fingerprint is
7z:ez:e7:f6:f4:f2:4f:8f:9z:79:85:62:20:90:92:z9.
Are you sure you want to continue connecting (yes/no)?
Enter |yes| at the prompt to continue. You should not see this message again when you connect
from this system to that node. Answering yes to this question causes an entry to be added to a
"known-hosts" file in the .ssh directory which is why subsequent connection requests do not re-
ask.
This is known to work on Solaris and Linux but may work on other platforms as well.
Probably a better alternative (than the generic documentation, bug 5929611 filed) for a remove
node is Note 269320.1
Can we output the backupset onto regular file system directly (not
onto flash recovery area) using RMAN command, when we use SE
RAC?
Yes, - customers might want to backup their database to offline storage so this is also supported.
More information: Metalink Note 296874.1 and Auto Port Aggregation (APA) Support Guide
More information: Metalink Note 283107.1 - IPMP in general. When IPMP is used for the
interconnect: Metalink Note 368464.1
Related RAC FAQ entries: In Solaris 10, do we need Sun Clusterware to provide redundancy for
the interconnect and multiple switches?
• Bonding
• Teaming
On Windows teaming solutions to ensure NIC availability are usually part of the network card
driver.
Thus, they depend on the network card used. Please, contact the respective hardware vendor for
more information.
$ chrt -p 31193
pid 31193's current scheduling policy: SCHED_OTHER
pid 31193's current scheduling priority: 0
Are there any issues for the interconnect when sharing the same
switch as the public network by using VLAN to separate the
network?
RAC and Clusterware deployment best practices recommend that the interconnect be deployed
on a stand-alone, physically seperate, dedicated switch. Many customers have consolidated
these stand-alone switches into larger managed switches. A consequence of this consolidation is
a merging of IP networks on a single shared switch, segmented by VLANs. There are caveats
associated with such deployments. RAC cache fusion exercises the IP network more rigorously
than non-RAC Oracle databases. The latency and bandwidth requirements as well as availability
requirements of the RAC/Clusterware interconnect IP network are more in-line with high
performance computing. Deploying the RAC/Clusterware interconnect on a shared switch,
segmented VLAN may expose the interconnect links to congestion and instability in the larger IP
network topology. If deploying the interconnect on a VLAN, there should be a 1:1 mapping of
VLAN to non-routable subnet and the VLAN should not span multiple VLANs (tagged) or multiple
switches. Deployment concerns in this environment include Spanning Tree loops when the larger
IP network topology changes, Assymetric routing that may cause packet flooding, and lack of fine
grained monitoring of the VLAN/port.
Can I run RAC 10g with RAC 11g?
Yes. The Oracle Clusterware should always run at the highest level. With Oracle Clusterware
11g, you can run both RAC 10g and RAC 11g databases. If you are using ASM for storage, you
can use either Oracle Database 10g ASM or Oracle Database 11g ASM however to get the 11g
features, you must be running Oracle Database 11g ASM. It is recommended to use Oracle
Database 11g ASM.
Yes, you can run 9iRAC in the cluster as well. 9i RAC requires the clusterware that is certified
with 9i RAC to be running in addition to Oracle Clusterware 11g.
Are block devices supported for OCR, Voting Disks, ASM devices?
Block Devices are only supported on Linux. For Unix platforms, the directio symantics not
applicable (or rather not implemented) for the block devices on these platforms.
Note: On Linux, raw devices are being deprecated so you should move to using block devices.
Note the Oracle Database 10g OUI does not support block devices however Oracle Clusterware
and ASM do.
We are using Transparent Data Encryption (TDE).
We create a wallet on node 1 and copy to nodes 2 & 3. Open the
wallet and we are able to select encrypted data on all three nodes.
Now, we want to REKEY the MASTER KEY. What do we have to
do?
After a re-key on node one, 'alter system set wallet close' on all other nodes, copy the wallet with
the new master key to all other nodes, 'alter system set wallet open identified by "password"; on
all other nodes to load the (obfuscated) master key into node's SGA.
Why does the NOAC attribute need to be set on NFS mounted RAC
Binaries?
The noac attribute is required because the installer determines sharedness by creating a file and
checking for that file s existance on remote node. If the noac attribute is not enabled then this
test will incorrectly fail. This will confuse installer and opatch. Some other minor issues issues
with spfile in the default $ORACLE_HOME/dbs will definitely be affected.
Obviously if you only have one copy of the OCR and it is lost or corrupt then you must restore a
recent backup, see ocrconfig utility for details, specifically -showbackup and -restore flags. Until a
valid backup is restored the Oracle Clusterware will not startup due to the corrupt/missing OCR
file.
The interesting discussion is what happens if you have the OCR mirrored and one of the copies
gets corrupt? You would expect that everything will continue to work seemlessly. Well.. Almost..
The real answer depends on when the corruption takes place.
If the corruption happens while the Oracle Clusterware stack is up and running, then the
corruption will be tolerated and the Oracle Clusterware will continue to funtion without
interruptions. Despite the corrupt copy. DBA is advised to repair this hardware/software problem
that prevent OCR from accessing the device as soon as possible; alternatively, DBA can replace
the failed device with another healthy device using the ocrconfig utility with -replace flag.
If however the corruption happens while the Oracle Clusterware stack is down, then it will not be
possible to start it up until the failed device becomes online again or some administrative action
using ocrconfig utility with -overwrite flag is taken. When the Clusteware attempts to start you will
see messages similar to:
This is because the software can't determin which OCR copy is the valid one. In the above
example one of the OCR mirrors was lost while the Oracle Clusterware was down. There are 3
ways to fix this failure:
a) Fix whatever problem (hardware/software?) that prevent OCR from accessing the device.
b) Issue "ocrconfig -overwrite" on any one of the nodes in the cluster. This command will
overwrite the vote check built into OCR when it starts up. Basically, if OCR device is configured
with mirror, OCR assign each device with one vote. The rule is to have more than 50% of total
vote (quorum) in order to safely make sure the available devices contain the latest data. In 2-way
mirroring, the total vote count is 2 so it requires 2 votes to achieve the quorum. In the example
above there isn't enough vote to start if only one device with one vote is available. (In the earlier
example, while OCR is running when the device is down, OCR assign 2 vote to the surviving
device and that is why this surviving device now with two votes can start after the cluster is
down). See warning below
EXTREME CAUTION should be excersized if chosing option b or c above since data loss can
occur if the wrong file is manipulated, please contact Oracle Support for assistance before
proceeding.
Bug 5055145 was the basis for this FAQ, also thanks to Ken Lee for his valuable feedback.
For more information on troubleshooting this error, see the following Metalink note:
1. The new node re-arps the world indicating a new MAC address for this IP address. For
directly connected clients, this usually causes them to see errors on their connections to
the old address;
2. Subsequent packets sent to the VIP go to the new node, which will send error RST
packets back to the clients. This results in the clients getting errors immediately.
In the case of existing SQL conenctions, errors will typically be in the form of ORA-3113 errors,
while a new connection using an address list will select the next entry in the list. Without using
VIPs, clients connected to a node that died will often wait for a TCP/IP timeout period before
getting an error. This can be as long as 10 minutes or more. As a result, you don't really have a
good HA solution without using VIPs.
What are my options for load balancing with RAC? Why do I get an
uneven number of connections on my instances?
All the types of load balancing available currently (9i-10g) occur at connect time.
This means that it is very important how one balances connections and what these connections
do on a long term basis.
Since establishing connections can be very expensive for your application, it is good
programming practice to connect once and stay connected. This means one needs to be careful
as to what option one uses. Oracle Net Services provides load balancing or you can use external
methods such as hardware based or clusterware solutions.
The following options exist prior to Oracle RAC 10g Releae 2 (for 10g Release 2 see Load
Balancing Advisory):
Random
Either client side load balancing or hardware based methods will randomize the connections to
the instances.
On the negative side this method is unaware of load on the connections or even if they are up
meaning they might cause waits on TCP/IP timeouts.
Load Based
Server side load balancing (by the listener) redirects connections by default depending on the
RunQ length of each of the instances. This is great for short lived connections. Terrible for
persistent connections or login storms. Do not use this method for connections from connection
pools or applicaton servers
Session Based
Server side load balancing can also be used to balance the number of connections to each
instance. Session count balancing is method used when you set a listener parameter,
prefer_least_loaded_node_listener-name=off. Note listener name is the actual name of the
listener which is different on each node in your cluster and by default is listener_nodename.
Session based load balancing takes into account the number of sessions connected to each node
and then distributes the connections to balance the number of sessions across the different
nodes.
Can our 10g VIP fail over from NIC to NIC as well as from node to
node ?
Yes the 10g VIP implementation is capable from failing over within a node from NIC to NIC and
back if the failed NIC is back online again, and also we fail over between nodes. The NIC to NIC
failover is fully redundant if redundant switches are installed.
What is CLB_GOAL and how should I set it?
CLB_GOAL is the connection load balancing goal for a service. There are 2 options,
CLB_GOAL_SHORT and CLB_GOAL_LONG (default).
Long is for applications that have long-lived connections. This is typical for connection pools and
SQL*Forms sessions. Long is the default connection load balancing goal.
Short is for applications that have short-lived connections.
The GOAL for a service can be set with EM or DBMS_SERVICE.
Note: You must still configure load balancing with Oracle Net Services
How do I configure FCF with BPEL so I can use RAC 10g in the
backend?
** Note:372456.1 describes the procedure to set up BPEL with a Oracle RAC 10g Release 1
database.
If you are using SSL, ensure the SSL enable attribute of ONS in opmn.xml file has same value,
either true or false, for all OPMN servers in the Farm. To troubleshoot OPMN at the application
server level, look at appendix A in Oracle Process Manager and Notification Server
Administrator's Guide.
If you are deploying Oracle RAC and require high availability, you must make the entire
infrastructure of the application highly available. This requires detailed planning to ensure there
are no single points of failure throughout the infrastructure. Oracle Clusterware is constantly
monitoring any process that it under its control, which includes all the Oracle software such as the
Oracle instance, listener, etc. Oracle Clusterware has been programmed to recover from failures,
which occur for the Oracle processes. In order to do it s monitoring and recovery, various
system activities happen on a regular basis such as user authentication, sudo, and hostname
resolution. In order for the cluster to be highly available, it must be able to perform these activities
at all times. For example, if you choose to use the Lightweight Directory Access Protocol (LDAP)
for authentication, then you must make the LDAP server highly available as well as the network
connecting the users, application, database and LDAP server. If the database is up but the users
cannot connect to the database because the LDAP server is not accessible, then the entire
system is down in the eyes of your users. When using external authentication such as LDAP or
NIS (Network Information Service), a public network failure will cause failures within the cluster.
Oracle recommends that the hostname, vip, and interconnect are defined in the /etc/hosts file on
all nodes in the cluster.
During the testing of the RAC implementation, you should include a destructive testing phase.
This is a systematic set of tests of your uration to ensure that 1) you know what to expect if the
failure occurs and how to recover from it and 2) that the system behaves as expected during the
failure. This is a good time to review operating procedures and document recovery procedures.
Destructive testing should include tests such as node failure, instance failure, public network
failure, interconnect failures, storage failure, storage network failure, voting disk failure, loss of an
OCR, and loss of ASM.
Using features of Oracle Real Application Clusters and Oracle Clients including Fast Application
Notification (FAN), Fast Connection Failover (FCF), Oracle Net Service Connection Load
Balancing, and the Load Balancing Advisory, applications can mask most failures and provide a
very highly available application. For details on implementing best practices, see the MAA
document Client Failover Best Practices for Highly Available Oracle Databases and the Oracle
RAC Administration and Deployment Guide.
Why am I seeing the following warnings in my listener.log for my
RAC 10g environment?
WARNING: Subscription for node down event still pending
This message indicates that the listener was not able to subscribe to the ONS events which it
uses to do the connection load balancing. This is most likely due to starting the listener using
lsnrctl from the database home. When you start the listener using lsnrctl, make sure you have set
the environment variable ORACLE_CONFIG_HOME = {Oracle Clusterware HOME}, also set it in
racgwrap in the $ORACLE_HOME/bin for the database.
An UP event is processed for both (a) new instance joins, as well as (b) down followed by an
instance UP. This has no relevance to initialLimit, or even minLimit. When a UP event comes into
our jdbc Implicit Connection Cache, we will create some new connections. Assuming you have
your listener load balancing set up properly, then those connections should go to the instance
that was just started. When your application does a get connection to the pool, it will be given an
idle connection, if you are running 10.2 and have the load balancing advisory turned on for the
service, we will allocate the session based on the defined goal to provide the best service level
MaxLimit, when set, defines the upper boundary limit for the connection cache. By default,
maxLimit is unbounded - your database sets the limit.
Can I use the 10.2 JDBC driver with 10.1 database for FCF?
Yes with the patch for Bug 5657975 for 10.2.0.3,the 10.2 JDBC driver will work with a 10.1
database. The fix will be part of the 10.2.0.4 patchset. If you do not have the patch then using
FCF, use the 10.2 JDBC driver with 10.2 database. If database is 10.1, use 10.1 JDBC driver.
Will FAN/FCF work with the default database service?
No. If you want the advanced features of RAC provided by FAN and FCF, then create a cluster
managed service for your application. Use the Clustered Managed Services Page in Enterprise
Manager DBControl to do this.
But in general, please take into consideration that memory requirements per instance are
reduced when the same user population is distributed over multiple nodes. In this case:
Assuming the same user population N number of nodes M buffer cache for a single system then
Thus for example with a M=2G & N=2 & no extra memory for failed-over users
=( 2G / 2 ) + (( 2G / 2 )) *0.10
=1G + 100M
What is the Load Balancing Advisory?
To assist in the balancing of application workload across designated resources, Oracle Database
10g Release 2 provides the Load Balancing Advisory. This Advisory monitors the current
workload activity across the cluster and for each instance where a service is active; it provides a
percentage value of how much of the total workload should be sent to this instance as well as
service quality flag. The feedback is provided as an entry in the Automatic Workload Repository
and a FAN event is published. The easiest way for an application to take advantage of the load
balancing advisory, is to enable Runtime Connection Load Balancing with an integrated client.
A more reliable, interactive way on Linux is to use the iptraf utility or the prebuilt rpms from
redhat or Novell (SuSE), another option on Linux is Netperf . On other Unix platforms: "snoop -S
-tr -s 64 -d hme0", AIX's topaz can show that as well.. Try to look for the peak (not average)
usage and see if that is acceptably fast.
Remember that NIC bandwidth is measured in Mbps or Gbps (which is BITS per second) and
output from above utilities can sometimes come in BYTES per second, so for comparison, do
proper conversion (divide bps value by 8 to get bytes/sec; or, multiple bytes value by 8 to get bps
value).
Additionally, you can't expect a network device to run at full capacity with 100% efficiency, due to
concurrency, collisions and retransmits that happens more frequently as the utilization gets
higher. If you are reaching high levels consider a faster interconnect or NIC bonding (multiple
NICs all servicing the same IP address).
Finally, above is measuring bandwidth utilization (how much), not latency (how fast) of the
interconnect, you may still be suffering from high latency connection (slow link) even though there
is plenty of bandwidth to spare. Most experts agree that low latency is by far more important than
a high bandwidth with respect to specifications of the private interconnect in RAC. Latency is best
measured by the actual user of the network link (RAC in this case), review statspack for stats on
latency. Also, in 10gR2 Grid Control you can view Global Cache Block Access Latency, you can
also drill down to the Cluster Cache Coherency page to see the cluster cache coherency metrics
for the entire cluster database.
Keep in mind that RAC is using the private interconnect like it was never used before, to
synchronize memory regions (SGAs) of multiple nodes (remember, since 9i, entire data blocks
are shipped accross the interconnect), if the network is utilized at 50% bandwidth, this means that
50% of the time it is busy and not available to potential users. In this case delays (due to
collisions and concurrency) will increase the latency even though the bandwidth might look
"reasonable", it's hiding the real issue.
Does Database blocksize or tablespace blocksize affect how the
data is passed across the interconnect?
Oracle ships database block buffers, i.e. blocks in a tablespace configured for 16K will result in a
16K data buffer shipped, blocks residing in a tablespace with base block size (8K) will be shipped
as base blocks and so on; the data buffers are broken down to packets of MTU sizes.
Storage vendors may sometimes discourage such testing, boasting about their amazing front or
backend battery backed memory caches that "eliminate" all I/O bottlenecks. This is all great, and
you should take advantage of such caches as much as possible... however, there is no substitute
to a a real world test, you may uncover that the HBA (Host Buss Adapater) firmware or the driver
versions are outdated (before you claim poor RAC / Application scalability issues).
It is highly recommended to test this storage scalability early on so that expectations are set
accordingly. On Linux there is a freely available tool released on OTN called ORION (Oracle I/O
test tool) which simulates Oracle I/O.
On other Unix platforms (as well as Linux) one can use IOzone, if prebuilt binary not available
you should build from source, make sure to use version 3.271 or later and if testing raw/block
devices add the "-I" flag.
In a basic read test you will try to demonstrate that a certain IO throughput can be maintained as
nodes are added. Try to simulate your database io patterns as much as possible, i.e. blocksize,
number of simultaneous readers, rates, etc.
For example, on a 4 node cluster, from node 1 you measure 20MB/sec, then you start a read
stream on node 2 and see another 20MB/sec while the first node shows no decrease. You then
run another stream on node 3 and get another 20MB/sec, in the end you run 4 streams on 4
nodes, and get an aggregated 80MB/sec or close to that. This will prove that the shared storage
is scalable. Obviously if you see poor scalability in this phase, that will be carried over and be
observed or interperted as poor RAC / Application scalability.
In many cases RAC / Application scalability is at blame for no real reason, that is, the underlying
IO subsystem is not scalable.
How many nodes are supported in a RAC Database?
With 10g Release 2, we support 100 nodes in a cluster using Oracle Clusterware, and 100
instances in a RAC database. Currently DBCA has a bug where it will not go beyond 63
instances. There is also a documentation bug for the max-instances parameter. With 10g
Release 1 the Maximum is 63. In 9i it is platform specific due to the different clusterware support
by vendors. See the platform specific FAQ for 9i.
After copying the files from the above mentioned files from the CD, you can relink oracle using
make -f ins_rdbms.mk rac_on ipc_udp ioracle
NOTE: Oracle RAC 11g will not support LLT for interconnect.
Ip:
84884742 total packets received
1201 fragments dropped after timeout
3384 packet reassembles failed
ifconfig a:
What storage option should I use for RAC 10g on Linux? ASM /
OCFS / Raw Devices / Block Devices / Ext3 ?
The recommended way to manage large amounts of storage in a RAC environment is ASM
(Automatic Storage Management) really need/want a clustered filesystem, then Oracle offers
OCFS (Oracle Clustered File System); for 2.4 kernel (RHEL3/SLES8) use OCFS Version 1 and
for 2.6 kernel (RHEL4/SLES9) use OCFS2. All these options are free to use and completely
supported, ASM is bundled in the RDBMS software, and OCFS as well as ASMLib are freely
downloadable from Oracle's OSS (Open Source Software) website.
EXT3 is out of the question, since it's data structures are not cluster aware, that is, if you mount
an ext3 filesystem from multiple nodes, it will quickly get corrupted.
Another option of course is NFS and iSCSI both are outside the scope of this FAQ but included
for completeness.
If for any reason the above options (ASM/OCFS) are not good enough and you insist on using
'raw devices' or 'block devices' here are the details on the two (This information is still very useful
to know in the context of ASM and OCFS).
block devices (/dev/sde9) are **BUFFERED** devices!! unless you explicitly open them in
O_DIRECT you will get buffered (linux buffer cache) IO.
character devices (/dev/raw/raw9) are *UN-BUFFERRED** devices!! no matter how you open
them, you always get unbufferred IO, hence no need to specify O_DIRECT on the file open call.
Above is not a typo, block devices on Unix do buffered IO by default (cached in linux buffer
cache), this means that RAC can not operate on it (unless opened with O_DIRECT), since the
IO's will not be immediately visible to other nodes.
You may check if a device is block or character device by the first letter printed with the "ls -l"
command:
crw-rw---- 1 root disk 162, 1 Jan 23 19:53 /dev/raw/raw1
brw-rw---- 1 root disk 8, 112 Jan 23 14:51 /dev/sdh
Above, "c" stands for character device, and "b" for block devices.
Starting with Oracle 10.1 an RDBMS fix added the O_DIRECT flag to the open call (O_DIRECT
flag tells the Linux kernel to bypass the Linux buffer cache and write directly to disk), in the case
of a block device, that ment that a create datafile on '/dev/sde9' would succeed (need to set
filesystemio_options=directio in init.ora).. This enhancement was well received, and shortly after
bug 4309443 was fixed (by adding the O_DIRECT flag on the OCR file open call) meaning that
starting with 10.2 (there are several 10.1 backports available) the Oracle OCR file could also
access block devices directly. For the voting disk to be opened with O_DIRECT you need fix for
bug 4466428 (5021707 is a duplicate). This means that both voting disks and OCR files could live
on block devices. However, due to OUI bug 5005148, there is still a need to configure raw
devices for the voting or OCR files during installation of RAC, not such a big deal, since it's just 5
files in most cases. It is not possible to ask for a backport of this bug since it means a full re-
release of 10g, one alternative if raw devices are not a good option is to use 11g Clusterware
(with 10g RAC database).
By using block devices you no longer have to live with the limitations of 255 raw devices per
node. You can access as many block devices as the system can support. Also block devices
carry persistent permissions across reboots, while with raw devices one would have to customize
that after installation otherwise the Clusterware stack or database would fail to startup due to
permission issues.
ASM or ASMlib can be given the raw devices (/dev/raw/raw2) as was done in the initial
deployment of 10g Release 1, or the more recommended way: ASM/ASMLib should be given the
block devices directly (eg. /dev/sde9).
Since RAW devices are being phased out of Linux in the long term, it is recommended everyone
should switch to using the block devices (meaning, pass these block devices to ASM or OCFS/2
or Oracle Clusterware)
$ gsdctl stop
What is the purpose of the gsd service in Oracle 9i RAC?
GSD is only needed for configuration/management of cluster database. Once database has been
configured and up, it can be safely stopped provided you don't run any 'srvctl or dbca or dbua'
tools. In 9iRAC, the GSD doesn't write anywhere unless tracing was turned on, in which case
traces go to stdout.
Once the database has been configured and started and you don't use 'srvctl or EM' to manage
or 'dbca to extend/remove' or 'dbua to upgrade' this database, GSD can be stopped.
We recommend using Automatic Segment Space Management rather than trying to manage
space manually. Unless you are migrating from an earlier database version with OPS and have
already built and tuned the necessary structures, Automatic Segment Space Management is the
preferred approach.
Automatic Segment Space Management is NOT the default, you need to set it.
I was installing RAC and my Oracle files did not get copied to the
remote node(s). What went wrong?
First make sure the cluster is running and is available on all nodes. You should be able to see all
nodes when running an 'lsnodes -v' command.
If lsnodes shows that all members of the cluster are available, then you may have an rcp/rsh
problem on Unix or shares have not been configured on Windows.
You can test rcp/rsh on Unix by issuing the following from each node:
More information can be found in the Step-by-Step RAC notes available on Metalink. To find
these search Metalink for 'Step-by-Step Installation of RAC'.
Each machine has a different clock frequency and as a result a slightly different time drift. NTP
computes this time drift every about 15 minutes, and stores this information in a "drift" file, it then
adjusts the system clock based on this known drift as well as compares it to a given time-server
the sys-admins sets up. This is the recommended approach.
• Minor changes in time (in the seconds range) are harmless for RAC and the Oracle
clusterware. If you intend on making large time changes it is best to shutdown the instances and
the entire Clusterware stack on that node to avoid a false eviction, especially if you are using the
10g low-brownout patches, which allow really low misscount settings.
• Backup/recovery aspect of large time changes are documented in note 77370.1, basically you
can't use RECOVER DATABASE UNTIL TIME to reach the second recovery point, It is possible
to overcome with RECOVER DATABASE UNTIL CANCEL or UNTIL CHANGE. If you are doing
complete recovery (most of the times) then this is not an issue since the Oracle recovery code
uses SCN (System Change Numbers) to advance in the redo/archive logs. The SCN numbers
never go back in time (unless a reset-logs operation is performed), there is always an association
of an SCN to a human readable timestamp (which may change forward or backwards), hence the
issue with recovery until point in time vs. until SCN/Cancel.
• If DBMS_SCHEDULER is in usage it will be affected by time changes, as it's using actual clock
rather than SCN.
• On platforms with OPROCD get fix for bug 5015469 "OPROCD REBOOTS NODE WHEN
TIME IS SET BACK BY XNTPD"
Apart from these issues, the Oracle server is immuned to time changes, i.e. will not affect
transaction/read consistency operations.
On Linux the "-x" flag can be added to the ntpd daemon to prevent the clock from going
backwards.
> pwd
/ora/install/4933522
Then use the following OPatch command:
> opatch query -is_rolling
...
Query ...
Please enter the patch location:
/ora/install/4933522
---------- Query starts ------------------
Patch ID: 4933522
....
Rolling Patch: True.
---------- Query ends -------------------
b) Prior to performing the Grid control agent install, just set CLUSTER_NAME environment
variable and run the install. This variable need to be set only for that install session. No need to
set it every time agent starts.
Sun: 8
HP UX: 16
HP Tru64: 8
IBM AIX:
<> Please note that certifications for Real Application Clusters are performed against the
Operating System and Clusterware versions. The corresponding system hardware is offered by
System vendors and specialized Technology vendors. Some system vendors offer pre-installed,
pre-configured RAC clusters. These are included below under the corresponding OS platform
selection within the certification matrix.
DBCA cannot be used to create databases on file systems on Oracle 9i Release 1. The user can
choose to set up a database on raw devices, and have DBCA output a script. The script can then
be modified to use cluster file systems instead.
With Oracle 9i RAC Release 2 (Oracle 9.2), DBCA can be used to create databases on a cluster
filesystem. If the ORACLE_HOME is stored on the cluster filesystem, the tool will work directly. If
ORACLE_HOME is on local drives on each system, and the customer wishes to place database
files onto a cluster file system, they must invoke DBCA as follows: dbca -datafileDestination
/oradata where /oradata is on the CFS filesystem. See 9iR2 README and bug 2300874 for more
info.
Detailed Reasons:
1) cross-cabling limits the expansion of RAC to two nodes
2) cross-cabling is unstable:
a) Some NIC cards do not work properly with it. They are not able
to negotiate the DTE/DCE clocking, and will thus not function. These
NICS were made cheaper by assuming that the switch was going to have
the clock. Unfortunately there is no way to know which NICs do not
have that clock.
b) Media sense behaviour on various OS's (most notably Windows) will
bring a NIC down when a cable is disconnected.
Either of these issues can lead to cluster instability and lead to ORA-
29740 errors (node evictions).
Please carefully read the following new information about configuring Oracle Cluster
Management on Linux, provided as part of the patch README:
[5000(msec) is hardcoded]
Note that the soft_margin is measured in seconds, -m and WatchMarginWait are measured in
milliseconds.
If CPU utilizatn in your system is high and you experience unexpected node reboots, check the
wdd.log file. If there are any 'ping came too late' messages, increase the value of the above
parameters.
Can RAC 10g and 9i RAC be installed and run on the same
physical Linux cluster?
Yes - However Oracle Clusterware (CRS) will not support a 9i RAC database so you will have to
leave the current configuration in place. You can install Oracle Clusterware and RAC 10g into the
same cluster. On Windows and Linux, you must run the 9i Cluster Manager for the 9i Database
and the Oracle Clusterware for the 10g Database. When you install Oracle Clusterware, your 9i
srvconfig file will be converted to the OCR. Both 9i RAC and 10g will use the OCR. Do not restart
the 9i gsd after you have installed Oracle Clusterware. Remember to check certify for details of
what vendor clusterware can be run with Oracle Clusterware.
Is the hangcheck timer still needed with Oracle RAC 10g and 11g?
YES! The hangcheck-timer module monitors the Linux kernel for extended operating system
hangs that could affect the reliability of the RAC node ( I/O fencing) and cause database
corruption. To verify the hangcheck-timer module is running on every node:
as root user:
/sbin/lsmod | grep hangcheck
To ensure the module is loaded every time the system reboots, verify that the local system
startup file (/etc/rc.d/rc.local) contains the command above.
For additional information please review the Oracle RAC Install and Configuration Guide (5-41)
and note:726833.1.
How to configure bonding on Suse SLES8.
Please see note:291958.1
The way to fix this is on RHEL4, OEL4 and SLES9 is to create /etc/udev/permission.d/40-
udev.permissions (you must choose a number that's lower than 50). You can do this by copying
/etc/udev/permission.d/50-udev.permissions, and removing the lines that are not needed (50-
udev.permissions gets replaced with upgrades so you do not want to edit it directly, also a typo in
the 50-udev.permissions can render the system non-usable). Example permissions file:
# raw devices
raw/raw[1-2]:root:oinstall:0640
raw/raw[3-5]:oracle:oinstall:0660
Note that this applied to all raw device files, here just the voting and OCR devices were specified.
On RHEL5, OEL5 and SLES10 a different file is used /etc/udev/rules.d/99-raw.rules, notice that
now the number must be (any number) higher than 50. Also the syntax of the rules is different
than the permissions file, here's an example:
You can check your interconnect through the alert log at startup. Check for the string cluster
interconnect IPC version:Oracle RDS/IP (generic) in the alert.log file.
11g Clusterware doesn't require this configuration since the installer can handle block devices
directly.
In local containers, you cannot manipulate hardware in any way, shape or form. You can't plumb
and unplumb network interfaces .... nothing ... even as the local container root user. You can only
do this in the global container. We rely on the uadmin command to quickly bring down a node if
an urgent condition is detected. As I recall, you can't do this from the local container either. CRS
has to maintain the ability to manipulate hardware and this just is not going to happen in a local
container.
The answer is the same if you are using Vendor Clusterware such as Veritas SF RAC or Sun
Cluster.
Can I run my 9i RAC and RAC 10g on the same Windows cluster?
Yes but the 9i RAC database must have the 9i Cluster Manager and you must run Oracle
Clusterware for the Oracle Database 10g. 9i Cluster Manager can coexsist with Oracle
Clusterware 10g.
Be sure to use the same 'cluster name' in the appropriate OUI field for both 9i and 10g when you
install both together in the same cluster.
The OracleCMService9i service will remain intact during the Oracle Clusterware 10g install, as a
9i RAC database would require that the 9i OracleCMService9i, it should be left running. The
information for the 9i database will get migrated to the OCR during the Oracle Clusterware
installation. Then, for future database management, you would use the 9i srvctl to manage the 9i
database, and the 10g srvctl to manage any new 10g databases. Both srvctl commands will use
the OCR.
Is HACMP needed for RAC on AIX 5.2 using GPFS file system?
The newest version of GPFS can be used without HACMP, if it is available for AIX 5.2 then you
do not need HACMP.
Newer versions of RDA (Remote Diagnostic Agent) have the RAC-DDT functionality, so going
forward RDA is the tool of choice. The RDA User Guide is in <>
If your processing requirements are extreme and your testing proves you must partition your
workload in order to reduce internode communications, you can use Profile Options to designate
that sessions for certain applications Responsibilities are created on a specific middle tier server.
That middle tier server would then be configured to connect to a specific database instance.
To determine the correct partitioning for your installation you would need to consider several
factors like number of concurrent users, batch users, modules used, workload characteristics etc.
We also recommend you configure the forms error URL to identify a fallback middle tier server for
Forms processes, if no router is available to accomplish switching across servers.
2. Use Clustered File System for all data base files or migrate all database files to raw devices.
(Use dd for Unix or ocopy for NT)
3. Install/upgrade to the latest available e-Business suite.
5. In step 4, install RAC option while installing Oracle9i and use Installer to perform install for all
the nodes.
Reference Documents:
Oracle E-Business Suite Release 11i with 9i RAC: Installation and Configuration : Metalink <>
E-Business Suite 11i on RAC : Configuring Database Load balancing & Failover: Metalink <>
Oracle E-Business Suite 11i and Database - FAQ : Metalink# <>
- Datafiles
- Control Files
- Redo Logs
- Archive Logs
- SPFILE
Oracle Clusterware files OCR and Voting Disk can be put on OCFS2 however Best Practice is to
put them on raw or block devices.
Is Sun QFS supported with RAC? What about Sun GFS?
From certify, check there for the latest details.
Sun Cluster - Sun StorEdge QFS (9.2.0.5 and higher,10g and 10gR2):
No restrictions on placement of files on QFS
Sun StorEdge QFS is supported for Oracle binary executables, database data files, database
data files, archive logs, Oracle Cluster Registry (OCR), Oracle Cluster ReadyServices voting disk
and recovery area can be placed on QFS.
Solaris Volume Manager for Sun Cluster can be used for host-based mirroring
Supports up to 8 nodes
to
# Allow the daemon to drop a diagnostic core file/
ulimit -c unlimited
ulimit -n 65536
In case where root.sh is failing to execute for the on an initial install (or a new node joining an
existing cluster), it is OK to re-run root.sh after the cause of the failure is corrected (permissions,
paths, etc.). In this case, please run rootdelete.sh to undo the local effects of root.sh before re-
running root.sh.
The private interconnect enforcement page determines which private interconnect will be used by
the RAC instances.
It's equivalent to setting the CLUSTER_INTERCONNECTS init.ora parameter, but is more
convenient because it is a cluster-wide setting that does not have to be adjusted every time you
add nodes or instances. RAC will use all of the interconnects listed as private in this screen, and
they all have to be up, just as their IP addresses have to be when specified in the init.ora
paramter. RAC does not fail over between cluster interconnects; if one is down then the instances
using them won't start.
What should the permissions be set to for the voting disk and ocr
when doing a RAC Install?
The Oracle Real Application Clusters install guide is correct. It describes the PRE INSTALL
ownership/permission requirements for ocr and voting disk. This step is needed to make sure that
the CRS install succeeds. Please don't use those values to determine what the
ownership/permmission should be POST INSTALL. The root script will change the
ownership/permission of ocr and voting disk as part of install. The POST INSTALL permissions
will end up being : OCR - root:oinstall - 640 Voting Disk - oracle:oinstall - 644
As long as you can confirm via the CSS daemon logfile that it thinks the voting disk is bad, you
can restore the voting disk from backup while the cluster is online. This is the backup that you
took with dd (by the manual's request) after the most recent addnode, deletenode, or install
operation. If by accident you restore a voting disk that the CSS daemon thinks is NOT bad, then
the entire cluster will probably go down.
crsctl add css votedisk - adds a new voting disk
crsctl delete css votedisk - removes a voting disk
Note: the cluster has to be down. You can also restore the backup via dd when the cluster is
down.
If you want to take a logical copy of OCR at any time use : ocrconfig -
export , and use -import option to restore the contents back.
Note: Customer should pay close attention to the bonding setup/configuration/features and
ensure their objectives are met, since some solutions provide only failover and some only
loadbalancing still others claim to provide both. As always, it's always important to test your setup
to ensure it does what it was designed to do.
When bonding with Network Interfaces that connect to separate switches (for redundancy) you
must test if the NIC's are configured for active/active mode. The most reliable configuration for
this architecture is to configure the NIC's for Active/Passive.
"Non-synchronized access" (i.e. database corruption) is prevented by ensuring that the remote
node is down before reassigning its locks. The voting disk, network, and the control file are used
to determine when a remote node is down, in different, parallel, indepdendent ways that allow
each to provide additional protection compared to the other. The algorithms used for each of
these three things are quite different.
As far as voting disks are concerned, a node must be able to access strictly more than half of the
voting disks at any time. So if you want to be able to tolerate a failure of n voting disks, you must
have at least 2n+1 configured. (n=1 means 3 voting disks). You can configure up to 32 voting
disks, providing protection against 15 simultaneous disk failures, however it's unlikely that any
customer would have enough disk systems with statistically independent failure characteristics
that such a configuration is meaningful. At any rate, configuring multiple voting disks increases
the system's tolerance of disk failures (i.e. increases reliability).
Configuring a smaller number of voting disks on some kind of RAID system can allow a customer
to use some other means of reliability than the CSS's multiple voting disk mechanisms. However
there seem to be quite a few RAID systems that decide that 30-60 second (or 45 minutes in the
case of veritas) IO latencies are acceptable. However we have to wait for at least the longest IO
latency before we can declare a node dead and allow the database to reassign database blocks.
So while using an independent RAID system for the voting disk may appear appealing,
sometimes there are failover latency consequenecs.
How can I register the listener with Oracle Clusterware in RAC 10g
Release 2?
NetCA is the only tool that configures listener and you should be always using it. It will register
the listener with Oracle Clusterware. There are no other supported alternatives.
Similarly, all private NICs must also have the same names on all nodes (ER 6785792 filed to
remove this requirement).
Do not mix NICs with different interface types (infiniband, ethernet, hyperfabric, etc.) for the same
subnet/network.
Why does Oracle still use the voting disks when other cluster
sofware is present?
Voting disks are still used when 3rd party vendor clusterware is present, because vendor
clusterware is not able to monitor/detect all failures that matter to Oracle Clusterware and the
database. For example one known case is when the vendor clusterware is set to have its
heartbeat go over a different network than RAC traffic. Continuing to use the voting disks allows
CSS to resolve situations which would otherwise end up in cluster hangs.
I made a mistake when I created the VIP during the install of Oracle
Clusterware, can I change the VIP?
Yes The details of how to do this are described in <>
What are the licensing rules for Oracle Clusterware? Can I run it
without RAC?
Check the Oracle Database Licensing Information 11g Release 1 (11.1) Part Number B28287-
01 Look in the Special Use section under Oracle Database Editions.
What is a stage?
CVU supports the notion of Stage verification. It identifies all the important stages in RAC
deployment and provides each stage with its own entry and exit criteria. The entry criteria for a
stage define a specific set of verification tasks to be performed before initiating that stage. This
pre-check saves the user from entering into a stage unless its pre-requisite conditions are met.
The exit criteria for a stage define another specific set of verification tasks to be performed after
completion of the stage. The post-check ensures that the activities for that stage have been
completed successfully. It identifies any stage specific problem before it propagates to
subsequent stages; thus making it difficult to find its root cause. An example of a stage is "pre-
check of database installation", which checks whether the system meets the criteria for RAC
install.
What is a component?
CVU supports the notion of Component verification. The verifications in this category are not
associated with any specific stage. The user can verify the correctness of a specific cluster
component. A component can range from a basic one, like free disk space to a complex one like
CRS Stack. The integrity check for CRS stack will transparently span over verification of multiple
sub-components associated with CRS stack. This encapsulation of a set of tasks within specific
component verification should be of a great ease to the user.
What is nodelist?
Nodelist is a comma separated list of hostnames without domain. Cluvfy will ignore any domain
while processing the nodelist. If duplicate entities after removing the domain exist, cluvfy will
eliminate the duplicate names while processing. Wherever supported, you can use '-n all' to
check on all the cluster nodes. Check this for more information on nodelist and shortcuts.
What are the default values for the command line arguments?
Here are the default values and behavior for different stage and component commands:
Do I have to type the nodelist every time for the CVU commands?
Is there any shortcut?
You do not have to type the nodelist every time for the CVU commands. Typing the nodelist for a
large cluster is painful and error prone. Here are few short cuts. To provide all the nodes of the
cluster, type '-n all'. Cluvfy will attempt to get the nodelist in the following order: 1. If a vendor
clusterware is available, it will pick all the configured nodes from the vendor clusterware using
lsnodes utility. 2. If CRS is installed, it will pick all the configured nodes from Oracle clusterware
using olsnodes utility. 3. In none of the above, it will look for the CV_NODE_ALL environmental
variable. If this variable is not defined, it will complain. To provide a partial list(some of the nodes
of the cluster) of nodes, you can set an environmental variable and use it in the CVU command.
For example: setenv MYNODES node1,node3,node5 cluvfy comp nodecon -n $MYNODES
How do I get detail output of a check?
Cluvfy supports a verbose feature. By default, cluvfy reports in non-verbose mode and just
reports the summary of a test. To get detailed output of a check, use the flag '-verbose' in the
command line. This will produce detail output of individual checks and where applicable will show
per-node result in a tabular fashion.
Why the peer comparison with -refnode says passed when the
group or user does not exist?
Peer comparison with the -refnode feature acts like a baseline feature. It compares the system
properties of other nodes against the reference node. If the value does not match( not equal to
reference node value ), then it flags that as a deviation from the reference node. If a group or user
does not exist on reference node as well as on the other node, it will report this as 'matched'
since there is no deviation from the reference node. Similarly, it will report as 'mismatched' for a
node with higher total memory than the reference node for the above reason.
&.