Database Alert Log Metrics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Page 1 of 5

">

Monitoring 11g Database Alert Log Errors in Enterprise Manager [ID 949858.1] Modified 13-SEP-2011 Type BULLETIN Status PUBLISHED

In this Document Purpose Scope and Application Monitoring 11g Database Alert Log Errors in Enterprise Manager Overview of the New Mechanism for 11g DB Alert Log Monitoring Errors and their related Categories Alert Log Monitoring in previous versions of EM and for previous versions of the DB Known Issues Gotchas FAQ References

Applies to:
Enterprise Manager Grid Control - Version: 10.2.0.5 to 10.2.0.5 - Release: to 10.2 Enterprise Manager for RDBMS - Version: 11.1.0.6 and later [Release: 11.1 and later] Information in this document applies to any platform.

Purpose
This Bulletin provides detailed coverage of the topic of monitoring the alert log of 11g database targets in Enterprise Manager

Scope and Application


Enterprise Manager users that require the ability to monitor the alert log of 11g databases for errors

Monitoring 11g Database Alert Log Errors in Enterprise Manager Overview of the New Mechanism for 11g DB Alert Log Monitoring
As of Enterprise Manager 10.2.0.5.0, when version 11g or above Database targets are monitored by a version 10.2.0.5.0 or above Oracle Management Agent, the mechanism/metrics for raising EM alerts when there are errors in the alert log has changed significantly. The mechanism is now tightly integrated with the Support Workbench with the benefits of being able to generate packages for each problem/incident reported and quickly upload them to support. The new mechanism no longer parses the text version of the alert log but makes use of the log.xml file which replaces the text based alert log going forward. Currently, both the text based alert_[SID].log file and the log.xml file are both maintained but the alert_[SID].log file is deprecated and may disappear in future releases of the DB. As a consequence of integrating with the Support Workbench, errors have been categorized into different classes and groups and only those that are classed as being significant enough to warrant further investigation raise alerts out of the box. At the highest level of categorization we have 2 different classes of error, incidents and operational errors. Incidents are defined as "incidents, for example, generic internal error, access violation, and so on as recorded in the database alert log file ... [signifying] that the database being monitored has detected a critical error condition about the database and has generated an incident to the alert log file" Operational Errors are defined as "errors that may affect the operation of the database, for example, archiver hung, media failure, and so on as recorded in the database alert log file .. [signifying] that the database being monitored has detected a critical error condition that may affect the normal operation of the database and has generated an error to the alert log file" Basically, the errors that have been identified as important have been categorized as those that logically require further investigation, that can be rolled up into problems in ADR and sent to Oracle via the Support Workbench with related trace files (Incidents) and those that, although important, do not require further investigation via Support (Operational Errors), e.g it does not normally make sense to send trace files to support regarding an Archiver Hung alert, you simply need to free up space for the Archiver to proceed. Within each class of error (incident or operational error) we have several groups, or categories, of errors. In Enterprise Manager, the 'Generic Alert Log Error' metric has been replaced with several new metrics that each map to one of the categories. Alerts are only raised for those errors that have been given a category in the Support Workbench and therefore have generated incidents or operational errors. Therefore, out-of-thebox, some errors that might have raised an EM alert in the previous mechanism will not raise an alert in the new mechanism.

Errors and their related Categories


Category / New Metric Generic Internal Error Access Violation Session Terminated Incident Out of Memory Incident Redo Log Corruption Incident Inconsistent DB State Deadlock Incident Incident ORA-4030 ORA-4031 ORA-353 ORA-355 ORA-356 ORA-1410 ORA-8103 ORA-4020 Note, ORA-0060 does *not raise an incident and is not part of this group. This group is reserved for system deadlocks, not application level deadlocks, that cannot be automatically resolved and so indicate a Bug that should be followed up with Oracle Support. No errors currently raise alerts in this category. Incident Internal SQL Error Incident Cluster Error Incident ORA-29740 ORA-604 ORA-Errors Incident Incident ORA-Errors ORA-600 ORA-7445 ORA-3113 ORA-603 Notes/Comments

File Access Error

https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=BUL...

11.11.2011.

Page 2 of 5

Data Block Corruption Operational Error Archiver Hung Media Failure Generic Incident Operational Error Operational Error Incident

ORA-1578 ORA-1157 ORA-27048 No errors currently raise alerts in this category. ORA-1242 ORA-1243 ORA-700 ORA-255 ORA-239 ORA-601 ORA-602 ORA-255 ORA-240 ORA-494 ORA-3137 ORA-202 ORA-214 ORA-227 ORA-1103 ORA-312 ORA-313 ORA-1110 ORA-1542 ORA-32701 ORA-32703 ORA-12751 See NOTE 1, below Catch all for all errors that generate incidents but do not fall under any of the other groups

Generic Operational Error

Operational Error

Catch all for all errors that generate operational errors but do not fall under any of the other groups

NOTE : The following errors raise EM Alerts under the 'Generic Operational Error' category when they occur in log.xml. These are listed below for clarity as they cannot be referred to using simple ORA- numbers: 1) "Checker run found %d new persistent data failures" 2) "ASM Health Checker found %d new failures" 3) "ERROR: ORA-48178 encountered when checking if the process is able to " "ERROR: create the ADR schema in the specified ADR Base directory [%s]" "ERROR: The ORA-48178 error is caused by the ORA-%d error. \n%s" "ERROR: Check if the directory is readable and check if the OS version is supported for ADR." "ERROR: The process will switch back to the pre-ADR method of tracing and logging." 4) "dbgtfRecWriteFailure: max write recursion depth, %u, reached" "current file name: %.*s\n" "default file name: %.*s\n" 5) "Process 0x%p appears to be hung in Auto SQL Tuning task" "Current time = %u, process death time = %u\n" "Attempting to kill process 0x%p with OS pid = %s\n" "OSD kill skipped for process %p\n" "OSD kill succeeded for process %p\n" "OSD kill failed for process %p\n"

Alert Log Monitoring in previous versions of EM and for previous versions of the DB
The new metrics only come into effect in Grid Control version 10.2.0.5.0, for 11g database targets that are monitored by 10.2.0.5.0+ version Agents. This means that an 11g DB monitored by a 10.2.0.4.0 Agent will still use the old mechanism.

Known Issues
Bug 8482764 VIEWING/SEARCHING DATABASE ODL ALERT LOG DISPLAYS NO RESULTS The new mechanism requires some java library files to be accessible in the Agent's sysman/jlib directory. Unfortunately, Bug 8482764 means that 10.2.0.5.0 Agents do not have these libraries in the correct path. This bug makes it impossible to view the Alert Log contents. There are two workarounds that have been seen to work for this issue. If one of the workarounds does not work for you, try the other. Workaround 1: Create a symbolic link in the <AGENT_HOME>/sysman/jlib directory to the java library ojdl.jar from the diagnostics/lib directory:
cd <AGENT_HOME>/sysman/jlib ln -s ../../diagnostics/lib/ojdl.jar

On platforms where symbolic links are not supported, simply copy the ojdl.jar from <AGENT_HOME>/diagnostics/lib into <AGENT_HOME>/sysman/jlib BEWARE that the copied .jar will not be patched in the future if upgrades/patches are applied. The patched .jar will need to be copied once again to <AGENT_HOME>/sysman/jlib until Bug 8482764 is fixed. Workaround 2: Copy the files <DB_HOME>/sysman/jlib/ojdl.jar and ojdl2.jar to <AGENT_HOME>/sysman/jlib BEWARE that the copied .jars will not be patched in the future if upgrades/patches are applied. The patched .jars will need to be copied once again to <AGENT_HOME>/sysman/jlib until Bug 8482764 is fixed. BUG 9174076 search in view alert log contents returns no rows This bug results in no results being returned when searching the Alert Log if not using the 'Regular Expression' option. The bug is still being investigated by Development. Please use the 'Regular Expression' option to search the alert log until a fix is available. Non-incident level Errors do not raise EM alerts

https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=BUL...

11.11.2011.

Page 3 of 5

The new mechanism only raises EM alerts when the alert log error is classed as an incident or operational error. This is different from the old mechanism that raised EM alerts for any string in the alert log that matched the threshold setting. This is being investigated in ER BUG 8930257 OPTION TO RECEIVE OR NOT ALERTS FOR ALL ORA- ERRORS IN LOG XML NOTE: The DB Home page, under 'Diagnostics Summary' contains a field 'Alert Log'. This section displays 'No ORA- errors' unless an error that is classed as an incident/operational error is logged. Enhancement Request Bug 9174791 "make diagnostics summary alert log field more meaningful" has been logged to request that this section be made clearer. There is a solution: OMS Patch 8930257 and Agent Patch 8694165 can be applied to re-introduce the 10g style Alert Log metric. NOTE 1 : A password is required to download each patch. To get the patches you need to raise an SR via My Oracle Support. NOTE 2 : The re-introduced metrics are disabled out of the box. To avoid duplicate alerts, if the re-introduced metrics are enabled, it is advised to disable the new metrics. The best way to achieve this is by creating a monitoring template with the required thresholds and filter expression for the re-introduced metrics that also disables the new metrics. There is a workaround: From version 11.1.0.7.0, any ORA-error can be configured to raise an Incident by setting an event in the database using the following command:
alter [system|session] set events '[ERRNO] incident([INCIDENTNAME])'; -- to set the event for the running instance for the lifetime of the instance only alter system set event='[ERRNO] incident([INCIDENTNAME])' scope=spfile; be set whenever the DB starts up. -- to configure the event to

where ERRNO is the event/error number and INCIDENTNAME is a user defined name for the incident. NOTE: The 'alter system set event=(...) SCOPE=SPFILE' syntax replaces all previous EVENT settings in the SPFILE. You need to specify ALL events that you want setting in the one command, seperated by colons, e.g:
alter system set event='942 trace name context off:60 incident (APPDEADLOCK)' scope=SPFILE; -- The following command unsets the event for ORA-942 errors and sets and event for ORA-60 errors

NOTE 2: The 'alter system set events ...' command sets the event for the currently running instance. The 'alter system set event ...' command sets the event in the spfile so that the events are persisted after a restart of the database. The latter command does not set the events in the currently running instance so both commands need to be run if the intention is to set the event from the current time and to persist that setting after database restarts.. When the event occurs, an incident will be created with the following problem key: "ORA 700 [EVENT_CREATED_INCIDENT] [ERRNO] [INCIDENTNAME]" -- the problem key defines what problem the incident will belongs to For example, if you set the following event for ORA-942:
SQL> alter system set events '942 incident(NONEXISTTBV)'; SQL> alter system set event='942 incident(NONEXISTTBV)' scope=SPFILE; System altered. SQL> drop table foobar; drop table foobar * ERROR at line 1: ORA-00942: table or view does not exist

You will get an incident with the following problem key: "ORA 700 [EVENT_CREATED_INCIDENT] [942] [NONEXISTTBV]" To unset the event, run the following command: alter [system|session] set events '[ERRNO] trace name context off'; alter system set event='[ERRNO] trace name context off' SCOPE=SPFILE; e.g: SQL> alter system set events '942 trace name context off'; System altered.

There is also an alternative workaround: See Note 961682.1 How to - Monitor Non Critical 11g Database Alert Log Errors Using a SQL UDM Unpublished Bug 7163411 ADR requires all clients to be the oracle user to get full services. This Bug results in errors when clicking on the number next to 'Active Incidents' on the DB Home page under 'Diagnostics Summary' and is caused by the Agent software owner being different from the DB software owner. See Note 973214.1 Problem - HTTP-404 Error after clicking on the number next to Active incidents on the DB homepage

Gotchas
If monitoring a database in an ORACLE_HOME that is not owned by the Agent software owner, ensure that the diagnostic_dest and the file log.xml itself is accessible to the Agent software owner or no errors will be reported. This can catch you out if you have a non-standard location and there is an older version of log.xml in the standard location that *is accessible to the Agent software owner as the metric will not complain.

FAQ
Question: What does the threshold ".*" refer to in Metrics and Policy Settings for the metrics the new alert log related metrics? Answer: The ".*" is a regular expression, meaning ' everything'which is compared against the ' , problem key' that has been returned by the metric and is set, by default, to raise an EM alert for all (hence .*) alerts that are returned. If you are not getting alerts that you are expecting to be raised you should verify that you have valid thresholds defined. As a sanity check, reset the thresholds to their default values ".*" and check whether an

https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=BUL...

11.11.2011.

Page 4 of 5

alert is raised with default thresholds in place. Question: If ".*" matches everything, then the metrics should match any character, how are the metrics limiting themselves to "ORA-" and not just any character? Answer: This metric works differently to the old one in that, previously, the threshold value was used to pull errors from the alert log. Every line in the alert log was compared against the thresholds to see if an EM alert would be raised. The old mechanism blindly raised alerts depending on whether the alert log had text in it that matched the threshold. The new mechanism has already pulled the incident from the alert log and knows that it is an error of a certain category. The thresholds are therefore being compared against something that we already know to be an error/incident - as opposed to some other content in the alert log. The thresholds, in this way, are more like the Alert Log Filter that was part of the previous mechanism but has now been removed. Question: The new metrics have an option for specifying multiple monitored objects via Metrics and Policy Settings. What does the time/line number refer to and what is the comparison syntax? Answer: The ability to add a monitored object for these metrics is basically a consequence of these metrics being displayed in a generic interface for all metrics. It does not make sense to create monitored objects for these metrics as the ' for these metrics is ' key' Time/Line Number' . Creating a monitored object for these alerts is basically saying that you want different thresholds for different times and line numbers in the log - which does not make sense! Question: How can I test that a given Metric successfully raises an alert when a given error is raised in the alert log? In the past I used sys.dbms_system.ksdwrt (see Note: 850320.1) Answer: It it *not possible to use sys.dbms_system.ksdwrt to test this mechanism. Errors raised by sys.dbms_system.ksdwrt are not raised as incidents in log.xml. This is currently being investigated via Bug 10213998 dbms_system.ksdwrt does not write to log.xml no effective way to simulate errors, please monitor the Bug or this note for updates. Question: Where has the Alert Log Filter Expression gone? How do I ignore specific errors that are known issues? Answer: The 11g DB target metric was released without an Alert Log Filter Expression. This was a design decision based on the logic that the metric was only going to raise alerts for critical errors (incidents) and therefore that such errors should not be ignored, combined with the fact that ADR has ' flood' control which stops repeating alerts from firing. Based on customer feedback, a Filter Expression has been added to the 11g DB target metric via the fix to Bug 9383791 - UNABLE TO FILTER ALERTS FOR ERRORS IN ALERT LOG OF 11G DATABASE TARGETS. Agents that consume the fix for Bug 9383791, including 10.2.0.5.0 Agents with Patch 9383791 will be able to have the the property 'adrAlertLogErrorCodeExcludeRegex' added to emd.properties, specifying a regular expression in the same format as those used for the 10g Alert Log Metric, e.g:
adrAlertLogErrorCodeExcludeRegex=.*ORA-0*(54|1142|1146)\D.*

The example, taken from the default value for the 10g Alert Log Filter expression can be broken down thus: .*ORA-0*(54|1142|1146)\D.* - Any string .*ORA-0*(54|1142|1146)\D.* - Followed by the string 'ORA-' .*ORA-0*(54|1142|1146)\D.* - Followed by none or more zeros .*ORA-0*(54|1142|1146)\D.* - Then either 54, 1142 or 1146. .*ORA-0*(54|1142|1146)\D.* - Followed by anything other than a digit .*ORA-0*(54|1142|1146)\D.* - Followed by any string Which translates to the following errors: ORA-54, ORA-1142, ORA-1146 The Alert Log Filter Expression for 11g database targets is not target specific but is configured for all 11g database targets monitored by the Agent. It cannot be configured via Monitoring templates. emcli can be used to deploy the Alert Log Filter expression to multiple Agents using the set_agent_property verb, for example:
emcli set_agent_property -agent_name="agent.example.com:3872" -name=adrAlertLogErrorCodeExcludeRegex value=".*ORA-0*(54|1142|1146)\D.*" -new

Note: The above emcli command will not work without the OMS Patch 8985371 applied. This patch enables the ability for emcli to set values of properties for an Agent when those properties do not yet exist in the Agent's emd.properties by specifying the '-new' flag. Patch 8985371 is available for the 10.2.0.5.0 OMS and is not required for the 11g OMS. Even after applying the patch to a 10.2.0.5 OMS you may still get the following error on the initial attempt to set the property:
error 'Invalid Property Name - adrAlertLogErrorCodeExcludeRegex'

In that case, run the command once again and it should succeed on the second attempt. This behavior is being investigated via an unpublished Bug. This Note will be updated as that investigation progresses.

References
BUG:11728205 - ARCHIVER HUNG ERROR NOT TRAPPED BY ADR ALERT LOG METRIC BUG:8482764 - VIEWING/SEARCHING DATABASE ODL ALERT LOG DISPLAYS NO RESULTS BUG:9174076 - SEARCH IN VIEW ALERT LOG CONTENTS RETURNS NO ROWS BUG:9174791 - MAKE DIAGNOSTICS SUMMARY ALERT LOG FIELD MORE MEANINGFUL BUG:9383791 - UNABLE TO FILTER ALERTS FOR ERRORS IN ALERT LOG OF 11G DATABASE TARGETS NOTE:961682.1 - How to - Monitor Non Critical 11g Database Alert Log Errors Using a SQL UDM NOTE:973214.1 - Problem - HTTP-404 Error after clicking on the number next to Active incidents on the DB homepage NOTE:976982.1 - Monitoring 10g Database Alert Log Errors in Enterprise Manager PATCH:9383791 - UNABLE TO FILTER ALERTS FOR ERRORS IN ALERT LOG OF 11G DATABASE TARGETS

Related

Products Enterprise Management > Enterprise Manager Products > Enterprise Manager > Enterprise Manager Base Platform Enterprise Management > Enterprise Manager Products > Managing Databases using Enterprise Manager > Enterprise Manager for Oracle Database Keywords ADR; CONTROL; ENTERPRISE MANAGER; GRID CONTROL; LOG FILTER; METRIC

Back to top

https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=BUL...

11.11.2011.

Page 5 of 5

Copyright (c) 2007, 2010, Oracle. All rights reserved. Legal Notices and Terms of Use | Privacy Statement

https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=BUL...

11.11.2011.

You might also like