NAGIOS User Manual v1 - 2

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 37

NAGIOS User Manual

V1.2

NAGIOS User Manual 1


Table of Contents

NAGIOS Application Overview....................................................................................................................... 3

NAGIOS Distribution....................................................................................................................................... 4

Types of Service Checks Suitable for Liberate............................................................................................... 5

Liberate Services Monitored by NAGIOS........................................................................................................ 6

Logging on to NAGIOS................................................................................................................................. 10

Navigation..................................................................................................................................................... 11

Tactical Overview.......................................................................................................................................... 12

Hosts............................................................................................................................................................. 13

Services........................................................................................................................................................ 16

Host Groups.................................................................................................................................................. 19

Service Groups............................................................................................................................................. 21

NAGIOS Documentation............................................................................................................................... 22

PNP4NAGIOS............................................................................................................................................... 23

Search Panel............................................................................................................................. 27

Actions Panel............................................................................................................................ 28

My Basket................................................................................................................................. 30

Time ranges.............................................................................................................................. 31

Services.................................................................................................................................... 31

Appendix A – Host Commands..................................................................................................................... 33

Appendix B – Service Commands................................................................................................................ 35

Glossary........................................................................................................................................................ 36

NAGIOS User Manual 2


NAGIOS Application Overview

NAGIOS is computer system and network monitoring software application. It watches hosts and services,
alerting users when things go wrong and again when they get better. NAGIOS provides the ability to handle
and notify of problems within a monitored computer system or network.

NAGIOS Features
Some of the main benefits of using NAGIOS are:

 Monitoring of Network Services (HTTP, PING)


 Monitoring of Host resources (CPU Usage, Memory, Disk Space)
 Monitoring of Apache2, Apache Tomcat, and MySql (via Plug-ins)
 Ability to Schedule Host & Service Downtime.
 Automatic Notification of Problems with hosts and services
 Web Interface for monitoring services and hosts
 Extended Web Interface for monitoring Trends in hosts and Services (PNP Plug-in)
 Ability to handle failing services, and attempt to recover failures prior to notification of issues.
 Ability to develop custom service checks specific to the Liberate environment. For example, the
current active Liberate sessions.

NAGIOS User Manual 3


NAGIOS Distribution

The NSCA plug-in allows a distributed network of LOCAL NAGIOS instances to operate with full
functionality of a NAGIOS server (Execute service checks, allow local web-interface and management)
while providing a Central NAGIOS server with all the information required for a central team to monitor the
state of all Distributed location services.

Each Local NAGIOS server will actively check the state of hosts and services within the local (Business
Unit) environment. Via NSCA (send_nsca) this server then passes a “passive” check result to the Central
NAGIOS server.

This configuration allows local IT team’s greater control over monitoring and managing their local resources,
while still maintaining constant state monitoring to the London (central) server/team.

The main benefit of this configuration is that if a link between London and a remote Business Unit is ‘down’
the local team can still monitor the state of their infrastructure. Whereas with previous configurations if a link
failed, then the Business unit / and the London teams would have no ability to monitor the state of the
infrastructure.

This configuration also provides a ‘failsafe’ in-case a local NAGIOS server fails, or becomes unavailable. In
a situation where a ‘passive’ check result is not received from the Business Unit NAGIOS server, within a
given timeframe, the central NAGIOS server will actively try to perform host checks, for any result not
received.

NAGIOS User Manual 4


Types of Service Checks Suitable for Liberate

This section introduces the types of checks that NAGIOS provides

Physical Host Checks

The physical host checks are checks that provide an insight in to the state of the host, and its hardware,
such as CPU Usage, or Free Disk Space. These checks are executed by way of a daemon operating on the
host to be checked called NRPE (NAGIOS Remote Plug-in Executor). The NAGIOS server sends an NRPE
command to the host on port 5666. This command is then linked to an executable plug-in / script that runs
on the host, and returns a result. The NRPE daemon then returns this result to the NAGIOS server.

Apache Checks

The Apache checks are executed by the NAGIOS server directly. NAGIOS executes the plug-in
check_apachestatus.pl, providing arguments for the Host IP Address, the port to test (80 by default), as well
as warning and critical thresholds for the number of free slots available to be used by Apache.

Tomcat/JMX Checks

The Apache Tomcat Checks connect to the Tomcat Server and retrieve values outputted as mBeans.
These values are then compared to the warning and critical thresholds provided to determine alerting. The
Tomcat/JMX checks are executed by the NAGIOS server directly. A standard Tomcat/JMX check requires
parameters for; the host IP address or FQDN, the port on which the Tomcat Server publishes the server
data, the JMX object type, and the attribute to be checked, as well as the warning and critical alert
thresholds.

MySQL Database Checks

The MySQL checks are executed by the NAGIOS server; they require that a user with sufficient privileges
is configured within the MySQL Server to be monitored. These checks connect to the MYSQL database (by
default on port 3306), once connected retrieves the required information to be compared to thresholds.
These Checks require parameters for the Host IP Address or FQDN, user name and password for a user
with access to the database, and the warning and critical levels. If no warning or critical levels are supplied,
the mySQL checks use default parameters.

Webinject Checks

Webinject checks are used to verify the status of the CIS WebServices. These test cases are executed from
the NAGIOS server and post soap request, the returned output is then compared to ensure the result is
correct and that the WebServices are available. These checks communicate with the remote host over the
standard http port (8080 unless defined differently in the test case configuration file)

Other Checks

In addition to the available script/check types above we have developed a script that extends the
functionality of the Tomcat JMX checks to calculate to total users concurrently active across all Liberate
Application Servers. This check takes four parameters, the liberate application URI (e.g liberate-11AT), a
text file containing a line break sperated list of host addresses and jmx ports (format ADDRESS:JMXport),
the Warning and Critical thresholds for the maximum number of concurrent users the overall system can
support.

NAGIOS User Manual 5


Liberate Services Monitored by NAGIOS

This section discusses the various services monitored by the NAGIOS application for the layers of the
Liberate Application Architecture.

Note: Checks are provided for both Linux and Windows OS Hosts.

Liberate Load Balancer Servers

Liberate uses Apache2 as a load balancer, this allows request to the different Layers of the application to
distributed across multiple hosts that provide the required service for the appropriate request. For the Load
Balancer Servers, we monitor the following with NAGIOS:

1. Physical Host Health – via NRPE (NAGIOS Remote Plug-in Executor)

a. CPU Usage (check-CPU) – Provides a check on the CPU Usage Load.


i. Alerts: Warning if CPU Usage > 85%
ii. Alerts: Critical if CPU Usage > 95%

b. Memory Usage (check-memory) – Provides a check on the Memory Usage of the Host,
Provides alerts if a minimum amount of memory is not available to the system.
i. Alerts: Warning if Available Memory < 15%
ii. Alerts: Critical if Available Memory < 5%

c. Disk Space (Root Drive Space) – Provides a check on the Available Disk Space, Provides
alerts if a minimum amount of disk space is not available.
i. Alerts: Warning if Available Disk Space < 15%
ii. Alerts: Critical if Available Disk Space < 5%

d. Disk IO (Basic IO Check sda) - Checks the I/O usage of the specified disk, using the iostat
external program (requires iostat installed – yum install iostat)
i. Alerts Based on Transactions per second (tps), Kilobytes per second, read from the
disk (KB_read/s) and written to the disk (KB_written/s).
ii. Default Warning and Critical values used by this check (set by the plug-in)

2. The Apache Load Balancer Status (Apache Load Balancer Health) checks if a connection can be
made to the Load Balancers status page, and the amount of available slots free to process
requests.
i. Alerts: Warning if Available Slots < 15% of Total
ii. Alerts: Critical if Available Slots < 15% of Total
iii. Alerts: Unknown if not able to connect to the load balancer status page

3. If an http connection can be made to the Load Balancer (Check HTTP) – Checks if an http
connection is available to the load balancer.
i. Alerts: Warning NONE
ii. Alerts: Critical if connection is refused or timeout.

4. If an http connection can be made via the Load Balancers to the Liberate application (Check
HTTP / ‘liberateAppURI’)
i. Alerts: Warning NONE
ii. Alerts: Critical if connection is refused or timeout.

5. Total Liberate Sessions – This check is not directly related to the load balancer, however has
been associated with it as a central check. This check uses a text file storing the details of all
liberate application servers, when executed. It retrieves the active sessions for each application
server, and then calculates the total active sessions.

NAGIOS User Manual 6


i. Alerts: Warning if total active sessions > 85% of the calculated max concurrent
user load)
ii. Alerts: Critical if total active sessions > 95% of the calculated max concurrent user
load

NAGIOS User Manual 7


Liberate Front End (Application) Servers

These are the servers that host the Liberate Web Application. These servers run Apache Tomcat Servers.
NAGIOS provides the functionality to monitor these servers in the following way:

1. Physical Host Health – via NRPE (NAGIOS Remote Plug-in Executor)

a. CPU Usage (check-CPU) – Provides a check on the CPU Usage Load.

i. Alerts: Warning if CPU Usage > 85%


ii. Alerts: Critical if CPU Usage > 95%

b. Memory Usage (check-memory) – Provides a check on the Memory Usage of the Host,
Provides alerts if a minimum amount of memory is not available to the system.

i. Alerts: Warning if Available Memory < 15%


ii. Alerts: Critical if Available Memory < 5%

c. Disk Space (Root Drive Space) – Provides a check on the Available Disk Space, Provides
alerts if a minimum amount of disk space is not available.

i. Alerts: Warning if Available Disk Space < 15%


ii. Alerts: Critical if Available Disk Space < 5%

d. Disk IO (Basic IO Check sda) - Checks the I/O usage of the specified disk, using the iostat
external program (requires iostat installed – yum install iostat)

i. Alerts Based on Transactions per second (tps), Kilobytes per second, read from the
disk (KB_read/s) and written to the disk (KB_written/s).
ii. Default Warning and Critical values used by this check (set by the plug-in)

2. Apache Tomcat Health – via JMX NAGIOS can be used to retrieve data regarding the state of a
Tomcat Server. Each of the checks below have been written to monitor the attributes that often
cause errors, and cause a Tomcat Server to fail.

a. UNLESS OTHERWISE STATES all jmx checks:

i. Alerts: Warning if Attribute > 85% of it Maximum


ii. Alerts: Critical if Attribute > 95% of it Maximum

b. Open File Descriptors (Linux Only) – checks the number of open file descriptors

c. HTTP Threads – checks the number of HTTP Threads in use.

d. Heap Memory – checks the usage of the Java Heap Memory

e. Non Heap Memory - checks the usage of the Java Non Heap Memory

f. OS Memory – Checks the amount of free memory available to the OS

i. Alerts: Warning if Available Memory < 85% of the total Memory


ii. Alerts: Critical if Available Memory < 95% of the total Memory

g. Active Liberate Sessions – Checks the active sessions for the Liberate application

NAGIOS User Manual 8


i. Alerts: Warning if total active sessions > 85% of the calculated max concurrent
user load)
ii. Alerts: Critical if total active sessions > 95% of the calculated max concurrent user
load

3. If an http connection can be made to the Tomcat Server (Check HTTP) – Checks if a http
connection is available to the Tomcat Server.

i. Alerts: Warning NONE


ii. Alerts: Critical if connection is refused or timeout.

4. If an http connection can be made to the Liberate application (Check HTTP / ‘liberateAppURI’)

i. Alerts: Warning NONE


ii. Alerts: Critical if connection is refused or timeout.

NAGIOS User Manual 9


Logging on to NAGIOS
The site is protected, so login credentials will be required. The level of information available will be
restricted for each login, for example a LIME regional login will only allow access to monitors on the servers
within the LIME BU, and users such as the Liberate Support team in India will be able to see all monitors.
The Liberate Console is not meant to replace any local monitoring already in place. Installation should have
no impact on the servers.

Liberate Console can be accessed via the web at the following addresses:

1. Navigate to http://nagios.cwigintra.com/nagios/

2. Enter your User Name and Password

The Nagios homepage is displayed:

NAGIOS User Manual 10


1 Navigation

Navigate to different screens in NAGIOS by using the left Hand navigation menu.

NAGIOS User Manual 11


Tactical Overview

The Tactical Overview displays a full view of host and service states. It allows users to view an overall
summary of the full monitored system.

The Network Health indicator to the right of this screen provides a quick view of the overall percentage of
service and hosts currently in an OK state.

NAGIOS User Manual 12


Hosts
The Hosts screen displays the status of all hosts monitored by NAGIOS. This status is based on a simple
PING check to see if the hosts are alive and active on the network.

1
Select a Host from the list to view additional information.

NAGIOS User Manual 13


2 Depending on user permissions, the Host Commands are available on the right of the screen (see
Appendix A).

There are a number of options available in the Host Commands menu, but one of the main ones is
3 Schedule Downtime on all services on this host. This is useful because it allows the local Liberate IT
staff to temporally suspend checks on services for the duration of maintenance/configuration changes.
Example: A Tomcat server is going to be reconfigured and restarted between 11:00 and 12:00. Scheduling
Downtime will prevent NAGIOS sending Warning and Critical notifications.

The Schedule downtime for all services on this host screen allows for 2 different downtime Type
4 options:

NAGIOS User Manual 14


 Fixed: A start and end time for the downtime is entered

 Flex: A duration and start time for the downtime is entered. Checks are disabled from the first fail
check, and re-enabled after the duration.

5 All commands have Command Descriptions displayed.

6 Always put comments where possible.

NAGIOS User Manual 15


Services

The Service screen lists hosts and all services running on each individual host. The following is displayed
for each service:

1 Status – Displays the status of each service. The status is colour coded. Green is OK, Yellow is for a
warning and Red means there is a critical error.

2 Last Check – Displays the date and time that NAGIOS last checked the service.

3 Duration – Displays the duration for which a host has been in the current state.

Attempt – Displays the amount of attempts a system will check a service with a status of Critical
4
before notifying the contact for that host/service.

5 Status Information – Displays additional information on the status of the service.

1 2 3 4 5

6 The Service Status Tools in the top right of the screen displays up-to-date information on the state of all
services. Click on a specific state to filter services. For example, click on Critical to view all 13 services with
a current state of Critical.
6

NAGIOS User Manual 16


7 Click a service to view additional information and perform further actions for that service.

8 Once a service has been selected the following screen is displayed:

The Service Commands on the right of the screen allow for a number of actions to be performed on the
9
service (see Appendix B).

10 For example, you can click Send Custom Service Notification if you want to send out a notification to all
contacts for the selected service. An example of a notification would be warning contacts that there will be a
change to a server, or any other information that they need to be aware of.

NAGIOS User Manual 17


10

NAGIOS User Manual 18


Host Groups
The Host Groups screen displays the state of hosts. Hosts are sorted in to individual host groups. This
allows users to easily identify the different hosts by group, for example Live or Test.

From this view users can see a high level view of service statuses for each host. For example, Live host
1
CWP_VMLIBWS1 has:

 22 Services checked as OK
 1 Warning Service Check
 1 Critical Service Check

Click the OK/WARNING/CRITICAL box to view more details on the services which are in a
2
OK/WARNING/CRITICAL state.

NAGIOS User Manual 19


2

NAGIOS User Manual 20


Service Groups
The Service Groups screen displays all services on a host grouped by the type of Service. For example,
services can be grouped by Tomcat checks or host checks.

NAGIOS User Manual 21


NAGIOS Documentation

1 Further information on NAGIOS can be found by clicking Documentation from the left-hand menu.

The NAGIOS Documentation screen is displayed. Click Table of Contents to access the content.
21

21

The Table of Contents lists all NAGIOS documentation.

NAGIOS User Manual 22


PNP4NAGIOS

PNP4NAGIOS is a NAGIOS add-on which analyses performance data provided by plug-ins and stores
them automatically into RRD-databases (Round Robin Databases).

Benefits

The graphs and metrics produced by PNP4NAGIOS allow users to view trends, and view the effect that
different usage periods have on host/service performance.

Using PNP4NAGIOS

PNP4NAGIOS can be opened by clicking the graph icon (exfoliated skin) or the

red dot (standard skin) displayed next to a service or host on NAGIOS.

Click on the icon to open the PNP4NAGIOS web interface for the required host or service.

The Display

PNP4NAGIOS displays the graphs for whichever host or service is selected. If a host is selected the display
shows all services for that host using the default time period (4hours). If a specific service is selected
graphs for all time periods are displayed.

NAGIOS User Manual 23


Graphs

Graphs show the metrics collected for each NAGIOS check.

The image below shows the 4hour and 25hour graphs for a JMX Tomcat check, HTTP Threads:

NAGIOS User Manual 24


Graph Options

PNP4NAGIOS provides functionality to:

View the recent NAGIOS alerts for the time period on that service/host.

View the NAGIOS availability report for the service/host.

Add the item/graph to the basket for quick viewing of custom graph sets.

Zoom in on the graph.

1 From the zoomed view the user can click and drag on the graph to zoom up further.

NAGIOS User Manual 25


1

2 The zoomed up image is displayed.

NAGIOS User Manual 26


Search Panel

The search panel allows you to search for a host and view the graphs for the selected host.

NAGIOS User Manual 27


Actions Panel

This panel contains a number of icons that can be used to perform actions on the PNP configuration.

Create Custom Time Period Panel

Opens a panel to create a custom time period

This can then be viewed to see the stats for hosts and services for the specific period. For example if there
is to be an expected peak in network traffic and server usage over a set period, the graphs could be
customised to show that period.

View PDF

Opens a PDF view of the graphs currently displayed, this would allow users to print graphs.

NAGIOS User Manual 28


View XML

Displays the XML that the displayed page is based on  Not very useful

View PNP Internal Statistics

View graphs on the performance of the PNP system.

View Documentation

Opens the PNP4NAGIOS documentation

NAGIOS User Manual 29


My Basket

1 My basket contains a list of graphs that have been manually added via the Add to basket icon .

2 The image below displays a basket with Active sessions for four Applications servers:

2 Click Show basket to view the 4 graphs simultaneously.

NAGIOS User Manual 30


Time ranges

The standard time periods for PNP to generate graphs for are:

 4hours
 25hours
 7days (One Week)
 One Month
 One Year

By clicking a time range from a list the user can change the view of all graphs displayed to view the metrics
for the selected time range.

Services

When viewing a host in PNP4NAGIOS, the services list is displayed. The user can choose a specific
service to view more details. By clicking a service the view is reloaded to show the selected service metric
graphs. If the NAGIOS service check provides multiple aspects of performance data all graphs produced by
this service are displayed.

NAGIOS User Manual 31


NAGIOS User Manual 32
Appendix A – Host Commands

Disable active checks of this host - Stops ‘this’ NAGIOS server actively checking the host

Re-schedule the next check of this host - Allows the user to set the time of the next execution of the
check of this host.

Submit passive check result for this host - Allows the user to submit a check result for the host (NOT
RECOMMENDED).

Stop accepting passive checks for this host - Stops ‘this’ NAGIOS server accepting check results from
other NAGIOS servers for this host.

Stop obsessing over this host -Stops this NAGIOS server sending check results via NSCA to a central
NAGIOS server

Enable notifications for this host - Enables notifications of hard state changes for this host.

Send custom host notification - Sends a notification to all contacts for the host. This can be used to notify
the contacts of changes to a host, or to schedule maintenance/downtime.

Schedule downtime for this host - Schedule a time period where this host will be offline/unavailable.
During this time checks and notifications will be disabled.

Schedule downtime for all services on this host -Schedule a time period where services on this host will
be unavailable. During this time checks and notifications will be disabled.

Disable notifications for all services on this host - Stops notifications being sent for all services running
on this host.

Enable notifications for all services on this host – Enables NAGIOS to send notifications for hard state
changes for all services running on this host.

Schedule a check of all services on this host - Allows the user to schedule the next execution of all
checks on services running on this host from this NAGIOS server.

Disable checks of all services on this host - Stops this NAGIOS server checking all services defined for
this host.

NAGIOS User Manual 33


Enable checks of all services on this host - Starts this NAGIOS server checking all services defined for
this host

Disable event handler for host - Disables any event handlers defined for this host in this NAGIOS Server.

Disable flap detection for this host - Disables flap detection for this host... this means that if the host
starts ‘flapping’ then notifications will not be disabled automatically.

NAGIOS User Manual 34


Appendix B – Service Commands

Disable active checks of this service - Stops ‘this’ NAGIOS server actively checking this service

Reschedule - Allows the user to set the time of the next execution of the check of this service

Submit Passive - Allows the user to submit a check result for the service (NOT RECOMMENDED)

Stop Accepting Passive - Stops ‘this’ NAGIOS server accepting check results from other NAGIOS servers
for this service

Start obsessing - This starts ‘this’ NAGIOS server sending check results via NSCA to a central NAGIOS
server

Disable Notifications for this host - Disables notifications of hard state changes for this host

Send Custom service notification for this service - Sends a notification to all contacts for the service...
this can be used to notify the contacts of some changes to a service, or scheduling maintenance/downtime.

Schedule Downtime for this service - Schedule a time period where this service will be
offline/unavailable (during this time checks and notifications will be disabled)

Disable Event handler for this service - Disables any event handlers defined for this service in this
NAGIOS Server.

Disable Flap Detection - Disables flap detection for this service... this means that if the service starts
‘flapping’ then notifications will not be disabled automatically.

NAGIOS User Manual 35


Glossary

Acknowledgement: A special comment added to a host/service check. These can be used to notify others
of a known problem that is causing check warning/critical results. For example a host with low memory
warnings could be acknowledged with a comment defining an action to increase the memory for that host.

Active Checks: A service/host check that is actively executed by the NAGIOS Server.

Alert(s): Any host/service state change with warning/critical check result. This is different from a
notification.

Flapping: Flapping occurs when a service or host changes state too frequently, resulting in a storm of
problem and recovery notifications. Flapping can be indicative of configuration problems.

Hard Check states occur for hosts and services in the following situations:
 When a host or service check results in a non-UP or non-OK state and it has been (re)checked the
number of times specified by the max_check_attempts option in the host or service definition. This
is a hard error state.
 When a host or service transitions from one hard error state to another error state (e.g. WARNING
to CRITICAL).
 When a service check results in a non-OK state and its corresponding host is either DOWN or
UNREACHABLE.
 When a host or service recovers from a hard error state. This is considered to be a hard recovery.
 When a passive host check is received. Passive host checks are treated as HARD unless the
passive_host_checks_are_soft option is enabled.
The following things occur when hosts or services experience HARD state changes:
 The HARD state is logged.
 Event handlers are executed to handle the HARD state.
 Contacts are notifified of the host or service problem or recovery.

Host: A physical server, workstation, device, etc. that resides on your network

Host Group: A group of one or more hosts together to simplifying the display in the Web Interface

Notification: A message sent (via email) to contacts for a specific service/host, in the case where it
reaches a HARD warning or critical state. Notifications can also be enabled to notify when a check recovers
to an OK or UP State.

NRPE (check_nrpe NRPE daemon): NAGIOS Remote Plug-in Executor –executes plugins on a host and
returns a check result to the NAGIOS server.

NSCA (send_nsca NSCA daemon): NAGIOS Service Check Acceptor – Application to report distributed
monitoring check results to a Central NAGIOS server.

Obsessing (Over host/Service): If enabled this provides the interface for a NAGIOS server to use NSCA
to report Check Results to a central NAGIOS server.

Passive Check: Checks that are executed by another NAGIOS server, and then passed to the NAGIOS
server.

Service: A "service" is something that that runs on a host. The term "service" is used very loosely. It can
mean an actual service that runs on the host (POP, SMTP, HTTP, etc.) or some other type of metric
associated with the host (response to a PING, number of logged in users, free disk space, etc.)

Service Group: A service group definition is used to group one or more services together for simplifying
display purposes in the Web Interface.

NAGIOS User Manual 36


Soft Check State: Occur in the following situations...

 When a service or host check results in a non-OK or non-UP state and the service check has not
yet been (re)checked the number of times specified by the max_check_attempts directive in the
service or host definition. This is called a soft error.
 When a service or host recovers from a soft error. This is considered a soft recovery.
The following things occur when hosts or services experience SOFT state changes:
 The SOFT state is logged.
 Event handlers are executed to handle the SOFT state. (If Defined)

NAGIOS User Manual 37

You might also like