9293 Selfmanaging In-Network
9293 Selfmanaging In-Network
9293 Selfmanaging In-Network
Introduction......................................................................................................................................... 2
The role of clustering ........................................................................................................................ 2
NonStop systems ............................................................................................................................. 3
A fundamental problem........................................................................................................................ 4
Background information ....................................................................................................................... 4
Our approach..................................................................................................................................... 5
Self-configuration ............................................................................................................................. 6
Self-optimization .............................................................................................................................. 7
Self-diagnosis .................................................................................................................................. 7
Self-healing ..................................................................................................................................... 7
Self-protection.................................................................................................................................. 8
Recent advances.................................................................................................................................. 8
Lessons learned................................................................................................................................ 9
Looking ahead............................................................................................................................... 10
Conclusion........................................................................................................................................ 11
For more information.......................................................................................................................... 12
Self-Managing Systems Make Unplanned Downtime History
To access this document, please complete all fields below and click 'Read Document'.
By completing this form, you agree to the collection, use, disclosure and transfer of the profile information collected
herein by TechTarget and the owner of the document. Based on the information provided, you may receive updates
from the TechTarget network of IT-specific websites (and/or the document owner) to inform you of the latest White
Paper, product, and content launches as they relate to your informational needs.
Once registration is complete, you will have access to all similar documents without having to fill out additional forms.
First Name:
Last Name:
Email Address:
Job Title:
Business Phone:
Company:
Address 1:
Address 2:
City:
State/Province: -- Select One --
Zip/Postal Code:
Country: UNITED STATES
# of Employees: -- Select # of employees --
Department: -- Select your department --
Industry: -- Select your industry --
Abstract: According to Gartner, 40 percent of all system
downtime is caused by operator error. Imagine a computer
system that has virtually zero planned or unplanned
downtime and can expand its capabilities dynamically, in
response to an increasing workload. A system that ensures
no single failure will cause denials of service or data
corruption. This may sound impossible, but in fact, these
benefits are derived from an industrial-strength
implementation of one fundamental concept across the
computer industry-self-management.
This paper describes a variety of self-management
technologies that have been implemented on HP's NonStop
system. These functions help lower the total cost of
ownership of the NonStop server while continuing to
improve user application availability.
Learn about a new approach to self-management that
encompasses five distinct areas:
Self-configuration
Self-optimization
Self-diagnosis
Self-healing
Self-protection
Read Document Cancel
Information entered on this page and other data about your use of the attached document will be stored
in a file on your computer and transmitted to TechTarget over the Internet. TechTarget may provide this
information to the owners of the document and either party may use this data to contact you and/or
track your use of the document. In consideration of access to the attached document, you agree to such
storage and uses as more fully described in the TechTarget Privacy Policy.
Introduction
Imagine a computer system that has virtually zero planned or unplanned downtimea system that
can run for decades without failing once. Thats not to say that the individual components in such a
system dont fail. Rather, it just means that the system (and the applications) continues to operate
reliably (with no loss of data) through component failure, problem diagnosis, and component
replacement steps.
Further imagine a system that can expand its capabilities dynamically, in response to an increasing
workload. As more processors are connected to the system, the workload is automatically
distributedtransparently and automaticallyto these additional processors, without requiring
programming or even configuration changes. And as processors are removed, the load managers
immediately restrict the workload to the available processors, transparently and automatically, without
any manual intervention.
And while all this is going on, the system sees to it that no single failure will cause requests for service
to be denied, and no existing data will become corrupted.
Sound impossible? Actually, all of these benefits and others are derived from an industrial-strength
implementation of one fundamental concept that is sweeping the computer industryclustering.
The role of clustering
In simple terms, clustering is about connecting a group of computers together so that they can share
the workload (scalability) and back each other up to hide failures from users (fault tolerance). In
reality, its fairly straightforward to connect computers together. Where the process gets difficult is in
meeting mutually exclusive goals, such as
Maintaining the ability to scale to hundreds or thousands of CPUs
Verifying that the performance of each individual subsystem remains acceptable
Promoting the integrity of data across the system
Maintaining the manageability of the overall system, so that the system detects changes and
handles them automatically (self-healing), and replicates simple management tasks across different
subsystems
From a high-level perspective, in order to provide the capabilities listed above, there are several
things that clusters need to do extremely well:
Rock-solid messaging, to enable information to be propagated between application instances
without fail
A message-based operating system
Heartbeat mechanisms to enable various parts of the system to tell other parts of the system about
their operational state. The absence of such mechanisms implies that the originator has encountered
some problem, which needs to be corrected
Application containers, which facilitate the existence of multiple application instances across the
cluster, distribute work to the various instances, provide various services to the instances (that is,
atomicity, consistency, isolation, and durability, or ACID, transaction services) to enable the
integrity of the database, detect when the workload has expanded sufficiently to warrant the
creation of additional instances, manage the life cycle of all application instances, and so on
The database must be flexible and cluster aware
2
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.
NonStop systems
There are as many different clustering implementations as there are computer and software vendors.
Only one stands out in terms of quality of implementation regarding functionality, performance,
scalability, fault tolerance, manageability, and data integrity: the HP Integrity NonStop and NonStop
server platforms (subsequently referred to as NonStop systems or NonStop servers).
Figure 1 illustrates the way in which hardware and software fault tolerance and linear scalability are
designed into NonStop systems.
This paper describes a variety of self-management technologies that have been implemented on the
NonStop system. Some of the functionality was designed into the system from the beginning, and
much has been added over the years. This combination of self-management functions helps lower the
total cost of ownership (TCO) of the NonStop server while continuing to improve user application
availability. In this paper, you will find examples of specific self-management techniques that weve
implemented, lessons that weve learned along the way, and discussions about future opportunities to
improve system self-management.
Figure 1. Both hardware and software fault tolerance and linear scalability are designed into the NonStop system together.
3
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.
A fundamental problem
As systems become more interconnected and diverse, architects
are less able to anticipate and design interactions among
components, leaving such issues to be dealt with at runtime. Soon
systems will become too massive and complex for even the most
skilled system integrators to install, configure, optimize, maintain,
and merge. And there will be no way to make timely, decisive
responses to the rapid stream of changing and conflicting
demands.
IBM manifesto (2001)
Today, the effect of system complexity is most easily measured in the TCO of a system, especially
when measuring all the costs associated with purchasing and operating a customer environment,
including the cost of system downtime. System downtime has a huge effect on TCO, and much of that
cost is directly associated with operator errors. According to Gartner, for example, 40 percent of all
system downtime is caused by operator error. The amount of time it takes for an operator to perform a
task correctly also affects TCO. The simpler the system is to operate, the fewer the number of required
tasks there are, and the less time is needed to perform them.
As noted earlier, complexity is directly related to system downtime and the cost of the overall IT
environment. Given that there are no signs of the IT environment becoming simpler, we have designed
the system to hide as much of the complexity as possible by automating as many operational tasks as
possible. Devices and systems manage themselves in order to reduce operator time and operator
errors.
Background information
IBMs gloomy manifesto about the IT industry warned that software complexity was causing a
looming crisis. Specifically, IBM said, the IT industry will collapse under its own weight if it
continues to rely on applications and environments, often with millions of lines of code that require
skilled professionals to get them running and keep them running.
Managing a system is a difficult business, and IBM predicted that even professionals will soon be
unable to keep up with system complexity. Interconnectivity, integration, making different systems
work together as one, and sheer scale all introduce new levels of complexity. In addition, extending
systems beyond company boundaries to the Internet introduces even more complexity. The only option
remaining, IBM suggested, is autonomic computing: systems that, given high-level objectives from
administrators, can manage themselves.
Autonomic computing suggests hierarchies, autonomy, interactivity, and cascading levels of smaller
and smaller systems, each of which can govern itself: in other words, self-management. The idea of
self-management is to free system administrators from the details of system operation and
maintenance and to provide users with a machine that runs at peak performance 24 x 7. Autonomic
systems are expected to maintain and adjust their operation in the face of changing components,
workloads, demands, and external conditions, as well as in the face of hardware or software failures,
both of which may be innocent or malicious.
Figure 2 shows some of the many forms of self-management built into NonStop systems.
4
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.
Figure 2. Some of the many forms of self-management within NonStop systems.
Our approach
HPs approach to self-management encompasses five distinct areas:
Self-configuration: Automated configuration of components and systems that follow high-level
policies. The rest of the system adjusts automatically and seamlessly.
Self-optimization: Components and systems continually seek to improve their own performance and
efficiency.
Self-diagnosis: The system can detect and diagnose its own problems.
Self-healing: The system automatically repairs localized software and hardware problems,
sometimes also reintegrating the repaired resource back into itself.
Self-protection: The system automatically defends against malicious attacks and cascading failures.
It anticipates and prevents system-wide failures.
The original design goal of the NonStop server was to create a system that could survive single faults
while hiding hardware and software errors from the application to the greatest possible extent. To
achieve this, three basic interdependent techniques wereand continue to bedeveloped:
Clustering of relatively autonomous processors: A NonStop system consists of two to 16 processors,
configured in a shared-nothing cluster that, in turn, can be aggregated into a two- to 255-way
group of clusters.
System self-management: The system is capable of automated configuration changes, optimization,
diagnosis, repair, and protection. Such capabilities are natural for a system that is designed from
the ground up to be tolerant of single faults.
5
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.
Figure 3. Backed up data is restored automatically upon corruption.
Resource virtualization: From an application perspective, every resource in the NonStop cluster is
virtualizedno aspects of software or hardware redundancy are made visible to the application.
Virtualizing resources enables us to provide transparent system self-management that is hidden from
the application.
In other words, to deliver the highest possible system availability, a clustered solution is needed.
Furthermore, to deliver such a solution at the lowest possible TCO, self-management techniques are
needed. To deliver transparent system self-management, the clusters resources need be virtualized
from an application perspective, thereby allowing automatic changes to the computing environment
without forcing the application to implement its own self-management techniques, too.
We have developed many self-management technologies over more than 25 years and continue to
make advances in all aspects of this self-management approach: self-configuration, self-optimization,
self-diagnosis, self-healing, and self-protection.
Figure 3 shows the one aspect of the self-healing capacity of NonStop systems.
Self-configuration
Automated system reconfiguration on expansion or reduction (system resizing): processors and
enclosures can be added to and removed from the system online, with the system adjusting its
configuration automatically. Also, switches and clusters can be added to the group of clusters.
Automated configuration of controllers and disk drives: a controller of a disk drive added to the
system is automatically configured and started. In the case of a host-based mirror (NonStop systems
use host-based mirroring for disks), the online data copy is started automatically when a mirrored
drive is configured.
In figure 3 step 1, scanning software detects bad data. In step 2, auto-repair deletes the bad data. In
step 3, good data is copied to the location of bad data. In step 4, data is back in sync and in a good
state. The application does not need to do anything for this process to occur.
6
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.
Figure 4. There is no overhead for synchronizing cache data.
Self-optimization
Mixed workload environment: the NonStop system allows the user to establish workload priorities,
and automatically responds to priority contention to help ensure that low-priority workloadssuch
as decision supportdo not impact higher-priority transaction response time. This design means
that lower-priority workloads can utilize free resources without impacting response time.
Automated workload distribution: the application environment (the middleware) automatically and
dynamically distributes work to server processes, depending on workload and resource availability.
Data on disk are often distributed and with the Data Access Manager shown in figure 4; the system
automatically self-optimizes, avoiding the overhead of cache synchronization.
Self-diagnosis
Detection of latent failure: alternate paths to devices and processors are either used in a ping-pong
fashion (for example, the system switches between available paths on a predetermined time
interval) or checked periodically. Data in memory and disks are checked periodically (this is known
as data scrubbing).
Incident analysis with automated data collection: Based on a highly structured common-event
system, incident analysis software is able to automatically diagnose 91 percent of all hardware
failures with a 94.3 percent level of accuracy. Data needed for problem analysis is collected
automatically and sent to the service organization.
Self-healing
Process pairs and per-processor processes: Based on resource virtualization, many of the system
services are implemented as process pairs (two collaborating processes running in different
processors that checkpoint both state and state data) or as per-processor processes. If one process
or its processor fails, the request is automaically rerouted to the remaining process, thereby fully
hiding the processor failure from the application (see figure 5).
Automated data repair: If the disk data-scrubbing software detects a data error, the incorrect data
is repaired from a second data disk (in the case of host-based mirroring; see figure 3).
7
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.
Figure 5. Process pairs take over automatically and retain access to data.
Automated reinstatement of repaired hardware with sanity checks: Repaired hardware inserted
into the system is detected automatically. For some types of hardware, a sanity check is performed
(for example, by sending test packages) before the hardware is fully reinstated.
Self-protection
End-to-end data checksums: All messages in the system are checked from beginning to end (see
figure 6). All system data buffers are protected with buffer tags.
Fail-fast technology: If a checksum error is encountered, the action is retried. If an overwritten buffer
tag is detected when an operating system kernel buffer is deallocated, the processor is halted to
maintain data integrity.
Recent advances
Recent self-management improvements include
A new file-type dependent checksum technology for data stored on industry-standard 512 bytes-per-
sector disks (see figure 6).
The NonStop Advanced Architecture, which introduces the next-generation self-checking processor
technology implementing an optional processor triplex. (In this system, each logical processor can
consist of one to three processor elements.)
Disk path probing, that is, periodic checking of alternate paths to a disk drive. This technology,
combined with other implementations of latent-failure technology, allows us to always know that a
component can be replaced or upgraded safely.
8
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.
Figure 6. As data flows throught the system, checksums are persistently validated with end-to-end data integrity.
Delta-based mirrored-disk copy, which allows us to, for example, gracefully handle the failure of an
enterprise storage box that hosts a whole set of backup logical unit numbers (LUNs): The data-
access manager keeps track of changes that occur to the remaining LUNs and can therefore copy
the delta only once the failed enterprise storage box is restored. Without this technology, it could
take days before the storage subsystem would be restored to full fault tolerance with all of the
backup LUNs brought back up to date. With this technology, this process takes just minutes.
Enhanced background quality scans of data stored on disks allow us to detect and repair latent
failures.
Lessons learned
Self-management is not easy, and it takes time to identify what tasks can be automated. Furthermore,
our experience shows that not all tasks should be automated, and that there are different ways to
provide system self-management. System self-management is a matter of continous improvement, but if
it is done correctly, much of the complexity of system management can be removed, as is illustrated
by the following examples.
The dependency chain for a disk drive (HP ServerNet adapter, disk controller, disk paths, and disk
drives, that is, in which processors to place the data-access manager, and so on) is quite complex.
Therefore, we decided to create a configuration manager that configures adapters and controllers
automatically, that knows the rules for the dependencies, and that can check them when the disk
configuration is created or changed (all configuration changes may be done online). Because of this,
disk-configuration errors are very rare to nonexistent.
The following points describe some of the ongoing work that we are doing to improve our self-
managing capability, including some areas we are still learning about:
Wide area network (WAN) subnetworks: In comparison, initially we did not implement the same
level of self-configuration for the WAN subsystem, which forced us to rely on our support resources
to sort out configuration errors. Due to the many configuration problems in the WAN subsystem, we
were eventually forced to add some self-configuring capabilities to it. In addition, we created a
WAN configuration wizard. Once implemented, these features helped alleviate the situation,
reducing configuration-related problems significantly.
9
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.
Automation not needed: When a disk is inserted, it can be automatically configured (using
predefined templates), labeled, started, and, if applicable, an online disk copy can be launched.
This has turned out to be interesting in demonstrations, but we have no evidence that the feature is
used by customers. Furthermore, it has proved impossible to carry the feature forward when moving
to disks that are external to the system (JBOD and enterprise storage).
Self-diagnosis: The self-diagnosing software can detect hardware failures (using events), but
currently cannot always call out the exact failing component in the case of storage. Our objective is
to improve our ability to pinpoint the failing component, whether it is inside a NonStop system
enclosure or in an external subsystem such as a storage array.
Looking ahead
As mentioned earlier, system self-management is a technology that must be improved continually. As
can be seen in figure 7, where a single or multiple applications may be distrubuted across processors
and across nodes, managing such complex distributed applications calls for continuous improvement
in the techniques discribed above. Furthermore, the architecture of the NonStop server continues to
evolve, which is especially obvious where specialized devices with their own management
architectures are used for core system functions, such as data storage and networking. The system
itself is both viewed and implemented as a hetergenous architecture. Thus, we are working on ways
to continue to provide self-management technology without owning all components in the system.
Today, the system already has to handle the management of four different operating environments,
which may increase in the future. Examples include
Enterprise storage server
Database and transaction server
System management console
NonStop operating system and POSIX personalities
HP and third-party developed value-added middleware
In such heteregenous system designs, a large problem that needs to be solved is how to aggregrate
information from a number of different components to pinpoint the source of a problem to the same
level as were capable of doing today (or to do so at even higher levels). Achieving this goal will
require increased levels of cooperation with other divisions of HP to share information and
technologies in the interest of a common goal for both HP and our customers: simplied system
operation using a combination of self-management and adaptive-management technologies.
10
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.
Figure 7. Applications can be distributed effortlessly across multiple systems.
Conclusion
In the end, IBM is right: Systems are becoming too complex for humans to manage, and that effect is
quite visible in the TCO of an IT infrastructure. Self-management is a technology that addresses this
problem, but it takes a long time to get right and it must by its very nature always continue to evolve.
The good news is that we have a long history of experience with this technology, an experience that
delivers continuous operation of NonStop systems, removing the worry and headaches from the
customer. Removing management complexity through automation reduces cost of ownership,
simplifies the overall task of running a sophisticated system, and results in world-class levels of
availability.
11
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.
For more information
www.hp.com/go/nonstop
2005 Hewlett-Packard Development Company, L.P. The information contained
herein is subject to change without notice. The only warranties for HP products and
services are set forth in the express warranty statements accompanying such
products and services. Nothing herein should be construed as constituting an
additional warranty. HP shall not be liable for technical or editorial errors or
omissions contained herein.
Linux is a U.S. registered trademark of Linus Torvalds.
06/2005
To access this document, please return to page 1 to complete the
form.
By completing this form once, you will have access to all similar
documents without needing to register again.