ZTC Endurance ZEN Availability Whitepaper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Evolutionary Fault Tolerance

A new approach to availability in x86 architectures

Stratus ztC Endurance is an innovative new


family of fault-tolerant computing platforms
that enable intelligent predictive failover and
99.99999% compute platform availability.

For an Always-On World www.stratus.com


Evolutionary Fault Tolerance: White Paper | 2
Stratus ztC EnduranceTM Availability

Executive Summary Stratus ztC Endurance is an innovative new


family of fault-tolerant computing platforms
This white paper discusses the evolutionary fault tolerance
that enable intelligent predictive failover and
delivered by the availability architecture of the Stratus ztC
Endurance™ computing platform. The paper provides a 99.99999% compute platform availability.
high-level overview of the redundant, fault tolerant platform
and its availability architecture, underscoring ztC
The Stratus ztC Endurance platform is available in three
Endurance’s unique design. Combined with Stratus’
distinct models:
proven, award-winning service and support, ztC Endurance
provides our partners and customers the highest possible ztC Endurance 3100 - The entry-level ztC Endurance
levels of uptime, availability, and reliability for model is designed to provide affordable performance for
mission-critical applications and data. critical applications in remote offices, branch offices, or
shop floor locations. It is a reliable, fault-tolerant system
The Stratus ztC Endurance availability architecture is
with a single-socket processor architecture that is
based on a number of factors and was carefully designed
roughly suitable for 12 medium-sized virtual machines or
to provide the optimal combination of protection,
applications. This model will offer a single-processor
performance, intelligence, modularity, flexibility, and
architecture with 1 x 12-core Hyper-Threaded Intel®
serviceability for a modern next-generation mission-critical
Xeon® Silver processor (providing 24 threads, or vCPUs)
computing system. ztC Endurance utilizes redundant
and will include either 64 GB, 128 GB, or 256 GB of
modular hardware, advanced health monitoring and
memory.
predictive diagnostics, and automatic self-healing and
failover capabilities. Additionally, the platform proactively ztC Endurance 5100 - The mid-range ztC Endurance
identifies and isolates faults, facilitating system services model offers a versatile compute platform for rapidly
and repairs through simple online replacements. This growing or evolving application requirements in regional
identify, isolate, and service availability architecture offices, remote plants, or regional data centers. The
provides maximum uptime and availability for workloads mid-range ztC Endurance model is a reliable,
and ensures that critical applications will continue to run fault-tolerant system that is roughly suitable for 24
with no downtime or data loss, even if a hardware failure medium-sized virtual machines or applications. This
should occur. model will offer a dual-processor architecture with 2 x
12-core Hyper-Threaded Intel Xeon Silver processors
The Stratus ztC Endurance Platform (providing 48 threads, or vCPUs) and will include either
128 GB, 256 GB, or 512 GB of memory. The mid-range
Stratus ztC Endurance is an innovative new family of
ztC Endurance system will provide a balance of
fault-tolerant computing platforms that enable intelligent
application performance and value to meet the demands
predictive failover and 99.99999% compute platform
of mission-critical applications in traditional data centers
availability. This family of platforms evolves from Stratus’
as well as at the edge.
proven combination of built-in fault tolerance, proactive
health monitoring, and unmatched serviceability. The ztC ztC Endurance 7100 - The high-performance
Endurance platform, together with its Automated Uptime ztC Endurance model provides the highest level of
Layer with Smart ExchangeTM, delivers the predictable, performance for data-intensive / transaction-intensive
protected performance needed by today and tomorrow’s applications of larger remote plants or corporate data
data center and edge environments by leveraging Intel RAS centers and for compute-intensive applications such as
capabilities, embedded hardware and software security, AI and ML. This ztC Endurance model offers a
and increased manageability and serviceability via a dual-processor architecture with 2 x 24-core
modular architecture.

For an Always-On World www.stratus.com


Evolutionary Fault Tolerance: White Paper | 3
Stratus ztC EnduranceTM Availability

Hyper-Threaded Intel Xeon Gold processors (providing 96 A single ztC Endurance system chassis includes eight CRU
threads, or vCPUs) and will include either 256 GB, 512 modules:
GB, or 1024 GB of memory. The high-end model will be
2 x Compute Modules
roughly suitable for 40+ medium-sized virtual machines
or applications. 2 x Storage Modules
2 x I/O Modules
This family of fault tolerant platforms is 2 x Power Supply Units (PSUs)
another advance by Stratus in providing Each of these eight CRUs can be independently
simple, protected, and autonomous hot-replaced (i.e., a failed CRU can be hot-removed as
computing. indicated by the CRU’s “safe-to-pull” LED and a
replacement CRU can be hot-inserted) to restore a ztC
Endurance system to a healthy, fully redundant
Simple configuration in the event of a hardware failure. This
Stratus computing platforms are designed to be simple to allows for ease of service whereby replacement parts can
use, to install, and to maintain. For rapid time-to-value and be dispatched to a site and hot-installed in the system,
simple management, including non-technical staff, Stratus restoring the system to full health while it is running and
computing solutions are easy to install, deploy, and with no impact to operations, applications, and data. This
manage across applications and infrastructure with also allows for highly granular serviceability whereby only
zero-touch operation. Stratus computing platforms support the failed CRU is removed and replaced. This means that
both bare metal and virtualized architectures to provide any one subsystem (compute, storage, I/O, or power) can
quick application deployment, to offer flexibility, to be independently serviced without affecting the other
maximize computing resources, and to lower the total cost subsystems.
of ownership of the system.
Protected
Easy To Use: The Stratus ztC Endurance platform has Stratus computing solutions are designed to protect
been designed for simplicity and ease of use for both IT operations, applications, and data. Stratus solutions are
and OT personnel. Features such as automatic redundant, fault-tolerant, and reduce operational, financial,
deployment scripts, a simple user interface, and reputational risk by ensuring “always on” availability,
industry-standard interfaces for remote monitoring and zero downtime, and zero data loss. The ztC Endurance
management, support for standard off-the-shelf operating platform is a redundant, fault-tolerant, hardened, secure
systems and hypervisors, automatic local and remote system with no single points of failure, ensuring continued
notification capabilities, and hot-swappable plug-and-play operation with no loss of in-flight data if a hardware failure
components ensure that a ztC Endurance system is easy should occur.
to deploy, easy to configure, easy to operate, easy to
monitor, and easy to service. Redundant Architecture: A core concept of the ztC
Endurance architecture is its fully redundant hardware
Modular / Serviceable: The Stratus ztC Endurance system design, including, for example, mainboards, processors,
design builds upon the redundant, hot-swappable memory, disk drives, network interfaces, and power
Customer Replaceable Units (CRUs) offered in previous supplies. To achieve this redundancy, a ztC Endurance
Stratus computing platforms, but expands that concept system includes 8 CRU modules, as described above,
into an even more modular, more serviceable design. including a pair of identical compute modules, a pair of
identical storage modules, a pair of identical I/O modules,
and a pair of identical Power Supply Units. Each pair of
CRU modules provides redundancy for a ztC Endurance
subsystem, so that one CRU in each pair can fail without
causing a system outage.

For an Always-On World www.stratus.com


Evolutionary Fault Tolerance: White Paper | 4
Stratus ztC EnduranceTM Availability

Redundancy for the compute modules is provided via an Despite the failure (or potential failure) of a component,
active / standby availability architecture. The active the applications will continue to run, and the data will
compute module handles all processing while the standby continue to be accessible. The failed component can then
compute module stays ready to be promoted to active be serviced via online CRU replacement while the server is
status via an automatic compute failover process, called running with no interruption to the business operations.
Smart ExchangeTM described below, should the active
compute module begin to fail. The ztC Endurance platform automatically provides this
fault-tolerant protection in an application-transparent
Redundancy for the storage modules, I/O modules, and fashion. This means that the system’s hardware
PSUs is provided via an active / active availability redundancy and fault-tolerant capabilities are abstracted
architecture whereby both storage modules, both I/O from an operating system / hypervisor, virtual machine, or
modules, and both PSUs are active and operational in a application, allowing the ztC Endurance to run standard
healthy ztC Endurance system. This active / active operating systems / hypervisors and the same
redundancy allows the associated subsystem (storage, applications that would run on a typical commodity server
I/O, or power) to continue operation in a seamless, — with no special setup, custom configuration, or code
bumpless fashion (i.e., with no system outage, no modifications required. This automatically protects the
downtime, and no failover process required) if one of the operating system / hypervisor, virtual machines,
redundant modules in that subsystem should fail. This applications, and data from outages and downtime, and
redundancy also allows a failed module to be serviced requires no additional work.
(removed/replaced) while the healthy active module
continues operation without any disruption to the SMART ZefrTM memory reduces the
operation of the overall system or to the applications. memory-related Defective Parts Per Million
(DPPM) metric from an industry-standard
Fault-Tolerant Approach: If the Automated Uptime Layer level of 3,000 DPPM down to 200 DPPM.
with Smart ExchangeTM identifies a hardware failure or
predicts a potential hardware failure, the Stratus ztC
Hardened Hardware and Software: The ztC Endurance
Endurance platform will utilize its redundant hardware and
solution makes use of hardened hardware components
built-in failover capabilities to automatically take action to
where possible. One example of this is the utilization of
avoid a system outage.
SMART Zefr memory. Zefr stands for Zero Failure Rate. All
memory modules utilized in ztC Endurance systems will
undergo a Zefr screening process. The process involves
Identify extended-runtime testing of each memory module on a
server-class motherboard at elevated operational
temperature with high-speed data exchange driven by
Isolate demanding test scripts. This screening process filters out
potentially weak memory modules and reduces the
memory-related Defective Parts Per Million (DPPM) metric
Service from an industry-standard level of 3,000 DPPM down to
200 DPPM. This dramatically increases the reliability of the
memory modules and further increases the overall
The ztC Endurance platform utilizes an “Identify -> Isolate reliability and availability of the ztC Endurance platform.
-> Service” approach whereby any internal components
that are identified as failed or likely to fail are automatically
removed from operation without impacting the compute
workload.

For an Always-On World www.stratus.com


Evolutionary Fault Tolerance: White Paper | 5
Stratus ztC EnduranceTM Availability

Additionally, Stratus develops and deploys hardened Autonomous


device drivers to support the robust redundancy and
fault-tolerance of the ztC Endurance storage and I/O Stratus zero-touch computing platforms require zero
subsystems. These hardened device drivers allow a ztC human intervention for identification and isolation of faults
Endurance platform to continue running and automatically and minimal human intervention for system support,
self-heal in the event of a firmware lockup, a driver-related service, maintenance, and repair. Stratus platforms offer
error condition, or a surprise removal or insertion of a “call home” features for remote issue notifications and
storage or I/O hardware component. Competing solutions 24x7x365 support to further minimize any chance of
do not typically include hardened device drivers and unplanned downtime for equipment and applications.
cannot typically recover or continue running if a similar Stratus also supports remote management and monitoring
error condition should occur. to provide for flexibility in system management and
maintenance activities.
Secure Operation: Ensuring the security of our computing
platforms is of utmost important to Stratus. Cybersecurity Self-Monitoring: The Stratus ztC Endurance platform
has become increasingly important in the modern actively monitors hundreds of internal data points from
computing world due to an increasing reliance on multiple data sources and continuously analyzes those
technology, a growing need for interconnectivity and data points to identify failures or to predict failures before
networking between distributed sites, the adoption of they occur. If a hardware component fails or is about to
remote access technologies, the emergence of IoT, and fail, ztC Endurance will automatically identify the issue via
the convergence of IT and OT. These trends are exposing these health monitoring capabilities.
previously isolated systems to new and existing
Once a system issue has been identified, ztC Endurance
vulnerabilities and are creating difficulties in securing
provides both local and remote notifications to ensure that
computing architectures using traditional methods and
issue resolution can immediately begin. A ztC Endurance
tools.
system can locally alert the user of a system issue using
Stratus reduces risk by delivering platforms with industry-standard methods and protocols such as e-mail
multi-layered defense-in-depth approaches to minimize alerts, SNMP, and REST APIs. ztC Endurance can also “call
viable attack vectors. Stratus focuses on both process home” to notify Stratus support of an issue via secure,
security and product security to ensure maximum security. encrypted communication channels.

For more information on Stratus process security and on Self-Healing: As described above, the Stratus ztC
the security of the Stratus ztC Endurance platform, please Endurance platform will automatically take action – with
download the Stratus Product Security Whitepaper. no user intervention required – to ensure continued
operation of the system if a component should fail or if a
component is predicted to fail. To achieve this, the system
utilizes built-in health monitoring and predictive analysis
capabilities and will leverage hardware redundancy and
online failover functionality to continue operating through
component failures with zero downtime or data loss.

If a component or module has failed, ztC Endurance will


run diagnostics on the failed component or module after it
has been taken out of service. The management
subsystem will use algorithms to determine whether the
error condition was persistent or transient. In the case of a
persistent error or a frequent transient error, the
component or module would be left out of service and
would need to be replaced.

For an Always-On World www.stratus.com


Evolutionary Fault Tolerance: White Paper | 6
Stratus ztC EnduranceTM Availability

But in the case of an infrequent transient error, the failed Additionally, Stratus Managed Services provides a wide
component or module may be reset and returned to range of additional remote service and support features,
service to self-heal the system. including server management, health monitoring, database
administration, reporting, and additional services.
Additionally, ztC Endurance’s self-healing capabilities
assist the operator if replacement of a module is ever Interoperability: In modern data center and Edge
required. No additional operator action or intervention is Computing environments, interoperability – the ability of
required beyond physically plugging the replacement different computerized systems to connect and
module into the ztC Endurance system chassis. There is communicate with one another freely and easily using
no need for a keyboard, mouse, or monitor or for any standard, coordinated methods with minimal restriction
operator actions (such as performing diagnostics, and without requiring custom implementation effort and
restoring device configurations, flashing firmware, specialized support – has become increasingly important.
synchronizing data, running scripts, balancing loads, etc.) The Stratus ztC Endurance platform provides broad
that are typically associated with component replacement. support for industry-standard protocols, such as SNMP,
The ztC Endurance management subsystem will SMTP, and REST APIs, that can be utilized by standard
automatically detect the replacement module and will off-the-shelf system management platforms to enable
automatically perform any actions required to return the centralized remote monitoring and management.
component to service and to return the system to full
health. This self-healing capability ensures that simple The Stratus Availability Architecture
module replacements can be done by OT staff without
requiring support from IT. The Stratus ztC Endurance platform’s availability
architecture was specifically designed to leverage the
Remote Monitoring and Management: The Stratus ztC increased reliability and intelligence of modern compute
Endurance management subsystem provides several hardware components to deliver the highest possible
web-based user interfaces for remote monitoring and levels of server availability and performance.
management. These web-based user interfaces include a
BMC UI (used for system health monitoring, console The ztC Endurance platform ensures the highest possible
access, and power control), a management subsystem UI availability by utilizing internal health monitoring,
(used for system health monitoring and system component diagnostics, predictive failure analysis,
management), and an OS/hypervisor UI (used for system redundancy, and automatic failover capabilities. At the
management and for configuration of virtual machines, same time, the ztC Endurance system provides maximum
applications, etc.). compute performance by utilizing the advanced features
and full functionality of modern hardware components,
Additionally, ztC Endurance provides built-in support for such as the fourth generation Intel® Xeon® processors,
remote monitoring by Stratus. This provides automatic while shielding the system from spurious alarms, false
notification of any system issue to Stratus Support via data divergences, and unnecessary hardware recovery /
secure internet-based connectivity with no additional resync cycles resulting from common, transient,
implementation required. correctable errors.

Remote Service and Support: The Stratus ztC Endurance The overall uptime percentage of a ztC Endurance system
platform provides a cloud-based capability for remote is 7 nines (or 99.99999%). This uptime percentage model
access that can enable Stratus Support to connect equates to an expected average system downtime of less
remotely for diagnostics, troubleshooting, and service. This than 3.15 seconds per year. The availability of ztC
can be done through secure internet-based connectivity, Endurance is significantly higher than the expected
only if explicitly permitted by the user. availability of a typical High Availability (HA) cluster
system with an uptime percentage of 99.95%, or an
expected average downtime of 4 hours and 23 minutes per
year.

For an Always-On World www.stratus.com


Evolutionary Fault Tolerance: White Paper | 7
Stratus ztC EnduranceTM Availability

It is also greater than the expected availability of a


conventional standalone server with an uptime percentage
of 99%, or an expected average downtime of 87 hours and
40 minutes per year.

The overall uptime percentage of a ztC


Endurance system is 7 nines (or 99.99999%).
This equates to an expected average system
downtime of less than 3.15 seconds per year.
As such, Stratus Customer Support plays a significant role
ztC Endurance provides an active / standby availability in achieving the highest possible levels of uptime and
architecture from a compute (i.e., motherboard, availability for a ztC Endurance system. In addition to
processors, and memory) perspective with fully redundant providing proactive monitoring and next-day replacement
active / active storage (i.e. disk drives), I/O (i.e. network parts in the event of a hardware failure, Stratus Customer
interfaces and PCIe peripherals), and power. Support makes other value add services available,
including 24x7x365 technical support for all
Stratus Automated Uptime Layer with Smart Stratus-provided hardware and software, system software
upgrades and patches, root cause failure analysis services,
Exchange
emergency onsite response, vendor collaboration for OEM
The Stratus ztC Endurance platform includes components, and full support for the operating system /
Stratus-provided firmware and software known as the hypervisor.
Automated Uptime Layer with Smart Exchange (AUL -
For additional details on the Stratus ztC Endurance, the
Smart Exchange) to support the system’s availability
Stratus Availability Architecture, or Stratus Customer
features. Unique in the industry, AUL - Smart Exchange
Support, please contact us to schedule a meeting with a
provides reliability and fault-tolerance features for
Stratus expert, or reach out to your local representative.
motherboards, processors, memory, busses, storage
You can also visit www.stratus.com to learn more.
devices, and I/O devices. AUL - Smart Exchange also
simplifies monitoring and management of the system and
enables local monitoring and remote service / support.
About Stratus

For leaders digitally transforming their operations in order


Service and Support to drive predictable, peak performance with minimal risk,
Stratus ensures the continuous availability of
The redundancy, fault tolerance, and availability
business-critical applications by delivering zero-touch
architecture of the Stratus ztC Endurance platform ensure
computing platforms that are simple to deploy and
that a workload will continue operation with no significant
maintain, protected from interruptions and threats, and
downtime, outage, or interruption in the event of a
autonomous. For over 40 years, we have provided reliable
hardware failure. Ensuring continued operation despite a
and redundant zero-touch computing, enabling global
hardware failure is an essential benefit. It is also critical to
Fortune 500 companies and small-to-medium sized
identify that an issue has occurred and to resolve the issue
businesses to securely and remotely turn data into
quickly, returning the system to fully healthy, redundant,
actionable intelligence at the Edge, cloud and data center
fault tolerant mode as soon as possible.
– driving uptime and efficiency. For more information,
please visit www.stratus.com or follow on Twitter
@StratusAlwaysOn and LinkedIn @StratusTechnologies.

Specifications and descriptions are summary in nature and subject to change without notice.

Stratus and the Stratus Technologies logo are trademarks or registered trademarks of Stratus Technologies Ireland Limited.
All other marks are the property of their respective owners. ©2023 Stratus Technologies Ireland Limited. All rights reserved.
www.stratus.com

You might also like