ZTC Endurance ZEN Availability Whitepaper
ZTC Endurance ZEN Availability Whitepaper
ZTC Endurance ZEN Availability Whitepaper
Hyper-Threaded Intel Xeon Gold processors (providing 96 A single ztC Endurance system chassis includes eight CRU
threads, or vCPUs) and will include either 256 GB, 512 modules:
GB, or 1024 GB of memory. The high-end model will be
2 x Compute Modules
roughly suitable for 40+ medium-sized virtual machines
or applications. 2 x Storage Modules
2 x I/O Modules
This family of fault tolerant platforms is 2 x Power Supply Units (PSUs)
another advance by Stratus in providing Each of these eight CRUs can be independently
simple, protected, and autonomous hot-replaced (i.e., a failed CRU can be hot-removed as
computing. indicated by the CRU’s “safe-to-pull” LED and a
replacement CRU can be hot-inserted) to restore a ztC
Endurance system to a healthy, fully redundant
Simple configuration in the event of a hardware failure. This
Stratus computing platforms are designed to be simple to allows for ease of service whereby replacement parts can
use, to install, and to maintain. For rapid time-to-value and be dispatched to a site and hot-installed in the system,
simple management, including non-technical staff, Stratus restoring the system to full health while it is running and
computing solutions are easy to install, deploy, and with no impact to operations, applications, and data. This
manage across applications and infrastructure with also allows for highly granular serviceability whereby only
zero-touch operation. Stratus computing platforms support the failed CRU is removed and replaced. This means that
both bare metal and virtualized architectures to provide any one subsystem (compute, storage, I/O, or power) can
quick application deployment, to offer flexibility, to be independently serviced without affecting the other
maximize computing resources, and to lower the total cost subsystems.
of ownership of the system.
Protected
Easy To Use: The Stratus ztC Endurance platform has Stratus computing solutions are designed to protect
been designed for simplicity and ease of use for both IT operations, applications, and data. Stratus solutions are
and OT personnel. Features such as automatic redundant, fault-tolerant, and reduce operational, financial,
deployment scripts, a simple user interface, and reputational risk by ensuring “always on” availability,
industry-standard interfaces for remote monitoring and zero downtime, and zero data loss. The ztC Endurance
management, support for standard off-the-shelf operating platform is a redundant, fault-tolerant, hardened, secure
systems and hypervisors, automatic local and remote system with no single points of failure, ensuring continued
notification capabilities, and hot-swappable plug-and-play operation with no loss of in-flight data if a hardware failure
components ensure that a ztC Endurance system is easy should occur.
to deploy, easy to configure, easy to operate, easy to
monitor, and easy to service. Redundant Architecture: A core concept of the ztC
Endurance architecture is its fully redundant hardware
Modular / Serviceable: The Stratus ztC Endurance system design, including, for example, mainboards, processors,
design builds upon the redundant, hot-swappable memory, disk drives, network interfaces, and power
Customer Replaceable Units (CRUs) offered in previous supplies. To achieve this redundancy, a ztC Endurance
Stratus computing platforms, but expands that concept system includes 8 CRU modules, as described above,
into an even more modular, more serviceable design. including a pair of identical compute modules, a pair of
identical storage modules, a pair of identical I/O modules,
and a pair of identical Power Supply Units. Each pair of
CRU modules provides redundancy for a ztC Endurance
subsystem, so that one CRU in each pair can fail without
causing a system outage.
Redundancy for the compute modules is provided via an Despite the failure (or potential failure) of a component,
active / standby availability architecture. The active the applications will continue to run, and the data will
compute module handles all processing while the standby continue to be accessible. The failed component can then
compute module stays ready to be promoted to active be serviced via online CRU replacement while the server is
status via an automatic compute failover process, called running with no interruption to the business operations.
Smart ExchangeTM described below, should the active
compute module begin to fail. The ztC Endurance platform automatically provides this
fault-tolerant protection in an application-transparent
Redundancy for the storage modules, I/O modules, and fashion. This means that the system’s hardware
PSUs is provided via an active / active availability redundancy and fault-tolerant capabilities are abstracted
architecture whereby both storage modules, both I/O from an operating system / hypervisor, virtual machine, or
modules, and both PSUs are active and operational in a application, allowing the ztC Endurance to run standard
healthy ztC Endurance system. This active / active operating systems / hypervisors and the same
redundancy allows the associated subsystem (storage, applications that would run on a typical commodity server
I/O, or power) to continue operation in a seamless, — with no special setup, custom configuration, or code
bumpless fashion (i.e., with no system outage, no modifications required. This automatically protects the
downtime, and no failover process required) if one of the operating system / hypervisor, virtual machines,
redundant modules in that subsystem should fail. This applications, and data from outages and downtime, and
redundancy also allows a failed module to be serviced requires no additional work.
(removed/replaced) while the healthy active module
continues operation without any disruption to the SMART ZefrTM memory reduces the
operation of the overall system or to the applications. memory-related Defective Parts Per Million
(DPPM) metric from an industry-standard
Fault-Tolerant Approach: If the Automated Uptime Layer level of 3,000 DPPM down to 200 DPPM.
with Smart ExchangeTM identifies a hardware failure or
predicts a potential hardware failure, the Stratus ztC
Hardened Hardware and Software: The ztC Endurance
Endurance platform will utilize its redundant hardware and
solution makes use of hardened hardware components
built-in failover capabilities to automatically take action to
where possible. One example of this is the utilization of
avoid a system outage.
SMART Zefr memory. Zefr stands for Zero Failure Rate. All
memory modules utilized in ztC Endurance systems will
undergo a Zefr screening process. The process involves
Identify extended-runtime testing of each memory module on a
server-class motherboard at elevated operational
temperature with high-speed data exchange driven by
Isolate demanding test scripts. This screening process filters out
potentially weak memory modules and reduces the
memory-related Defective Parts Per Million (DPPM) metric
Service from an industry-standard level of 3,000 DPPM down to
200 DPPM. This dramatically increases the reliability of the
memory modules and further increases the overall
The ztC Endurance platform utilizes an “Identify -> Isolate reliability and availability of the ztC Endurance platform.
-> Service” approach whereby any internal components
that are identified as failed or likely to fail are automatically
removed from operation without impacting the compute
workload.
For more information on Stratus process security and on Self-Healing: As described above, the Stratus ztC
the security of the Stratus ztC Endurance platform, please Endurance platform will automatically take action – with
download the Stratus Product Security Whitepaper. no user intervention required – to ensure continued
operation of the system if a component should fail or if a
component is predicted to fail. To achieve this, the system
utilizes built-in health monitoring and predictive analysis
capabilities and will leverage hardware redundancy and
online failover functionality to continue operating through
component failures with zero downtime or data loss.
But in the case of an infrequent transient error, the failed Additionally, Stratus Managed Services provides a wide
component or module may be reset and returned to range of additional remote service and support features,
service to self-heal the system. including server management, health monitoring, database
administration, reporting, and additional services.
Additionally, ztC Endurance’s self-healing capabilities
assist the operator if replacement of a module is ever Interoperability: In modern data center and Edge
required. No additional operator action or intervention is Computing environments, interoperability – the ability of
required beyond physically plugging the replacement different computerized systems to connect and
module into the ztC Endurance system chassis. There is communicate with one another freely and easily using
no need for a keyboard, mouse, or monitor or for any standard, coordinated methods with minimal restriction
operator actions (such as performing diagnostics, and without requiring custom implementation effort and
restoring device configurations, flashing firmware, specialized support – has become increasingly important.
synchronizing data, running scripts, balancing loads, etc.) The Stratus ztC Endurance platform provides broad
that are typically associated with component replacement. support for industry-standard protocols, such as SNMP,
The ztC Endurance management subsystem will SMTP, and REST APIs, that can be utilized by standard
automatically detect the replacement module and will off-the-shelf system management platforms to enable
automatically perform any actions required to return the centralized remote monitoring and management.
component to service and to return the system to full
health. This self-healing capability ensures that simple The Stratus Availability Architecture
module replacements can be done by OT staff without
requiring support from IT. The Stratus ztC Endurance platform’s availability
architecture was specifically designed to leverage the
Remote Monitoring and Management: The Stratus ztC increased reliability and intelligence of modern compute
Endurance management subsystem provides several hardware components to deliver the highest possible
web-based user interfaces for remote monitoring and levels of server availability and performance.
management. These web-based user interfaces include a
BMC UI (used for system health monitoring, console The ztC Endurance platform ensures the highest possible
access, and power control), a management subsystem UI availability by utilizing internal health monitoring,
(used for system health monitoring and system component diagnostics, predictive failure analysis,
management), and an OS/hypervisor UI (used for system redundancy, and automatic failover capabilities. At the
management and for configuration of virtual machines, same time, the ztC Endurance system provides maximum
applications, etc.). compute performance by utilizing the advanced features
and full functionality of modern hardware components,
Additionally, ztC Endurance provides built-in support for such as the fourth generation Intel® Xeon® processors,
remote monitoring by Stratus. This provides automatic while shielding the system from spurious alarms, false
notification of any system issue to Stratus Support via data divergences, and unnecessary hardware recovery /
secure internet-based connectivity with no additional resync cycles resulting from common, transient,
implementation required. correctable errors.
Remote Service and Support: The Stratus ztC Endurance The overall uptime percentage of a ztC Endurance system
platform provides a cloud-based capability for remote is 7 nines (or 99.99999%). This uptime percentage model
access that can enable Stratus Support to connect equates to an expected average system downtime of less
remotely for diagnostics, troubleshooting, and service. This than 3.15 seconds per year. The availability of ztC
can be done through secure internet-based connectivity, Endurance is significantly higher than the expected
only if explicitly permitted by the user. availability of a typical High Availability (HA) cluster
system with an uptime percentage of 99.95%, or an
expected average downtime of 4 hours and 23 minutes per
year.
Specifications and descriptions are summary in nature and subject to change without notice.
Stratus and the Stratus Technologies logo are trademarks or registered trademarks of Stratus Technologies Ireland Limited.
All other marks are the property of their respective owners. ©2023 Stratus Technologies Ireland Limited. All rights reserved.
www.stratus.com