HCIP-Storage V5.0 Learning Guide

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 129

Huawei Storage Certification Training

HCIP-Storage
Course Notes

HUAWEI TECHNOLOGIES CO., LTD.


Copyright © Huawei Technologies Co., Ltd. 2020. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any means without
prior written consent of Huawei Technologies Co., Ltd.

Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their
respective holders.

Notice
The purchased products, services and features are stipulated by the contract made between
Huawei and the customer. All or part of the products, services and features described in this
document may not be within the purchase scope or the usage scope. Unless otherwise specified in
the contract, all statements, information, and recommendations in this document are provided
"AS IS" without warranties, guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made
in the preparation of this document to ensure accuracy of the contents, but all statements,
information, and recommendations in this document do not constitute a warranty of any kind,
express or implied.

Huawei Technologies Co., Ltd.


Address: Huawei Industrial Base Bantian, Longgang Shenzhen 518129
People's Republic of China
Website: http://e.huawei.com

Huawei Proprietary and Confidential


Copyright © Huawei Technologies Co., Ltd.
HCIP-Storage Course Notes Page 1

Huawei Certification System


Huawei Certification follows the "platform + ecosystem" development strategy, which is a new
collaborative architecture of ICT infrastructure based on "Cloud-Pipe-Terminal". Huawei has set up
a complete certification system comprising three categories: ICT infrastructure, Platform and
Service, and ICT vertical. Huawei's technical certification system is the only one in the industry
covering all of these fields.
Huawei offers three levels of certification: Huawei Certified ICT Associate (HCIA), Huawei
Certified ICT Professional (HCIP), and Huawei Certified ICT Expert (HCIE).
Huawei Certified ICT Professional-Storage (HCIP-Storage) is designed for Huawei engineers,
students and ICT industry personnel. HCIP-Storage covers storage system introduction, flash
storage technology and application, distributed storage technology and application, storage design
and implementation, and storage maintenance and troubleshooting.
The HCIP-Storage certificate introduces you to the storage industry and markets, helps you
understand sector innovation, and makes sure you stand out among your industry peers.
HCIP-Storage Course Notes Page 2
HCIP-Storage Course Notes Page 3

1 Storage System Introduction

1.1 Introduction to All-Flash Storage


1.1.1 Product Positioning
Huawei storage systems can be classified into all-flash storage, hybrid flash storage, and distributed
storage.
Huawei all-flash storage is built on the next-generation Kunpeng hardware platform and uses
SmartMatrix to establish a full-mesh, end-to-end NVMe architecture. It supports multiple advanced
protection technologies such as RAID-TP to tolerate failure of seven out of eight controllers. In
addition, the use of FlashLink and intelligent chips accelerates service processing from end to end.
Huawei hybrid flash storage leverages a brand-new hardware architecture and intelligent processors to
accelerate services. It supports flexible scale-out and balances loads among controllers for hot backup
to ensure system reliability. Storage faults are transparent to hosts. In addition, it converges SAN and
NAS on a unified platform for easy resource management.
Huawei distributed storage integrates HDFS, block, object, and file storage services. It supports
erasure coding and FlashLink and allows coexistence of x86 and Kunpeng platforms. It also provides
service acceleration and intelligent I/O scheduling.
Huawei OceanStor all-flash storage systems are designed for medium- and large-size enterprise
storage environments. The storage systems focus on the core services of enterprise data centers,
virtual data centers, and cloud data centers to meet their requirements for robust reliability, excellent
performance, and high efficiency.
Huawei OceanStor all-flash storage systems leverage a SmartMatrix full-mesh architecture, which
guarantees service continuity in the event that one out of two controller enclosures fails or seven out
of eight controllers fail, meeting the reliability requirements of enterprises' core services. In addition,
OceanStor Dorado 8000 V6 and Dorado 18000 V6 storage systems incorporate AI chips to meet the
requirements of various service applications such as online transaction processing (OLTP), online
analytical processing (OLAP), high-performance computing (HPC), digital media, Internet operations,
centralized storage, backup, disaster recovery, and data migration.
Huawei OceanStor all-flash storage systems provide comprehensive data backup and disaster
recovery solutions to ensure the smooth and secure running of data services. The storage systems also
offer various methods for easy management and convenient local/remote maintenance, remarkably
reducing management and maintenance costs.

1.1.2 Software and Hardware Architectures


First, let's look at the hardware architecture.
Controller enclosure specifications:
Supports 100 V to 240 V AC and 240 V high-voltage DC. The power module must be removed
during BBU replacement; the BBU does not need to be removed during power module replacement.
Port types: 12 Gbit/s SAS, 8 Gbit/s, 16 Gbit/s, and 32 Gbit/s Fibre Channel, GE, 10GE, 25GE, 40GE,
and 100GE. The scale-out interface module can only be installed in slot 2.
HCIP-Storage Course Notes Page 4

 System architecture
Pangea V6 Arm hardware platform, fully autonomous and controllable
Huawei-developed HiSilicon Kunpeng 920 CPU
2 U controller enclosure with integrated disks
The controller enclosure can house 25 x 2.5-inch SAS SSDs or 36 x palm-sized NVMe SSDs.
Two controllers in an enclosure work in active-active mode.
 Disk enclosure
If the controller enclosure uses NVMe SSDs, it must connect to NVMe disk enclosures. If the
controller enclosure uses SAS SSDs, it must connect to SAS disk enclosures.
The disk enclosure (including the entry-level controller enclosure used as a disk enclosure) is
powered on and off with the controller enclosure. The power button on the disk enclosure is
invalid and cannot control disk enclosure power separately.
The smart disk enclosure has Arm CPUs and 8 GB or 16 GB memory, providing computing
capability to offload reconstruction tasks.
Next, let's look at the software architecture. Huawei all-flash storage supports multiple advanced
features, such as HyperSnap, HyperMetro, and SmartQoS. Maintenance terminal software such
as SmartKit and eService can access the storage system through the management network port or
serial port. Application server software such as OceanStor BCManager and UltraPath can access
the storage system through iSCSI or Fibre Channel links.
OceanStor Dorado 8000 V6 and Dorado 18000 V6 storage systems use the SmartMatrix full-
mesh architecture, which leverages a high-speed, fully interconnected passive backplane to
connect to multiple controllers. Interface modules (Fibre Channel and back-end expansion) are
shared by all controllers over the backplane, allowing hosts to access any controller via any port.
The SmartMatrix architecture allows close coordination between controllers, simplifies software
models, and achieves active-active fine-grained balancing, high efficiency, and low latency.
 Front-end full interconnection
The high-end product models of Huawei all-flash storage support front-end interconnect I/O
modules (FIMs), which can be simultaneously accessed by four controllers in a controller
enclosure. Upon reception of host I/Os, the FIM directly distributes the I/Os to appropriate
controllers.
 Full interconnection among controllers
Controllers in a controller enclosure are connected by 100 Gbit/s RDMA links (40 Gbit/s for
OceanStor Dorado 3000 V6) on the backplane.
For scale-out to multiple controller enclosures, any two controllers are directly connected to
avoid data forwarding.
 Back-end full interconnection
Huawei OceanStor Dorado 8000 and 18000 V6 support back-end interconnect I/O modules
(BIMs), which allow a smart disk enclosure to be connected to two controller enclosures and
accessed by eight controllers simultaneously. This technique, together with continuous mirroring,
allows the system to tolerate failure of 7 out of 8 controllers.
Huawei OceanStor Dorado 3000, 5000, and 6000 V6 do not support BIMs. Disk enclosures
connected to OceanStor Dorado 3000, 5000, and 6000 V6 can be accessed by only one controller
enclosure. Continuous mirroring is not supported.
The active-active architecture with multi-level intelligent balancing algorithms balances service
loads and data in the entire storage system. Customers only need to consider the total storage
capacity and performance requirements of the storage system.
HCIP-Storage Course Notes Page 5

LUNs are not owned by any specific controller. LUN data is divided into 64 MB slices. Slices are
distributed to different vNodes (each vNode matches a CPU) based on the hash (LUN ID + LBA)
result.
The balancing algorithms on Huawei all-flash storage include:
 Front-end load balancing
Huawei UltraPath selects proper physical links to send each slice to the corresponding vNode.
The FIMs forward the slices to the corresponding vNodes.
If there is no UltraPath or FIM, the controllers forward I/Os to the corresponding vNodes.
 Global write cache load balancing
Data volumes received by the global write cache are balanced, and data hotspots are evenly
distributed on all vNodes.
 Global storage pool load balancing
Disk utilization, disk service life, data distribution, and hotspot data are evenly distributed.

1.1.3 Key Technologies


We will discuss the key technologies of all-flash storage in terms of high performance, reliability, and
security.
1.1.3.1 High Performance
 I/O acceleration
Huawei OceanStor all-flash storage systems support end-to-end NVMe and provide high-
performance I/O channels.
First, the storage-host network supports NVMe over FC and will evolve into NVMe over RoCE
v2.
Second, the network of storage controllers and disk enclosures supports NVMe over RoCE v2.
NVMe provides reliable NVMe commands and data transmission. NVMe over Fabrics extends
NVMe to various storage networks to reduce the overhead for processing storage network
protocol stacks and achieve high concurrency and low latency. Huawei uses self-developed ASIC
interface modules, SSDs, and enclosures for high-speed end-to-end NVMe channels. This takes
full advantage of the unique designs in protocol parsing, I/O forwarding, service priority, and
hardware acceleration.
Huawei-developed ASIC interface module offloads TCP/IP protocol stack processing for 50%
lower latency. It directly responds to the host from its chip for fewer I/O interactions and evenly
distributes I/Os. In addition, it supports lock-free processing with multi-queue polling.
Huawei-developed ASIC SSD and enclosure prioritize read requests on SSDs for prompt
response to hosts. Smart disk enclosures have CPUs, memory, and hardware acceleration engines
to offload data reconstruction for a lower latency. They also support lock-free processing with
multi-queue polling.
 Protocol offload with DTOE
When a traditional NIC is used, the CPU must spend great resources in processing each MAC
frame and the TCP/IP protocol (checksum and congestion control).
With TOE, the NIC offloads the TCP/IP protocol. The system only processes the actual TCP data
flow. High latency overhead still exists in kernel mode interrupts, locks, system calls, and thread
switching.
DTOE has the following advantages: Each TCP connection has an independent hardware queue
to avoid the lock overhead. The hardware queue is operated in user mode to avoid the context
switching overhead. In addition, the polling mode reduces the latency, and better performance
and reliability can be achieved.
HCIP-Storage Course Notes Page 6

 Intelligent multi-level cache


Data IQ identifies the access frequency of metadata and data, and uses the DRAM cache to
accelerate reads on LUN and pool metadata. Reads on file system metadata and data are
accelerated by using DRAM for the hottest data and SCM cache for the second hottest data. This
reduces 30% of latency.
 SmartCache
Storage systems identify and store hot data to the SmartCache pool to accelerate read requests
and improve overall system performance.
 Round robin scheduling algorithm for metadata
The round robin cache scheduling algorithm improves the metadata hit ratio by 30%. It works as
follows:
Cache resources are managed using cyclic buffers. An alloc cursor for allocation and a swept
cursor for scanning are set.
Each subsystem applies for the read cache at the alloc position for the pages. The weights of the
cached pages are set based on the hit ratio.
When the cache pool usage triggers background reclamation scanning, the swept cursor traverses
all of the cached pages and reduces their weights. A page is evicted when its weight becomes 0.
 File system distribution
When a directory is created, an owner file system partition (FSP) is selected for it. By default, the
ownership of files is the same as that of the directory. The owner FSP of the root directory is
determined by the hash value of the file system ID. The FSP of a directory is determined by
whether the system uses affinity mode or load balancing mode. In affinity mode, the directory is
preferentially allocated to the FSP on the local controller accessed by the client. In load balancing
mode, directories are evenly distributed to FSPs of all controllers based on DHT. Hot files from
large directories can be distributed to FSPs of different vNodes to improve performance.
Affinity mode: When a client accesses a controller through an IP address, the client's directories
and files are processed by that controller locally. Directories and files created by a client are
evenly distributed to the vNodes on the controller connected by the client. Directories and files
from the same IP address are locally processed on the allocated vNode to reduce access across
vNodes or controllers. When the vNodes have a capacity difference greater than 5%, new
directories are allocated to less loaded vNodes for balancing.
Load balancing mode: A client accesses a controller through an IP address and balances
directories among vNodes of multiple controllers to achieve 100% storage performance.
Directories created by the client are evenly distributed to multiple vNodes. Files in a directory are
processed by the same vNode as that directory.
 FlashLink
OceanStor Dorado 8000 and 18000 V6 storage systems take the advantage of the flash-dedicated
FlashLink® technique to serve million-level input/output operations per second (IOPS) while
maintaining a consistent low latency.
FlashLink® employs a series of optimizations for flash media. It associates controller CPUs with
SSD CPUs to coordinate SSD algorithms between these CPUs, thereby achieving high system
performance and reliability. The key technologies of FlashLink® include:
1. Multi-core technology
Huawei OceanStor all-flash storage systems use self-developed CPUs. The controller
contains more CPUs and cores than any other controller in the industry. The intelligent
multi-core technology drives linear growth of storage performance with CPUs and cores.
Service processing by vNodes: I/O requests from hosts are distributed to the vNodes based
on the intelligent distribution algorithm and are processed in the vNodes from end to end.
This eliminates the overhead of communication across CPUs and accessing the remote
HCIP-Storage Course Notes Page 7

memory, as well as conflicts between CPUs, allowing performance to increase linearly with
the number of CPUs.
Service grouping: All CPU cores of a vNode are divided into multiple core groups. Each
service group matches a CPU core group. The CPU cores corresponding to a service group
run only the service code of this group, and different service groups do not interfere with
each other. Service groups isolate various services on different cores, preventing CPU
contention and conflicts.
Lock-free: In a service group, each core uses an independent data organization structure to
process service logic. This prevents the CPU cores in a service group from accessing the
same memory structure, and implements lock-free design between CPU cores.
2. Sequential writes of large blocks
Flash chips on SSDs can be erased for a limited number of times. In traditional RAID
overwrite mode, hot data on an SSD is continuously rewritten, and its mapping flash chips
wear out quickly.
Huawei OceanStor Dorado V6 supports ROW-based sequential writes of large blocks.
Controllers detect data layouts in Huawei-developed SSDs and aggregate multiple small and
discrete blocks into a large sequential block. Then the large blocks are written into SSDs in
sequence. RAID 5, RAID 6, and RAID-TP perform just one I/O operation and do not
require the usual multiple read and write operations for small and discrete write blocks. In
addition, RAID 5, RAID 6, and RAID-TP deliver similar write performance.
3. Hot and cold data separation
The controller works with SSDs to identify hot and cold data in the system, improve
garbage collection efficiency, and reduce the program/erase (P/E) cycles on SSDs to
prolong their service life.
Garbage collection: In an ideal situation, garbage collection would expect all data in a block
to be invalid so that the whole block could be erased without data movement. This would
minimize write amplification.
Multi-streaming: Data with different change frequencies is written to different SSD blocks,
reducing garbage collection.
Separation of user data and metadata: Metadata is frequently modified and is written to
different SSD blocks from user data.
Separation of new data and garbage collection data: Data to be reclaimed by garbage
collection is saved in different SSD blocks from newly written data.
4. I/O priority adjustment
I/O priority adjustment functions like a highway. A highway has normal lanes for general
traffic, but it also has emergency lanes for vehicles which need to travel faster. Similarly,
priority adjustment lowers latency by granting different types of I/Os different priorities by
their SLAs for resources.
5. Smart disk enclosure
The smart disk enclosure is equipped with CPU and memory resources, and can offload
tasks, such as data reconstruction upon a disk failure, from controllers to reduce the
workload on the controllers and eliminate the impact of such tasks on service performance.
Reconstruction process of a common disk enclosure, using RAID 6 (21+2) as an example: If
disk D1 is faulty, the controller must read D2 to D21 and P, and then recalculate D1. A total
of 21 data blocks must be read from disks. The read operations and data reconstruction
consume great CPU resources.
Reconstruction of a smart disk enclosure: The smart disk enclosure receives the
reconstruction request and reads data locally to calculate the parity data. Then, it only needs
to transmit the parity data to the controller. This saves the network bandwidth.
HCIP-Storage Course Notes Page 8

Load sharing of controller tasks: Each smart disk enclosure has two expansion modules with
Kunpeng CPUs and memory resources. The smart disk enclosure takes over some
workloads from the controller enclosure to save controller resources.
6. AI
Huawei OceanStor all-flash storage systems use the Ascend 310 AI chip to boost the
computing power and accelerate services. Ascend 310 is a highly efficient, flexible, and
programmable AI processor that provides data precision for multiple devices and supports
both training and inference. Ascend 310 balances AI computing power and energy
efficiency and can analyze data access frequencies, including cold and hot data, health, and
data association. The intelligent analysis of this AI chip allows for implementation of
functions such as intelligent cache, intelligent QoS, and intelligent deduplication.
1.1.3.2 High Reliability
Next, let's look at high reliability technologies. OceanStor Dorado 8000 and 18000 V6 offer
protection measures against component and power failures, and use advanced technologies to
minimize risks of disk failures and data loss, ensuring system reliability. In addition, the storage
systems provide multiple advanced protection technologies to protect data against catastrophic
disasters and ensure continuous system running.
 High availability architecture
Tolerating simultaneous failure of two controllers: The global cache provides three cache copies
across controller enclosures. If two controllers fail simultaneously, at least one cache copy is
available. A single controller enclosure can tolerate simultaneous failure of two controllers with
the three-copy mechanism.
Tolerating failure of a controller enclosure: The global cache provides three cache copies across
controller enclosures. A smart disk enclosure connects to 8 controllers (in 2 controller
enclosures). If a controller enclosure fails, at least one cache copy is available.
Tolerating successive failure of 7 out of 8 controllers: The global cache provides continuous
mirroring to tolerate successive failure of 7 out of 8 controllers (on 2 controller enclosures).
 Zero interruption upon controller failure
The front-end ports are the same as common Ethernet ports. Each physical port provides one host
connection and has one MAC address.
Local logical interfaces (LIFs) are created for internal links. Four internal links connect to all
controllers in an enclosure. Each controller has a local LIF.
IP addresses are configured on the LIFs of the controllers. The host establishes IP connections
with the LIFs.
If the LIF goes down upon a controller failure, the IP address automatically fails over to the LIF
of another controller.
 Non-disruptive upgrade with a single link
The process is as follows:
HCIP-Storage Course Notes Page 9

I/O process upgrade time < 1.5s; host reconnection time < 3.5s; service suspension time < 5s
 SMB advanced features
Server Message Block (SMB) is a protocol used for network file access. It allows a local PC to
access files and request services on PCs over the local area network (LAN). CIFS is a public
version of SMB.

Protocol File Handle Usage


SMB 2.0 Durable handle Used to prevent intermittent link disconnection
SMB 3.0 Persistent handle Used for a failover

SMB 2.0 implements a failover as follows: SmartMatrix continuously mirrors SMB 2.0 durable
handles across controllers. If a controller or an interface module is faulty, the system performs
transparent migration of NAS logical interfaces. When the host restores the SMB 2.0 service
from the new controller, the controller obtains the handle from the controller on which the
durable handle is backed up to ensure service continuity.
SMB 3.0 implements a failover as follows: SmartMatrix continuously mirrors SMB 3.0 persistent
handles across controllers. If a controller or an interface module is faulty, the system performs
transparent migration of NAS logical interfaces. The host restores the persistent handle that was
backed up on a controller to a specified controller based on the SMB 3.0 failover standards.
 Failover group
A failover group is a group of ports that are used for IP address failover in a storage system. The
storage system supports the default failover group, VLAN failover group, and user-defined
failover group. Manual and automatic failbacks are supported. A failback takes about 5 seconds.
Default failover group: If a port is faulty, the storage device fails over the LIFs of this port to a
port with the same location, type (physical or bond), rate (GE or 10GE), and MTU on the peer
controller. If the port is faulty again, the storage device finds a proper port on another controller
using the same rule. On a symmetric network, select this failover group when creating LIFs.
VLAN failover group: The system automatically creates a VLAN failover group when a VLAN
port is created. If a VLAN port is faulty, the storage device fails over the LIFs to a normal VLAN
port that has the same tag and MTU in the failover group. Use this failover group for easier
deployment of LIFs when VLANs are used.
User-defined failover group: The user manually specifies the ports in a failover group. If a port is
faulty, the system finds a proper port from the specified group member ports.
 Data reliability solution
HCIP-Storage Course Notes Page 10

Dual mappings for directory metadata: Directories and inodes have dual logical mappings for
redundancy.
Data redundancy with snapshots: Snapshots provide local redundancy for file system data and
data recovery when needed.
Data redundancy on disks: Data is redundantly stored on disks using RAID 2.0+ to prevent loss
in the event of disk failures. The system automatically recovers the data using RAID as long as
the amount of corrupted data is within the permitted range.
Data redundancy across sites: Corrupted data at the local site can be recovered from the remote
site.
1.1.3.3 High Security
 Trusted and secure boot of hardware
Secure boot is to establish a hardware root of trust (which is tamperproofing) to implement
authentication layer by layer. This builds a trust chain in the entire system to achieve predictable
system behavior.
Huawei OceanStor all-flash storage systems use this methodology to avoid loading tampered
software during the boot process.
Software verification and loading process for secure boot:
Verify the signed public key of Grub. BootROM verifies the integrity of the signed public key of
Grub. If the verification fails, the boot process is terminated.
Verify and load Grub. BootROM verifies the Grub signature and loads Grub if the verification is
successful. If the verification fails, the boot process is terminated.
Verify the status of the software signature certificate. Grub verifies the status of the software
signature certificate based on the certificate revocation list. If the certificate is invalid, the boot
process is terminated.
Verify and load the OS. Grub verifies the OS signature and loads the OS if the verification is
successful. If the verification fails, the boot process is terminated.
 Role-based permission management
Preset default roles: The system provides default roles for system and vStore administrators.
 Default roles of system administrators:

Super administrator super_administrator

Administrator Administrator

Security administrator security_administrator

SAN administrator san_administrator

NAS administrator nas_administrator

Data protection administrator dataProtection_administrator

Network administrator network_administrator

 Default roles of vStore administrators:

vStore administrator vStore_administrator

vStore data protection administrator vStore_dataProtection


HCIP-Storage Course Notes Page 11

vStore protocol administrator vStore_protocol

User-defined role: Users customize roles based on service requirements. During customization,
users can select multiple functions for a role and multiple objects for each function. User-defined
roles can be deleted and modified.
 Security log audit
Technical principles of the native audit log:
Users can specify the file systems and file operations to be audited, such as create, delete,
rename, modify, and chmod.
Audit logs and read/write I/Os are processed in the same process to record the I/Os and logs at
the same time.
Audit logs are stored as metadata in the Audit-Dtree directory of each file system to ensure I/O
performance.
The system converts the log metadata from the *.bin format to the *.xml format in the
background for reads and writes.
Audit logs in the *.xml format are stored in the Audit-Log-FS file system of each vStore.
Asynchronous remote replication provides disaster recovery for the audit logs.

1.1.4 Application Scenarios


Storage for virtual environments: OceanStor Dorado V6 supports virtualization environments. It
incorporates server virtualization optimization technologies such as vStorage APIs for Array
Integration (VAAI), vStorage APIs for Storage Awareness (VASA), and Site Recovery Manager
(SRM). It employs numerous key technologies to increase VM deployment efficiency, enhance VMs'
bearing capability and running speed, and streamline storage management in virtual environments,
helping you easily cope with storage demands in virtual environments.
Multi-protocol access: The storage system allows users to configure both NFS sharing and CIFS
sharing for a file system to support both SMB and NFS services.

1.2 Introduction to Hybrid Flash Storage


1.2.1 Product Positioning
Business development leads to an increasing amount of service data, which poses ever high demands
on storage systems. Traditional storage systems are unable to meet these demands and encounter
bottlenecks such as inflexible storage performance expansion, complex management of various
devices, difficult to reuse legacy devices, and increasing maintenance costs occupying a large part of
the total cost of ownership (TCO). To address these problems, Huawei launches OceanStor hybrid
flash storage systems.
The storage systems incorporate file-system- and block-level data and storage protocols and provide
industry-leading performance and a variety of efficiency improvement mechanisms. All these
advantages provide customers with comprehensive high-performance storage solutions, maximize
customers' return on investment (ROI), and meet the requirements of large-scale databases for online
transaction processing (OLTP) or online analytical processing (OLAP), high-performance computing,
digital media, Internet operation, central storage, backup, disaster recovery, and data migration.
Featuring the cutting-edge hardware structure and block-and-file unified software architecture,
Huawei OceanStor hybrid flash storage systems combine advanced data application and protection
technologies to meet the storage requirements of medium- and large-sized enterprises for high
performance, scalability, reliability, and availability.
HCIP-Storage Course Notes Page 12

Brand-new architecture: The latest-generation multi-core CPU and SmartMatrix 3.0 architecture
enable the storage systems to support up to 32 controllers and 192 PB of all-flash capacity for linear
performance increase.
Ultimate convergence: SAN and NAS are converged to provide elastic storage, simplify service
deployment, improve storage resource utilization, and reduce TCO.
Outstanding performance: The flash-optimized technology gives full play to SSD performance. Inline
deduplication and compression are supported. Loads are balanced among controllers that serve as hot
backup for each other, delivering higher reliability. Resources are centrally stored and easily
managed.

1.2.2 Software and Hardware Architectures


 Hardware architecture
The 7-nanometer Arm processors with high performance and low power consumption simplify
the design of the storage printed circuit board (PCB), occupy less internal space, and offer better
heat dissipation. With a compact hardware design, the storage systems can provide more
interface modules for customers in smaller footprints and less power consumption.
OceanStor V5 Kunpeng series storage systems have the following changes: The CPUs and
control modules are switched to the Huawei-developed Kunpeng architecture. Onboard fan
modules and BBUs are smaller in size. Two hot-swappable interface modules are added, but
FCoE and IB ports are not supported. Back-end disk enclosures support SAS 3.0 and Huawei-
developed RDMA high-speed ports.
Log in to the support website (support.huawei.com/enterprise/) to obtain the product
documentation. In the product documentation, see section "General Information > Product
Description > Hardware Architecture" to view the hardware architecture of the corresponding
storage product, such as information about controller enclosures, disk enclosures, and interface
modules.
 Software architecture
For NAS and SAN in hybrid flash storage, they are parallel in software protocol stacks, but are
converged on the resource allocation and management planes.
1. SmartMatrix 3.0 for full load balancing: The architecture features full switching,
virtualization, and redundancy, and native load balancing. It can work with advanced
technologies such as end-to-end (E2E) data integrity field (DIF), memory error checking
and correcting (ECC), and transmission channel cyclic redundancy check (CRC). By
combining the architecture and these technologies, the storage systems support linear
performance growth, maximum scalability, 24/7 high availability, and high system security,
thereby satisfying critical service demands of medium- and large-sized data centers.
2. Front-end and back-end fully shared architecture: The front-end and back-end interconnect
I/O modules work with SmartMatrix to balance data flows and workloads among multiple
controllers. The interface modules of the new-generation high-end storage are fully shared,
and are smaller than the previous ones. Their deployment is more flexible and convenient,
and the bandwidth is higher.
Four-controller interconnection: The Fibre Channel front-end interconnect I/O modules
(FIMs), controllers, and back-end interconnect I/O modules (BIMs) are fully interconnected.
I/Os can reach any controller via any interface module without forwarding.
Single-link upgrade: When a host connects to a single controller and the controller is
upgraded, interface modules automatically forward I/Os to other controllers without
affecting the host.
Non-disruptive reset: When a controller is reset or faulty, interface modules automatically
forward I/Os to other controllers without affecting the host.
HCIP-Storage Course Notes Page 13

Multi-controller redundancy: The storage system tolerates the failures of three out of four
controllers.
Next-generation power protection: BBUs are built into controllers. When a controller is
removed, the BBU provides power for flushing cache data to system disks. Even when
multiple controllers are removed concurrently, data is not lost.
3. Controller faults are transparent to hosts.
Port: Each front-end port provides one Fibre Channel session for a host. The host detects
only one Fibre Channel session and WWN from each storage port.
Chip: Four internal links are established, each connecting to a controller in a controller
enclosure. Each controller establishes its own Fibre Channel session with the host.
FIMs enable the full interconnection of front-end links and all storage controllers. When any
controller fails, they ensure continuous front-end access without affecting hosts. I'd like to
take a moment to examine exactly how FIMs work.
For the host's perspective, each front-end port provides the host with one Fibre Channel
session, so the host only identifies one Fibre Channel session and WWN from each storage
port.
For the storage system's perspective, four internal links are established, each connecting to a
controller in a controller enclosure. Each controller establishes its own Fibre Channel
session with the host.
Controller failures: When any controller in a controller enclosure fails, FIMs redirect the
I/Os to the remaining controllers. The host remains unaware of the fault, the Fibre Channel
links remain up, and services run properly. No alarm or event is reported.

1.2.3 Key Technologies


Huawei hybrid flash storage supports parallel access of SAN and NAS, providing optimal access
paths for different services and achieving optimal access performance. Convergence of block and file
storage eliminates the need for NAS gateways and reduces procurement costs. Huawei hybrid flash
storage products are suitable for government, transportation, finance, and carrier industries in
scenarios such as databases, video surveillance, and VDI virtual desktops.
 Intelligent tiering technology for SAN and NAS
Throughout the data lifecycle, hot data gradually changes to cold data. If cold data occupied a
large amount of cache and SSD space, high-speed storage resources are wasted and the storage
system's performance is adversely affected.
In some cases, cold data may become hot. If the hot data is still stored on slow-speed storage
media such as tapes and NL-SAS disks, the I/O response will be slow, severely deteriorating
service efficiency.
To solve this problem, the storage system uses the intelligent tiering technology to flexibly
allocate data storage media in the background.
To use this technology, services must be deployed on a storage system containing different media
types. This technology monitors data in real time. Data that is not accessed for a long time is
marked as cold data, and is gradually transferred to lower-performance media, ensuring
outstanding service performance in a long term.
After cold data is activated and frequently accessed, it can be quickly relocated to the high-speed
media, ensuring stable system performance. The relocation policy can be manually or
automatically executed in the granularity of both LUN and file system. The storage system with
this feature enabled is cost-effective for customers.
 RAID 2.0+ software architecture
The RAID 2.0+ technology combines bottom-layer media virtualization and upper-layer resource
virtualization to provide fast data reconstruction and intelligent resource allocation. Time
HCIP-Storage Course Notes Page 14

required for data reconstruction is shortened from 10 hours to 30 minutes. Data reconstruction
speed is accelerated by 20 times, greatly reducing the impact on services and probabilities of
multi-disk failures in the reconstruction process. All disks in the storage pool participate in
reconstruction, and only service data is reconstructed. The reconstruction mode is changed from
traditional RAID's many-to-one to many-to-many.
 Huawei-developed chips
Front-end transmission: The intelligent multi-protocol interface chip supports the industry's
fastest 32 Gbit/s Fibre Channel and 100 Gbit/s Ethernet protocol for hardware offloading. It
enables interface modules to implement protocol parsing previously performed by the CPU to
reduce the CPU workloads and improve the transmission performance. It reduces the front-end
access latency from 160 μs to 80 μs. The parsed data interacts with the CPU to implement
advanced features, such as the traffic control.
Controller chip: The Kunpeng 920 processor is the first 7-nm Arm CPU in the industry and
integrates the southbridge, network adapter, and SAS controller chips.
SSD storage chip: The core FTL algorithm is embedded in the self-developed chip. The chip
directly determines the read/write location, reducing the write latency from 40 μs to 20 μs. The
last is the intelligent management chip used in the management plane that is important to the
entire running period of a storage system.
Intelligent management chip: Its built-in library contains more than 10 years of storage faults to
quickly identify and rectify problems. Once a fault is detected, the management chip quickly
matches the fault model from the library, locates the fault with the accuracy of 93%, and provides
a solution.
 RDMA scale-out
Four controllers are expanded to eight controllers without using any switches. The networking is
simple.
100 Gbit/s RDMA ports transmit data between the two controller enclosures.
VLANs are used for logical data communication to ensure data security and reliability on the I/O
plane and management and control plane.
 Self-encrypting drive (SED)
SEDs use the AES-256 encryption algorithm to encrypt data stored on the disks without affecting
performance.
Internal Key Manager is a key management application embedded in storage systems. OceanStor
18000 V5 and 18000F V5 use the trusted platform module (TPM) to protect keys.
External Key Manager uses the standard KMIP + TLS protocols. Internal Key Manager is
recommended when the key management system is only used by the storage systems in a data
center.
OceanStor V5 storage systems combine SEDs with Internal Key Manager (built-in key
management system) or External Key Manager (independent key management system) to
implement static data encryption and ensure data security.
The principle of the AES algorithm is based on permutation and substitution. AES uses several
different methods to perform permutation and substitution operations. It is an iterative and
symmetric-key algorithm that has a fixed block size of 128 bits (16 bytes) and a key size of 128,
192, or 256 bits. Different from public key passwords that use key pairs, symmetric key
passwords use the same key to encrypt and decrypt data. The number of bits of the encrypted
data returned by AES is the same as that of the input data. The key size used for an AES cipher
specifies the number of repetitions of transformation rounds that convert the input, called the
plaintext, into the final output, called the ciphertext.
Internal Key Manager is easy to deploy, configure, and manage. There is no need to deploy an
independent key management system.
HCIP-Storage Course Notes Page 15

 Advanced features
The block service and file service support a wide range of advanced features. For details, see the
training slides.

1.2.4 Application Scenarios


 Multi-site disaster recovery
Hybrid flash storage is applicable to the cascading and parallel architecture of the geo-redundant
DR solution. Solution highlights are as follows: interoperability among entry-level, mid-range,
and high-end storage arrays; second-level RPO and minute-level RTO for asynchronous
replication (HyperReplication/A); DR Star.
If the DR center fails, the remaining sites automatically establish the replication relationship for
continuous data protection. After the standby replication relationship is activated, incremental
data is replicated without changing the RTO.
Configuration of DR Star can be done at a single site for simplified management.
 Application scenarios for storage tiering
Different service applications have different requirements on performance and reliability. For
example, the CRM and billing services are hot data applications while backup is a cold data
application. Huawei all-flash storage, hybrid flash storage, and distributed storage are applicable
to those applications to implement data consolidation and tiering and provide data storage with
different SLA levels.

1.3 Introduction to Distributed Storage


1.3.1 Product Positioning
Designed for mass data scenarios, Huawei distributed storage series provides diversified storage
services for various applications, such as virtualization/cloud resource pools, mission-critical
databases, big data analysis, high-performance computing (HPC), video, content storage, backup, and
archiving, helping enterprises release the value of mass data.
Intelligent distributed storage is a Huawei-developed intelligent distributed storage product with
scale-out. A cluster provides standard interfaces for upper-layer applications, such as block, HDFS,
and object services, eliminating complex operation problems caused by siloed storage systems. Its
diverse and adaptable range of features provide stable bearer for complex services, maximized
efficiency for diversified data, and cost-effective storage for mass data.
Block service supports SCSI and iSCSI interfaces and provides upper-layer applications with mass
storage pools that can be obtained on demand and elastically expanded, greatly improving the
preparation efficiency of application environments. It is an ideal storage platform for private clouds,
containers, virtualization, and database applications.
HDFS service provides a decoupled storage-compute big data solution based on native HDFS. The
solution implements on-demand configuration of storage and compute resources, provides consistent
user experience, and helps reduce the total cost of ownership (TCO). It can coexist with the legacy
coupled storage-compute architecture. Typical application scenarios include big data for finance,
carriers (log retention), and governments.
Object service supports a single bucket carrying a maximum of 100 billion objects without
performance deteriorated. This eliminates the trouble of bucket reconstruction for large-scale
applications. Typical scenarios include production, storage, backup, and archiving of financial
electronic check images, audio and video recordings, medical images, government and enterprise
electronic documents, and Internet of Vehicles (IoV).
HCIP-Storage Course Notes Page 16

Scale-Out NAS adopts a fully symmetric distributed architecture. Scale-Out NAS is used for storing
mass unstructured data with its industry-leading performance, large-scale scale-out capability, and
ultra-large single file system. Huawei Scale-Out NAS can improve the storage efficiency of IT
systems, simplify the workload and migration process, and cope with the growth and evolution of
unstructured data.

1.3.2 Software and Hardware Architectures


First, let's look at the hardware architecture of distributed storage.
The following table lists the hardware configuration when the standard hardware is used.

Hardware Model Description Applicable To


Provides 42 U space for
Cabinet Standard IT cabinet -
device installation.
Converged, object, HDFS,
and block services
Functions as a 2 U 12-slot
Note:
EXP node equipped with two
Kunpeng 920 CPUs (48-core Converged services refer to
2.6 GHz). the scenarios where multiple
storage services are deployed
at a site.
P100
Functions as a 2 U 12-slot
EXP node equipped with two
Kunpeng 920 CPUs (48-core
2.6 GHz).
Block service
Functions as a 2 U 25-slot
EXP node equipped with two
Kunpeng 920 CPUs (48-core
2.6 GHz).
Storage Functions as a 4 U 36-slot
nodes passthrough node equipped Converged, object, HDFS,
C100
with two Kunpeng 920 CPUs and block services
(48-core 2.6 GHz).

Functions as a 2 U 12-slot
EXP NVMe all-flash node
F100 equipped with two Kunpeng Block service
920 CPUs (48-core 2.6
GHz).

Functions as a 2 U 12-slot
Converged, object, HDFS,
node equipped with x86
and block services
CPUs.
P110
Functions as a 2 U 25-slot
node equipped with x86 Block service
CPUs.
HCIP-Storage Course Notes Page 17

Hardware Model Description Applicable To

Functions as a 4 U 36-slot
Converged, object, HDFS,
C110 node equipped with x86
and block services
CPUs.

Functions as a 2 U 12-slot
NVMe all-flash node
equipped with x86 CPUs.
F110 Block service
Functions as a 2 U 24-slot
NVMe all-flash node
equipped with x86 CPUs.
Functions as a GE
BMC/management switch,
and provides four 10GE
S5731-H48T4XC SFP+ Ethernet optical ports -
and forty-eight
10/100/1000BASE-T
Ethernet electrical ports.
Functions as a GE
BMC/management switch,
and provides four 10GE
S5720-56C-EI-AC SFP+ Ethernet optical ports -
and forty-eight
10/100/1000BASE-T
Ethernet electrical ports.
Functions as a GE
BMC/management switch,
and provides four 10GE
S5331-H48T4XC SFP+ Ethernet optical ports -
Network and forty-eight
devices 10/100/1000BASE-T
Ethernet electrical ports.
Functions as a GE
BMC/management switch,
and provides four 10GE
S5320-56C-EI-AC SFP+ Ethernet optical ports -
and forty-eight
10/100/1000BASE-T
Ethernet electrical ports.
Functions as a 10GE storage
switch, and provides forty-
eight 10GE SFP+ Ethernet
CE6881-48S6CQ -
optical ports and six 40GE
QSFP28 Ethernet optical
ports.
Functions as a 10GE storage
CE6855-48S6Q-HI switch, and provides forty- -
eight 10GE SFP+ Ethernet
HCIP-Storage Course Notes Page 18

Hardware Model Description Applicable To


optical ports and six 40GE
QSFP+ Ethernet optical
ports.
Functions as a 10GE storage
switch, and provides forty-
CE6857-48S6CQ- eight 10GE SFP+ Ethernet
-
EI optical ports and six
40GE/100GE QSFP28
Ethernet optical ports.
Functions as a 25GE storage
switch, and provides forty-
eight 10GE/25GE SFP28
CE6863-48S6CQ -
Ethernet optical ports and six
40GE/100GE QSFP28
Ethernet optical ports.
Functions as a 25GE storage
switch, and provides forty-
CE6865-48S8CQ- eight 25GE SFP28 Ethernet
-
EI optical ports and eight
100GE QSFP28 Ethernet
optical ports.
Functions as a 100GE
aggregation switch, and
provides two 10GE SFP+
CE8850-64CQ-EI -
Ethernet optical ports and
sixty-four 100GE QSFP28
Ethernet optical ports.
Functions as a 100 Gbit/s IB
storage switch, and provides
SB7800 -
thirty-six 100 Gbit/s QSFP28
optical ports.
Keyboard, video, and mouse
Provides eight KVM ports. -
(KVM)

If Scale-Out NAS is used, the hardware contains storage nodes, network devices, KVM, and short
message service (SMS) modems. The following table lists the hardware components.

Hardware Model Description


(Recommended)
FR42612L Provides 42 U space for device installation.
Cabinet
Storage nodes 2 U 12-slot node (configuration example: 12 x SATA disks
P12E
or 1 x SSD + 11 x SATA disks)
2 U 25-slot node with higher performance (configuration
P25E
example: 1 x SSD + 24 x SAS disks)
P36E 4 U 36-slot node with higher performance (configuration
example: 1 x SSD + 35 x SATA disks)
HCIP-Storage Course Notes Page 19

Hardware Model Description


4 U 36-slot node with higher performance (configuration
C36E
example: 36 x SATA disks)
2 U 12-slot node (configuration example: 12 x SATA disks
P12
or 1 x SSD + 11 x SATA disks)
2 U 25-slot node (configuration example: 1 x SSD + 24 x
P25
SAS disks)
4 U 36-slot node (configuration example: 1 x SSD + 35 x
P36
SATA disks)
C36 4 U 36-slot node (configuration example: 36 x SATA disks)
C72 4 U 72-slot node (configuration example: 72 SATA disks)
CE6810-
48S4Q-EI/
10GE switch
CE6810-
24S2Q-LI
Network devices
S5700-52C-
GE switch
SI/S5352C-SI
SX6018 InfiniBand (IB) switch
8-port KVM Provides eight KVM ports.
Modem Provides SMS-based alarm notification.

Log in to https://support.huawei.com/enterprise/ to obtain the product documentation. Choose


Product Documentation > Basic Information > Product Description > Hardware Architecture to
view the hardware architecture of the corresponding storage product.
Next, let's look at the software architecture of distributed storage.
The following uses an example to describe the key concepts of the software architecture:
A protocol is a storage protocol layer. Block, object, HDFS, and file services support local mounting
access over iSCSI or VSC, S3/Swift access, HDFS access, and NFS access, respectively.
VBS is a block access layer of the block service. User I/Os are delivered to VBS through iSCSI or
VSC.
EDS-B provides the block service with enterprise-level features, and receives and processes I/Os from
VBS.
EDS-F provides the HDFS service.
OBS service provides the object service.
DP protects data.
A persistence layer provides the persistent storage capability. It owns EC and multi-copy capabilities
using Plog clients to provide Append Only access of Plogs.
Infrastructure provides infrastructure capabilities for the storage system, such as scheduling and
memory allocation.
OAM is the management plane of storage, which provides functions such as deployment, upgrade,
capacity expansion, monitoring, and alarming.
HCIP-Storage Course Notes Page 20

It supports rich enterprise-class features, second-level HyperReplication and HyperMetro of the block
service. Microservice-based architecture is supported. Block, HDFS, and object services can share the
persistence service.
The block service supports a wide range of virtualization platforms and database applications with
standard access interface protocols such as SCSI and iSCSI, and delivers high performance and
scalability to meet SAN storage requirements of virtualization, cloud resource pools, and databases.
Key features of the block service include HyperMetro (active-active storage), HyperReplication
(remote replication), HyperSnap (snapshot), SmartQoS (intelligent service quality control),
SmartDedupe (deduplication), and SmartCompression (compression).
The object service supports mainstream cloud computing ecosystems with standard object service
APIs for content storage, cloud backup and archiving, and public cloud storage service operation. Key
features of the object service include HyperReplication (remote replication), Protocol-Interworking
(object/file interworking), SmartDedupe (deduplication), SmartQuota (quota management), and
SmartQoS (intelligent service quality control).
The HDFS service supports native HDFS interfaces without plug-ins and provides a cloud-enabled
decoupled storage-compute solution for big data analysis. It enables you to efficiently process
massive amounts of data, deploy and use resources on demand, and reduce TCO. Key features of the
HDFS service include SmartTier (tiered storage), SmartQuota (quota), and recycle bin.

1.3.3 Key Technologies


 DHT technology
The block service uses the distributed hash table (DHT) routing algorithm. Each storage node
stores a small proportion of data. DHT helps select a location to store the whole system data.
Traditional storage systems typically employ the centralized metadata management mechanism,
which allows metadata to record the disk distribution of the LUN data with different offsets. For
example, the metadata may record that the first 4 KB data in LUN1 + LBA1 is distributed on
LBA2 of the 32nd disk. Each I/O operation initiates a query request for the metadata service. As
the system scale grows, the metadata size also increases. However, the concurrent operation
capability of the system is subject to the capability of the server accommodating the metadata
service. In this case, the metadata service will become a performance bottleneck of the system.
Unlike traditional storage systems, the block service uses DHT for data addressing. The
following figure shows the implementation.

The DHT ring of the block service contains 2^32 logical space units which are evenly divided
into n partitions. The n partitions are evenly allocated on all disks in the system. For example, n
is 3600 by default. If the system has 36 disks, each disk is allocated 100 partitions. The system
configures the partition-disk mapping relationship during system initialization and will adjust the
mapping relationship accordingly after the number of disks in the system changes. The partition-
disk mapping table occupies only a small space, and block service nodes store the mapping table
in the memory for rapid routing. The routing mechanism of the block service is different from
that of the traditional storage array. It does not employ the centralized metadata management
HCIP-Storage Course Notes Page 21

mechanism and therefore does not have performance bottlenecks incurred by the metadata
service.
An example is provided as follows: If an application needs to access the 4 KB data identified by
an address starting with LUN1 + LBA1, the system first constructs "key= LUN1 + LBA1/1M",
calculates the hash value for this key, performs modulo operation for the value N, gets the
partition number, and then obtains the disk to which the data belongs based on the partition-disk
mapping.
In addition, the DHT routing algorithm has the following characteristics:
Balance: Data is distributed to all nodes as evenly as possible, thereby balancing loads among
nodes.
Monotonicity: If new nodes are added to the system, the system redistributes data among nodes.
Data migration is implemented only on the new nodes, and the data on the existing nodes is not
significantly adjusted.
 Range segmentation and WAL aggregation
Data to be stored is distributed on different nodes in range mode. Write Ahead Log (WAL) is an
intermediate storage technology used before data persistence. After data is stored using WAL, the
message that data is written successfully can be returned to upper-layer applications. WAL
highlights that modifications to data files (they are carriers of tables and indexes) can only occur
after the modifications have been logged, that is, after the log records describing the changes
have been flushed to persistent storage.
 Multi-NameNode concurrency
The NameNode is the metadata request processing node of the HDFS, and the DataNode is the
data request processing node of the HDFS.
Traditional HDFS NameNode model:
Only one active NameNode provides the metadata service. The active and standby NameNodes
are not consistent in real time and have a synchronization period.
After the current active NameNode breaks down, the new NameNode cannot provide metadata
services for several hours until the new NameNode loads logs.
The number of files supported by a single active NameNode depends on the memory of a single
node. A maximum of 100 million files can be supported by a single active NameNode.
If a namespace is under heavy pressure, concurrent metadata operations consume a large number
of CPU and memory resources, resulting in poor performance.
Huawei HDFS multi-NameNode concurrency has the following features:
Multiple active NameNodes provide metadata services, ensuring real-time data consistency
among multiple nodes.
It avoids metadata service interruption caused by traditional HDFS NameNode switchover.
The number of files supported by multiple active NameNodes is no longer limited by the memory
of a single node.
Multi-directory metadata operations are concurrently performed on multiple nodes.
 Append Only Plog technology
HDD and SSD media can be supported at the same time. Both media have significant differences
in technology parameters such as bandwidth, IOPS, and latency. Therefore, I/O patterns
applicable to both media are greatly different. The Append Only Plog technology is adopted for
unified management of HDDs and SSDs. It provides the optimal disk writing performance model
for media. Small I/O blocks are aggregated into large ones, and then large I/O blocks are written
to disks in sequence. This write mode complies with the characteristics of disks.
 EC intelligent aggregation technology
HCIP-Storage Course Notes Page 22

The intelligent aggregation EC based on append write always ensures EC full-stripe write,
reducing read/write network amplification and disk amplification by several times. Data is
aggregated at a time, reducing the CPU computing overhead and providing ultimate peak
performance.
 Multi-level cache technology
The following figure shows the write cache.

The detailed procedure is as follows:

Step 1 The storage system writes data to the RAM-based write cache (memory write cache)

Step 2 The storage system writes data to the SSD WAL cache (for large I/Os, data is written to the
HDD) and returns a message to the host indicating that the write operation is complete.

Step 3 When the memory write cache reaches a certain watermark, the storage system writes data to
the SSD write cache.

Step 4 For large I/Os, the storage system writes data to the HDD. For small I/Os, the system first
writes data to the SSD write cache, and then writes data to the HDD after aggregating the small
I/Os into large I/Os.
Note: If the data written in Step 1 exceeds 512 KB, it is directly written to the HDD in Step 4.
The following figure shows the read cache.
HCIP-Storage Course Notes Page 23

The detailed procedure is as follows:

Step 1 The storage system reads data from the memory write cache. If the read I/O is hit, the message
that the data is read successfully is returned. Otherwise, the storage system proceeds to Step 2.

Step 2 The storage system reads data from the memory read cache. If the read I/O is hit, the message
that the data is read successfully is returned. Otherwise, the storage system proceeds to Step 3.

Step 3 The storage system reads data from the SSD write cache. If the read I/O is hit, the message that
the data is read successfully is returned. Otherwise, the storage system proceeds to Step 4.

Step 4 The storage system reads data from the SSD read cache. If the read I/O is hit, the message that
the data is read successfully is returned. Otherwise, the storage system proceeds to Step 5.

Step 5 The storage system reads data from the HDD.


Note: Pre-fetched data, such as sequential data, is cached to the memory read cache.
Hotspot data identified during the read process is cached to the SSD read cache.
 Distributed metadata access
The following figure shows the access process.
HCIP-Storage Course Notes Page 24

Key concepts are described as follows:


CA: Client Agent
MDS: Metadata Service
DS: Data Service
The process is described as follows:
1. The client initiates an access request to query metadata from the root directory of the
metadata service (MDS) MDS 1.
2. The root directory informs the client to query metadata from MDS 2.
3. Metadata query continues.
4. Metadata query needs to be performed on MDS 4.
5. After obtaining metadata, data is read as indicated by the metadata.
 Intelligent load balancing technology
It works based on domain name access (in active-standby mode). It supports partitions. Each
partition can be configured with an independent domain name and load balancing policy.
It can access the file system using the primary or a secondary domain name, for example,
fx.tx.com.
It can resolve the domain name and returns an IP address based on the load balancing policy.
It can access the service based on an IP address.
 Single file system
Resources are centrally managed in a unified resource pool and can be easily shared. When
accessing a single file system, users do not need to pay attention to the specific data storage
location.
1. The system provides a unified file system for accessing all available space.
2. In a single file system, a file set is presented as a directory.
3. A unified file system is automatically created when the system is started.

1.3.4 Application Scenarios


 Private cloud and virtualization
Data storage resource pools for mass data storage are provided. The pools feature on-demand
resource provisioning and elastic capacity expansion in private cloud and virtualization
environments, improving storage deployment, expansion, and operation and maintenance (O&M)
efficiency with general-purpose servers. Typical scenarios include Internet-finance channel
HCIP-Storage Course Notes Page 25

access clouds, development and testing clouds, cloud-based services, B2B cloud resource pools
in carriers' BOM domains, and e-Government cloud.
 Mission-critical database
Enterprise-grade capabilities, such as distributed active-active storage and consistent low latency,
ensure efficient and stable running of data warehouses and mission-critical databases, including
online analytical processing (OLAP) and online transaction processing (OLTP).
 Big data analysis
An industry-leading decoupled storage-compute solution is provided for big data, which
integrates traditional data silos and builds a unified big data resource pool for enterprises. It also
leverages enterprise-grade capabilities, such as elastic large-ratio erasure coding (EC) and on-
demand deployment and expansion of decoupled compute and storage resources, to improve big
data service efficiency and reduce the TCO. Typical scenarios include big data analysis for
finance, carriers (log retention), and governments.
 Content storage and backup archiving
Superb-performance and high-reliability enterprise-grade object storage resource pools are
provided to meet the requirements of real-time online services such as Internet data, online audio
and video data, and enterprise web disks. It delivers large throughput, enables frequent access to
hotspot data, and implements long-term storage and online access Typical scenarios include
storage, backup, and archiving of financial electronic check images, audio and video recordings,
medical images, government and enterprise electronic documents, and Internet of Vehicles (IoV).
For example, the distributed storage block service can be used in scenarios such as BSS, MSS,
OSS, and VAS. The object service can also be used in application scenarios. Its advantages are as
follows:
Stable and low latency for the customer access process: The stable latency is less than 80 ms,
meeting the stability requirements of continuous video write latency and improving the access
experience of end users.
High concurrent connections: Millions of video connections are supported, ensuring stable
performance.
On-demand use: Storage resources can be dynamically used and paid on demand based on
service growth at any time, reducing the TCO.

1.4 Introduction to Hyper-Converged Storage


1.4.1 Product Positioning
Most traditional IT architectures use midrange computers + Fibre Channel storage. The IT
infrastructure based on such architecture is expensive to deploy and maintain. In addition, its poor
scalability cannot match the exponential data growth in large Internet companies. For example, every
day, Facebook has about two billion new photos that need to be stored and processed in time.
Web-scale IT is a concept proposed by Gartner to describe all of the things happening at the large
cloud service firms, such as Facebook, Google, and LinkedIn, that cope with explosive growth of
services and data through computing virtualization and distributed storage capabilities. Most
enterprises cannot build their own IT systems using the web-scale IT architecture because they do not
have enough IT capabilities to support applications of distributed storage software and complex IT
system management.
The hyper-converged architecture is a small-scale web-scale architecture. It is further optimized to
avoid web-scale complexity through integrated architecture and unified O&M. The hyper-converged
architecture delivers the same flexibility and scalability as the web-scale architecture.
What is hyper-converged infrastructure (HCI)? Here are the definitions in the industry.
HCIP-Storage Course Notes Page 26

Hyper-converged infrastructure (HCI) is a conglomeration of devices consolidating not only compute,


network, storage, and server virtualization resources, but also elements such as backup software,
snapshot technology, data deduplication, and inline compression. Multiple sets of devices can be
aggregated by a network to achieve modular, seamless scale-out and form a unified resource pool.
HCI is the ultimate technical approach to implementing Software-Defined Data Center (SDDC). HCI,
similar to the web-scale infrastructure used by Google and Facebook, provides optimal efficiency,
high flexibility, excellent scalability, low cost, and proven reliability. Both Arm and x86 are
supported on the hardware platform.
Nutanix: HCI is a conglomeration of devices consolidating not only compute, network, storage, and
server virtualization resources, but also elements such as backup software, snapshot technology, data
deduplication, and inline compression. Multiple sets of devices can be aggregated by a network to
achieve modular, seamless scale-out and form a unified resource pool. HCI is the ultimate technical
approach to implementing SDDC.
Gartner: HCI is a software-defined architecture that tightly integrates compute, storage, network, and
virtualization resources (and possibly other technologies) into a single hardware device provided by a
single vendor.
IDC: A hyper-converged system integrates core storage, computing, and storage network functions
into a single software solution or device. It is an emerging integration system.
Summary: In all definitions, virtualization plus software-defined distributed storage is the minimum
subset of HCI.
How does Huawei define HCI? What are the advantages and features of Huawei-defined HCI?
The Huawei HCI is an IT platform based on a hyper-converged architecture. It converges compute
and storage resources, and preintegrates a distributed storage engine, virtualization platform, and
cloud management software. It supports on-demand resource scheduling and linear expansion. It is
mainly used in mixed workload scenarios, such as databases, desktop clouds, containers, and
virtualization.
 Preintegration
FusionCube has all its components installed before delivery, simplifying onsite installation and
commissioning. The commissioning time is slashed from several weeks or even months to only
hours.
Preintegration includes:
Hardware preinstallation: Devices are installed in cabinets and cables are connected properly
(supported only for E9000).
Software preinstallation: The BIOS and system disk RAID are configured, and the management
software FusionCube Center and storage software FusionStorage Block are preinstalled.
Cabinet shipment: The cabinet is delivered with all devices installed (supported only for E9000).
 Compatibility with Mainstream Virtualization Platforms
FusionCube supports mainstream virtualization platforms such as VMware vSphere, and
provides unified compute, storage, and network resources for virtualization platforms.
FusionCube incorporates resource monitoring for virtualization platforms and provides unified
O&M through one management interface.
 Convergence of Compute, Storage, and Network Resources
FusionCube is prefabricated with compute, network, and storage devices in an out-of-the-box
package, eliminating the need for users to purchase extra storage or network devices.
Distributed storage engines are deployed on compute nodes (server blades) to implement the
convergence of compute and storage resources, which reduces data access delay and improves
overall access efficiency.
HCIP-Storage Course Notes Page 27

Automatic network deployment and network resource configuration are supported to implement
the convergence of compute and network resources. In addition, network resources are
dynamically associated with compute and storage resources.
 Distributed Block Storage
FusionCube employs FusionStorage block storage to provide distributed storage services.
FusionStorage block storage uses an innovative cache algorithm and adaptive data distribution
algorithm based on a unique parallel architecture, which eliminates high data concentration and
improves system performance. FusionStorage block storage also allows rapid and automatic self-
recovery and ensures high system availability and reliability.
1. Linear scalability and elasticity: FusionStorage block storage uses the distributed hash table
(DHT) to distribute all metadata among multiple nodes. This prevents performance
bottlenecks and allows linear expansion. FusionStorage block storage leverages an
innovative data slicing technology and a DHT-based data routing algorithm to evenly
distribute volume data to fault domains of large resource pools. This allows load balancing
on hardware devices and higher IOPS and megabit per second (MBPS) performance of each
volume.
2. High performance: FusionStorage block storage uses a lock-free scheduled I/O software
subsystem to prevent conflicts of distributed locks. The delay and I/O paths are shortened
because there is no lock operation or metadata query on I/O paths. By using distributed
stateless engines, hardware nodes can be fully utilized, greatly increasing the concurrent
IOPS and MBPS of the system. In addition, the distributed SSD cache technology and large-
capacity SAS/SATA disks (serving as the main storage) ensure high performance and large
storage capacity.
3. High reliability: FusionStorage block storage supports multiple data redundancy and
protection mechanisms, including two-copy backup and three-copy backup. FusionStorage
block storage supports the configuration of flexible data reliability policies, allowing data
copies to be stored on different servers. Data will not be lost and can still be accessed even
in case of server faults. FusionStorage block storage also protects valid data slices against
loss. If a disk or server is faulty, valid data can be rebuilt concurrently. It takes less than 30
minutes to rebuild data of 1 TB. All these measures improve system reliability.
4. Rich advanced storage functions: FusionStorage block storage provides a wide variety of
advanced functions, such as thin provisioning, volume snapshot, and linked clone. The thin
provisioning function allocates physical space to volumes only when users write data to the
volumes, providing more virtual storage resources than physical storage resources. The
volume snapshot function saves the state of the data on a logical volume at a certain time
point. The number of snapshots is not limited, and performance is not compromised. The
linked clone function is implemented based on incremental snapshots. A snapshot can be
used to create multiple cloned volumes. When a cloned volume is created, the data on the
volume is the same as the snapshot. Subsequent modifications on the cloned volume do not
affect the original snapshot and other cloned volumes.
 Automatic Deployment
FusionCube supports automatic deployment, which simplifies operations on site and increases
deployment quality and efficiency.
FusionCube supports preinstallation, preintegration, and preverification before the delivery,
which simplifies onsite installation and deployment and reduces the deployment time.
Devices are automatically discovered after the system is powered on. Wizard-based system
initialization configuration is provided for the initialization of compute, storage, and network
resources, accelerating service rollout.
An automatic deployment tool is provided to help users conveniently switch and upgrade
virtualization platforms.
 Unified O&M
HCIP-Storage Course Notes Page 28

FusionCube supports unified management of hardware devices (such as servers and switches)
and resources (including compute, storage, and network resources). It can greatly improve O&M
efficiency and QoS.
A unified management interface is provided to help users perform routine maintenance on
hardware devices such as chassis, servers, and switches and understand the status of compute,
storage, and network resources in a system in real time.
The IT resource usage and system operating status are automatically monitored. Alarms are
reported for system faults and potential risks in real time, and alarm notifications can be sent to
O&M personnel by email.
Rapid automatic capacity expansion is supported. Devices to be added can be automatically
discovered, and wizard-based capacity expansion configuration is supported.
 Typical Application Scenarios
Server virtualization: Integrated FusionCube virtualization infrastructure is provided without
requiring other application software.
Desktop cloud: Virtual desktop infrastructures (VDIs) or virtualization applications run on the
virtualization infrastructure to provide desktop cloud services.
Enterprise office automation (OA): Enterprise OA service applications such as Microsoft
Exchange and SharePoint run on the virtualization infrastructure.

1.4.2 Software and Hardware Architectures


First, let's look at the hardware architecture of hyper-converged storage.
Hardware can be blade servers, high-density servers, and rack servers.
 Blade Servers
Huawei E9000 is supported. It is a 12 U blade server that integrates compute nodes, switch
modules, and management modules in flexible configuration. The features are as follows:
1. A chassis can house up to 8 full-width or 16 half-width compute nodes in flexible
configuration.
2. A half-width compute node offers 850 W cooling capacity.
3. A full-width compute node offers 1700 W cooling capacity.
4. A half-width compute node supports up to 2 processors and 24 DIMMs.
5. A full-width compute node supports up to 4 processors and 48 DIMMs.
6. A chassis supports up to 32 processors and 24 TB memory.
7. The midplane delivers a maximum of 5.76 Tbit/s switch capacity.
8. The server provides two pairs of slots for switch modules, supports a variety of switching
protocols, such as Ethernet and InfiniBand, and provides direct I/O ports.
E9000 supports the following blades: 2-socket CH121 V3 compute node, 2-socket CH222 V3
compute and storage node, 2-socket CH220 V3 compute and I/O expansion node, 2-socket
CH225 V3 compute and storage node, 4-socket CH242 V3 compute node, 2-socket CH121 V5
compute node, and 2-socket CH225 V5 compute and storage node, 4-socket CH242 V5 compute
node.
 High-Density Servers
X6000 and X6800 servers are supported. The X6000 server provides high compute density. It
comes with four nodes in a 2 U chassis. Each node supports six 2.5-inch disks (including the
system disk), with an LOM supporting two GE and two 10GE ports and an NVMe SSD card
serving as the cache. The X6800 provides high compute and storage density. It comes with four
nodes in a 4 U chassis. Each node supports two system disks, ten 3.5-inch disks, and two rear
PCIe x8 slots.
HCIP-Storage Course Notes Page 29

 Rack Servers
FusionServer servers (x86) and TaiShan servers (Kunpeng) are supported. FusionCube supports
1-socket, 2-socket, and 4-socket rack servers, which can be flexibly configured based on
customer requirements.
Next, let's look at the software architecture of hyper-converged storage.
The overall architecture of hyper-converged storage consists of the hardware platform,
distributed storage software, installation, deployment, and O&M management platforms,
virtualization platforms, and backup and disaster recovery (DR) software. The virtualization
platforms can be Huawei FusionSphere and VMware vSphere. In addition, in the FusionSphere
scenario, FusionCube supports the hybrid deployment of the virtualization and database
applications.

Name Description
Manages FusionCube virtualization and hardware resources, and
FusionCube Center
implements system monitoring and O&M.
Enables quick installation and deployment of FusionCube software. It
FusionCube Builder
can be used to replace or update the virtualization platform software.
Provides high-performance and high-reliability block storage services
FusionStorage by using distributed storage technologies to schedule local disks on
servers in an optimized manner.
Implements system virtualization management. The Huawei
Virtualization platform
FusionSphere and VMware virtualization platforms are supported.
Provides the service virtualization function of backup systems, which
include the Huawei-developed backup software eBackup and
Backup
mainstream third-party backup software, such as Veeam, Commvault,
and EISOO.
Provides DR solutions based on active-active storage and
DR asynchronous storage replication. The DR software includes Huawei-
developed BCManager and UltraVR.
Supports E9000, X6800, X6000, and rack servers. The servers
integrate compute, storage, switch, and power modules and allow on-
demand configuration of compute and storage nodes. FusionCube
Hardware platform
supports GPU and SSD PCIe acceleration and expansion, as well as
10GE and InfiniBand switch modules to meet different configuration
requirements.

In the traditional architecture, centralized SAN controllers cause performance bottleneck. The
bottleneck is eliminated in FusionCube because FusionCube employs distributed architecture and
distributed storage. This way, each machine contains compute and storage resources, so each
machine can be regarded as a distributed storage controller.
In the decoupled compute-storage architecture, all data needs to be read from and written to the
storage array through the network. As a result, the network limit becomes another bottleneck.
FusionCube removes this bottleneck by using the InfiniBand network, the fastest in the industry,
to provide 56 Gbit/s bandwidth with nodes interconnected in P2P mode.
The third bottleneck in the traditional architecture is the slow disk read/write speed. The Huawei
HCI architecture uses ES3000 SSD cards, the fastest in the industry, as the cache, which
effectively solves the problems of local disk reads/writes.
HCIP-Storage Course Notes Page 30

Logical structure of distributed storage: In the entire system, all modules are deployed in a
distributed and decentralized manner, which lays a solid foundation for high scalability and high
performance of the system. The functions of some key components are as follows:
1. The VBS module provides standard SCSI/iSCSI services for VMs and databases at the
stateless interface layer. It is similar to the controller of a traditional disk array, but unlike
the controller of a traditional disk array, the number of VBS modules is not limited. The
number of controllers in a traditional disk array is limited, but VBS modules can be
deployed on all servers that require storage services.
2. The OSD module manages disks and is deployed on all servers with disks. It provides data
read and write for VBS and advanced storage services, including thin provisioning,
snapshot, linked clone, cache, and data consistency.
3. The MDC module manages the storage cluster status and is deployed in each cluster. It is
not involved in data processing. It collects the status of each module in the cluster in real
time and controls the cluster view based on algorithms.

1.4.3 Key Technologies


 DHT Algorithm
FusionStorage block storage employs the DHT architecture to distribute metadata onto storage
nodes according to predefined rules, preventing metadata bottlenecks caused by cross-node
access. This core architecture ensures large-scale linear expansion.
FusionStorage block storage leverages an innovative data slicing technology and a DHT-based
data routing algorithm to evenly distribute volume data to fault domains of large resource pools.
The load sharing across hardware resources enables each volume to deliver better IOPS and
MBPS performance. In addition, multiple volumes share the disks in a resource pool. Resources
can be flexibly allocated to each application as the load changes, preventing load imbalance
incurred by traditional RAID.
 Adaptive Global Deduplication and Compression
FusionStorage block storage supports global adaptive inline and post-process deduplication and
compression for ultimate space reduction, low TCO, and minimal resource consumption.
Deduplication reduces write amplification of disks before data is written to disks. FusionStorage
uses the opportunity table and fingerprint table mechanism. After data enters the cache, the data
is broken into 8 KB data fragments. The SHA-1 algorithm is used to calculate 8 KB data
fingerprints. The opportunity table is used to reduce invalid fingerprint space and thereby
reducing cost in memory. Adaptive inline and post-process deduplication is used. When the
system resource usage reaches the threshold, inline deduplication automatically stops. Data is
directly written to disks for persistent storage. When system resources are idle, post-process
deduplication starts. After deduplication is complete, the compression process starts. The
compression is aligned based on 1 KB. The LZ4 algorithm is used and HZ9 deep compression is
supported for a higher compression ratio. The data reduction ratio can be 5:1 for VDI system
disks and 1.4:1 by default in VSI and database scenarios, with the performance deteriorating by
less than 15%.
 Multiple Data Security Mechanisms
EC, multi-copy, PCIe SSD cache, strong-consistency replication protocol, and storage DR can be
used to protect data by redundancy.
 Rapid Parallel Data Rebuilding
Each disk in the FusionCube distributed storage system stores multiple data blocks, whose data
copies are scattered on other nodes in the system based on certain distribution rules. If detecting a
disk or server fault, the FusionCube distributed storage system automatically repairs data in the
background. Data copies are stored on different nodes, so data reconstruction is performed on the
nodes concurrently and each node reconstructs only a small amount of data. This prevents
HCIP-Storage Course Notes Page 31

performance bottlenecks caused by reconstructing a large amount of data on a single node and
minimizes adverse impacts on upper-layer services.
 Dynamic EC
EC Turbo drives higher space utilization and provides a data redundancy solution that ensures
stable performance and reliability when faults occur. Dynamic EC is supported. When a node is
faulty, EC reduction ensures data redundancy and performance. As shown in the following
figure, if the 4+2 EC scheme is used, when a node is faulty, the 4+2 EC scheme is reduced to 2+2
to ensure that new data written into each node is not downgraded.
EC folding is supported. While a cluster requires at least three nodes, the three nodes can also be
configured with the 4+2 EC scheme with the EC folding technology to improve space utilization.
Incremental EC is provided. The system supports writes for data increments and parity bits for
partially full stripes and writes for data appending, such as D1+D2+D3+D4+P1+P2. Storage
utilization is high. N+2, +3, and +4 redundancy levels (maximum 22+2) are supported for a
maximum storage utilization of 90%.

 Cabinet-Level Reliability
For the traditional SAN, when Cabinet 2 is faulty, services of App 2 running on Cabinet 2 are
interrupted and need to be manually recovered, as shown in the following figure.
HCIP-Storage Course Notes Page 32

For the hyper-converged storage, when Cabinet 1 is faulty, services are not affected because the
storage pool is shared, as shown in the following figure.

1.4.4 Application Scenarios


 Private Cloud Scenario
Unified service catalog and rich cloud service experience
Self-help service provisioning, enabling users to quickly obtain required resources
Unified display of the management information (such as alarms, topology, performance, and
report) of multiple cloud service resource pools
Unified management of physical and virtual resources, and of heterogeneous virtual resources
 LightDC ROBO Branch Solution
As companies go global, remote and branch offices are gaining importance and affecting
company revenues. However, IT solutions for such remote sites face the following challenges:
HCIP-Storage Course Notes Page 33

1. Lack of standardization: There are various types and large quantities of devices with low
integration, resulting in difficult management.
2. Long deployment cycle: It usually takes 30 days to deploy a site.
3. High O&M cost: Personnel at remote sites must be highly skilled at O&M operations.
4. Low line utilization: The private line utilization ratio is less than 50% at most sites.
To address the preceding challenges, Huawei introduces FusionCube. In conjunction with
FusionCube Center Vision, FusionCube offers integrated cabinets, service rollout, O&M
management, and troubleshooting services in a centralized manner. It greatly shortens the
deployment cycle, reduces the O&M cost, and improves the private line utilization.
 Cloud Infrastructure Scenario
The virtualization platforms can be Huawei FusionSphere or VMware vSphere to implement
unified management of physical resources.
 Asynchronous Replication Scenario
The Huawei asynchronous replication architecture consists of two sets of FusionCube distributed
storage that build the asynchronous replication relationship and the UltraVR or BCManager DR
management software. The data on the primary and secondary volumes are periodically
synchronized based on the comparison of snapshots. All the data generated on the primary
volume after the last synchronization will be written to the secondary volume in the next
synchronization.
Storage DR clusters can be deployed on demand. The storage DR cluster is a logical object that
provides replication services. It manages cluster nodes, cluster metadata, replication pairs,
consistency groups, and performs data migration. The DR cluster and system service storage are
deployed on storage nodes. DR clusters offer excellent scalability. One system supports a
maximum of eight DR clusters. A single DR cluster contains three to 64 nodes. A single DR
cluster supports 64,000 volumes and 16,000 consistency groups, meeting future DR
requirements.
The UltraVR or BCManager manages DR services from the perspective of applications and
protects service VMs of the FusionCube system. It provides process-based DR service
configuration, including one-click DR test, DR policy configuration, and fault recovery
operations at the active site.
 Summary of Features
RPO within seconds without differential logs is supported, helping customers recover services
more quickly and efficiently.
Replication network type: GE, 10GE, or 25GE (TCP/IP)
Replication link between sites: It is recommended that the replication link between sites be
within 3000 km, the minimum bidirectional connection bandwidth be at least 10 Mbit/s, and the
average write bandwidth of replication volumes be less than the remote replication bandwidth.
System RPO: The minimum RPO is 15 seconds, and the maximum RPO is 2880 minutes (15
seconds for 512 volumes per system; 150 seconds for 500 volumes per node).
HCIP-Storage Course Notes Page 34

2 Flash Storage Technology and Application

2.1 Hyper Series Technology and Application


2.1.1 HyperSnap for Block
2.1.1.1 Overview
With the rapid development of information technologies, enterprises' business data has exploded,
making data backups more important than ever. Traditionally, mission-critical data is periodically
backed up or replicated for data protection. However, traditional data backup approaches have the
following issues:
 A large amount of time and system resources are consumed, leading to high backup costs. In
addition, the recovery time objective (RTO) and recovery point objective (RPO) for data backup
are long.
 The backup window and service suspension time are relatively long, unable to meet mission-
critical service requirements.
Note:
 RTO is the duration of time within which service data must be restored after a disaster. For
example, an RTO of 1 hour means that in case of a disaster, the service data needs to be restored
in 1 hour.
 RPO is a defined period of time in which data can be lost but services can still continue. For
example, if a service could handle an RPO of 20 minutes, it would be able to experience a
disaster, lose 20 minutes of data, and still be able to work normally.
 A backup window is the optimal time to perform a backup without seriously affecting application
operations.
Facing exponential data growth, enterprises' system administrators must shorten the backup window.
To address these backup issues, numerous data backup and protection technologies, characterized by a
short or even zero backup window, have been developed.
Snapshot is one of these data backup technologies. Like taking a photo, taking a snapshot is to
instantaneously make a point-in-time copy of the target application state, enabling zero-backup-
window data backup and thereby meeting enterprises' high business continuity and data reliability
requirements.
Snapshot can be implemented using the copy-on-write (COW) or redirect-on-write (ROW)
technology:
 COW snapshots copy data during the initial data write. This data copy will affect the storage
write performance.
 ROW snapshots do not copy data during a write operation, but frequently overwrite data, causing
the data discretely scattered in the source LUN and consequently affecting the sequential read
performance.
Legacy storage systems use hard disk drives (HDDs) and take COW snapshots, which introduce a
data pre-copy process resulting in a storage write performance penalty. Comparatively, Huawei
OceanStor Dorado V6 series all-flash storage systems use solid state drives (SSDs), take ROW
HCIP-Storage Course Notes Page 35

snapshots, and offer high random read/write performance capabilities; that is, the OceanStor Dorado
V6 series storage systems eliminate the necessity for both the data backup process and the sequential
read operations done in legacy storage snapshots, thereby delivering lossless storage read/write
performance.
HyperSnap is the snapshot feature developed by Huawei. Huawei HyperSnap creates a point-in-time
consistent copy of original data (LUN) to which the user can roll back, if and when it is needed. It
contains a static image of the source data at the data copy time point. In addition to creating snapshots
for a source LUN, the OceanStor Dorado V6 series storage systems can also create a snapshot (child)
for an existing snapshot (parent); these child and parent snapshots are called cascading snapshots.
Once created, snapshots become accessible to hosts and serve as a data backup for the source data at
the data copy time.
HyperSnap provides the following advantages:
 Supports online backup, without the need to stop services.
 Provides writable ROW snapshots with no performance compromise.
 If the source data is unchanged since the previous snapshot, the snapshot occupies no extra
storage space. If the source data has been changed, only a small amount of space is required to
store the changed data.
2.1.1.2 Working Principle
A snapshot is a copy of the source data at a point in time. Snapshots can be generated quickly and
only occupy a small amount of storage space.
2.1.1.2.1 Basic Concepts
ROW: This is a core technology used to create snapshots. When a storage system receives a write
request to modify existing data, the storage system writes the new data to a new location and directs
the pointer of the modified data block to the new location.
Data organization: The LUNs created in the storage pool of the OceanStor Dorado V6 series storage
systems consist of metadata volumes and data volumes.
Metadata volume: records the data organization information (LBA, version, and clone ID) and data
attributes. A metadata volume is organized in a tree structure.
Logical block address (LBA) indicates the address of a logical block. The version corresponds to the
snapshot time point and the clone ID indicates the number of data copies.
Data volume: stores user data written to a LUN.
Source volume: A volume that stores the source data requiring a snapshot. It is represented to users as
a source LUN or an existing snapshot.
Snapshot volume: A logical data duplicate generated after a snapshot is created for a source LUN. A
snapshot volume is represented to users as a snapshot LUN. A single LUN in the storage pool uses the
data organization form (LBA, version, or clone ID) to construct multiple copies of data with the same
LBA. The source volume and shared metadata of the snapshot volume are saved in the same shared
tree.
Snapshot copy: It copies a snapshot to obtain multiple snapshot copies at the point in time when the
snapshot was activated. If data is written into a snapshot and the snapshot data is changed, the data in
the snapshot copy is still the same as the snapshot data at the point in time when the snapshot was
activated.
Snapshot cascading: Snapshot cascading is to create snapshots for existing snapshots. Different from a
snapshot copy, a cascaded snapshot is a consistent data copy of an existing snapshot at a specific point
in time, including the data written to the source snapshot. In comparison, a snapshot copy preserves
the data at the point in time when the source snapshot was activated, excluding the data written to the
source snapshot. The system supports a maximum of eight levels of cascaded snapshots.
HCIP-Storage Course Notes Page 36

Snapshot consistency group: Protection groups ensure data consistency between multiple associated
LUNs. OceanStor Dorado V6 supports snapshots for protection groups. That is, snapshots are
simultaneously created for each member LUN in a protection group.
Snapshot consistency groups are mainly used by databases. Typically, databases store different types
of data on different LUNs (such as the online redo log volume and data volume), and these LUNs are
associated with each other. To back up the databases using snapshots, the snapshots must be created
for these LUNs at the same time point, so that the data is complete and available for database
recovery.
2.1.1.2.2 Implementation
ROW is a core technology used for snapshot implementation. The working principle is as follows:
Creating a snapshot: After a snapshot is created and activated, a data copy that is identical to the
source LUN is generated. Then the storage system copies the source LUN's pointer to the snapshot so
that the snapshot points to the storage location of the source LUN's data. This enables the source LUN
and snapshot to share the same LBA.
Writing data to the source LUN: When an application server writes data to the source LUN after the
snapshot is created, the storage system uses ROW to save the new data to a new location in the
storage pool and directs the source LUN's pointer to the new location. The pointer of the snapshot still
points to the storage location of the original source data, so the source data at the snapshot creation
time is saved.
Reading snapshot data: After a snapshot is created, client applications can access the snapshot to read
the source LUN's data at the snapshot creation time. The storage system uses the pointer of the
snapshot to locate the requested data and returns it to the client.
Figure 2-1 shows the metadata distribution in the source LUN before a snapshot is created.

Figure 2-1 Metadata distribution in the source LUN


Figure 2-2 shows an example snapshot read and write process, covering one snapshot period (during
which only one snapshot is created) serving as an example.
HCIP-Storage Course Notes Page 37

Figure 2-2 Read and write process of a snapshot


1. Both the original volume (source LUN) and snapshot use a mapping table to access the physical
space. The original data in the source LUN is ABCDE and is saved in sequence in the physical
space. Before the original data is modified, the mapping table for the snapshot is empty. All read
requests to the snapshot are redirected to the source LUN.
2. When the source LUN receives a write request that changes C to F, the new data is written into a
new physical space P5 instead of being overwritten in P2.
3. After the data is written into the new physical space, the L2->P2 entry is added to the mapping
table of the snapshot. When the logical address L2 of the snapshot is read subsequently, the read
request will not be redirected to the source LUN. Instead, the requested data is directly read from
physical space P2.
4. In the mapping table of the source LUN, the system changes L2->P2 to L2->P5. Data in the
source LUN is changed to ABFDE and data in the snapshot is still ABCDE.
Figure 2-3 shows how the metadata of the source LUN and snapshot is distributed after the data
changes.
HCIP-Storage Course Notes Page 38

Figure 2-3 Metadata distribution in the source LUN and snapshot


HyperSnap supports quick recovery of the source LUN's data. If data on a source LUN is lost due to
accidental deletion, corruption, or virus attacks, you can roll back the source LUN to the point in time
when the snapshot was created, minimizing data loss.
2.1.1.3 Application Scenarios
HyperSnap can be used in various scenarios, including rapid data backup and restoration, continuous
data protection, and repurposing of backup data.
Rapid data backup and restoration by working with BCManager eReplication: (Only snapshots of
source LUNs support this scenario.)
BCManager eReplication is a piece of host software developed by Huawei. Running on an application
server, the software works with value-added features provided by storage systems to implement
backup, protection, and disaster recovery for mission-critical data of mainstream application systems.
HyperSnap enables you to complete data backup and restoration within a few seconds. Working with
BCManager eReplication, HyperSnap allows you to configure snapshot policies more flexibly,
improving operation efficiency and ensuring consistency among multiple snapshots.
You can roll back a source LUN using a snapshot or reconstruct a snapshot as required.
 If the data on a source LUN encounters non-physical damage due to a virus, or if data is
accidentally deleted or overwritten, you can use a snapshot to roll back the source LUN to the
data state at a specific snapshot activation point in time.
 If data on a source LUN is changed, you can reconstruct the snapshot of the source LUN to make
the snapshot quickly synchronize data changes made to the source LUN.
Repurposing of backup data:
LUNs serve different purposes in different service scenarios, such as report generation, data testing,
and data analysis. If multiple application servers write data to a LUN simultaneously, changes to the
data may adversely affect services on these application servers. Consequently, the data testing and
analysis results may be inaccurate.
OceanStor Dorado V6 supports the creation of multiple duplicates of a snapshot LUN. The duplicates
are independent of each other and have the same attributes as the snapshot LUN, so they can be used
by different application servers. In this way, data on the source LUN is protected while the snapshot
LUN can be used for other purposes.
HCIP-Storage Course Notes Page 39

2.1.2 HyperSnap for File


2.1.2.1 Overview
A file system snapshot is an available point-in-time copy of a source file system. Application servers
can read file system snapshots.
HyperSnap for file systems has the following advantages:
 Supports online backup, without the need to stop services.
 If the source file system is unchanged since the previous snapshot, the snapshot occupies no
extra storage space. If the source file system has been changed, only a small amount of space is
required to store the changed data.
 Enables quick restoration of the source data at multiple points in time.
2.1.2.2 Working Principle
A file system snapshot is a point-in-time copy of a source file system. File system snapshots are
generated quickly and only occupy a small amount of storage space.
2.1.2.2.1 Basic Concepts
Source file system: A file system that contains the source data for which a snapshot needs to be
created.
Read-only file system snapshot: A point-in-time copy of a source file system. Based on NFS sharing,
an application server can read a file system snapshot.
Block pointer (BP): Metadata that records storage locations of data blocks in a file system.
ROW: A core technology used to create file system snapshots. When a source file system receives a
write request to modify existing data, the storage system writes the new data to a new location and
directs the BP of the modified data block to the new location. Figure 2-1 illustrates ROW
implementation.

Figure 2-1 ROW implementation


HCIP-Storage Course Notes Page 40

2.1.2.2.2 Implementation
An application server can access a file system snapshot to read the data of the source file system at the
point in time when the snapshot was created.
ROW is the core technology used to create file system snapshots. When a source file system receives
a write request to modify existing data, the storage system writes the new data to a new location and
directs the BP of the modified data block to the new location. The BP of the file system snapshot still
points to the original data of the source file system. That is, a file system snapshot always preserves
the original state of the source file system.
Figure 2-1 shows the process of reading a file system snapshot (one snapshot is created in this
example).
HCIP-Storage Course Notes Page 41

Figure 2-1 Reading a file system snapshot


 Creating a snapshot
After a snapshot is created for a file system, a data copy that is identical to the source file system
is generated. Then the storage system copies the source file system's BPs to the snapshot so that
the snapshot points to the storage locations of the source file system's data. In addition, the
storage system reserves space in the source file system to store data of the file system snapshot. If
files in the source file system are modified or deleted, the system retains these files in the
reserved space to keep the snapshot consistent with the state of the source file system at the
snapshot creation time.
 Writing data to the source file system
When an application server writes data to the source file system after the snapshot is created, the
storage system uses ROW to save the new data to a new location in the storage pool and directs
the source file system's BP to the new location. The BP of the snapshot still points to the storage
location of the original source data, so the source data at the snapshot creation time is saved.
 Reading the snapshot
After a file system snapshot is created, it can be shared to clients using NFS. Client applications
can access the snapshot to read the source file system's data at the snapshot creation time. The
storage system uses the BPs of the snapshot to locate the requested data and returns it to the
client.
HyperSnap for file systems supports quick recovery of the source file system's data. If data in a
source file system is lost due to accidental deletion, corruption, or virus attacks, you can roll back
the source file system to the point in time when the snapshot was created, minimizing data loss.
Figure 2-2 illustrates the rollback process.
HCIP-Storage Course Notes Page 42

Figure 2-2 Rolling back a source file system


2.1.2.3 Application Scenarios
HyperSnap for file systems is applicable to scenarios such as data protection and emulation testing
(for example, data analysis and archiving).
2.1.2.3.1 Data Protection
HyperSnap can create multiple snapshots for a source file system at different points in time and
preserve multiple recovery points for the file system. If data in a source file system is lost due to
accidental deletion, corruption, or virus attacks, a snapshot created before the loss can be used to
recover the data, ensuring data security and reliability. Figure 2-1 shows a schematic diagram of
HyperSnap used for data protection.
HCIP-Storage Course Notes Page 43

Figure 2-1 HyperSnap used for data protection


2.1.2.3.2 Emulation Testing
After snapshots are created for a file system at different points in time and shared to clients, the clients
can access these snapshots to read the source file system's data at these time points, facilitating
testing, data analysis, and archiving. In this way, data of the production system is protected while
backup data is fully utilized. Figure 2-1 shows a schematic diagram of HyperSnap used for emulation
testing.
HCIP-Storage Course Notes Page 44

Figure 2-1 HyperSnap used for emulation testing

2.1.3 HyperReplication
2.1.3.1 Overview
As digitalization advances in various industries, data has become critical to the operation of
enterprises, and customers impose increasingly demanding requirements on stability of storage
systems. Although some storage devices offer extremely high stability, they fail to prevent
irrecoverable damage to production systems upon natural disasters. To ensure continuity,
recoverability, and high reliability of service data, remote DR solutions emerge. The remote
replication technology is one of the key technologies used in remote DR solutions.
HyperReplication is a core technology for remote DR and backup of data.
It supports the following replication modes:
 Synchronous remote replication
In this mode, data is synchronized between two storage systems in real time to achieve full
protection for data consistency, minimizing data loss in the event of a disaster.
 Asynchronous remote replication
In this mode, data is synchronized between two storage systems periodically to minimize service
performance deterioration caused by the latency of long-distance data transmission.
2.1.3.2 Working Principles
2.1.3.2.1 Basic Concepts
This section describes basic concepts related to HyperReplication, including pair, consistency group,
synchronization, splitting, primary/secondary switchover, data status, and writable secondary LUN.
To enable service data backup and recovery on the secondary storage system, a remote replication
task is implemented in four phases, as shown in Figure 2-1.
HCIP-Storage Course Notes Page 45

Figure 2-1 Implementation of a remote replication task


 Pair
A pair is the relationship between a primary logical unit number (LUN) and a secondary LUN in
a remote replication task. In remote replication, data can be synchronized only from the primary
LUN to the secondary LUN through a remote replication link. Before data synchronization, a pair
must be established between two LUNs. To be paired, the primary and secondary LUNs must be
in different storage systems, namely, primary storage system and secondary storage system.
The running status of a pair may change throughout the implementation of a remote replication
task. By viewing the running status of a pair, you can determine whether the pair requires further
HCIP-Storage Course Notes Page 46

actions and, if so, what operation is required. After performing an operation, you can view the
running status of the pair to check whether the operation has succeeded. Table 2-1 describes the
running status of a pair involved in a remote replication task.

Table 2-1 Running status of a remote replication pair


Running Status Description
The Normal running status of a remote replication pair indicates that data
synchronization between the primary and secondary LUNs in the pair is
complete.
Note:
Normal
If you set Initial Synchronization to The data on primary and secondary
resources is consistent and data synchronization is not required when
creating a remote replication pair, the status of the remote replication pair
will become Normal after the remote replication pair is created.
Data replication between the primary and secondary LUNs in a remote
Split replication pair is suspended. The running status of a pair changes to Split
after the primary and secondary LUNs are manually split.
The running status of a remote replication pair changes to Interrupted after
the pair relationship between the primary and secondary LUNs is
Interrupted
interrupted. This occurs when the links used by a remote replication task are
down or either LUN fails.
If a remote replication pair requires restoration using a manual policy after
the fault that caused a pair interruption is rectified, the pair running status
To be recovered changes to To be recovered. This status reminds users of manual data
synchronization between the primary and the secondary LUNs to restore the
pair relationship between them.
If the properties of a remote replication pair are changed at the primary or
secondary site after the pair is interrupted, the running status of the pair
Invalid
becomes Invalid because the configurations of the pair between the primary
and secondary sites are inconsistent.
The running status of a remote replication pair is Synchronizing when data
is being synchronized from the primary LUN to the secondary LUN. In this
Synchronizing state, data on the secondary LUN is unavailable and cannot be used for data
recovery if a disaster occurs. Only the secondary LUN in the consistent state
can be used to recover data of the primary LUN.

 Data status
By determining data differences between the primary and secondary LUNs in a remote
replication pair, HyperReplication identifies the data status of the pair. If a disaster occurs,
HyperReplication determines whether a primary/secondary switchover is allowed based on the
data status of the pair. The data status values are Consistent and Inconsistent.
 Writable secondary LUN
A writable secondary LUN refers to a secondary LUN to which host data can be written. After
HyperReplication is configured, the secondary LUN is read-only by default. If the primary LUN
is faulty, the administrator can cancel write protection for the secondary LUN and set the
secondary LUN to writable. In this way, the secondary storage system can take over host
services, ensuring service continuity. The secondary LUN can be set to writable in the following
scenarios:
HCIP-Storage Course Notes Page 47

− The primary LUN fails and the remote replication links are in disconnected state.
− The primary LUN fails but the remote replication links are in normal state. The pair must be
split before you enable the secondary LUN to be writable.
 Consistency group
A consistency group is a collection of pairs that have a service relationship with each other. For
example, the primary storage system has three primary LUNs that respectively store service data,
logs, and change tracking information of a database. If data on any of the three LUNs becomes
invalid, all data on the three LUNs becomes unusable. For the pairs in which these LUNs exist,
you can create a consistency group. Upon actual configuration, you need to create a consistency
group and then manually add pairs to the consistency group.
 Synchronization
Synchronization is a process of copying data from the primary LUN to the secondary LUN.
Synchronization can be performed for a single remote replication pair or for multiple remote
replication pairs in a consistency group at the same time.
Synchronization of a remote replication pair involves initial synchronization and incremental
synchronization.
After an asynchronous remote replication pair is created, initial synchronization is performed to
copy all data from the primary LUN to the secondary LUN. After the initial synchronization is
complete, if the remote replication pair is in normal state, incremental data will be synchronized
from the primary LUN to the secondary LUN based on the specified synchronization mode
(manual or automatic). If the remote replication pair is interrupted due to a fault, incremental data
will be synchronized from the primary LUN to the secondary LUN based on the specified
recovery policy (manual or automatic) after the fault is rectified.
After a synchronous remote replication pair is created, initial synchronization is performed to
copy all data from the primary LUN to the secondary LUN. After the initial synchronization is
complete, if the remote replication pair is in normal state, host I/Os will be written into both the
primary and secondary LUNs, not requiring data synchronization. If the remote replication pair is
interrupted due to a fault, incremental data will be synchronized from the primary LUN to the
secondary LUN based on the specified recovery policy (manual or automatic) after the fault is
rectified.
 Splitting
Splitting is a process of stopping data synchronization between primary and secondary LUNs.
This operation can be performed only by an administrator. Splitting can be performed for a single
remote replication pair or multiple remote replication pairs in a consistency group at one time.
After the splitting, the pair relationship between the primary LUN and the secondary LUN still
exists and the access permission of hosts for the primary and secondary LUNs remains
unchanged.
At some time, for example when the bandwidth is insufficient to support critical services, you
probably do not want to synchronize data from the primary LUN to the secondary LUN in a
remote replication pair. In such cases, you can split the remote replication pair to suspend data
synchronization.
You can effectively control the data synchronization process of HyperReplication by performing
synchronization and splitting.
 Primary/secondary switchover
A primary/secondary switchover is a process of exchanging the roles of the primary and
secondary LUNs in a pair relationship. You can perform a primary/secondary switchover for a
single remote replication pair or for multiple remote replication pairs in a consistency group at
the same time. A primary/secondary switchover is typically performed in the following scenarios:
After the primary site recovers from a disaster, the remote replication links are re-established and
data is synchronized between the primary and secondary sites.
HCIP-Storage Course Notes Page 48

When the primary storage system requires maintenance or an upgrade, services at the primary
site must be stopped, and the secondary site takes over the services.
 Link compression
Link compression is an inline compression technology. In an asynchronous remote replication
task, data is compressed on the primary storage system before transfer. Then the data is
decompressed on the secondary storage system, reducing bandwidth consumption in data
transfer. Link compression has the following highlights:
− Inline data compression
Data is compressed when being transferred through links.
− Intelligent compression
The system preemptively determines whether data can be compressed, preventing
unnecessary compression and improving transfer efficiency.
− High reliability and security
The lossless compression technology is used to ensure data security. Multiple check
methods are used to ensure data reliability. After receiving data, the secondary storage
system verifies data correctness and checks data consistency after the data is decompressed.
− User unawareness
Link compression does not affect services running on the hosts and is transparent to users.
− Compatibility with full backup and incremental backup
Link compression compresses all data that is transferred over the network regardless of
upper-layer services.
 Protected object
For customers, the protected objects are LUNs or protection groups. That is, HyperReplication is
configured for LUNs or protection groups for data backup and disaster recovery.
LUN: Data protection can be implemented for each individual LUN.
Protection group: Data protection can be implemented for a protection group, which consists of
multiple independent LUNs or a LUN group.
How to distinguish a protection group and a LUN group:
A LUN group applies to mapping scenarios in which the LUN group can be directly mapped to a
host. You can group LUNs for different hosts or applications.
A protection group applies to data protection with consistency groups. You can plan data
protection policies for different applications and components in the applications. In addition, you
can enable the LUNs used by multiple applications in the same protection scenario to be
protected in a unified manner. For example, you can group the LUNs to form a LUN group, map
the LUN group to a host or host group, and create a protection group for the LUN group to
implement unified data protection of the LUNs used by multiple applications in the same
protection scenario.
2.1.3.2.2 Implementation
 Data replication
Data replication is a process of writing service data generated by hosts to the secondary LUNs in
the secondary storage system. The writing process varies depending on the remote replication
mode. This section describes data replication performed in synchronous and asynchronous
remote replication modes.
 Writing process in synchronous remote replication
Synchronous remote replication replicates data in real time from the primary storage system to
the secondary storage system. The characteristics of synchronous remote replication are as
follows:
HCIP-Storage Course Notes Page 49

After receiving a write I/O request from a host, the primary storage system sends the request to
the primary and secondary LUNs.
The data write result is returned to the host only after the data is written to both primary and
secondary LUNs. If data fails to be written to the primary LUN or secondary LUN, the primary
LUN or secondary LUN returns a write I/O failure to the remote replication management module.
Then, the remote replication management module changes the mode from dual-write to single-
write, and the remote replication pair is interrupted. In this case, the data write result is
determined by whether the data is successfully written to the primary LUN and is irrelevant to
the secondary LUN.
After a synchronous remote replication pair is created between a primary LUN and a secondary
LUN, you need to manually perform synchronization so that data on the two LUNs is consistent.
Every time a host writes data to the primary storage system after synchronization, the data is
copied from the primary LUN to the secondary LUN of the secondary storage system in real
time.
The specific process is as follows:
1. Initial synchronization
After a remote replication pair is created between a primary LUN on the primary storage
system at the production site and a secondary LUN on the secondary storage system at the
DR site, initial synchronization is started.
All data on the primary LUN is copied to the secondary LUN.
During initial synchronization, if the primary LUN receives a write request from a host and
data is written to the primary LUN, the data is also written to the secondary LUN.
2. Dual-write
After initial synchronization is complete, the data on the primary LUN is the same as that on
the secondary LUN. Then an I/O request is processed as follows:
Figure 2-1 shows how synchronous remote replication processes a write I/O request.

Figure 2-1 Writing process in synchronous remote replication mode


a. The primary storage system at the production site receives the write request.
HyperReplication records the write request in a log. The log contains the address
information instead of the specific data.
b. The write request is written to both the primary and secondary LUNs. Generally, the
LUNs are in write back status. The data is written to the primary cache and secondary
cache.
HCIP-Storage Course Notes Page 50

c. HyperReplication waits for the primary and secondary LUNs to return the write result.
If data write to the secondary LUN times out or fails, the remote replication pair
between the primary and secondary LUNs is interrupted. If data write succeeds, the log
is cleared. Otherwise, the log is stored in the DCL, and the remote replication pair is
interrupted. In the follow-up data synchronization, the data block to which the address
of the log corresponds will be synchronized.
d. HyperReplication returns the data write result to the host. The data write result of the
primary LUN prevails.
LOG: data write log
DCL: data change log
Note:
The DCL is stored on all disks and all DCL data has three copies for protection. Storage
system logs are stored on coffer disks.
 Writing process in asynchronous remote replication
Asynchronous remote replication periodically replicates data from the primary storage system to
the secondary storage system. The characteristics of asynchronous remote replication are as
follows:
Asynchronous remote replication relies on the snapshot technology. A snapshot is a point-in-time
copy of source data.
When a host successfully writes data to a primary LUN, the primary storage system returns a
response to the host declaring the successful write.
Data synchronization is triggered manually or automatically at preset intervals to ensure data
consistency between the primary and secondary LUNs.
HyperReplication in asynchronous mode adopts the multi-time-segment caching technology. The
working principle of the technology is as follows:
1. After an asynchronous remote replication relationship is set up between primary and
secondary LUNs, the initial synchronization begins by default. The initial synchronization
copies all data from the primary LUN to the secondary LUN to ensure data consistency.
2. After the initial synchronization is complete, the secondary LUN data status becomes
consistent (data on the secondary LUN is a copy of data on the primary LUN at a certain
past point in time). Then the I/O process shown in the following figure starts. Figure 2-2
shows the writing process in asynchronous remote replication mode.

Figure 2-2 Writing process in asynchronous remote replication mode


HCIP-Storage Course Notes Page 51

a. When a synchronization task starts in asynchronous remote replication, a snapshot is


generated on both the primary and secondary LUNs (snapshot X on the primary LUN
and snapshot X - 1 on the secondary LUN), and the point in time is updated.
b. New data from the host is stored to the cache of the primary LUN at the X + 1 point in
time.
c. A response is returned to the host, indicating that the data write is complete.
d. The differential data that is stored on the primary LUN at the X point in time is copied
to the secondary LUN based on the DCL.
e. The primary LUN and secondary LUN store the data that they have received to disks.
After data is synchronized from the primary LUN to the secondary LUN, the latest data
on the secondary LUN is a full copy of data on the primary LUN at the X point in time.
 Service switchover
When a disaster occurs at the primary site, HyperReplication enables the secondary site to
quickly take over services from the primary site to ensure service continuity.
HyperReplication not only implements remote data backup but also recovers services as soon as
possible in the event of a disaster to keep service continuity. The following two indicators need to
be considered before a service switchover:
− RPO: The maximum acceptable time period prior to a failure or disaster during which
changes to data may be lost as a consequence of recovery. Data changes preceding the
failure or disaster by at least this time period are preserved by recovery. Synchronous
remote replication copies data from a primary LUN to a secondary LUN in real time,
ensuring that the RPO is zero. Zero is a valid value and is equivalent to a "zero data loss"
requirement. A remote DR system built based on synchronous remote replication
implements data-level DR. In asynchronous remote replication scenarios, the RPO is the
time period that you set for the synchronization interval.
− RTO: The maximum acceptable time period required to bring one or more applications and
associated data back from an outage to a correct operational state. The indicated recovery
time serves as the objective and ensures that the standby host takes over services as quickly
as possible. RTO depends on host services and disasters in remote replication scenarios.
Choose a remote replication mode based on the RPO and RTO requirements of users.
Service switchover through remote replication:
Services can run on the secondary storage system only when the following conditions are met:
− Before a disaster occurs, data in the primary LUN is consistent with that in the secondary
LUN. If data in the secondary LUN is incomplete, services may fail to be switched.
− Services on the production host have also been configured on the standby host.
− The secondary storage system allows a host to access a LUN in a LUN group mapped to the
host.
When a disaster occurs at the primary site, the remote replication links between the primary LUN
and the secondary LUN go down. If this occurs, an administrator needs to manually change the
access permission of the secondary LUN to writable to enable a service switchover. Figure 2-3
shows how a service switchover is implemented through remote replication.
HCIP-Storage Course Notes Page 52

Figure 2-3 Service switchover through remote replication


 Data recovery
If the primary site fails, the secondary site takes over its services. When the primary site recovers,
it takes control of those services again.
After the primary site recovers from a disaster, it is required to rebuild a remote replication
relationship between the primary and secondary storage systems. You can use the data of the
secondary site to recover that of the primary site. Figure 2-4 shows how the storage system
recovers data at the primary site after a disaster.
HCIP-Storage Course Notes Page 53

Figure 2-4 Process of recovering data at the primary site after a disaster
 Functions of a consistency group
In medium- and large-sized database applications, data, logs, and change records are stored on
associated LUNs of storage systems. The data correlation between those LUNs is ensured by
upper-layer host services at the primary site. When data is replicated to the secondary site, the
data correlation must be maintained. Otherwise, the data at the secondary site cannot be used to
recover services. To maintain the data correlation, you can add the remote replication pairs of
those LUNs to the same consistency group. This section compares storage systems running a
consistency group with storage systems not running a consistency group to show you how a
consistency group ensures service continuity.
HCIP-Storage Course Notes Page 54

Users can perform synchronization, splitting, and primary/secondary switchovers for a single
remote replication pair or perform these operations for multiple remote replication pairs using a
consistency group. Note the following when using a consistency group:
− Remote replication pairs can be added to a consistency group only on the primary storage
system. In addition, secondary LUNs in all remote replication pairs must reside in the same
remote storage system.
− LUNs in different remote replication pairs in a consistency group can belong to different
working controllers.
− Remote replication pairs in one consistency group must work in the same remote replication
mode.
2.1.3.3 Application Scenarios
HyperReplication is used for data backup and DR by working with BCManager eReplication. The
typical application scenarios include central backup and DR as well as 3DC.
Different remote replication modes apply to different application scenarios.
 Synchronous remote replication
Applies to backup and DR scenarios where the primary site is very near to the secondary site, for
example, in the same city (same data center or campus).
 Asynchronous remote replication
Applies to backup and DR scenarios where the primary site is far from the secondary site (for
example, cross countries or regions) or the network bandwidth is limited.

Table 2-1 Typical application scenarios of HyperReplication


Analysis Item Central Backup and DR 3DC
 Three data centers (DCs) are
 Backup data is managed deployed in two locations. Real-
centrally so that data analysis time backup and remote backup
and data mining can be are available concurrently.
performed without affecting
services.  Service data is backed up to the
intra-city DR center in real time
 When a disaster occurs at a through a high-speed link.
service site, the central backup
Scenario site can quickly take over  When data in the production
characteristics services and recover data, center is unavailable, services
achieving unified service data are quickly switched to the
management. intra-city DR center.
 The remote replication mode  If a large-scale disaster occurs
can be selected for a service site at both the production center
flexibly based on the distance and DR center in the same city,
between the service site and the the remote DR center can take
central backup site. over services and implement
DR.
 Same-city: synchronous or
Remote replication Synchronous remote replication or asynchronous remote replication
mode asynchronous remote replication  Remote: asynchronous remote
replication
Maximum distance  Synchronous remote  Synchronous remote replication:
between primary replication: within 300 km within 300 km
and secondary sites  Asynchronous remote  Asynchronous remote
HCIP-Storage Course Notes Page 55

Analysis Item Central Backup and DR 3DC


replication: no restriction replication: no restriction

2.1.4 HyperMetro
2.1.4.1 Overview
HyperMetro is Huawei's active-active storage solution that enables two storage systems to process
services simultaneously, establishing a mutual backup relationship between them. If one storage
system malfunctions, the other one will automatically take over services without data loss or
interruption. With HyperMetro being deployed, you do not need to worry about your storage systems'
inability to automatically switch over services between them and will enjoy rock-solid reliability,
enhanced service continuity, and higher storage resource utilization.
Huawei's active-active solution supports both single-data center (DC) and cross-DC deployments.
 Single-DC deployment
In this mode, the active-active storage systems are deployed in two equipment rooms in the same
campus.
Hosts are deployed in a cluster and communicate with storage systems through a switched fabric
(Fibre Channel or IP). Dual-write mirroring channels are deployed on the storage systems to
ensure continuous operation of active-active services.
Figure 2-1 shows an example of the single-DC deployment mode.

Figure 2-1 Single-DC deployment


 Cross-DC deployment
HCIP-Storage Course Notes Page 56

In this mode, the active-active storage systems are deployed in two DCs in the same city or in
two cities located close. The distance between the two DCs is within 300 km. Both of the DCs
can handle service requests concurrently, thereby accelerating service response and improving
resource utilization. If one DC fails, its services are automatically switched to the other DC.
In cross-DC deployment scenarios involving long-distance transmission (≥ 25 km for Fibre
Channel; ≥ 80 km for IP), dense wavelength division multiplexing (DWDM) devices must be
used to ensure a short transmission latency. In addition, mirroring channels must be deployed
between the active-active storage systems for data synchronization.
Figure 2-2 shows an example of the cross-DC deployment mode.

Figure 2-2 Cross-DC deployment


Highlights:
 Dual-write ensures data redundancy on the storage systems. In the event of a storage system or
DC failure, services are switched over to the other DC at zero RTO and RPO, ensuring service
continuity without any data loss.
 Both DCs provide services simultaneously, fully utilizing DR resources.
 The active-active solution ensures 24/7 service continuity. The gateway-free design reduces
potential fault points, further enhancing system reliability.
 HyperMetro can work with HyperReplication to form a geo-redundant layout comprising three
DCs.
With these highlights, HyperMetro is well suitable to various industries such as health care, finance,
and social insurance.
HCIP-Storage Course Notes Page 57

2.1.4.2 Working Principle


2.1.4.2.1 Basic Concepts
Protected object: For customers, the protected objects are LUNs or protection groups. That is,
HyperMetro is configured for LUNs or protection groups for data backup and disaster recovery.
LUN: Data protection can be implemented for each individual LUN.
Protection group: Data protection can be implemented for a protection group, which consists of
multiple independent LUNs or a LUN group.
How to distinguish a protection group and a LUN group:
A LUN group can be directly mapped to a host for the host to use storage resources. You can group
LUNs for different hosts or applications.
A protection group applies to data protection with consistency groups. You can plan data protection
policies for different applications and components in the applications. In addition, you can enable
unified protection for LUNs used by multiple applications in the same protection scenario. For
example, you can group the LUNs to form a LUN group, map the LUN group to a host or host group,
and create a protection group for the LUN group to implement unified data protection of the LUNs
used by multiple applications in the same protection scenario.
HyperMetro domain: A HyperMetro domain allows application servers to access data across DCs. It
consists of a quorum server and the local and remote storage systems.
HyperMetro pair: A HyperMetro pair is created between a local and a remote LUN within a
HyperMetro domain. The two LUNs in a HyperMetro pair have an active-active relationship. You can
examine the state of the HyperMetro pair to determine whether operations such as synchronization,
suspension, or priority switchover are required by its LUNs and whether such an operation is
performed successfully.
HyperMetro consistency group: A HyperMetro consistency group is created based on a protection
group. It is a collection of HyperMetro pairs that have a service relationship with each other. For
example, the service data, logs, and change tracking information of a medium- or large-size database
are stored on different LUNs of a storage system. Placing these LUNs in a protection group and then
creating a HyperMetro consistency group for that protection group can preserve the integrity of their
data and guarantee write-order fidelity.
Dual-write: Dual-write enables the synchronization of application I/O requests with both local and
remote LUNs.
DCL: DCLs record changes to the data in the storage systems.
Synchronization: HyperMetro synchronizes differential data between the local and remote LUNs in a
HyperMetro pair. You can also synchronize data among multiple HyperMetro pairs in a consistency
group.
Pause: Pause is a state indicating the suspension of a HyperMetro pair.
Force start: To ensure data consistency in the event that multiple elements in the HyperMetro
deployment malfunction simultaneously, HyperMetro stops hosts from accessing both storage
systems. You can forcibly start the local or remote storage system (depending on which one is
normal) to restore services quickly.
Preferred site switchover: Preferred site switchover indicates that during arbitration, precedence is
given to the storage system which has been set as the preferred site (by default, this is the local
storage system). If the HyperMetro replication network is down, the storage system that wins
arbitration continues providing services to hosts.
FastWrite: FastWrite uses the First Burst Enabled function of the SCSI protocol to optimize data
transmission between storage devices, reducing the number of interactions in a data write process by
half.
Functions of a HyperMetro consistency group: A consistency group ensures that the read/write control
policies of the multiple LUNs on a storage system are consistent.
HCIP-Storage Course Notes Page 58

In medium- and large-size databases, the user data and logs are stored on different LUNs. If data on
any LUN is lost or becomes inconsistent in time with the data on other LUNs, data on all of the LUNs
becomes invalid. Creating a HyperMetro consistency group for these LUNs can preserve the integrity
of their data and guarantee write-order fidelity.
 HyperMetro I/O processing mechanism
− Write I/O Process
Dual-write and locking mechanisms are essential for data consistency between storage
systems.
Dual-write and DCL technologies synchronize data changes while services are running.
Dual-write enables hosts' I/O requests to be delivered to both local and remote caches,
ensuring data consistency between the caches. If the storage system in one DC
malfunctions, the DCL records data changes. After the storage system recovers, the data
changes are synchronized to the storage system, ensuring data consistency across DCs.
Two HyperMetro storage systems can process hosts' I/O requests concurrently. To prevent
conflicts when different hosts access the same data on a storage system simultaneously, a
locking mechanism is used to allow only one storage system to write data. The storage
system denied by the locking mechanism must wait until the lock is released and then obtain
the write permission.
1. A host delivers a write I/O to the HyperMetro I/O processing module.
2. The write I/O applies for write permission from the optimistic lock on the local storage
system. After the write permission is obtained, the system records the address
information in the log but does not record the data content.
3. The HyperMetro I/O processing module writes the data to the caches of both the local
and remote LUNs concurrently. When data is written to the remote storage system, the
write I/O applies for write permission from the optimistic lock before the data can be
written to the cache.
4. The local and remote caches return the write result to the HyperMetro I/O processing
module.
5. The system determines whether dual-write is successful.
If writing to both caches is successful, the log is deleted.
If writing to either cache fails, the system:
a. Converts the log into a DCL that records the differential data between the local and
remote LUNs. After conversion, the original log is deleted.
b. Suspends the HyperMetro pair. The status of the HyperMetro pair becomes To be
synchronized. I/Os are only written to the storage system on which writing to its
cache succeeded. The storage system on which writing to its cache failed stops
providing services for the host.
6. The HyperMetro I/O processing module returns the write result to the host.
 Read I/O Process
The data of LUNs on both storage systems is synchronized in real time. Both storage systems are
accessible to hosts. If one storage system malfunctions, the other one continues providing
services for hosts.
1. A host delivers a read I/O to the HyperMetro I/O processing module.
2. The HyperMetro I/O processing module enables the local storage system to respond to the
read request of the host.
3. If the local storage system is operating properly, it returns data to the HyperMetro I/O
processing module.
HCIP-Storage Course Notes Page 59

4. If the local storage system is not operating properly, the HyperMetro I/O processing module
enables the host to read data from the remote storage system. Then the remote storage
system returns data to the HyperMetro I/O processing module.
5. The HyperMetro I/O processing module returns the requested data to the host.
 Arbitration Mechanism
If links between two HyperMetro storage systems are disconnected or either storage system
breaks down, real-time data synchronization will be unavailable to the storage systems and only
one storage system of the HyperMetro relationship can continue providing services. To ensure
data consistency, HyperMetro uses the arbitration mechanism to determine which storage system
continues providing services.
HyperMetro provides two arbitration modes:
− Static priority mode: Applies when no quorum server is deployed.
If no quorum server is configured or the quorum server is inaccessible, HyperMetro works
in static priority mode. When an arbitration occurs, the preferred site wins the arbitration
and provides services.
If links between the two storage systems are down or the non-preferred site of a HyperMetro
pair breaks down, LUNs of the storage system at the preferred site continue providing
HyperMetro services and LUNs of the storage system at the non-preferred site stop.
If the preferred site of a HyperMetro pair breaks down, the non-preferred site does not take
over HyperMetro services automatically. As a result, the services stop. You must forcibly
start the services at the non-preferred site.
− Quorum server mode (recommended): Applies when quorum servers are deployed.
In this mode, an independent physical server or VM is used as the quorum server. You are
advised to deploy the quorum server at a dedicated quorum site that is in a different fault
domain from the two DCs.
In the event of a DC failure or disconnection between the storage systems, each storage
system sends an arbitration request to the quorum server, and only the winner continues
providing services. The preferred site takes precedence in arbitration.

2.1.5 HyperCDP
2.1.5.1 Overview
HyperCDP is a continuous data protection feature developed by Huawei. A HyperCDP object is
similar to a common writable snapshot, which is a point-in-time consistent copy of original data to
which the user can roll back, if and when it is needed. It contains a static image of the source data at
the data copy time point.
HyperCDP has the following advantages:
 HyperCDP provides data protection at an interval of seconds, with zero impact on performance
and small space occupation.
 Support for scheduled tasks
You can specify HyperCDP schedules by day, week, month, or a specific interval, meeting
different backup requirements.
 Intensive and persistent data protection
HyperCDP provides higher specifications than common writable snapshots. It achieves
continuous data protection by generating denser recovery points with a shorter protection interval
and longer protection period.
HCIP-Storage Course Notes Page 60

2.1.5.2 Working Principle


2.1.5.2.1 Basic Concepts
ROW: ROW is the core technology to implement HyperCDP. When a storage system receives a write
request to modify existing data, the storage system writes the new data to a new location and directs
the pointer of the modified data block to the new location.
Data organization: The LUNs created in the storage pool of OceanStor Dorado V6 consist of metadata
volumes and data volumes.
Metadata volume: records the data organization information (logical block address (LBA), version,
and clone ID) and data attributes. A metadata volume is organized in a tree structure.
LBA indicates the address of a logical block. The version corresponds to the HyperCDP time point
and the clone ID indicates the number of data copies.
Data volume: stores user data written to a LUN.
Source volume: The volume that stores the source data for which HyperCDP objects are generated. It
is presented as a source LUN to users.
HyperCDP duplicate: HyperCDP objects can be replicated to generate multiple duplicates. These
duplicates preserve the same data as the HyperCDP objects at the time when the objects were created.
HyperCDP consistency group: Protection groups ensure data consistency between multiple associated
LUNs. OceanStor Dorado V6 supports HyperCDP for protection groups. That is, HyperCDP objects
are created for member LUNs in a protection group at the same time.
HyperCDP consistency groups are mainly used by databases. Typically, databases store different data
(for example, data files, configuration files, and log files) on different LUNs, and these LUNs are
associated with each other. To back up the databases using HyperCDP, HyperCDP objects must be
created for these LUNs at the same time point to ensure consistency of the application data during
database restoration.
2.1.5.2.2 Implementation
Creating a HyperCDP object: After a HyperCDP object is created, it is activated immediately and a
data copy that is identical to the source LUN is generated. Then the storage system copies the source
LUN's pointer to the HyperCDP object so that the HyperCDP object points to the storage location of
the source LUN's data. This enables the source LUN and HyperCDP object to share the same LBA.
Writing data to the source LUN: When an application server writes data to the source LUN after the
HyperCDP object is created, the storage system uses ROW to save the new data to a new location in
the storage pool and directs the source LUN's pointer to the new location. The pointer of the
HyperCDP object still points to the storage location of the source data, so the source data at the object
creation time is saved.
Figure 2-1 shows the metadata distribution in the source LUN before a HyperCDP object is created.
HCIP-Storage Course Notes Page 61

Figure 2-1 Metadata distribution in the source LUN


Figure 2-2 shows the storage locations of data after a HyperCDP object is created and new data is
written to the source LUN.

Figure 2-2 Data storage locations


1. Both the source LUN and HyperCDP object use a mapping table to access the physical space.
The original data in the source LUN is ABCDE and is saved in sequence in the physical space.
Before the original data is modified, the mapping table for the HyperCDP object is empty.
2. When the source LUN receives a write request that changes C to F, the new data is written into a
new physical space P5 instead of being overwritten in P2.
3. After the data is written into the new physical space, the L2->P2 entry is added to the mapping
table of the HyperCDP object. When the logical address L2 of the HyperCDP object is read
subsequently, the read request will not be redirected to the source LUN; instead, the requested
data is directly read from physical space P2.
4. In the mapping table of the source LUN, the system changes L2->P2 to L2->P5. Data in the
source LUN is changed to ABFDE and data in the HyperCDP object is still ABCDE.
Figure 2-3 shows how the metadata of the source LUN and HyperCDP object is distributed after the
data changes.
HCIP-Storage Course Notes Page 62

Figure 2-3 Metadata distribution in the source LUN and HyperCDP object
Hosts cannot read the data in a HyperCDP object directly. To allow a host to access the HyperCDP
data, you must create a duplicate for the HyperCDP object and map the duplicate to the host. If you
want to access the HyperCDP data for the same LUN at another time point, you can recreate the
duplicate using the HyperCDP object generated at that time point to obtain its data immediately.
HyperCDP supports quick recovery of the source LUN's data. If data on a source LUN suffers
incorrect deletion, corruption, or virus attacks, you can roll back the source LUN to the point in time
when the HyperCDP object was created, minimizing data loss.
2.1.5.3 Application Scenarios
HyperCDP can be used for various scenarios, for example, rapid data backup and restoration,
continuous data protection, and repurposing of backup data.
 Rapid data backup and restoration
HyperCDP objects can be generated periodically for service data to implement quick data
backup.
You can use the latest HyperCDP object to roll back data within several seconds. This protects
data against the following situations:
− Virus infection
− Incorrect deletion
− Malicious tampering
− Data corruption caused by system breakdown
− Data corruption caused by application bugs
− Data corruption caused by storage system bugs
In terms of data backup and restoration, HyperCDP has the following advantages:
− The RTO is significantly reduced. Even a large amount of data can be restored in a few
seconds.
− Data can be frequently backed up without service interruption. Applications can run
correctly without performance compromise.
− The backup window is notably shortened or eliminated.
 Repurposing of Backup Data
LUNs serve different purposes in different service scenarios, such as report generation, data
testing, and data analysis. If multiple application servers write data to a LUN simultaneously,
HCIP-Storage Course Notes Page 63

changes to the data may adversely affect services on these application servers. Consequently, the
data testing and analysis results may be inaccurate.
OceanStor Dorado V6 supports multiple duplicates of a HyperCDP object, which can be used by
different application servers for report generation, data testing, and data analysis.
Figure 2-1 shows how HyperCDP duplicates are used for various purposes.

Figure 2-1 Purposes served by HyperCDP duplicates

2.1.6 Other Hyper Series Technologies


2.1.6.1 HyperMirror
HyperMirror is a continuous data protection technology that creates two physical mirror copies of a
LUN for redundant backup protection against host service interruption.
HyperMirror has the following characteristics:
 HyperMirror creates two physical mirror copies of a LUN. The data of the two mirror copies is
the same. If one mirror copy fails, applications on the LUN will continue to run and hosts
running the applications remain normal.
 HyperMirror can be applied to LUNs in an OceanStor series storage system (the local storage
system) and those in a third-party storage system (the heterogeneous storage system). Therefore,
HyperMirror can improve the reliability of LUNs in both types of storage systems.
 Other value-added features can be applied to LUNs protected by HyperMirror, protecting data at
multiple layers.
By generating two mirror copies of either a local or external LUN, HyperMirror implements
continuous data protection for that LUN. If the LUN or either mirror copy is damaged, host
applications will remain unaffected. Therefore, HyperMirror improves the reliability of LUNs and
lowers disaster recovery risks and maintenance costs.
HyperMirror applies to the following scenarios:
 Works with the SmartVirtualization feature to generate redundant data backup in heterogeneous
storage systems.
 Works with SmartVirtualization to back up data between heterogeneous storage systems.
 Provides continuous protection for data on local LUNs.
2.1.6.2 HyperLock
With the development of technologies and society and explosive increase of information, secure
access and application of data are attached great importance. As required by laws and regulations,
important data such as case documents of courts, medical records, and financial documents can only
HCIP-Storage Course Notes Page 64

be read but cannot be written within a specific period. Therefore, measures must be taken to prevent
such data from being tampered with. In the storage industry, WORM is the most common method
used to archive and back up data, ensure secure data access, and prevent data tampering.
A file protected by WORM enters the read-only state immediately after data is written to it. In read-
only state, the file can be read, but cannot be deleted, modified, or renamed. The WORM feature can
prevent data from being tampered with, meeting data security requirements of enterprises and
organizations.
File systems with the WORM feature configured are called WORM file systems. WORM can only be
configured by administrators. There are two WORM modes: Regulatory Compliance WORM
(WORM-C for short) and Enterprise WORM (WORM-E).
The WORM feature implements read-only protection for important data in archived documents to
prevent data tampering, meeting regulatory compliance requirements.
WORM is used to protect important data in archived documents that cannot be tampered with or
damaged, for example, case documents of courts, medical records, and financial documents.
For example, a large number of litigation files are generated in courts. According to laws and
regulations, the protection periods of litigation files can be set to permanent, long-term, and short-
term based on the characteristics of the files.
2.1.6.3 HyperVault
Based on file systems, HyperVault enables data backup and recovery within a storage system and
between different storage systems.
Data backup involves local backup and remote backup. With file systems' snapshot or remote
replication technology, HyperVault backs up the data at a specific point in time to the source storage
system or backup storage system based on a specified backup policy.
Data recovery involves local recovery and remote recovery. With file systems' snapshot rollback or
remote replication technology, HyperVault specifies a local backup snapshot of a file system to roll
back it or specifies a remote snapshot of the backup storage system for recovery.
HyperVault has the following characteristics:
 Time-saving local backup and recovery: A storage system can generate a local snapshot within
several seconds to obtain a consistent copy of the source file system, and roll back the snapshot
to quickly recover data to that at the desired point in time.
 Incremental backup for changed data: In remote backup mode, full backup at an initial time and
permanent incremental backup save bandwidth.
 Flexible and reliable data backup policy: HyperVault supports self-defined backup policies and
threshold for the number of copies. A copy of invalid backup data will not affect follow-up
backup tasks.
HyperVault applies to data backup, data recovery, and other scenarios.

2.2 Smart Series Technology and Application


2.2.1 SmartPartition
2.2.1.1 Overview
An IT system typically consists of three components: computer, network, and storage. In the
traditional architecture, just a small number of applications run on each storage system. These
applications run independently of one another and have minor impact on each other's performance.
Traditional storage systems that carry only one service application are being replaced by more
advanced storage systems that are capable of provisioning storage services for thousands of
HCIP-Storage Course Notes Page 65

applications at the same time. As the service applications running on each storage system grow
sharply and have different I/O characteristics, resource preemption among service applications
undermines the performance of mission-critical service applications.
To meet demanding QoS requirements and guarantee the service performance of storage systems,
storage vendors have introduced a number of techniques, such as I/O priority, application traffic
control, and cache partitioning. In particular, cache partitioning provides an efficient way to meet QoS
requirements as cache resources are indispensable to data transmissions between storage systems and
applications.
SmartPartition is an intelligent cache partitioning feature developed by Huawei. It is also a
performance-critical feature that allows you to assign cache partitions of different capacities to users
of different levels. Service performance for service applications assigned with a cache partition is
improved with specified cache capacity.
SmartPartition applies to LUNs (block services) and file systems (file services).
2.2.1.2 Working Principle
2.2.1.2.1 Concepts
SmartPartition ensures the service quality for mission-critical services by isolating cache resources
among services. In a storage system, a cache capacity indicates the amount of cache resources that a
service can use. Cache capacity is a major factor to the performance of a storage system and also
affects services with different I/O characteristics to different extents.
 For data writes, a larger cache capacity means a higher write combining rate, a higher write hit
ratio, and better sequential disk access.
 For data reads, a larger cache capacity means a higher read hit ratio.
 For a sequential service, its cache capacity should only be enough for I/O request merging.
 For a random service, a larger cache capacity enables better sequential disk access, which
improves service performance.
Cache resources are divided into read cache and write cache:
Read cache effectively improves the read hit ratio of a host by means of read prefetching.
Write cache improves the disk access performance of a host by means of combining, hitting, and
sequencing.
On a storage system, you can set a dedicated read cache and write cache for each SmartPartition
partition to meet the requirements of different types of services. The cache partitions on a storage
system include SmartPartition partitions and a default partition.
SmartPartition partitions are created by users and provide cache services for service applications in
the partitions.
The default partition is a cache partition automatically reserved by the system and provides cache
services for system operation and other applications for which no SmartPartition partition is assigned.
2.2.1.2.2 Implementation
The following figure shows the implementation of SmartPartition.
HCIP-Storage Course Notes Page 66

Figure 2-1 Scheduling multiple services with SmartPartition


Users can create SmartPartition partitions based on LUNs or file systems. SmartPartition partitions
are independent of one another. Resources in each partition are exclusively accessible by the
applications for which the partition is assigned. After the cache capacity of a partition is manually
configured, the SmartPartition feature periodically counts the number of I/Os in each partition so that
the configuration can be optimized to maintain high service quality for mission-critical services.
2.2.1.3 Application Scenarios
 Ensuring the Performance of Mission-Critical Services in a Multi-Service System
Storage systems now provide increasingly better performance and larger capacity. Therefore,
multiple service systems are typically deployed on one storage system. On one hand, this practice
simplifies the storage system architecture and cuts configuration and management costs. On the
other hand, this practice incurs storage resource preemption among service systems, which
greatly undermines the performance of each service system. SmartPartition resolves this problem
by specifying cache partitions for different service systems to isolate cache resources and
guarantee the normal operation of mission-critical services.
For example, a production system and its test system are running on the same storage system.
The following table lists the I/O characteristics of the two service systems.

Table 2-1 Service characteristics


Service System Service Characteristic
Production system Frequent read and write I/Os
Test system Frequent read I/Os and infrequent write I/Os

SmartPartition allows you to configure independent cache partitions for the production and test
systems separately. In addition, appropriate read and write cache capacities can be configured for
the two systems according to their respective read and write I/O frequencies. This approach
HCIP-Storage Course Notes Page 67

improves the read and write I/O performance of the production system while maintaining the
normal operation of the test system.
Example:
SmartPartition policy A: SmartPartition partition 1 is created for the production system. The read
cache is 2 GB and the write cache is 1 GB. The read and write caches are enough for processing
frequent read and write I/Os in the production system.
SmartPartition policy B: SmartPartition partition 2 is created for the test system. The read cache
is 1 GB and the write cache is 512 MB. The cache resources are limited for the test system but
are enough to maintain its normal operation while not affecting the performance of the
production system.
 Meeting the QoS Requirements of High-Level Users in VDI Scenarios
In virtual desktop infrastructure (VDI) scenarios, different users use different services and have
different QoS requirements. How to meet the QoS requirement of each user while making full
use of resources is a pressing problem that data centers must address.
SmartPartition allows you to create cache partitions of different capacities for different users.
When resources are limited, SmartPartition preferentially meets the QoS requirements of high-
level users.
For example, multiple users share the storage resources provided by a data center. The following
table lists the QoS requirements of user A and user B.

Table 2-2 User characteristics


User QoS Requirement
User A (gold level) High
User B (silver level) Low

SmartPartition allows you to create cache partitions for users A and B, respectively, and define
different cache read/write policies.
SmartPartition policy A: SmartPartition partition 1 is created for user A. The read cache is 2 GB
and the write cache is 1 GB. The read and write caches are enough to guarantee the normal
operation and excellent data read and write performance of the applications used by user A.
SmartPartition policy B: SmartPartition partition 2 is created for user B. The read cache is 1 GB
and the write cache is 512 MB. The cache resources are limited for user B but are enough to
maintain the normal operation of the applications used by user B while meeting user A's
demanding requirements on applications.

2.2.2 SmartQuota
2.2.2.1 Overview
IT systems are in urgent need of improving resource utilization and management to ride on the
advancements in virtualization and cloud computing. In a typical IT storage system, all available
storage resources (disk space) will be used up. Therefore, we must find a way to control storage
resource utilization and growth to save costs.
In a network attached storage (NAS) file service environment, resources are provisioned as directories
to departments, organizations, and individuals. Each department or individual has unique resource
requirements or limitations, and therefore, storage systems must allocate and limit resources based on
actual conditions. SmartQuota perfectly meets this requirement by limiting the directory resources
that users can use.
HCIP-Storage Course Notes Page 68

SmartQuota is a file system quota technology. It allows system administrators to control storage
resource usage by limiting the disk space that each user can use and accordingly, preventing users
from excessively using resources.
2.2.2.2 Working Principle
In each I/O operation, SmartQuota checks the sum of used space and file quantity plus additional
space and file quantity required for this operation. If the sum exceeds the hard quota, the operation
will fail. If the sum does not exceed the hard quota, this operation will succeed. If the I/O operation
succeeds, SmartQuota updates the used space and file quantity under the quotas and writes the quota
update together with the data generated in the I/O operation to the file system. Either both the I/O
operation and quota update succeed or both fail. This approach guarantees that the used space checked
in each I/O operation is correct.
SmartQuota checks the hard quota as well as the soft quota. If the sum of used and incremental space
and file quantity does not exceed the hard quota, SmartQuota checks whether used space or file
quantity exceeds the soft quota. If yes, an alarm will be reported. After used space or file quantity
drops below the soft quota, the alarm will be cleared. The alarm is sent to the alarm center after an I/O
operation success is returned to the file system.
 Alarm Generation and Clearance Policies
When the amount of used resources (space or file quantity) exceeds the space or file quantity soft
quota, SmartQuota generates an alarm to notify administrators for handling. A soft quota is
designed to allow administrators to handle the resource over-usage problem by deleting
unnecessary files or applying for additional quotas before a file operation fails due to insufficient
quota.
SmartQuota clears the resource over-usage alarm only when the amount of resources used by a
user is less than 90% of the soft quota. This way, frequent generation and clearance of alarms can
be prevented as the amount of used resources is kept remarkably below the soft quota.
− Quota trees are critical to the implementation of SmartQuota. Directory quotas can only be
configured on quota trees. Quota trees are a kind of special directories:
− Quota trees can only be created, deleted, or renamed by administrators in the CLI or GUI.
Only empty quota trees can be deleted.
− Quota trees can be shared through a protocol and cannot be renamed or deleted when they
are being shared.
− Files cannot be moved (through NFS) or cut (through CIFS) between quota trees.
− A hard link cannot be created between quota trees.
 Supporting Directory Quotas
SmartQuota limits resource usage by setting one or more resource quotas for each user.
SmartQuota principally employs directory quotas to limit resource usage:
A directory quota limits the maximum available space of all files under a directory. SmartQuota
supports only directory quotas on special level-1 directories (level-1 directories created by
running the specific management command). Such level-1 directories are called quota trees.
The following figure shows a typical configuration of SmartQuota.
HCIP-Storage Course Notes Page 69

Figure 2-1 Typical configuration of SmartQuota


To facilitate administrators to configure quotas, SmartQuota provides the following mechanisms
for configuring quotas for the same type of objects in batches:
Default directory quota: A default directory quota is configured for a file system and applies to
all quota trees. If a new quota tree is created and no directory quota is configured for it, resource
usage on this quota tree will be checked and limited based on the default directory quota.
It should be noted that a default directory quota is just a configuration item and does not record
and update usage information.
 Limiting Space and File Quantity Usage
The following describes the quotas that you can configure for each quota object:
1. Space soft quota: A space soft quota is a space over-usage alarm threshold. When the space
used by a quota object reaches the configured space soft quota, SmartQuota reports an alarm
indicating insufficient space and suggests deleting unnecessary files or applying for
additional quotas. The user can continue to write data to the directory in this case. When the
used space drops below the space soft quota, the system clears the alarm. To prevent
frequent alarm generation and clearance as the used space frequently changes below and
above the soft quota, the system clears an alarm only when the used space drops to less than
90% of the soft quota. For SmartQuota, space is measured in bytes and calculated based on
logical space of files.
2. Space hard quota: A space hard quota limits the maximum space that a quota object can
use. When the space used by a quota object reaches the configured hard quota, SmartQuota
reports an error indicating insufficient space. This way, the used space will not exceed this
quota. A space hard quota can be considered as the total space of a disk. Therefore, the used
space can never exceed the quota.
3. File quantity soft quota: A file quantity soft quota is a file quantity over-usage alarm
threshold. When the number of files used by a quota object reaches the configured file
quantity soft quota, SmartQuota reports an alarm indicating insufficient file resources and
suggests deleting unnecessary files or applying for additional quotas. Users can continue to
create files or directories in this case. When the number of used files drops below the file
quantity soft quota, the system clears the alarm. For SmartQuota, a file quantity quota is
measured by the number of files. A file, directory, soft link (NFS), or shortcut (CIFS) can
be considered as a file. Once a file is created, it is counted in the file quantity quota.
4. File quantity hard quota: A file quantity hard quota limits the maximum number of files that
a quota object can use. A file quantity hard quota functions similarly to a space hard quota.
When the number of files used by a quota object reaches the configured hard quota,
SmartQuota reports an error indicating insufficient space. This way, the number of used
files will not exceed the quota.
When configuring quotas for an object, you must configure at least one of the four quotas.
2.2.2.3 Application Scenarios
SmartQuota applies to file service scenarios. An administrator allocates quota trees to the
departments, organizations, or individuals in an enterprise and configure quotas on these quota trees
based on service requirements to properly limit resource usage.
HCIP-Storage Course Notes Page 70

Figure 2-1 Application scenarios for SmartQuota

2.2.3 SmartVirtualization
2.2.3.1 Overview
As the amount of user data grows, efficient management and capacity expansion of existing storage
systems become increasingly important. However, these operations are impeded by the following
problems:
If a user replaces an existing storage system with a new storage system, service data stored on the
existing storage system must be migrated to the new storage system. However, incompatibility
between storage systems of different vendors prolongs data migration duration and even causes data
loss during migration.
If a user acquires a new storage system and manages storage systems separately, the maintenance
costs will increase with the addition of the new system. In addition, storage resources provided by
existing storage systems and the new storage system cannot be effectively integrated and uniformly
managed.
SmartVirtualization can effectively address these problems. Physical attributes of different storage
systems are shielded for easy configuration and management of storage systems and efficient
utilization of storage resources.
SmartVirtualization is a heterogeneous virtualization feature developed by Huawei. After a local
storage system is connected to a heterogeneous storage system, the local storage system can use the
storage resources provided by the heterogeneous storage system as local storage resources and
HCIP-Storage Course Notes Page 71

manage them in a unified manner, regardless of different software and hardware architectures
between storage systems.
SmartVirtualization applies only to LUNs (block services).
SmartVirtualization resolves incompatibility between storage systems so that a user can manage the
storage resources provided by the local storage system and the heterogeneous storage system in a
unified manner. Meanwhile, a user can still use the storage resources provided a legacy storage
system to save investment.
In this section, the local storage system refers to an OceanStor V5 series storage system. The
heterogeneous storage system can be a Huawei (excluding an OEM storage system commissioned by
Huawei) or third-party storage system.
SmartVirtualization allows only management but not configuration of the storage resources on a
heterogeneous storage system.
SmartVirtualization allows online or offline takeover of a heterogeneous storage system.
SmartVirtualization offers the following benefits:
 Broad compatibility: The local storage system is compatible with mainstream heterogeneous
storage systems to facilitate planning and managing storage resources in a unified manner.
 Conserving storage space: When a local storage system uses the storage space provided by the
external LUNs on a heterogeneous storage system, it does not perform a full physical data
mirroring, which remarkably saves the storage space on the local storage system.
 Scalable functions: A local storage system can not only use external LUNs as local storage
resources, but also configure value-added functions, such as HyperReplication and HyperSnap,
for these LUNs, to meet higher data security and reliability requirements.
2.2.3.2 Working Principle
2.2.3.2.1 Concepts
 Data organization
A local storage system uses a storage virtualization technology. Each LUN in the local storage
system consists of a metadata volume and a data volume.
A metadata volume records data storage locations.
A data volume stores user data.
 External LUN
It is a LUN on a heterogeneous storage system, which is displayed as a remote LUN in
DeviceManager.
 eDevLUN
In the storage pool of the local storage system, the mapped external LUNs are created as raw
storage devices based on the virtualization data organization form. The raw storage devices
created in this way are called eDevLUNs. An eDevLUN consists of a metadata volume and a
data volume. Physical space needed by an eDevLUN on the local storage system is that needed
by the metadata volume. Application servers can use eDevLUNs to access data on external LUNs
and configure value-added features, such as HyperSnap, HyperReplication, SmartMigration, and
HyperMirror, for the eDevLUNs.
 LUN masquerading
When encapsulating a LUN on a heterogeneous storage system into an eDevLUN, you can
configure the LUN masquerading property. An application server will identify the eDevLUN as a
LUN on the heterogeneous storage system. The WWN and host LUN ID of the eDevLUN
detected by a host are the same as those of the external LUN. The masquerading property of an
eDevLUN is configured to implement online takeover.
 Takeover
HCIP-Storage Course Notes Page 72

LUNs on a heterogeneous storage system are mapped to the local storage system to allow the
local storage system to use and manage these LUNs.
 Relationship between an eDevLUN and an external LUN
An eDevLUN consists of a data volume and a metadata volume. The data volume is a logical
abstract object of the data on an external LUN. The physical space needed by the data volume is
provided by a heterogeneous storage system instead of the local storage system. The metadata
volume manages the storage locations of data on an eDevLUN. The physical space needed by the
metadata volume is provided by the local storage system. A metadata volume needs small storage
space. If no value-added feature is configured for eDevLUNs, each eDevLUN consumes about
130 MB of space in the storage pool of the local storage system. A mapping is configured
between each eDevLUN created on the local storage system and each external LUN on a
heterogeneous storage system. An application server accesses the data on an external LUN by
reading and writing data from and to an eDevLUN.

Figure 2-1 Relationship between an eDevLUN and an external LUN


2.2.3.2.2 Implementation
 Data Read and Write Process
With SmartVirtualization, an application server can read and write data from and to an external
LUN on a heterogeneous storage system through the local storage system. The process of reading
and writing data from and to an external LUN is similar to the process of reading and writing
data from and to a local LUN.
 Data Read Process
After an external LUN on a heterogeneous storage system is taken over using
SmartVirtualization, hot data on the external LUN is cached to the cache of an eDevLUN. When
an application server sends a request to read data from the external LUN, data will be read from
the eDevLUN on the local storage system. If the read cache misses in the eDevLUN on the local
storage system, data will be read from the external LUN on the heterogeneous storage system.
The following figure shows the data read process.
HCIP-Storage Course Notes Page 73

Figure 2-1 Data read process


 Data Write Process
After a heterogeneous storage system is taken over using SmartVirtualization, both the
eDevLUNs and other LUNs on the local storage system support the write-back and write-through
policies. The following figure shows the data write process.
HCIP-Storage Course Notes Page 74

Figure 2-2 Data write process


Write-back: When an application server sends a data write-back request, data blocks are written
to the local storage system and then the local storage system returns a write success to the
application server. Then, the local storage system writes the same data blocks to a heterogeneous
storage system.
Write-through: When an application server sends a write-through request, data blocks are written
to the local storage system and then from the local storage system to a heterogeneous storage
system. After the data blocks are successfully written to the heterogeneous storage system, the
heterogeneous storage system returns a write success to the local storage system. Then the local
storage system returns a write success to the application server.
The default write policy for eDevLUNs is write-through. The write policy for an eDevLUN can
be modified by changing its LUN attribute. A write policy will affect the performance of a
heterogeneous storage system.
The write-through policy is the most secure write policy because data is ultimately stored on a
heterogeneous storage system. If the load capacity of a heterogeneous storage system cannot
meet the data write speed in a short period of time, the write-through policy is recommended.
If the load capacity of a heterogeneous storage system cannot meet the data write speed in a short
period of time, the write-back policy will adversely affect the overall performance of the local
storage system. As an eDevLUN supports a number of value-added features, the write policy for
an eDevLUN should be modified for ensuring the performance when value-added features are
enabled.
HCIP-Storage Course Notes Page 75

2.2.3.3 Application Scenarios


SmartVirtualization resolves incompatibility between storage systems. Therefore, SmartVirtualization
applies to a wide range of scenarios, such as migration of service data and management of storage
resources between storage systems.
 Unified Management of Storage Resources
If multiple heterogeneous storage systems have been installed on site, the following two
problems may arise:
The multipathing software on an application server may be incompatible with one or more
storage systems.
In a certain network environment, for example, a Fibre Channel network, an application server
can only be connected to one storage system. However, in practice, an application server needs to
distribute services to multiple storage systems.
SmartVirtualization functions similarly to a virtual gateway. SmartVirtualization allows you to
detect the storage resources provided by multiple heterogeneous storage systems through the
local storage system, deliver commands to read and write data from and to these storage
resources, and manage these storage resources in a unified manner.
The storage resources provided by heterogeneous storage systems can be managed in a unified
manner by the following two means:
Offline takeover: During an offline takeover, a heterogeneous storage system and application
servers are disconnected and accordingly, services are interrupted for a short duration of time.
Online takeover: In contrast with an offline takeover, an online takeover can be performed
without disconnecting a heterogeneous storage system and application servers and interrupting
services. This maintains service continuity and guarantees data integrity.
 Migration of Service Data Between Storage Systems
Migrating service data to a new storage system: With services growing day by day, the amount of
data that must be stored also increases. The storage spaces provided by existing storage systems
are no longer enough to meet the current requirements for data storage capacity and performance.
In this case, users need to acquire storage systems that provide larger capacity and better
performance to upgrade or replace their existing storage systems. As two storage systems use
different software and hardware components, data migration may interrupt services and even
cause data loss. SmartVirtualization helps shield the differences between the two storage systems
by mapping the external LUNs in the original storage system to the eDevLUNs in the new
storage system. The SmartVirtualization and SmartMigration features then work together to
migrate service data from the original storage system to the new storage system while
maintaining data integrity and reliability without interrupting services.
Migrating cold data to the original storage system: After a new storage system is installed and
runs for a period of time, cold data is found being stored on the storage system. If a large amount
of cold data is stored on the new storage system, the utilization of storage resources will be
adversely affected and accordingly, storage space will be unnecessarily wasted.
SmartVirtualization can work with SmartMigration to reduce operating expense by migrating
cold data to the original heterogeneous storage system to fully utilize existing resources.
 Heterogeneous Data DR
When a user stores its service data in two data centers and requires excellent service continuity, it
typically employs the asynchronous remote replication feature to allow the two data centers to
mutually back up the data stored in each data center. When a disaster occurs in one data center,
the other data center can take over the services from the faulty data center and recover data.
If storage systems from different vendors are deployed in the two data centers, the two data
centers cannot mutually back up the data stored in each data center due to different hardware and
software architectures. As a result, user requirements cannot be met. In this case,
SmartVirtualization is useful in implementing mutual data backup between heterogeneous
HCIP-Storage Course Notes Page 76

storage systems and cross-site data DR. The implementation process is as follows: First, take
over the LUNs on the heterogeneous storage system in each data center and create eDevLUNs.
Then, create an asynchronous remote replication pair between each eDevLUN and each LUN on
a Huawei storage system deployed at the other site.
 Heterogeneous Data Protection
After a heterogeneous storage system is taken over using SmartVirtualization, data on the LUNs
in the heterogeneous storage system may still be subject to damage due to viruses or other
reasons. To this end, HyperSnap can be used to create snapshots of eDevLUNs for backing up
the data on external LUNs. Damaged data on an external LUN can be swiftly recovered by
recovering the data on an eDevLUN from a specified snapshot point-in time by means of quick
snapshot rollback.
 Heterogeneous Local HA
After a heterogeneous storage system is taken over using SmartVirtualization, service data is still
stored on the heterogeneous storage system. A variety of heterogeneous storage systems may be
incompatible with one another, which may cause service interruption and even data loss. The
HyperMirror feature can be enabled on the local storage system to create a mirror LUN for each
eDevLUN. Then, two mirror copies of each mirror LUN are saved on the local storage system.
Data on an external LUN is written to both mirror copies at the same time, preventing service
interruption and data loss.

2.2.4 Other Smart Series Technologies


2.2.4.1 SmartCache
SmartCache is an intelligent data cache feature developed by Huawei.
This technology creates a SmartCache pool of SSDs and moves hot data featuring frequent small
random read I/Os from conventional hard disk drives (HDDs) to the high-speed cache pool. SSDs
provide much faster data reads than HDDs, so SmartCache remarkably reduces the response time to
hot data and improves system performance.
SmartCache divides the cache pool into multiple partitions to provide fine-grained SSD cache
resources. Different services can share one partition or use different partitions. The partitions are
independent of each other. More cache resources can be allocated to mission-critical applications to
ensure application performance.
SmartCache neither interrupts services nor compromises data reliability.
The SmartCache feature applies to scenarios characterized by hot data and random small read I/Os. In
such scenarios, SmartCache can considerably improve the read performance.
As SSDs provide efficient response and high IOPS, SmartCache can improve the read performance,
especially in scenarios characterized by hot data and random small I/Os with more frequent data reads
than data writes. These scenarios include database, OLTP, web, and file services.
2.2.4.2 SmartMulti-Tenant
The requirements for XaaS in public and private clouds emerge with the soaring development of
cloud services. As the number of end users increases constantly, one physical storage system may be
used by multiple enterprises or individuals. The following challenges arise:
The logical resources of enterprises or individuals who use the same storage system may interfere
with each other or may be subject to unauthorized access, impairing data security.
IT service providers need to pay extra costs to manage users.
Data migration without affecting services is required.
Developed to deal with these challenges, the multi-tenancy technology allows storage resource
sharing among tenants and at the same time simplifies configuration and management, as well as
enhances data security.
HCIP-Storage Course Notes Page 77

Huawei's SmartMulti-Tenant allows tenants to create multiple virtual storage systems in one physical
storage system. With SmartMulti-Tenant, tenants can share hardware resources and safeguard data
security and confidentiality in a multi-protocol unified storage architecture.
SmartMulti-Tenant enables users to implement flexible, easy-to-manage, and cost-effective storage
sharing among multiple vStores in a multi-protocol unified storage infrastructure. SmartMulti-Tenant
supports performance tuning and data protection settings for each vStore to meet different SLA
requirements.
vStore-based service isolation: The development of cloud technology brings a higher sharing level of
underlying resources. There is also an increasing demand for data resource isolation. With
SmartMulti-Tenant, multiple vStores can be created in a physical storage system, providing
independent services and configuration space for each vStore, and isolating services, storage
resources, and networks among vStores. Different vStores can share the same hardware resources,
without affecting data security and privacy.
Example: An enterprise allocates a physical storage system to several business departments. These
business departments manage and allocate their own storage resources while meeting the requirement
for secure storage resource access and isolation.
2.2.4.3 SmartQoS
SmartQoS is an intelligent service quality control feature developed by Huawei. It dynamically
allocates storage system resources to meet the performance requirement of certain applications.
SmartQoS extended the information lifecycle management (ILM) strategy to control the performance
level for each application within a storage system. SmartQoS is an essential add-on to a storage
system, especially when certain applications have demanding SLA requirements. In a storage system
serving two or more applications, SmartQoS helps derive the maximum value from the storage
system:
SmartQoS controls the performance level for each application, preventing interference between
applications and ensuring the performance of mission-critical applications.
SmartQoS prioritizes mission-critical applications in storage resource allocation by limiting the
resources allocated to non-critical applications.
SmartQoS applies to LUNs (block services) and file systems (file services).
SmartQoS dynamically allocates storage resources to ensure performance for mission-critical services
and high-priority users.
 Ensuring Performance for Mission-Critical Services
SmartQoS is useful in specifying the performance objectives for different services to guarantee
the normal operation of mission-critical services.
You can ensure the performance of mission-critical services by setting I/O priorities or creating
SmartQoS traffic control policies.
The services running on a storage system can be categorized into the following types:
− Online Transaction Processing (OLTP) service is a mission-critical service and
requires excellent real-time performance.
− Archive and backup service involves a large amount of data but requires general real-
time performance.
The OLTP service runs between 08:00 a.m. and 00:00 a.m. and the archive and backup service
runs between 00:00 a.m. and 08:00 a.m.
Adequate system resources must be provided for those two types of services when they are
running in specific periods.
As the OLTP service is a mission-critical service, you can modify LUN I/O priorities to give a
higher priority to the OLTP service than the archive and backup service. This practice guarantees
HCIP-Storage Course Notes Page 78

the normal operation of the OLTP service and prevents the archive and backup service from
affecting the running of the OLTP service.
To meet service requirements, you can leverage the following two policies:
Setting two upper limits:
Traffic control policy A: Limits the bandwidth for the archive and backup service (for example, ≤
50 MB/s) between 08:00 a.m. and 00:00 a.m. to reserve adequate system resources for the normal
operation of the OLTP service during daytime.
Traffic control policy B: Limits the IOPS for the OLTP service (for example, ≤ 200) between
00:00 a.m. and 08:00 a.m. to reserve adequate system resources for the normal operation of the
archive and backup service during night.
Setting a lower limit:
Traffic control policy C: Sets the latency for the OLTP service (for example, ≤ 10 ms) between
08:00 a.m. and 00:00 a.m. to reserve adequate system resources for the normal operation of the
OLTP service during daytime.
 Ensuring Performance for High-Priority Subscribers
To reduce the total cost of ownership (TCO) and maintain service continuity, some subscribers
tend to run their services on the storage platforms offered by a storage service provider instead of
building their own storage systems. However, storage resource preemption may occur among
different types of services with different service characteristics. This may prevent high-priority
subscribers from using adequate storage resources.
SmartQoS is useful in creating SmartQoS policies and setting I/O priorities for different
subscribers. This way, when resources become insufficient, high-priority subscribers can
maintain normal and satisfactory operation of their services.
2.2.4.4 SmartDedupe and SmartCompression
SmartDedupe and SmartCompression are the intelligent data deduplication and compression features
developed by Huawei.
SmartDedupe is a data reduction technology that removes redundant data blocks from a storage
system to reduce the physical storage space used by data and meet the increasing data storage
requirements. OceanStor storage systems support inline deduplication, that is, only new data is
deduplicated.
SmartCompression reorganizes data while maintaining data integrity to reduce data amount, save
storage space, and improve data transmission, processing, and storage efficiency. Storage systems
support inline compression, that is, only new data is compressed.
SmartDedupe and SmartCompression implement data deduplication and compression to reduce the
storage space occupied by data. In application scenarios such as databases, virtual desktops, and email
services, SmartDedupe and SmartCompression can be used independently or jointly to improve
storage efficiency as well as reduce investments and O&M costs.
 Application Scenarios for SmartDedupe
Virtual Desktop Infrastructure (VDI) is a common application scenario for SmartDedupe. In VDI
applications, multiple virtual images are created on a storage system. These images contain a
large amount of duplicate data. As the amount of duplicate data increases, the storage space
provided by the storage system becomes insufficient for the normal operation of services.
SmartDedupe removes duplicate data between images to release storage resources for more
service data.
 Application Scenarios for SmartCompression
Data compression occupies CPU resources, which increase with the amount of data to be
compressed.
HCIP-Storage Course Notes Page 79

Databases are best application scenario for SmartCompression. To store a large amount of data in
databases, it is wise to trade a little service performance for more than 65% increase in available
storage space.
File services are also a common application scenario for SmartCompression. A typical example
is a file service system that is only busy for half of its service time and has a 50% compression
ratio for datasets.
Engineering, seismic, and geological data: With similar characteristics to database backups, these
types of data are stored in the same format but contain little duplicate data. Such data can be
compressed to save the storage space.
 Application Scenarios for Using Both SmartDedupe and SmartCompression
SmartDedupe and SmartCompression can be used together to save more storage space in a wide
range of scenarios, such as data testing or development systems, file service systems, and
engineering data systems.
In VDI applications, multiple virtual images are created on a storage system. These images
contain a large amount of duplicate data. As the amount of duplicate data increases, the storage
space provided by the storage system becomes insufficient for the normal operation of services.
SmartDedupe and SmartCompression remove or compress duplicate data between images to
release storage resources for more service data.
HCIP-Storage Course Notes Page 80

3 Distributed Storage Technology and


Application

3.1 Block Service Features


3.1.1 SmartDedupe and SmartCompression
The block service intelligently switches between inline and post-process deduplication. When the
service load is heavy, inline deduplication is disabled automatically to ensure service performance and
post-process deduplication is implemented to delete duplicate data. When the service load is light,
inline deduplication is enabled automatically to prevent read/write amplification of post-process
deduplication. The intelligent self-adaptive deduplication technology automatically switches between
inline and post-process deduplication based on service loads without user awareness, making the most
of the two deduplication modes and providing excellent read and write performance even when
deduplication is enabled.
To obtain better deduplication and compression effects, the block service adopts global deduplication.
The distributed storage space is huge. To reduce the memory space consumed by the fingerprint table,
an opportunity table is introduced, as shown in the following figure. The fingerprints of data blocks
are first recorded in the opportunity table for counting. When the number of data blocks with a same
fingerprint reaches a specific threshold (3 by default and modifiable), the system promotes the
fingerprint to the fingerprint table and implements data deduplication accordingly.

Figure 3-1 Deduplication in the block service


HCIP-Storage Course Notes Page 81

If compression is disabled, the system directly applies for storage space to store data blocks. If
compression is enabled, compression will be performed for the data blocks before storage. The data
blocks will be compressed by the compression engine at the granularity of 512 bytes and then saved in
the system.
The compression engine runs in a combination of two different compression algorithms. One is the
algorithm with high compression speed but low compression rate and the other one is the algorithm
with high compression rate but low compression speed. By configuring different execution ratios of
the two compression algorithms, you can obtain different performance and data reduction rates. Only
one compression algorithm can be selected for a storage pool. Changing the compression algorithm of
a storage pool does not affect compressed data. During data reads, compressed data will be
decompressed using the same compression algorithm when the data was compressed.

Figure 3-2 Multi-policy compression in the block service

3.1.2 SmartQoS
SmartQoS enables you to set upper limits on IOPS or bandwidth for certain applications. Based on the
upper limits, SmartQoS can accurately limit performance of these applications, preventing them from
contending for storage resources with critical applications.
SmartQoS extends the information lifecycle management (ILM) strategy to implement application
performance tiering in the block service. When multiple applications run on one storage system,
proper QoS configurations ensure the performance of critical services:
 SmartQoS controls storage resource usage by limiting the performance upper limits of non-
critical applications so that critical applications have sufficient storage resources to achieve
performance objectives.
 Some services are prone to traffic bursts or storms in specified time periods, for example, daily
backup, database sorting, monthly salary distribution, and periodic bill settlement. The traffic
bursts or storms will consume a large number of system resources. If the traffic bursts or storms
occur at production time, interactive services will be affected. To avoid this, you can limit the
maximum IOPS or bandwidth of these services during traffic burst occurrence time to control
array resources consumed by the services, preventing production or interactive services from
being affected.
HCIP-Storage Course Notes Page 82

3.1.2.1 Functions and Principles of SmartQoS


SmartQoS enables you to set performance objectives for volumes and storage pools by specifying
bandwidth and IOPS upper limits. The total read and write performance, read performance, or write
performance can be limited. QoS policies can take effect in specified time ranges based on service
loads to prevent I/O storms from affecting production services.
SmartQoS leverages a self-adaptive adjustment algorithm based on negative feedback and a volume-
based I/O traffic control management algorithm to limit traffic based on the performance control
objectives (such as IOPS and bandwidth) specified by users. The I/O traffic control mechanism
prevents certain services from affecting other services due to heavy traffic and supports burst traffic
functions within specified time ranges. QoS traffic control is implemented as follows in the block
service:
 Self-adaptive adjustment algorithm based on negative feedback
When a volume is mounted to multiple VBS nodes, the system resources consumed by services
on the volume need to be controlled. That is, the overall performance of the volume needs to be
limited. This requires coordination of distributed traffic control parameters.
Suppose that the system is in the initial state and the IOPS upper limit of Volume 0 is 1000, as
shown in the following figure. The initial IOPS pressure of each host accessing Volume 0 is
1000 and the total IOPS pressure on Volume 0 is 2000, exceeding the specified IOPS upper
limit. To prevent this, the system will detect the service pressure from Host 0 and Host 1 on
Volume 0 and adaptively adjust the number of tokens in Token bucket 0 and Token bucket 1 to
limit the maximum IOPS to 1000.

Figure 3-1 Self-adaptive adjustment algorithm based on negative feedback


 Volume-based I/O traffic control management algorithm
QoS traffic control management is implemented by volume I/O queue management, token
allocation, and dequeuing control. After you set a performance upper limit objective for a QoS
policy, the system determines the performance upper limit of each VBS node by coordinating the
distributed traffic control parameters and then converts the performance upper limit into a
specified number of tokens. If the traffic to be restricted is IOPS, one I/O consumes one token. If
the traffic to be restricted is bandwidth, one byte consumes one token. Volume-based I/O queue
management uses the token mechanism to allocate storage resources. The larger the number of
tokens in the I/O queue of a volume is, the more the I/O resources allocated to the volume are.
HCIP-Storage Course Notes Page 83

As shown in the following figure, I/Os from application servers first enter I/O queues of volumes.
SmartQoS periodically processes I/Os waiting in the queues. It dequeues the head element in a
queue, and attempts to obtain tokens from a token bucket. If the number of remaining tokens in
the token bucket meets the token requirement of the head element, the system delivers the
element to another module for processing and continues to process the next head element. If the
number of remaining tokens in the token bucket does not meet the token requirement of the head
element, the system puts the head element back in the queue and stops I/O dequeuing.

Figure 3-2 Volume-based I/O traffic control management algorithm

3.1.3 HyperSnap
HyperSnap is the snapshot feature in the block service that captures the state of volume data at a
specific point in time. The snapshots created using HyperSnap can be exported and used for restoring
volume data.
The system uses the ROW mechanism to create snapshots, which imposes no adverse impact on
volume performance.

Figure 3-1 HyperSnap in the block service


SCSI volumes with multiple mount points are shared volumes. All iSCSI volumes are shared
volumes. To back up shared volumes, the block service can create snapshots for the shared volumes.
The procedure for creating snapshots for shared volumes is the same as that for common volumes.
HCIP-Storage Course Notes Page 84

The block service supports the consistency snapshot capability. Specifically, the block service can
ensure that the snapshots of multiple volumes used by an upper-layer application are at the same point
in time. Consistency snapshots are used for VM backup. A VM is usually mounted with multiple
volumes. When a VM is backed up, all volume snapshots must be at the same time point to ensure
data restoration reliability.

Figure 3-2 Consistency snapshot in the block service

3.1.4 HyperClone
HyperClone is the clone feature in the block service that provides the linked clone function to create
multiple clone volumes from one snapshot. Data on each clone volume is consistent with that of the
snapshot. Data writes and reads on a clone volume have no impact on the source snapshot or other
clone volumes.
The system supports a linked clone rate of 1:2048, effectively improving storage space utilization.
A clone volume has all functions of a common volume. You can create snapshots for a clone volume,
use the snapshots to restore the clone volume, and clone the clone volume.
HCIP-Storage Course Notes Page 85

Figure 3-1 Linked clone in the block service


 Clone application scenario
Creating clones based on volumes/snapshots, reading and writing clones, can achieve data
mining and testing purposes, without affecting service data.

A snapshot is generated for data to be tested at 11 in the morning.


Use the test server to create a clone for the snapshot and test the clone (read/write clone). During
the test, the source data and services running the source data are not affected.
One hour later, the source data and cloned data are changed compared with that obtained at 11:00
am.
HCIP-Storage Course Notes Page 86

3.1.5 HyperReplication
HyperReplication is the asynchronous remote replication feature in the block service that periodically
synchronizes differential data on primary and secondary volumes of block service clusters. All the
data generated on primary volumes after the last synchronization will be synchronized to the
secondary volumes.
Periodic synchronization: Based on the preset synchronization period, the primary replication cluster
periodically initiates a synchronization task and breaks it down to each working node based on the
balancing policy. Each working node obtains the differential data generated at specified points in time
and synchronizes the differential data to the secondary end.
No differential logs: HyperReplication does not provide the differential log function. The LSM log
(ROW) mechanism supports data differences at multiple time points, saving memory space and
reducing impacts on host services.
Each logical address mapping entry (metadata) records a time point at which data is written.
Write requests with the same address at the same time point are appended. Metadata is written to a
new address, new metadata is recorded, old metadata is deleted, and data space is reclaimed.
Write requests with the same address at different time points are appended. Metadata is written to a
new address and new metadata is recorded. If no snapshot is created at the original time point, the
metadata is deleted and space is reclaimed. Otherwise, the metadata is retained.
The metadata mapping entry itself can identify an incremental data modification address within a
specified time period.
You can deploy DR clusters as required. A DR cluster provides replication services and manages DR
nodes, cluster metadata, replication pairs, and replication consistency groups. DR nodes can be
deployed on the same servers as storage nodes or on independent servers. DR clusters have excellent
scalability. A single DR cluster contains three to 64 nodes. One system supports a maximum of eight
DR clusters. A single DR cluster supports 64000 volumes and 16000 consistency groups, meeting
future DR requirements.
After an asynchronous remote replication relationship is established between a primary volume at the
primary site and a secondary volume at the secondary site, initial synchronization is implemented.
After initial synchronization, the data status of the secondary volume becomes consistent. Then, I/Os
are processed as follows:
1. The primary volume receives a write request from a production host.
2. The system writes the data to the primary volume, and returns a write completion response to the
host.
3. The system automatically synchronizes incremental data from the primary volume to the
secondary volume at a user-defined interval, which ranges from 60 seconds to 1440 minutes in
the standard license and from 10 seconds to 1440 minutes in the advanced license. If the
synchronization mode is manual, you need to trigger synchronization manually. When the
synchronization starts, the system generates a synchronization snapshot for the primary volume
to ensure that the data read from the primary volume during the synchronization remains
unchanged.
4. The system generates a synchronization snapshot for the secondary volume to back up the
secondary volume's data in case that the data becomes unavailable if an exception occurs during
the synchronization.
5. During synchronization, the system copies data in the synchronization snapshot of the primary
volume to the secondary volume. After synchronization, the system automatically deletes the
synchronization snapshots of the primary and secondary volumes.
HCIP-Storage Course Notes Page 87

Figure 3-1 Asynchronous remote replication in the block service

3.1.6 HyperMetro
HyperMetro is the active-active storage feature that establishes active-active DR relationships
between two block service clusters in two data centers. It provides HyperMetro volumes by
virtualizing volumes in the two block service clusters and enables the HyperMetro volumes to be read
and written by hosts in the two data centers at the same time. If one data center fails, the other
automatically takes over services without data loss and service interruption.
HyperMetro in the block service supports incremental synchronization. If a site fails, the site winning
arbitration continues to provide services. I/O requests change from the dual-write state to the single-
write state. After the faulty site recovers, incremental data can be synchronized to it to quickly restore
the system.
The block service supports logical write error handling. If the system is running properly but one site
fails to process a write I/O, the system will redirect the write I/O to a normal site for processing. After
the fault is rectified, incremental data can be synchronized from the normal site to the one that fails to
process the I/O. By doing so, upper-layer applications do not need to switch sites for I/O processing
upon logical write errors.
HyperMetro supports a wide range of upper-layer applications, including Oracle RAC and VMware.
It is recommended that the distance between two HyperMetro storage systems be less than 100 km in
database scenarios and be less than 300 km in VMware scenarios. For details about supported upper-
layer applications, access Storage Interoperability Navigator.
HCIP-Storage Course Notes Page 88

Figure 3-1 HyperMetro in the block service


3.1.6.1 Key Technologies
 I/O Reads and Writes
When HyperMetro functions properly, both sites provide read and write services for hosts. When
either of the two sites receives read and write I/O requests from a host, the operations shown in
the following figures are performed:
After receiving a host read I/O request, the system accesses the data on volumes at the local site
and returns the data to the host.
HCIP-Storage Course Notes Page 89

After receiving a write I/O request, the system writes the data into both the local and remote
volumes at the same time and returns a write success message to the host.

When HyperMetro functions properly but a storage pool at the local site is faulty, the active-
active relationship is disconnected, the remote site continues providing services, and volumes at
the local site cannot be read or written. After the I/O redirection function is enabled at the local
site, read and write I/Os delivered to the local site will be redirected to the remote site for
processing (only the SCSI protocol is supported).

 Multiple Replication Clusters and Elastic Expansion


HyperMetro provides active-active DR services. Replication clusters can be deployed only when
HyperMetro is enabled. Replication clusters provide linear scalability for the performance and
number of HyperMetro pairs. When the performance or number of HyperMetro pairs cannot
meet service requirements due to service growth, replication nodes can be added without
interrupting services. A single replication cluster can contain a maximum of 64 nodes. A
replication cluster can establish active-active storage relationships with a maximum of two
HCIP-Storage Course Notes Page 90

remote replication clusters (one volume can only be used to create one HyperMetro pair). Active-
active storage relationships cannot be established among replication clusters in the block service.
A new replication cluster can be deployed to meet customers' requirements for a dedicated
replication cluster or when one cluster cannot meet service requirements. The block service
supports a maximum of 8 replication clusters.

 Cross-Site Data Reliability


1. Cross-Site Data Mirroring
Data on the HyperMetro volumes must be consistent in the storage pools at the two sites in
real time. Therefore, when a host delivers a write request, the data must be successfully
written to storage pools at the two sites. If data is successfully written in the storage pool at
only one site because the storage pool at the other site is faulty or the link between the two
sites is down, logs containing basic information about the write request instead of the
specific data will be recorded. After the fault is rectified, incremental data synchronization is
performed based on the logs.
HCIP-Storage Course Notes Page 91

For example, when site A receives a write I/O, the mirroring process is as follows:
A host at site A delivers the write I/O request.
A pre-write log is recorded in the storage pool at site A.
The pre-write log is successfully processed.
The block cluster at site A writes data to the local storage pool and delivers the write request
to the remote cluster at the same time.
After the data is successfully written to the remote cluster, the remote cluster returns a write
success message to the local cluster.
Data is successfully written to both the local and remote clusters. The system deletes the
pre-write log, and returns a write success message to the host.
If the data fails to be written to either of the local or remote site, the active-active storage
relationship is disconnected. Only the site to which the data is successfully written continues
providing services and the pre-write log is converted to a data change record. After the
active-active storage relationship is recovered, incremental data will be synchronized
between the two sites based on the data change record.
2. Data Consistency Assurance
In the HyperMetro DR scenario, read and write operations can be concurrently performed at
both sites. If data is read from or written to the same storage address on a volume
simultaneously, the storage layer must ensure data consistency at both sites.

HyperMetro enables the storage systems at the two sites to provide concurrent accesses. The
two sites can perform read and write operations on the same volume concurrently. If
different hosts perform read and write operations to the same storage address on a volume
concurrently, the storage system must ensure data consistency between the two sites.
Traditional distributed storage systems use the distributed locking mechanism to resolve
concurrent write I/O conflicts. In the conventional solution, when a host accesses a volume,
it applies for a lock from the cross-site distributed lock service. Data can be written to the
volume only after a cross-site lock is obtained, other write requests that do not have locks
can only be performed after the lock is released. The problem caused by this mechanism is
that each write operation requires a cross-site lock, which increases cross-site interactions.
HCIP-Storage Course Notes Page 92

In addition, the system concurrency is poor. Even when two write requests are concurrently
delivered to two storage addresses, they are still processed in a serial manner, decreasing the
efficiency of dual-write and affecting the system performance.
HyperMetro uses the optimistic locking mechanism to reduce write conflicts. Write requests
initiated by each host are processed independently without applying for locks. Request
conflicts are checked until data write submission. When the block service detects that the
data in a same storage address is modified by two concurrent write requests, one of the write
request is forwarded to the other site to be processed in a serial manner, ensuring data
consistency at the two sites.
As shown in the preceding figure, hosts in the two DCs both write data to the HyperMetro
volume. I/O dual writes are successfully performed on the volume. A host in DC A delivers
I/O 2 to modify the data in a storage address on the HyperMetro volume. The system then
detects that I/O 1 delivered by the host in DC B is also modifying the data in the same
address (the local scope lock can be used to detect whether data modifications conflict)
during the submission. In this case, I/O 2 is forwarded to DC B and will be written to both
DCs after I/O 1 is processed.
In the optimistic locking mechanism, the cross-site lock service is not required. Write
requests do not need to apply for a lock from the distributed lock service or even the cross-
site lock service, improving the concurrency performance of active-active clusters.
3. Cross-Site Bad Block Repair
If a storage pool at a site has bad data blocks, that is, multiple data copies in the storage pool
have bad blocks, you can use the data at the peer site to repair the bad blocks.

The process is as follows:


A production host reads data from site A.
Site A detects bad blocks when reading data.
Site A delivers a request to read data from site B after verifying that the data in site B is
consistent.
After successfully reading data from site B, site A uses the data to repair the bad blocks in
site A.
Site A returns the correct data to the production host.
HCIP-Storage Course Notes Page 93

4. Performance Optimization
To ensure real-time data consistency of two sites, a write success message is returned to
hosts only when the data has been written to the storage systems at both sites. Real-time
dual-write increases the latency of active-active I/Os. To address this, HyperMetro employs
various I/O performance optimization solutions to mitigate the impact on write latency and
improve the overall active-active service performance.
Initial data synchronization performance optimization: In the initial synchronization of
active-active mirroring data, only a small amount of data is written to the local volume. To
improve the copy performance and reduce the impact on hosts, the initial synchronization is
optimized when the remote volume has no data. Only data written to the local volume is
synchronized to the remote site. Suppose that the size of the local volume is 1 TB and only
100 GB data is written to the volume. If the remote volume is a new volume that has no
data, only the 100 GB data is synchronized to the remote volume during initial data
synchronization.
FastWrite: A write I/O operation is generally divided into two steps: write allocation and
write execution. In this way, to perform a remote write operation, the local site needs to
communicate with the remote site twice. To reduce the communication latency between
sites and improve the write performance, HyperMetro combines write allocation and write
execution as one request and delivers it to the remote site. In addition, the interaction for the
write allocation completion is canceled. This halves the interactions of a cross-site write I/O
operation. For example, the RTT is 1 ms. FastWrite reduces the transmission time for
delivering requests to the remote site from 2 ms to 1 ms.
Optimistic lock: When HyperMetro functions properly, both sites support host access. To
ensure data consistency, write operations need to be locked. In the traditional distributed
lock solution, each write request needs to obtain a cross-site distributed lock, increasing the
host write latency. To improve the write performance, HyperMetro uses the local optimistic
lock to replace the traditional distributed lock, reducing the time for cross-site
communication.
Load balancing: HyperMetro is in the active-active storage mode and both sites support host
access. You can set third-party multipathing software to the load balancing mode to balance
the read and write operations delivered to both sites, improving host service performance.
 Arbitration Mechanism
1. Dual Arbitration Mode
HyperMetro supports the static priority arbitration mode and quorum server mode. If a third-
place quorum server is faulty, the systems automatically switch to the static priority
arbitration mode. When the link between the two sites fails, the quorum function is still
available.
2. Static Priority Mode
When creating a HyperMetro pair, you can specify the preferred site of the pair. If the link
between the two sites is abnormal, the preferred site continues providing services. The
principles are as follows:
HCIP-Storage Course Notes Page 94

The storage at each site periodically sends a heartbeat message to the peer site to check
whether the peer cluster is working properly.
When the heartbeat between the local and remote clusters is abnormal, the pair provides
services only at site A according to the configuration.
If site A is faulty, site B still stops providing services. In this case, service interruption
occurs.
3. Third-Place Arbitration Mode

A quorum server needs to be deployed on a physical or virtual machine at a third place. In


addition to sending heartbeats to each other, storage systems at each site periodically send
heartbeats to the quorum server. Arbitration is triggered only when heartbeat exchanged
with the peer site is abnormal. During the arbitration process, the non-preferred site initiates
the arbitration request later than the preferred site to ensure that the preferred site wins the
arbitration first and continues providing services. When heartbeats between the two sites are
normal, the failure of the arbitration service does not affect the HyperMetro services and the
systems automatically switch to the static priority arbitration mode.
4. Service-based Arbitration

HyperMetro provides consistency groups. If services running on multiple pairs are mutually
dependent, you can add the pairs into a consistency group. All member pairs in a
consistency group can be arbitrated to the same site when a link is faulty to ensure service
continuity. The arbitration is implemented as follows:
HCIP-Storage Course Notes Page 95

Preferred sites are independently specified for consistency groups. The preferred sites of
some consistency groups are site A while the preferred sites of other consistency groups are
site B.
If a link is down, some services are running at site A while other services are running at site
B. Service performance is not degraded.
After the link is recovered, differential data is synchronized between the two sites by
consistency group.

3.1.7 Application Scenarios of the Block Service and Kubernetes


Integration Solution
The block service releases a CSI plug-in which supports mainstream container management platform
Kubernetes.
The process for Kubernetes to use the Huawei distributed storage CSI plug-in to provide volumes is
as follows:
The Kubernetes Master instructs the CSI plug-in to create a volume. The CSI plug-in invokes the
storage interface to create a volume.
The Kubernetes Master instructs the CSI plug-in to map the volume to the specified node. The CSI
plug-in invokes the storage interface to map the volume to the specified node host.
The target Kubernetes node to which the volume is mapped instructs the CSI plug-in to mount the
volume. The CSI plug-in formats the volume and mounts it to the specified directory of Kubernetes.
Note:
The CSI plug-in auxiliary service provided by the Kubernetes community is implemented based on
the CSI specifications. The CSI plug-ins are deployed on all Kubernetes nodes to create, delete, map,
unmap, mount, or unmount volumes.
The CSI plug-in is used to connect to the management and control plane to manage the storage
system.
The SCSI (private clients are deployed on compute nodes) or iSCSI (private clients are deployed on
storage devices) mode is used to implement data plane access.

3.2 Object Service Features


3.2.1 Online Aggregation of Small Objects
Traditional object storage systems face the following challenges incurred by small objects: Three
copies are kept for each small object. The system space utilization is only about 33%. When encoding
small objects using EC, the system must read these objects from HDDs, imposing high demands on
performance. To address these challenges, the object service aggregates small objects online,
significantly improving space utilization without compromising performance. The following figure
shows the aggregation process.
HCIP-Storage Course Notes Page 96

Figure 3-1 Small-object aggregation


As shown in the preceding figure, small objects (such as Obj1 to Obj7) uploaded by clients are
written into the SSD cache first. After the total size of the small objects reaches the size of an EC
stripe, the system calculates the objects using EC and stores generated data fragments (such as Strip1)
and parity fragments (such as Parity1) onto HDDs. In this way, small objects are erasure coded, and
the space utilization is significantly improved. For example, if the EC scheme is 12+3, the space
utilization is 80%, 2.4 times higher than 33% of the traditional three-copy mode.

3.2.2 Quota and Resource Statistics


The object service supports bucket and tenant capacity quotas as well as object resource statistics. The
following figure shows a capacity quota example where company departments represent tenants and
employees in the departments represent buckets. You can set a 40 TB quota for the financial
department (tenant 2) and a 10 TB quota for employee b (bucket 2) in the department.
HCIP-Storage Course Notes Page 97

Figure 3-1 Object service quota


The capacity quota function in the object service has the following characteristics:
 Bucket capacity quota: specifies the maximum size of a bucket. When the bucket size reaches
the specified upper limit, new data cannot be written into the bucket.
 Tenant capacity quota: specifies the maximum capacity assigned to a tenant. When the total
size of buckets in a tenant reaches the specified upper limit, the tenant and all its users cannot
write new data.
The object service can use REST APIs to obtain resource statistics of tenants and buckets, such as the
number and capacity of objects:
 Bucket resource statistics: includes bucket sizes and the number of objects in buckets. Users
can query their own bucket resources.
 Tenant resource statistics: includes the tenant quotas, number of buckets and objects in tenants,
and the total capacity.

3.2.3 Access Permission Control


The object service implements access permission control for buckets and objects using Access
Control Lists (ACLs) and bucket policies. You can only access resources for which you have
permissions:
 ACL: grants tenants the permissions to access resources. Each entry in an ACL specifies
permissions (read-only, write, or read and write) of specific tenants. ACLs can grant permissions
but cannot deny permissions. The following figure shows an example.
HCIP-Storage Course Notes Page 98

Figure 3-1 Access permission control by ACL


 Bucket policy: controls access from accounts and users to buckets and objects. Bucket policies
can both grant and deny permissions. Bucket policies provide more refined permission control
than ACLs. For example, bucket policies can control specific operations (such as PUT, GET, and
DELETE), forcibly enable HTTPS access, control access from specific IP address segments,
allow access to objects with specific prefixes, and grant access permissions to specific clients.
The following figure shows an example.

Figure 3-2 Access permission control by bucket policy


HCIP-Storage Course Notes Page 99

3.2.4 Multi-Tenancy

Figure 3-1 Multi-tenancy


As shown in the preceding figure, the object service provides multi-tenant management. Data of
different tenants is logically isolated to facilitate resource allocation. Multi-tenant management has
the following benefits:
 A single system provides a variety of client services, reducing initial investments.
 The system is centrally managed, data is logically isolated, and online storage is supported.
Encrypted HTTPS transmission and user authentication are supported to ensure data transmission
security.

3.2.5 SmartQoS
The object service provides SmartQoS to properly allocate system resources and deliver better service
capabilities.
HCIP-Storage Course Notes Page 100

Figure 3-1 SmartQoS


In multi-tenant scenarios such as cloud environments, customers require that transactions per second
(TPS) and bandwidth resources in storage pools be properly allocated to tenants or buckets with
different priorities and that the TPS and bandwidth resources of mission-critical services be sufficient.
To meet customer requirements, the object service provides the following refined QoS capabilities:
 Refined I/O control: enables the system to provide differentiated services for tenants and
buckets with different priorities.
 TPS- and bandwidth-based QoS for tenants and buckets: accurately controls operations, such
as PUT, GET, DELETE, and LIST.
QoS in the object service allocates buckets with different TPS and bandwidth capabilities for
applications of different priorities. This maximizes storage pool resource utilization and prevents
mission-critical services from being affected by other services. Different QoS policies can be
configured for VIP and common tenants in the same system to ensure service quality for high-priority
tenants.

3.2.6 Object-Level Deduplication


Object-level deduplication enables the system to automatically detect and delete duplicate objects.
After object-level deduplication is enabled for accounts, the system automatically searches for
duplicate objects belonging to the accounts, retains one copy of an object, and replaces its duplicates
with pointers indicating the location of the one remaining copy. By doing so, redundant data is deleted
and storage space is freed up. For example, when clients upload identical files (or images, videos,
software) onto a web disk, the system with object-level deduplication enabled will save only one copy
and replace all the other identical files with pointers indicating the location of that copy. The
following figure shows an example.
HCIP-Storage Course Notes Page 101

Figure 3-1 Object-level deduplication


The system identifies duplicate objects by comparing their MD5 values, data protection levels, and
sizes. If the objects are duplicate, the system retains one copy of the objects. Other duplicate objects
point to the copy. This saves storage space and increases space utilization.
Enabling and disabling object-level deduplication is done at the account level. Object-level
deduplication is disabled by default. Once object-level deduplication is enabled for an account, the
system automatically scans all objects of the account for deduplication.

3.2.7 WORM
Write Once Read Many (WORM) is a technology that allows data to be read-only once being written.
Users can set protection periods for objects. During protection periods, objects can be read but cannot
be modified or deleted. After protection periods expire, objects can be read or deleted but cannot be
modified. WORM is mandatory for archiving systems.
The WORM feature in the object service does not provide any privileged interfaces or methods to
delete or modify object data that has the WORM feature enabled.
WORM policies can be configured for buckets. Different buckets can be configured with different
WORM policies. In addition, you can specify different object name prefixes and protection periods in
WORM policies. For example, you can set a 100-day protection period for objects whose names start
with prefix1 and a 365-day protection period for objects whose names start with prefix2.
The object service uses built-in WORM clocks to time protection periods. After a WORM clock is
set, the system times protection periods according to the clock. This ensures that objects are properly
protected even if the local clock time is changed. Each object has creation time and expiration time
measured by its WORM clock. After WORM properties are set for an object, the object uses a
WORM clock for timing, preventing its protection period from being changed due to local node time
changes.
A WORM clock can automatically adjust its time according to the local node time:
 If the local node time is earlier than the WORM clock time, the WORM clock winds back its
time 128 seconds or less every hour.
 If the local node time is later than the WORM clock time, the WORM clock adjusts its time to
the local node time.
Objects enabled with WORM have three states: unprotected, protected, and protection expired, as
shown in the following figure.
HCIP-Storage Course Notes Page 102

Figure 3-1 WORM


 Unprotected: Objects in the unprotected state can be read, modified, and deleted, same as
common objects.
 Protected: After a WORM policy is enabled for a bucket, the objects that meet the WORM policy
enter the protected state and can only be read.
 Protection expired: When the WORM protection period of objects expires, the objects enter the
protection expired state. In this state, the objects can only be read or deleted.

3.2.8 HyperReplication
HyperReplication is a remote replication feature provided in the object service that implements
asynchronous remote replication to periodically synchronize data between primary and secondary
storage systems for system DR. This minimizes service performance deterioration caused by the
latency of long-distance data transmission.
Remote replication: is a core technology for DR and backup, as well as the basis for data
synchronization and DR. It remotely maintains a data copy through the remote data connection
function of storage devices that reside in different places. Even when a disaster occurs, data backups
on remote storage devices are not affected and are used for data restoration, ensuring service
continuity. Remote replication can be divided into synchronous remote replication and asynchronous
remote replication by whether a write request on the client needs the confirmation information of the
secondary storage system of remote replication.
Asynchronous remote replication: When a client sends data to the primary storage system, the primary
storage system writes the data. After the data is successfully written to the primary storage system, a
write success message is returned to the client. The primary storage system periodically synchronizes
data to the secondary storage system, minimizing service performance deterioration caused by the
latency of long-distance data transmission.
Synchronous remote replication: The client sends data to the primary storage system. The primary
storage system synchronizes the data to the secondary storage system in real time. After the data is
successfully written to both the primary and secondary storage systems, a write success message is
returned to the client. Synchronous remote replication maximizes data consistency between the
primary and secondary storage systems and reduces data loss in the event of a disaster.
Default cluster and non-default cluster: A region has only one default cluster and other clusters are all
non-default clusters. The difference between the default cluster and a non-default cluster is that the LS
and POE services in the default cluster are active services and the default cluster has read and write
permissions on LS and POE operations. However, those in a non-default cluster are standby services
and a non-default cluster only has the read permission on LS and POE operations. The default and
HCIP-Storage Course Notes Page 103

non-default clusters are nothing related to replication groups, as well as the primary and secondary
clusters introduced in the following.
Primary and secondary clusters: The primary and secondary clusters are based on buckets. The cluster
where a source bucket resides is the primary cluster, and the backup of the source bucket is stored in
the secondary cluster. Therefore, a cluster is not an absolute primary cluster. Assume that there are
clusters Cluster1 and Cluster2 and buckets Bucket1 and Bucket2. For Bucket1, its primary cluster
is Cluster1 and secondary cluster is Cluster2. At the same time, for Bucket2, its primary cluster is
Cluster2 and secondary cluster is Cluster1. Primary and secondary clusters are nothing related to the
default and non-default clusters. The default cluster is not definitely a primary cluster, and a non-
default cluster is not definitely a secondary cluster.
Replication group: A replication group is the DR attribute of a bucket. It defines the primary and
secondary clusters, as well as the replication link of the bucket. The bucket and all objects in it are
synchronized between the primary and secondary clusters. If one cluster is faulty, data can be
recovered using backups in the other cluster. A bucket belongs to only one replication group. When
creating a bucket, you need to select the replication group to which the bucket belongs. Then, the
system performs remote replication for the bucket based on the replication group's definition. If you
do not select an owning replication group when creating a bucket, the system will add the bucket to
the default replication group. The system has only one default replication group, which is specified by
the user.
A replication relationship is established between the primary and secondary clusters. After the
replication relationship is established, data written to the primary cluster is asynchronized to the
secondary cluster. The process is as follows:
The client puts an object to the primary cluster. After the object is successfully uploaded, a replication
log is generated. The log records information required by the replication task, such as the bucket name
and object name.
The synchronization task at the primary cluster reads the replication log, parses the bucket name and
object name, reads the data, and writes the data to the secondary site to complete the replication of an
object. If the secondary cluster is faulty or the network between the primary and secondary clusters is
faulty, the replication task fails for a short period of time. The primary cluster retries until the object
replication is successful.
The primary and secondary clusters are in the same region. After asynchronous replication is
configured, data written to the primary cluster will be synchronized to the secondary cluster. If the
primary cluster is faulty, the secondary cluster can be promoted to primary at one click to continue
providing services.
The object service can be accessed by the unified domain name and supports seamless failover after a
primary cluster is faulty. Users do not need to change the domain name or URL for accessing the
object service.

3.2.9 Protocol-Interworking
The feature of interworking between object and file protocols (Protocol-Interworking) provided by
Huawei distributed storage is based on the addition of NAS functions to the object storage system. By
providing the NFS access capability based on the distributed object service, the storage system can:
receive I/O requests from standard NFS clients, parse the requests, convert requests for files into
requests for objects, use the storage capability of the object storage system to process I/O requests,
and also enable in-depth software optimization to improve customers' NFS experience.
In the object storage system, buckets are classified into common object buckets and file buckets. A
common object bucket can be accessed only through the object protocol. A file bucket can be
accessed through both the standard NFS protocol and object protocol.
The object storage system has a built-in NFS protocol parsing module. This module receives I/O
requests from standard NFS clients, parses the requests, converts operations on files into operations
HCIP-Storage Course Notes Page 104

on the objects, and then uses built-in object clients to send the requests to the storage system for
processing.
The object service reads and writes data. The module only processes and converts the NFS protocol
and caches data, but does not store data. All data is stored in buckets of the object storage system. The
data protection level is determined by the bucket configuration.
All directories and files viewed by customers from a file system are objects for the object storage
system. A directory is an object named Full path of a directory/. In this way, listing directories is
converted to ListObject. The prefix is the parent directory, and the delimiter is /. Read/write access to
a file is converted to read/write access to the object named the full path of the file. Protocol-
Interworking is used to parse the accessed file/directory to obtain the object name and convert the
operation on the file/directory to that on an object/bucket.

3.2.10 Application Scenarios of the Object Service Video Surveillance


Cloud Solution
With the continuous expansion of urbanization, city security is facing diversified threats, including
social security, public health, natural disasters, and accidents. The video surveillance system is an
indispensable part of a safe city.
The client software that supports unified view display can consolidate storage clusters in different
areas into a storage resource pool. When unexpected events occur, cross-area video surveillance data
can be easily checked.
The object service video surveillance cloud solution has the following features:
Stable and low latency for the customer access process: The latency is low, meeting the stability
requirements of continuous video write latency and improving the access experience of end users.
High concurrent connections: Millions of video connections are supported, ensuring stable
performance.
On-demand use: Storage resources can be dynamically used on demand based on service growth.

3.3 HDFS Service Features


3.3.1 Decoupled Storage-Compute Big Data Solution
The decoupled storage-compute big data solution provides an efficient big data foundation with
decoupled storage and compute based on a highly scalable distributed architecture. The solution has
the following highlights:
 On-demand storage and compute configuration, protecting customer investments
The decoupled storage-compute big data solution organizes storage media, such as HDDs and
SSDs, into large-scale storage pools using distributed technologies. It decouples storage
resources from compute resources, achieving flexible storage and compute resource
configuration, on-demand capacity expansion, investment reduction, and customer investment
protection. Because storage resources are decoupled from compute resources, data is separated
from compute clusters. This enables fast capacity expansion and reduction for compute clusters
without data migration and flexible allocation of compute resources.
 Multi-tenancy, helping you build unified storage resource pools
The decoupled storage-compute big data solution allows multiple namespaces to connect to
multiple compute clusters. Authentication is isolated among compute clusters and each compute
cluster is authenticated with its namespaces in a unified manner. Storage resource pools are fully
utilized through logical data isolation among namespaces, flexible space allocation, and storage
capability sharing. The solution provides multiple storage capabilities using storage tiering
HCIP-Storage Course Notes Page 105

policies. Hot, warm, and cold data is online in real time, and applications are unaware of data
flow.
 Distributed data and metadata management, elastically and effectively meeting future data
access requirements
The decoupled storage-compute big data solution adopts a fully distributed architecture. It
enables a linear growth in system capacity and performance by increasing storage nodes,
requiring no complex resource requirement plans. It can be easily expanded to contain thousands
of nodes and provide EB-level storage capacity, meeting storage demands of fast-growing
services. The native HDFS uses active and standby NameNodes and a single NameNode only
supports a maximum of 100 million files. Different from the native HDFS, the decoupled
storage-compute big data solution adopts a fully distributed NameNode mechanism, enabling a
single namespace to support ten billion files and the whole cluster to support trillions of files.
 Full compatibility between EC and native HDFS semantics, helping you migrate services
smoothly
The native HDFS EC does not support interfaces such as append, truncate, hflush, and fsync.
Different from the native HDFS EC, EC adopted in the decoupled storage-compute big data
solution is fully compatible with native HDFS semantics, facilitating smooth service migration
and supporting a wide range of Huawei and third-party big data platforms. The solution even
supports the 22+2 large-ratio EC scheme with a utilization rate of 91.6%, significantly higher
than the utilization achieved by using the native HDFS EC and three-copy mechanism. This
reduces investment costs.
 Enterprise-grade reliability, ensuring service and data security
The decoupled storage-compute big data solution provides a reconstruction speed of 2 TB/hour,
preventing data loss caused by subsequent faults. The solution supports faulty and sub-healthy
disk identification and fault tolerance processing, token-based flow control, as well as silent data
corruption check, ensuring service and data security with enterprise-grade reliability.

3.3.2 SmartTier
Data stored in big data platforms can be classified into cold and hot data. For example, charging data
records (CDRs) and network access logs are frequently accessed by CDR query systems, accounting
systems, or customer behavior analysis systems on the day or in the month when they are generated
and thus become hot data. However, such data will be accessed less frequently or even no longer
accessed in the next month and thus become cold data. Assume that network access logs need to be
stored for 12 months and the logs are frequently accessed only in one month. Storing all the logs on
high-performance media is costly and storing them on low-performance media affects service
performance.
To address this, the HDFS service provides SmartTier, a storage tiering feature, to store hot and cold
data on different tiers. Hot data is stored on high-performance SSDs to ensure service performance
and cold data is stored on SATA disks to reduce costs, providing high energy efficiency.
HCIP-Storage Course Notes Page 106

Figure 3-1 Principles


 General principles:
− The compute layer is decoupled from data and is unaware of flows of hot, warm, and cold
data.
− The lifecycle management layer configures and manages data tiering.
− The resource pool management layer manages resource pools with different storage media.
 Key features:
− Data tiering: You can customize policies to write data into different resource pools. For
example, you can write data of one directory to a hot resource pool and that of another
directory to a warm resource pool.
− Data migration: You can customize migration policies (such as by creation time) to
automatically migrate data among hot, warm, and cold resource pools.
− Distribution query: You can query, manage, and monitor the space usage of hot, warm,
and cold resource pools.
 Advantages:
Many storage tiering solutions require data conversion between different protocols and data
migration at the application layer. Even cold data needs to be migrated before being analyzed.
Unlike such solutions, the HDFS service provides the following advantages:
− Consistent EC utilization: EC is used in hot, warm, and cold storage, improving storage
utilization from 33% up to 91.67%. The utilization can be reached no matter how large files
are.
− Unified HDFS semantics: Data is online in real time and EC is fully compatible with native
HDFS semantics, requiring no S3 storage to reduce costs.
− Unified namespace: Data migration does not require directory changes and applications are
unaware of the migration.
− Unified storage resource pool:
 Manages hundreds of billions of files.
HCIP-Storage Course Notes Page 107

 Achieves optimal utilization and performance for both large and small files.

3.3.3 Quota and Resource Statistics


The HDFS service supports file system capacity quotas as well as resource statistics. The following
figure shows a capacity quota example where employees in company departments represent file
systems. You can set a 2 TB quota for employee a (file system 1) and a 10 TB quota for employee b
(file system 2). The quotas are modifiable at any time.

Figure 3-1 Quota in the HDFS service


The capacity quota function in the HDFS service has the following characteristics:
File system capacity quota: specifies the maximum size of a file system. When the file system size
reaches the specified upper limit, new data cannot be written into the file system.
The HDFS service can collect statistics on file system resources, such as file quantity and capacity:
File system resource statistics: includes file system sizes and the number of files in the file systems.
Users can query their own file system resources.

3.3.4 Application Scenarios of the HDFS Service Solution


The HDFS service provides standard HDFS interfaces to interconnect with mainstream big data
application platforms in the industry. It supports interconnection with big data products from
mainstream vendors, such as Huawei FusionInsight, Cloudera, and Hortonworks.
On-demand configuration of storage and computing resources: The ratio of storage resources to
computing resources can be randomly configured.
Extensive compatibility: The solution adopts an EC mechanism which is wholly compatible with
native HDFS semantics, while the native HDFS EC does not support the append and truncate
interfaces.
Distributed NameNode: global namespace, supporting massive files
Efficient EC: It supports high performance EC, which is better than the native HDFS 3.0.
High reliability: Fast reconstruction prevents data loss caused by two faults. Fault tolerance, token
flow control, and silent data damage check are supported for subhealthy disks.
HCIP-Storage Course Notes Page 108

3.4 File Service Features


3.4.1 InfoEqualizer
InfoEqualizer is a load balancing feature that intelligently distributes client access requests to storage
nodes to improve service performance and reliability. Before InfoEqualizer is used, all clients access a
node using the static IP address of the node, and the client connection load cannot be detected or
evenly distributed across nodes. As a result, nodes that have a heavy client connection load are prone
to performance bottlenecks.
 Domain Name Resolution
Each node has a physically bound static front-end service IP address and a dynamic front-end
service IP address that can float to another node upon node failure. Domain names that can be
used to send access requests to a group of nodes (in a zone) include a static domain name and a
dynamic domain name. These domain names are resolved to obtain static front-end service IP
addresses and dynamic front-end service IP addresses, respectively.
 Zone
Zones are used to separate nodes that provide different services. A zone consists of nodes with
the same static and dynamic domain names and the same load balancing policy. The system has a
default zone root. This zone contains all system nodes. You can create zones and add nodes to
them based on network requirements.
Zones enable load balancing policies to be flexibly configured for nodes carrying different types
of client services, improving service performance. In addition, each client can access a specific
zone for fault isolation. The load balancing policies are set based on node domains. The
following policies are supported: round robin, CPU usage, number of node connections, node
throughput, and comprehensive node load.
 Subnet
In some scenarios, clients residing on different subnets need to be connected to nodes whose IP
addresses reside on the same subnet. In this situation, you can define a subnet that contains one or
multiple zones, configure a DNS service IP address for this subnet, and reassign a static service
IP address for all nodes on this subnet and a dynamic IP address pool for this subnet.
Subnets can be divided into subnets using IPv4 and IPv6 protocols. For each subnet, only one IP
address type is acceptable. After a subnet is created successfully, its protocol cannot be changed.
The subnet supports subnets with no VLAN configured and subnets with VLANs configured.

3.4.2 InfoTier
InfoTier, also named dynamic storage tiering (DST), can store files on storage devices with different
performance levels according to file properties, and can automatically migrate files between devices.
InfoTier meets users' requirements on file processing speed and storage capacity, ensures optimized
space utilization, enhances access performance, and reduces deployment costs.
InfoTier focuses on the following file properties: file name, file path, file size, creation time,
modification time, last access time, owning user/user group, I/O count, I/O popularity, and SSD
acceleration.
 Storage Tier Composition
InfoTier enables files to be stored on different tiers based on file properties. A tier consists of one
or more node pools. A node pool consists of multiple nodes. A node pool is divided into multiple
disk pools. A partition is created for each disk pool. A node pool is the basic unit of a storage tier.
A node pool consists of multiple nodes. Nodes of different features form node pools of different
features, which are combined into tiers of different performance levels to implement classified
data management. After a node pool is successfully deployed, nodes in the node pool can no
longer be changed. If you want to change the node pool to which a node belongs, delete the node
HCIP-Storage Course Notes Page 109

from the node pool first and then add it into another node pool. After a node pool is created, it
can be migrated from one storage tier to another one without restriping data in the node pool.
Disks of all the nodes in each node pool form disk pools based on disk types. The disk pool
formed by SSDs is used to store data of small files if SSD acceleration is enabled. The disk pool
formed by HDDs is used to store data and metadata. Disk pool division is related to disk
configurations of nodes. In typical configurations, one SSD can be inserted into the first slot of
each node to be used by the underlying file system. No disk pool composed of SSDs is available.
To fully utilize the advantages in reading and writing small files, you need to configure SSDs in
slots 2 to N.
After system deployment, an administrator can set tiers based on service requirements and
specify the mappings between node pools and tiers. A default tier exists in the system. If no tier
is added, all node pools belong to the default tier. To leverage the advantages of InfoTier, you are
advised to configure multiple tiers and corresponding file pool policies.
You are advised to associate the node pool to which the nodes with high disk processing
capability and high response speed belong with the tier where the frequently accessed data is
stored. This accelerates the system's response to hotspot data and improves the overall storage
performance.
You are advised to associate the node pool to which the nodes with low response speed and large
storage capacity belong with the tier where the less frequently accessed data is stored. This fully
utilizes the advantages of different nodes and effectively reduces deployment and maintenance
costs.
It is recommended that one tier consist of node pools of the same type. Users can configure the
type of node pools in a tier based on site requirements.
 Restriping
Restriping means migrating data that has been stored to another tier or node pool.
The system periodically scans metadata and determines whether to restripe files that have been
stored in the storage system based on file pool policies. If files need to be restriped, the system
sends a restriping task.
Before restriping, the system determines whether the used space of the node pool in the target tier
reaches the read-only watermark. Restriping is implemented only when the used space is lower
than the read-only watermark. If the used spaces of all node pools in the target tier are higher
than the read-only watermark, restriping is stopped.
A restriping operation must not interrupt user access. If a user modifies data that is being
restriped, the restriping operation stops and rolls back. The data that has been restriped to the new
node pool is deleted and restriping will be performed later.
You can start another restriping operation only after the current one is complete. For example,
during the process of restriping a file from tier 1 to tier 2, if the file pool policy changes and the
file needs to be restriped to tier 3, you must wait until the restriping to tier 2 is complete and then
start the restriping to tier 3.
 Watermark Policy
InforTier uses watermark policies to monitor the storage capacities in node pools. Based on the
available capacities in node pools, InfoTier determines where to store new data.
The watermark is the percentage of the used capacity in the available capacity of a disk in a node
pool. The available capacity of a disk is the same as the minimum capacity of the disk in a disk
pool. Watermarks include the high watermark and read-only watermark. A watermark enables
Scale-Out NAS to limit where a file is stored and restriped. In addition, you can set spillover for
a node pool to determine whether data can be written to other node pools when the read-only
watermark is reached.
 File Pool Policy
HCIP-Storage Course Notes Page 110

Administrators can create file pool policies to determine initial file storage locations and storage
tiers to which files are restriped.
Immediately after InfoTier is enabled, the storage matches a file pool policy and uses the file
pool policy to store and restripe files.
A file pool policy can be configured to be a combination of multiple parameters. A file pool
policy can be matched only when the file properties match all parameters of the file pool policy.

3.4.3 InfoAllocator
It is a resource control technology that restricts available resources (including storage space and file
quantity) for a specified user or user group in a directory. Using the InfoAllocator feature,
administrators can:
Plan storage space or file quantity for users or user groups properly.
Manage storage space or file quantity for users or user groups.
Make statistics on and check file quantity or storage space capacity consumed by users or user groups.
 Quota Types
Capacity quota: manages and monitors storage space usage.
File quantity quota: manages and monitors the number of files.
 Quota Modes
Calculate quota: only monitors storage capacity or file quantity.
Mandatory quota: monitors and controls storage capacity or file quantity.
1. Relationship between thresholds of a mandatory quota
Recommended threshold: When the used storage space or file quantity reaches the
recommended threshold, the storage system does not restrict writes but only reports an
alarm.
Soft threshold: When the used storage space or file quantity reaches the soft threshold, the
storage system generates an alarm but allows data writing before the grace period expires.
However, after the grace period expires, the system forbids data writes and reports an alarm.
You need to configure the soft threshold and grace period at the same time.
2. Effective thresholds in multi-quota applications
After you set a quota for a user and its owning user group or for a directory and its parent
directory, or set different types of quotas for a directory, the quotas are all valid and the
quota that reaches the hard threshold takes effect first.
For example, quota A is configured for user group group1 and quota B is configured for
user quota_user1 who belongs to group1. The two quotas are both effective. However, the
effective threshold for quota B is the first threshold.
 Effective Quotas
Common quota: Quota of the specified directory that can be used by a specified user or quota that
can be used by all users in a user group for a specified directory. For example, if the hard
threshold of the common quota for a user group is 10 GB, the total space used by all users in the
user group is 10 GB.
Default quota: Quota that any user in a user group can use for a specified empty directory. The
default quota of the everyone user group applies to all users in the cluster that uses the directory.
For example, if the hard threshold of the default quota for a user group is 10 GB, the space used
by each user in the user group is 10 GB.
Associated quota: If a default quota is configured for a directory, when a user writes files to the
directory or creates a directory, the system automatically generates a new quota, which is called
HCIP-Storage Course Notes Page 111

an associated quota. The quota is associated with the default quota. The use of storage space or
file quantity is limited by the default quota.

3.4.4 InfoLocker
Definition of InfoLocker
InfoLocker is a Write Once Read Many (WORM) feature. The feature can be used to set a retention
period for files. During the retention period, the files can be read but cannot be modified or deleted.
After the retention period expires, the files can be deleted but cannot be modified. With this function,
InfoLocker becomes a mandatory feature of archiving files.
InfoLocker has the enterprise compliance mode and regulatory compliance mode. In enterprise
compliance mode, locked files can only be deleted by system administrators. In regulatory compliance
mode, no one can delete locked files.
There are four WORM file states, as described in the following table.

Status Description
Unprotected
A file in this state can be modified or deleted.
state
After the write permission for a file is disabled, the file enters the protected state.
A file in this state can be read but cannot be deleted or modified.
Protected state
NOTE: However, the super administrator (admin) can execute the privileged
deletion of locked files.
After the write permission for an empty file in the protected state is enabled, the
Appended file enters the appended state. Data can be appended to a file in the appended
state.
Files in this state cannot be modified but can be deleted and read and their
Expired state
properties can be viewed.

3.4.5 InfoStamper
InfoStamper is a directory-based snapshot function provided for scale-out file storage. It can create
snapshots for any directory (except the root directory) in a file system to provide precise on-demand
data protection for users. A single directory supports a maximum of 2048 snapshots while a system
supports a maximum of 8192 snapshots.
COW: Before changing a protected snapshot data, copy the data to another location or object to be
saved as a snapshot, and then replace the original data object with the new data. Scale-Out NAS
implements COW for metadata because: 1. One read operation and two write operations need to be
performed during the COW process. 2. A few write operations but many read operations need to be
performed on metadata. 3. Metadata occupies small space.
ROW: Before changing a protected snapshot data, write the data to a new location or object without
overwriting the original data. Scale-Out NAS uses the ROW technology for data. The ROW mode is
used to reduce the impact on the system performance because the data volume is large.
In a file system, each file consists of metadata and data and each directory contains only metadata:
Metadata: defines data properties and includes dentries and index nodes. Dentries contain information
about file names, parent directories, and subdirectories and associate file names with inodes. Inodes
contain file size, creation time, access time, permissions, block locations, and other information.
Data: For a file, data is the content of the file. Scale-Out NAS divides data into stripes.
HCIP-Storage Course Notes Page 112

Figure 3-1 File system structure


Metadata COW refers to the process of copying the inodes and dentries of a node after a snapshot is
taken and before the metadata is modified for the first time. In this way, the snapshot version
corresponding to the metadata is generated.
Metadata can be modified in the following scenarios: file (including soft links) or directory creation,
file (including soft links) or directory deletion, attribute modification, and renaming.
The ROW process is as follows:

Figure 3-2 ROW process


After stripes stripe3 and stripe4 of File3 are changed, new data is written to the File3-1 object.
Stripe3 and stripe4 in File3-0 become snapshot data.
Other stripes are shared by File3-1 and File3-0.
COW is implemented for metadata of the original file to generate metadata of the snapshot, and the
metadata of the original file is updated.
HCIP-Storage Course Notes Page 113

3.4.6 InfoScanner
InfoScanner is an antivirus feature. Scale-Out NAS provides Huawei Antivirus Agent and
interconnects with third-party antivirus software installed on external antivirus servers, thereby
protecting shared directories from virus attacks. The third-party antivirus software accesses shared
directories using the CIFS protocol and scans files in the directories for viruses (in real time or
periodically). If viruses are detected, the third-party antivirus software kills the viruses based on the
configured antivirus policy, providing continuous protection for data in storage.
InfoScanner:
The anti-virus software is installed on the anti-virus proxy server.
The antivirus server reads files from CIFS shares for virus scanning and isolation.

3.4.7 InfoReplicator
InfoReplicator provides the directory-level asynchronous remote replication function for Scale-Out
NAS. Folders or files can be periodically or manually replicated between directories in different
storage systems through IP links over a local area network (LAN) or wide area network (WAN),
saving a data duplicate of the local cluster to the remote cluster.
In InfoReplicator, a remote replication pair is a replication relationship that specifies data replication
source, destination, frequency, and other rules.
Synchronization is an operation that copies the data of the primary directory to the secondary
directory. Replication synchronizes data based on pairs to maintain data consistency between the
primary and secondary directories.
InfoReplicator supports two types of synchronization:
Full synchronization: copies all data of the primary directory to the secondary directory.
Incremental synchronization: copies only the data that has changed in the primary directory since the
beginning of the last synchronization.
InfoReplicator allows you to split a pair to suspend replication between directories in the pair. If you
want to resume synchronization between the directories in a split pair to keep directory data
consistent, manually start synchronization for the pair again. By so doing, the suspended
synchronization resumes instead of starting at the beginning. This is called resumable data
transmission.
When data is replicated for the first time from the primary directory to the secondary directory in a
replication pair, the storage system automatically creates a snapshot for the primary directory at the
replication point in time. When data is replicated from the primary directory to the secondary
directory again, the storage system creates a snapshot for the primary directory, compares it with the
last one, analyzes differences between the two snapshots, and synchronizes the changed data to the
secondary directory. In this way, the storage system can easily locate the changed data without the
need to traverse all its directories, improving data synchronization efficiency.
Full synchronization refers to the process of fully replicating data from the primary directory to the
secondary directory. The initial synchronization of a pair uses the full synchronization mode.
Incremental synchronization indicates that only the incremental data that is changed after the previous
synchronization is complete and before the current synchronization is started is copied to the
secondary directory. After the initial full synchronization, each synchronization is in incremental
synchronization mode.
A replication zone is a collection of nodes that participate in remote replication. You can add nodes to
the replication zone by specifying front-end service IP addresses of the nodes in the replication zone.
Replication channel refers to a group of replication links between the replication zones of the primary
and secondary Scale-Out NAS systems.
HCIP-Storage Course Notes Page 114

You can create only one replication channel between two storage systems. This channel is shared by
all pairs that are used to replicate data between the two storage systems. This channel is also used to
authenticate and control traffic for all replication links between the two storage systems.

3.4.8 InfoRevive
InfoRevive is used to provide error tolerance for video surveillance systems. By using this feature,
when the number of faulty nodes or disks exceeds the upper limit, some video data can still be read
and new video data can still be written, protecting user data and improving the continuity of video
surveillance services.
InfoRevive supports the following operation modes:
 Read Fault Tolerance Mode
When the number of faulty nodes or disks exceeds the upper limit, the system can still read part
of damaged video file data, enhancing data availability and security. This mode applies to the
scenario where video surveillance data has been written to the storage system and only read
operations are required.
 Read and Write Fault Tolerance Mode
When the number of faulty nodes or disks exceeds the upper limit, the system can still read and
write part of damaged video file data, enhancing service continuity and availability. This mode
applies to the scenario where video surveillance data is not completely written or new
surveillance data needs to be written, and both write and read operations are required.
Assume that the faulty disks (in gray) are data disks, the 4+1 data protection level is used, two
data copies damaged. When the InfoRevive feature is enabled, the system reads only three copies
of data successfully and returns three copies of data that is successfully read. The system adds 0s
to the two copies of data that fails to be read and returns the two copies of data. Only three copies
of data response write success. No 0 is added to the two copies of data that fails to be written.
The write operation is processed as a stripe write success.

Figure 3-1 An example of fault-tolerance read and write


HCIP-Storage Course Notes Page 115

3.4.9 InfoTurbo
InfoTurbo, which is a performance acceleration feature and supports intelligent prefetch, SMB3
Multichannel, and NFS protocol enhancement.
Intelligent prefetch provides a higher cache hit ratio for users in media assets scenarios. In latency-
sensitive scenarios, performance can be greatly improved.
The SMB3 Multichannel function greatly improves service performance and reliability. In addition, if
one channel fails, it transmits data over another channel to prevent services from being affected.
In CIFS file sharing scenarios, if a client that uses SMB 3.0 (delivered with Windows 8 and Windows
Server 2012 by default) is equipped with two or more GE/10GE/IB network ports of the same type or
with one GE/10GE/IB network port that supports Receive-Side Scaling (RSS), the client will set up
multiple channels with Scale-Out NAS. By bringing multi-core CPUs and bandwidth resources of
clients into full play, SMB3 Multichannel greatly improves service performance. In addition, after one
channel fails, SMB3 Multichannel transmits data over another channel, thereby improving service
reliability.
The NFS protocol enhancement feature is a performance acceleration feature provided by Scale-Out
NAS. By configuring multiple network ports and installing NFS protocol optimization plug-in
DFSClient on a client, concurrent connections can be established between the client and Scale-Out
NAS, thereby increasing the access bandwidth. Cache optimization is enabled for Mac OS X clients to
further improve access performance to adapt to 4K video editing in media assets scenarios.

3.4.10 Application Scenarios of the File Service


A media asset management system is the core of a television program production system and is used
to upload, download, catalog, retrieve, transcode, archive, store, invoke, and manage media materials.
The media asset management system involves the storage, invoking, and management of TV
programs. The features and requirements of the system for the storage system are as follows:
Media materials have a high bit rate and are large in size. The capacity of the storage systems must be
large and easy to be expanded.
Acquisition, editing, and synthesis of audios and videos require stable and low-latency storage
systems.
Concurrent editing requires storage systems that can deliver reliable and easy-of-use data sharing.
Small files are frequently processed in video rendering and special effect processing. This requires
storage systems to deliver high data read and write performance.
Scale-Out NAS in a media asset management system has the following features:
Supports high-speed data sharing to improve program production efficiency.
Supports on-demand and online capacity expansion.
Supports dynamic storage tiering to meet high performance requirements and reduce device purchase
costs.
Adopts an energy saving design to reduce the OPEX.
It is a centralized storage solution for ingesting, producing, and distributing media materials. It meets
the media industry's requirements for storage capacity, performance, stability, and scalability.
HCIP-Storage Course Notes Page 116

4 Storage Design and Implementation

4.1 Storage Planning and Design


4.1.1 Process
4.1.1.1 Concepts
Planning is a phase of an integration project that usually includes strategy and design.
A strategy specifies design principles, such as business objectives, development requirements, and
technology selection. A good design usually begins with clear and explicit goals, requirements,
assumptions, and restrictions. This information can make it easy for a design project to be
implemented without problems. It is often necessary to balance the best practices of technology and
the ultimate goals and requirements of the organization. Even a perfect technical design cannot be
delivered if it does not meet the requirements of the organization.
The design should be easily deployed, used, and understood, involving name standardization, port
group network switching standards, storage standards, and so on. Defining too complex standards will
increase management and maintenance costs in deployment and future use. Generally, manageability,
high availability, scalability, and data security must be considered during design. It is also necessary
to consider the cost and reduce the cost as much as possible under reasonable circumstances. Usability
and scalability must also be considered during design. A scalable environment can reduce the scaling
cost of an organization when the organization grows in the future. Always consider data security. For
users, data security does not allow any accident. Design is based on the service level of the user. For
important services, consider high availability when hardware damage occurs and during maintenance.
4.1.1.2 Basic Phases
 Survey
The survey phase includes determining the project scope and collecting design data.
Understanding customer needs and project details and identifying key stakeholders are key
success factors in providing designs that meet customer needs. In this phase, the project scope
and business objectives are determined.
The most effective way to get information before design is to listen as much as possible. In the
early discussions with the business unit, objectives, restrictions and risks of all projects should be
covered.
If objectives, requirements, or restrictions cannot be met or risks are generated, discuss with key
personnel of the business unit about the problems and recommended solutions as soon as possible
to avoid project delay.
When making analysis and other design choices with business units, consider the performance,
availability, scalability, and other factors that may affect the project. Also consider how to reduce
risks during design, and discuss the costs brought thereof.
Record all discussion points during design for future acceptance.
A good design must meet several final goals. The final goals are influenced by some strategic
principles of the organization.
 Conceptual design
HCIP-Storage Course Notes Page 117

Conceptual design emphasizes content simplification, differentiates priorities, and focuses on


long-term and overall benefits.
Conceptual design is to determine service objectives and requirements of a project. It determines
the entities affected by the project, such as business units, users, applications, processes,
management methods, and physical machines. It also determines how project objectives,
requirements, and constraints apply to each entity. A system architecture that meets both service
objectives and restrictions must be output. For example, the availability, scalability, performance,
security, and manageability must meet the requirements, the cost can be controlled, and other
restrictions are met.
 High level design (HLD)
HLD is to output the network scope and function deployment solution based on the network
architecture. It includes the interaction process of important services within the scope of the
contract, function allocation of all services on each NE, peripheral interconnection relationships
of NEs, interconnection requirements and rules, and quantitative calculation of interconnection
resources.
 Low level design (LLD)
Based on the network topology, signaling route, and service implementation solution output in
the network planning phase, LLD refines the design content, provides data configuration
principles, and guides data planning activities. In this way, the storage solution can be
successfully implemented and meet customer requirements. LLD includes detailed hardware
information and implementation information. For example:
− Distribution of data centers and layout of equipment rooms
− Server quantity, server type, service role, network configuration, and account information
− Storage hardware quantity and type, RAID name and level, and disk size and type
− Network topology, switch configuration, router configuration, access control, and security
policy
4.1.1.3 Implementation
Information collection: Collect service information (such as network topology and host information)
and detailed storage requirements (involving disk domains and storage pools) to know the service
running status and predict the service growth trend.
Requirement analysis: Based on the collected information and specific requirements, analyze storage
capacity, IOPS, storage pool, management, and advanced functions, examine the feasibility of the
implementation solution, and determine the implementation roadmap of key requirements.
Compatibility check: Check the compatibility based on the host operating system version, host
multipathing information, host application system information, Huawei storage version, storage
software, and multipathing software information provided by the customer.
LLD planning and design: Output the LLD solution and document.
Advanced function design: Plan and design purchased advanced storage functions based on customer
requirements.
LLD document submission: Submit the LLD document to the customer for review. Modify the
document based on the customer's comments.
HLD: HLD is to output the network scope and function deployment solution based on the network
architecture. It includes the interaction process of important services within the scope of the contract,
function allocation of all services on each NE, peripheral interconnection relationships of NEs,
interconnection requirements and rules, and quantitative calculation of interconnection resources.
LLD: LLD is the activities of further detailing the design content, providing data configuration rules,
and guiding data planning based on the network topology solution, signaling channel routing solution,
and service implementation solution developed at the network planning phase.
HCIP-Storage Course Notes Page 118

4.1.2 Content
4.1.2.1 Project Information
 Project information collection
Project information collection is the first step of planning and design and the basis for subsequent
activities. Comprehensive, timely, and accurate identification, filtering, and collection of raw
data are necessary for ensuring information correctness and effectiveness. Storage project
information to be collected involves live network devices, network topology, and service
information.
It also includes the schedule, project delivery time, and key time points. In the schedule, we need
to clarify the time needed to complete the work specified in the delivery scope of a certain phase,
tasks related to the delivery planned in each time period, and milestones as well as time points of
events.
Customer requirement collection: Collect information about the customer's current pain points of
services, whether the storage product (involving storage capacity and concurrency) meets the
service growth requirements, and system expansion analysis for the future.
 Requirement analysis
Availability: indicates the probability and duration of normal system running during a certain
period. It is a comprehensive feature that measures the reliability, maintainability, and
maintenance support of the system.
Manageability:
− Integrated console: integrates the management functions of multiple devices and systems
and provides end-to-end integrated management tools to simplify administrator operations.
− Remote management: manage the system through the network on the remote console. The
devices or system does not need to be managed by personnel on site.
− Traceability: ensures that the management operation history and important events can be
recorded.
− Automation: The event-driven mode is used to implement automatic fault diagnosis,
periodic and automatic system check, and alarm message sending when the threshold is
exceeded.
Performance: Indicators of a physical system are designed based on the Service Level Agreement
(SLA) for the overall system and different users. Performance design includes not only
performance indicators required by normal services, but also performance requirements in
abnormal cases, such as the burst peak performance, fault recovery performance, and DR
switchover performance.
Security: Security design must provide all-round security protection for the entire system. The
following aspects must be included: physical layer security, network security, host security,
application security, virtualization security, user security, security management, and security
service. Multiple security protection and management measures are required to form a
hierarchical security design.
Cost: Cost is always an important factor. A good design should always focus on the total cost of
ownership (TCO). When calculating the TCO, consider all associated costs, including the
purchase cost, installation cost, energy cost, upgrade cost, migration cost, service cost,
breakdown cost, security cost, risk cost, reclamation cost, and handling cost. The cost and other
design principles need to be coordinated based on balance principles and best practices.
4.1.2.2 Hardware Planning
Storage device selection: Consider the following aspects: capacity, throughput, and IOPS. Different
scenarios have different requirements. The cost must be considered during the evaluation. If multiple
types of disks can meet the performance requirements, select the most cost-effective one.
HCIP-Storage Course Notes Page 119

Disk type: A disk type in a disk domain corresponds to a storage tier of a storage pool. If the disk
domain does not have a specific disk type, the corresponding storage tier cannot be created for a
storage pool.
Nominal capacity: The disk capacity defined by the vendor and operating system is different. As a
result, the nominal capacity of a disk is different from that displayed in the operating system.
Disk capacity defined by disk manufactures: 1 GB = 1,000 MB, 1 MB = 1,000 KB, 1 KB = 1,000
bytes
Disk capacity calculated by operating systems: 1 GB = 1,024 MB, 1 MB = 1,024 KB, 1 KB = 1,024
bytes
Hot spare capacity: The storage system provides hot spare space to take over data from failed member
disks.
RAID usage: indicates the capacity used by parity data at different RAID levels.
Disk bandwidth performance: The total bandwidth provided by the back-end disks of a storage device
is the sum of the bandwidth provided by all disks. The minimum value is recommended during device
selection.
RAID level: A number of RAID levels have been developed, but just a few of them are still in use.
I/O characteristics: Write operations consume most of disk resources. The read/write ratio describes
the ratio of read and write requests. The disk flushing ratio indicates the ratio of disk flushing
operations when the system responds to read/write requests.
Compatibility check: Use the Huawei Storage Interoperability Navigator to query the compatibility
between storage systems and application servers, switches, and cluster software, and evaluate whether
the live network environment meets the storage compatibility requirements.
4.1.2.3 Network Planning
 Flash storage
Direct-connection network: An application server is connected to different controllers of a
storage system to form two paths for redundancy. The path between the application server and
the owning controller of LUNs is the optimal path and the other path is a standby path.
Single-switch network:
Switches increase the number of ports to allow more access paths. Moreover, switches extend the
transmission distance by connecting remote application servers to the storage system. As only
one switch is available in this mode, a single point of failure may occur. There are four paths
between the application server and storage system. The two paths between the application server
and the owning controller of LUNs are the optimal paths, and the other two paths are standby
paths. In normal cases, the two optimal paths are used for data transmission. If one optimal path
is faulty, UltraPath selects the other optimal path for data transmission. If both optimal paths are
faulty, UltraPath uses the two standby paths for data transmission. After an optimal path
recovers, UltraPath switches data transmission back to the optimal path again.
Dual-switch network:
With two switches, single points of failure can be prevented, boosting the network reliability.
There are four paths between the application server and storage system. UltraPath works in the
same way as it works in a multi-link single-switch networking environment.
 Distributed storage
Management plane: interconnects with the customer's management network for system
management and maintenance.
BMC plane: connects to management ports of management or storage nodes to enable remote
device management.
Storage plane: An internal plane used for service data communication among all nodes in the
storage system.
HCIP-Storage Course Notes Page 120

Service plane: interconnects with customers' applications and accesses storage devices through
standard protocols such as iSCSI and HDFS.
Replication plane: enables data synchronization and replication among replication nodes.
Arbitration plane: communicates with the HyperMetro quorum server. This plane is planned
when the HyperMetro function is planned for the block service.
Network port and VLAN planning:
On the firewall, allow the following ports: TCP ports (FTP (20), SSH (22), and iSCSI (3260)),
upper-layer network management port (5989), DeviceManager or CLI device connection port
(8080), DeviceManager service management port (8088), iSNS service port (24924), and UDP
port (SNMP (161)).
This example describes switch port planning when six nodes are deployed on a 10GE network,
and the service, storage, and management switches are deployed independently.
M-LAG implements link aggregation among multiple devices. In a dual-active system, one
device is connected to two devices through M-LAG to achieve device-level link reliability.
When the management network uses independent switches, the BMC switch and management
switch can be used independently or together.
The slides describe switch port planning when service and storage switches are used
independently and management and BMC switches are used independently.
4.1.2.4 Service Planning
Block service planning: Disk domain planning is applicable during hybrid flash storage service
planning. For details, see the product documentation. Disk domain planning, disk read/write policy
planning, and iSCSI CHAP planning are optional. Disk domain planning does not involve space
allocation but the number of disks and disk types. The disk domain space size depends on the number
of disks. A disk domain is a collection of disks. Disks in different disk domains are physically
isolated. In this way, faults and storage resources of different disk domains can be isolated.
File service planning: Disk domain planning and user authentication planning are optional. User
permission: Users with the full control permission can not only read and write directories but also
have permissions to modify directories and obtain all permissions of directories. Users with the
forbidden permission can view only shared directories and cannot perform operations on any
directory. File systems can be shared using NFS, CIFS, FTP, and HTTP protocols.

4.1.3 Tools
4.1.3.1 eService LLDesigner
Service engineers spend a lot of time on project planning and design, device installation, and device
configuration during the project delivery. How to improve the efficiency?
LLDesigner: provides functions such as hardware configuration, device networking, and resource
allocation to quickly complete product planning and design.
LLDesigner supports free creation, creation by importing a BOQ, and creation by using a template. It
outputs the LLD document and configuration files. LLDesigner provides wizard-based, visualized,
standardized, and automated services.
4.1.3.2 Other Tools
 Networking Assistant
Click the networking assistant, select a product model and configuration mode, and output the
networking diagram.
 Energy consumption calculation
Enter the power calculator page, select a product and component type, and view the result.
HCIP-Storage Course Notes Page 121

4.2 Storage Installation and Deployment


4.2.1 Flash Storage Installation and Deployment
4.2.1.1 System Installation
The storage system installation process consists of seven phases: installation preparation, device
installation, cable connection, hardware installation check, power-on, storage system initialization,
and remote O&M environment setup.
4.2.1.1.1 Preparing for Installation
Before installing the storage system, unpack the goods and check the installation environment, tools,
and materials.
 Checking auxiliary installation tools and materials
Before installing storage devices, ensure that necessary auxiliary materials are available,
including installation tools, meters, software tools, and documents.
 Checking the installation environment
Check the installation environment to ensure successful installation and proper running of
devices.
Environmental requirements: include temperature, humidity, altitude, vibration, shock, particle
contaminants, corrosive airborne contaminants, heat dissipation, and noise.
Power supply requirements: provide guidelines for best practices on power supply configuration.
Fire fighting requirements: To secure an equipment room, ensure that the equipment room have a
powerful fire fighting system.
 Unpacking and checking
When the equipment arrives, project supervisors and customer representatives must unpack the
equipment together and ensure that the equipment is intact and the quantity is correct.
Before unpacking goods, find the packing list and check goods based on it.
If the number of packages is incorrect or any carton is seriously damaged or soaked, stop
unpacking, find out the causes, and report the situation to Huawei local office.
If the number of packages is correct and cartons are in good conditions, unpack the equipment.
Inspect all components in each package based on the information on the packing list.
Check the packages of components for obvious damage.
Unpack goods and check whether components are intact and any one is lost.
If any component is lost or damaged, contact Huawei local office immediately.
4.2.1.1.2 Installing Devices
This section describes how to slide and secure a controller enclosure into the cabinet, and then check
the installation. Before installing the controller enclosure, ensure that the guide rails have been
properly installed and the protective cover of the controller enclosure has been removed. Install
interface modules in the controller enclosure before installing the controller enclosure. If the interface
modules have been installed on the controller enclosure, skip this procedure. Cable trays facilitate
maintenance and cabling of interface modules. Each controller enclosure requires one cable tray.
Before installing a disk enclosure into a cabinet, ensure that the guide rails have been installed
properly. Insert disks into a storage device before installing the storage device. If all disks have been
inserted into the storage device to be installed, skip this procedure.
4.2.1.1.3 Connecting Cables
Storage devices involve ground cables, power cables, network cables, optical fibers, RDMA cables,
and serial cables.
HCIP-Storage Course Notes Page 122

Connecting a certain quantity of disk enclosures to the controller enclosure expands the storage space.
Observe the principles of cascading disk enclosures and then cascade disk enclosures in the correct
way.
Bend cables naturally and reserve at least 97 mm space in front of each enclosure for wrapping cables.
Standard and smart disk enclosures cannot be connected to one expansion loop.
If you want to connect two disk enclosures, create multiple loops according to the number of
expansion ports on the controller enclosure and allocate disk enclosures evenly to the loops.
The number of disk enclosures connected to the expansion ports on the controller enclosure and that
connected to the back-end ports cannot exceed the upper limit.
Connect the expansion module on controller A to expansion module A on each disk enclosure and the
expansion module on controller B to expansion module B on each disk enclosure.
A pair of SAS ports support connection of up to two SAS disk enclosures. One is recommended.
A pair of RDMA ports support connection of up to two smart disk enclosures. One is recommended.
4.2.1.1.4 Checking Hardware Installation and Powering On
Check that all components and cables are correctly installed and connected.
After hardware installation is complete, power on storage devices and check that they are working
properly. You can press the power buttons on all controller enclosures or remotely power on
controller enclosures.
Correct power-on sequence: Switch on external power supplies connecting to all the devices; press the
power button on either controller; switch on switches; switch on application servers.
4.2.1.1.5 Initializing the Storage System
After checking that the storage system is correctly powered on, initialize the storage system.
Initialization operations include: changing the management IP address, logging in to DeviceManager,
initializing the configuration wizard, configuring security policies, and handling alarms.
In the initial configuration wizard, configure basic information such as device information, time,
license, and alarms, creating a storage pool, scanning for UltraPath hosts, and allocating resources.
4.2.1.1.6 Security Policies
System security policies include the account, login, access control, and user account audit policies.
Proper settings of the security policies improve system security.
Configuring account policies: user name, password complexity, and validity period
Configuring login policies: password locking and idle account locking
Configuring authorized IP addresses: This function specifies the IP addresses that are allowed to
access DeviceManager to prevent unauthorized access. After access control is enabled,
DeviceManager is accessible only to the authorized IP addresses or IP address segment.
Configuring user account auditing: After account auditing is enabled, the system periodically sends
account audit alarms to remind the super administrator to audit the number, role, and status of
accounts to ensure account security.
4.2.1.1.7 Alarms and Events
To better manage and clear alarms and events, read this section to learn the alarming mechanism,
alarm and event notification methods, and alarm dump function. Alarm severities indicate the impact
of alarms on user services. In Huawei all-flash storage systems, alarm severities are classified into
critical, major, warning, and warning in descending order.
Alarm notifications can be sent by email, SMS message, Syslog, or trap.
Email notification: allows alarms of specified severities to be sent to preset email addresses.
SMS notification: allows alarms and events of specified severities to be sent to preset mobile phones
by SMS.
HCIP-Storage Course Notes Page 123

Syslog notification: allows you to view storage system logs on a Syslog server.
Trap notification: You can modify the addresses that receive trap notifications based on service
requirements. The storage system's alarms and events will be sent to the network management
systems or other storage systems specified by the trap servers.
Alarm dump: automatically dumps alarm messages to a specific FTP or SFTP server when the alarm
message number exceeds a system-definable threshold.
4.2.1.2 Service Deployment
The basic service configuration process involves block service configuration and file service
configuration.
Before configuring the block service, plan the configuration and perform checks. Check whether the
software installation, network connection status, and initial configuration meet the configuration
requirements. To configure the basic block service on DeviceManager, create disk domains, storage
pools, LUNs, LUN groups, hosts, and host groups, and then map LUNs or LUN groups to hosts or
host groups. Some processes and steps vary depending on products. For example, you may not need to
create a disk domain for Huawei all-flash storage devices. For details about the configuration
procedure, see the product documentation of the corresponding product.
Before configuring the file service, plan the configuration and perform checks as well. To configure
the basic file service on DeviceManager, create disk domains, storage pools, and file systems, and
share and access the file systems with application servers. You can create quota trees and quotas for
file systems. Some processes and steps vary according to products. For details about the configuration
procedure, see the product documentation of the corresponding product.

4.2.2 Distributed Storage Installation and Deployment


4.2.2.1 System Installation
Before the installation, prepare the hardware, operating system, tools, software, and technical
documents.
The procedure for preparing hardware is similar to that in a flash storage system.
Before installing FusionStorage Manager (FSM) and FusionStorage Agent (FSA) nodes, verify the
software package, and configure and check the installation environment.
Installing FSM is to deploy management nodes of the storage service. The nodes work in
active/standby mode to improve reliability.
Installing FSA is to deploy storage nodes of the storage service.
After the installation, use the inspection tool or perform service dialing tests to check services.
The theoretical textbooks will focus on hardware-related content, and the software installation and
deployment will be described in the experiment part. The hardware part mainly involves some
recommended configurations and connection examples of switches.
 Configuring a storage switch
Node port: You are advised to connect service ports of nodes in sequence.
Reserved port: They are idle ports. You are advised to run the shutdown command to disable
these ports.
M-LAG port: You are advised to use two 100GE ports for interconnection between switches.
Aggregation port: You are advised to use four 100GE ports to connect to the aggregation switch.
ETH management port: It is used to manage switches and connects to the BMC management
switch.
 Configuring a management switch
MGMT port: It connects to the MGMT port of each node.
HCIP-Storage Course Notes Page 124

Reserved port: They are idle ports. You are advised to run the shutdown command to disable
these ports.
NIC port: It connects to the NIC port of each node.
Aggregation port: You are advised to use two GE ports to connect to the management network.
ETH management port: It connects to the ETH management port of the storage switch.
 Cable connection in converged deployment (for block)
The figure describes the port usage of nodes when the storage network is 10GE/25GE and each
node is equipped with one 4-port 10GE/25GE NIC.
 Cable connection in separated deployment (for block)
The figure describes the port usage of nodes when the storage network is 10GE/25GE and each
node is equipped with one 4-port 10GE/25GE NIC.
 Object service node connection
The figure describes the port usage when the service network is GE, the storage network is
10GE, and each storage node is equipped with one 4-port 10GE/25GE NIC.
 HDFS service node connection
The figure describes the port usage when the service and storage networks are 10GE and each
storage node is equipped with one 4-port 10GE NIC.
 KVM signal cable connection
An idle VGA port must be connected with a KVM cable. The other end of the KVM cable is
bound to the mounting bar of the cabinet.
4.2.2.2 Service Deployment
The following plans ports on a 48-port CE6800 service switch and a storage switch.
The deployment processes of the block service, file service, HDFS service, and object service are the
same. You can select a service type by creating a storage pool. You can import the licenses of
different services to specify the service type provided by each cluster.
For the block service, you need to create a VBS client before configuring services. For the object
service, you need to initialize the object service before configuring services.
The following describes the configuration processes of different services. The training materials apply
to many scenarios but the specific configuration may vary based on actual needs. For details, see the
corresponding basic service configuration guide.
 Block storage configuration process
SCSI: The compute node must be configured with the VBS client, management network, and
front-end storage network. The front-end storage IP address and management IP address of the
added compute node must communicate with the network plane of existing nodes in the cluster.
iSCSI: A compute node must be configured with the multipathing software and an independent
service network is deployed between the host and storage system. To configure the iSCSI
service, you need to plan the IP address for the node to provide the iSCSI service.
 HDFS service configuration process
When configuring a Global zone/NameNode zone, you need to plan the IP address for the node
to provide the HDFS metadata service and data service for external systems.
 Object storage configuration process
The object service uses an independent installation package and needs to be deployed during
object service initialization.
When configuring the service network for the object service, you need to plan the IP addresses
for the nodes to provide the object service.
HCIP-Storage Course Notes Page 125

5 Storage Maintenance and Troubleshooting

5.1 Storage O&M


5.1.1 O&M Overview
O&M provides technical assurance for products to deliver quality services. Its definition varies with
companies and business stages.
 DeviceManager: device-level O&M software.
 SmartKit: a professional tool for Huawei technical support engineers. It provides functions
including compatibility evaluation, planning and design, one-click fault information collection,
inspection, upgrade, and FRU replacement.
 eSight: multi-device maintenance suite provided for customers. It allows fault monitoring and
visualized O&M.
 DME: intended for customers. It offers unified management of storage resource software, service
catalog orchestration, on-demand supply of storage services and data application services.
 eService Client: deployed on a customer's equipment room. It can discover exceptions of storage
devices in real time and report them to Huawei maintenance center.
 eService cloud system: deployed on the Huawei maintenance center. It monitors devices on the
network in real time and changes reactive maintenance to proactive maintenance, or even
implements maintenance for customers.
 DME Storage: a full-lifecycle automated management platform.
DME Storage adopts the service-oriented architecture design. Based on automation, AI analysis, and
policy supervision, DME Storage integrates all phases of the storage lifecycle, including planning,
construction, O&M, and optimization, to implement automated storage management and control.
Key characteristics of the architecture include:
 Adopts the distributed microservice architecture with high reliability and 99.9% availability.
 Manages large-scale storage devices. A single node can manage 16 storage devices, each with
1500 volumes.
 Supports open northbound and southbound systems. Northbound systems provide RESTful,
SNMP, and ecosystem plug-ins (such as Ansible) to interconnect with upper-layer systems.
Southbound systems manage storage resources through open interface protocols such as SNMP
and RESTful.
 Allows AI & policy engine-based proactive O&M by DME.
 Automatically analyzes detected issues in the maintenance phase.
 Automatically checks and detects potential issues from multiple dimensions, such as capacity,
performance, configuration, availability, and optimization, based on preset policies and AI
algorithm models.
 Supports user-defined check policies, such as capacity thresholds.
HCIP-Storage Course Notes Page 126

5.1.2 Information Collection


Collecting information in the event of a fault helps maintenance engineers quickly locate and rectify
the fault. Necessary information includes basic information, fault information, storage system
information, network information, and application server information. You must obtain the customer's
consent before collecting information.

5.1.3 O&M Operations


Routine maintenance is required during device running. The O&M process is as follows:
1. Check storage system indicators.
2. Check system information.
3. Check the service status.
4. Check storage system performance.
5. Check and handle alarms.
6. Collect information and report faults.
a. Collect information using DeviceManager.
b. Collect information using SmartKit.
c. Contact technical support.

5.2 Troubleshooting
5.2.1 Fault Overview
Storage system faults are classified into minor faults, major faults, and critical faults in terms of fault
impact.
Faults can be divided in to storage faults and environment faults in terms of fault occurrence
locations.
 Storage fault
Storage system fault caused by hardware or software. The fault information can be obtained
using the alarm platform of the storage system.
 Environment fault
Software or hardware fault occurs when data is transferred from the host to the storage system
over a network. Such faults are caused by network links. The fault information can be obtained
from operating system logs, application program logs, and switch logs.

5.2.2 Troubleshooting Methods


Troubleshooting refers to measures taken when a fault occurs, including common troubleshooting and
emergency handling.
Basic fault locating principles help you exclude useless information and locate faults.
During troubleshooting, observe the following principles:
 Analyze external factors first, and then internal factors.
When locating faults, consider the external factors first.
External factor failures include failures in optical fibers, optical cables, power supplies, and
customers' devices.
Internal factors include disks, controllers, and interface modules.
HCIP-Storage Course Notes Page 127

 Analyze the alarms of higher severities and then those of lower severities.
The alarm severity sequence from high to low is critical alarms, major alarms, and warnings.
 Analyze common alarms and then uncommon alarms.
When analyzing an event, confirm whether it is an uncommon or common fault and then
determine its impact. Determine whether the fault occurred on only one component or on
multiple components.
To improve the emergency handling efficiency and reduce losses caused by emergency faults,
emergency handling must comply with the following principles:
− If a fault that may cause data loss occurs, stop host services or switch services to the standby
host, and back up the service data in time.
− During emergency handling, completely record all operations performed.
− Emergency handling personnel must participate dedicated training courses and understand
related technologies.
− Recover core services before recovering other services.

5.2.3 Troubleshooting Practices


Storage system troubleshooting is classified into common troubleshooting and emergency handling.
Common troubleshooting deals with faults of the following two types:
 Management software faults
 Basic storage service faults
Emergency handling deals with faults of the following types:
 Hardware faults
 Multipathing software faults
 Basic storage service faults
 Value-added services
 Other faults

You might also like