HCIP-Storage V5.0 Learning Guide
HCIP-Storage V5.0 Learning Guide
HCIP-Storage V5.0 Learning Guide
HCIP-Storage
Course Notes
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their
respective holders.
Notice
The purchased products, services and features are stipulated by the contract made between
Huawei and the customer. All or part of the products, services and features described in this
document may not be within the purchase scope or the usage scope. Unless otherwise specified in
the contract, all statements, information, and recommendations in this document are provided
"AS IS" without warranties, guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made
in the preparation of this document to ensure accuracy of the contents, but all statements,
information, and recommendations in this document do not constitute a warranty of any kind,
express or implied.
System architecture
Pangea V6 Arm hardware platform, fully autonomous and controllable
Huawei-developed HiSilicon Kunpeng 920 CPU
2 U controller enclosure with integrated disks
The controller enclosure can house 25 x 2.5-inch SAS SSDs or 36 x palm-sized NVMe SSDs.
Two controllers in an enclosure work in active-active mode.
Disk enclosure
If the controller enclosure uses NVMe SSDs, it must connect to NVMe disk enclosures. If the
controller enclosure uses SAS SSDs, it must connect to SAS disk enclosures.
The disk enclosure (including the entry-level controller enclosure used as a disk enclosure) is
powered on and off with the controller enclosure. The power button on the disk enclosure is
invalid and cannot control disk enclosure power separately.
The smart disk enclosure has Arm CPUs and 8 GB or 16 GB memory, providing computing
capability to offload reconstruction tasks.
Next, let's look at the software architecture. Huawei all-flash storage supports multiple advanced
features, such as HyperSnap, HyperMetro, and SmartQoS. Maintenance terminal software such
as SmartKit and eService can access the storage system through the management network port or
serial port. Application server software such as OceanStor BCManager and UltraPath can access
the storage system through iSCSI or Fibre Channel links.
OceanStor Dorado 8000 V6 and Dorado 18000 V6 storage systems use the SmartMatrix full-
mesh architecture, which leverages a high-speed, fully interconnected passive backplane to
connect to multiple controllers. Interface modules (Fibre Channel and back-end expansion) are
shared by all controllers over the backplane, allowing hosts to access any controller via any port.
The SmartMatrix architecture allows close coordination between controllers, simplifies software
models, and achieves active-active fine-grained balancing, high efficiency, and low latency.
Front-end full interconnection
The high-end product models of Huawei all-flash storage support front-end interconnect I/O
modules (FIMs), which can be simultaneously accessed by four controllers in a controller
enclosure. Upon reception of host I/Os, the FIM directly distributes the I/Os to appropriate
controllers.
Full interconnection among controllers
Controllers in a controller enclosure are connected by 100 Gbit/s RDMA links (40 Gbit/s for
OceanStor Dorado 3000 V6) on the backplane.
For scale-out to multiple controller enclosures, any two controllers are directly connected to
avoid data forwarding.
Back-end full interconnection
Huawei OceanStor Dorado 8000 and 18000 V6 support back-end interconnect I/O modules
(BIMs), which allow a smart disk enclosure to be connected to two controller enclosures and
accessed by eight controllers simultaneously. This technique, together with continuous mirroring,
allows the system to tolerate failure of 7 out of 8 controllers.
Huawei OceanStor Dorado 3000, 5000, and 6000 V6 do not support BIMs. Disk enclosures
connected to OceanStor Dorado 3000, 5000, and 6000 V6 can be accessed by only one controller
enclosure. Continuous mirroring is not supported.
The active-active architecture with multi-level intelligent balancing algorithms balances service
loads and data in the entire storage system. Customers only need to consider the total storage
capacity and performance requirements of the storage system.
HCIP-Storage Course Notes Page 5
LUNs are not owned by any specific controller. LUN data is divided into 64 MB slices. Slices are
distributed to different vNodes (each vNode matches a CPU) based on the hash (LUN ID + LBA)
result.
The balancing algorithms on Huawei all-flash storage include:
Front-end load balancing
Huawei UltraPath selects proper physical links to send each slice to the corresponding vNode.
The FIMs forward the slices to the corresponding vNodes.
If there is no UltraPath or FIM, the controllers forward I/Os to the corresponding vNodes.
Global write cache load balancing
Data volumes received by the global write cache are balanced, and data hotspots are evenly
distributed on all vNodes.
Global storage pool load balancing
Disk utilization, disk service life, data distribution, and hotspot data are evenly distributed.
memory, as well as conflicts between CPUs, allowing performance to increase linearly with
the number of CPUs.
Service grouping: All CPU cores of a vNode are divided into multiple core groups. Each
service group matches a CPU core group. The CPU cores corresponding to a service group
run only the service code of this group, and different service groups do not interfere with
each other. Service groups isolate various services on different cores, preventing CPU
contention and conflicts.
Lock-free: In a service group, each core uses an independent data organization structure to
process service logic. This prevents the CPU cores in a service group from accessing the
same memory structure, and implements lock-free design between CPU cores.
2. Sequential writes of large blocks
Flash chips on SSDs can be erased for a limited number of times. In traditional RAID
overwrite mode, hot data on an SSD is continuously rewritten, and its mapping flash chips
wear out quickly.
Huawei OceanStor Dorado V6 supports ROW-based sequential writes of large blocks.
Controllers detect data layouts in Huawei-developed SSDs and aggregate multiple small and
discrete blocks into a large sequential block. Then the large blocks are written into SSDs in
sequence. RAID 5, RAID 6, and RAID-TP perform just one I/O operation and do not
require the usual multiple read and write operations for small and discrete write blocks. In
addition, RAID 5, RAID 6, and RAID-TP deliver similar write performance.
3. Hot and cold data separation
The controller works with SSDs to identify hot and cold data in the system, improve
garbage collection efficiency, and reduce the program/erase (P/E) cycles on SSDs to
prolong their service life.
Garbage collection: In an ideal situation, garbage collection would expect all data in a block
to be invalid so that the whole block could be erased without data movement. This would
minimize write amplification.
Multi-streaming: Data with different change frequencies is written to different SSD blocks,
reducing garbage collection.
Separation of user data and metadata: Metadata is frequently modified and is written to
different SSD blocks from user data.
Separation of new data and garbage collection data: Data to be reclaimed by garbage
collection is saved in different SSD blocks from newly written data.
4. I/O priority adjustment
I/O priority adjustment functions like a highway. A highway has normal lanes for general
traffic, but it also has emergency lanes for vehicles which need to travel faster. Similarly,
priority adjustment lowers latency by granting different types of I/Os different priorities by
their SLAs for resources.
5. Smart disk enclosure
The smart disk enclosure is equipped with CPU and memory resources, and can offload
tasks, such as data reconstruction upon a disk failure, from controllers to reduce the
workload on the controllers and eliminate the impact of such tasks on service performance.
Reconstruction process of a common disk enclosure, using RAID 6 (21+2) as an example: If
disk D1 is faulty, the controller must read D2 to D21 and P, and then recalculate D1. A total
of 21 data blocks must be read from disks. The read operations and data reconstruction
consume great CPU resources.
Reconstruction of a smart disk enclosure: The smart disk enclosure receives the
reconstruction request and reads data locally to calculate the parity data. Then, it only needs
to transmit the parity data to the controller. This saves the network bandwidth.
HCIP-Storage Course Notes Page 8
Load sharing of controller tasks: Each smart disk enclosure has two expansion modules with
Kunpeng CPUs and memory resources. The smart disk enclosure takes over some
workloads from the controller enclosure to save controller resources.
6. AI
Huawei OceanStor all-flash storage systems use the Ascend 310 AI chip to boost the
computing power and accelerate services. Ascend 310 is a highly efficient, flexible, and
programmable AI processor that provides data precision for multiple devices and supports
both training and inference. Ascend 310 balances AI computing power and energy
efficiency and can analyze data access frequencies, including cold and hot data, health, and
data association. The intelligent analysis of this AI chip allows for implementation of
functions such as intelligent cache, intelligent QoS, and intelligent deduplication.
1.1.3.2 High Reliability
Next, let's look at high reliability technologies. OceanStor Dorado 8000 and 18000 V6 offer
protection measures against component and power failures, and use advanced technologies to
minimize risks of disk failures and data loss, ensuring system reliability. In addition, the storage
systems provide multiple advanced protection technologies to protect data against catastrophic
disasters and ensure continuous system running.
High availability architecture
Tolerating simultaneous failure of two controllers: The global cache provides three cache copies
across controller enclosures. If two controllers fail simultaneously, at least one cache copy is
available. A single controller enclosure can tolerate simultaneous failure of two controllers with
the three-copy mechanism.
Tolerating failure of a controller enclosure: The global cache provides three cache copies across
controller enclosures. A smart disk enclosure connects to 8 controllers (in 2 controller
enclosures). If a controller enclosure fails, at least one cache copy is available.
Tolerating successive failure of 7 out of 8 controllers: The global cache provides continuous
mirroring to tolerate successive failure of 7 out of 8 controllers (on 2 controller enclosures).
Zero interruption upon controller failure
The front-end ports are the same as common Ethernet ports. Each physical port provides one host
connection and has one MAC address.
Local logical interfaces (LIFs) are created for internal links. Four internal links connect to all
controllers in an enclosure. Each controller has a local LIF.
IP addresses are configured on the LIFs of the controllers. The host establishes IP connections
with the LIFs.
If the LIF goes down upon a controller failure, the IP address automatically fails over to the LIF
of another controller.
Non-disruptive upgrade with a single link
The process is as follows:
HCIP-Storage Course Notes Page 9
I/O process upgrade time < 1.5s; host reconnection time < 3.5s; service suspension time < 5s
SMB advanced features
Server Message Block (SMB) is a protocol used for network file access. It allows a local PC to
access files and request services on PCs over the local area network (LAN). CIFS is a public
version of SMB.
SMB 2.0 implements a failover as follows: SmartMatrix continuously mirrors SMB 2.0 durable
handles across controllers. If a controller or an interface module is faulty, the system performs
transparent migration of NAS logical interfaces. When the host restores the SMB 2.0 service
from the new controller, the controller obtains the handle from the controller on which the
durable handle is backed up to ensure service continuity.
SMB 3.0 implements a failover as follows: SmartMatrix continuously mirrors SMB 3.0 persistent
handles across controllers. If a controller or an interface module is faulty, the system performs
transparent migration of NAS logical interfaces. The host restores the persistent handle that was
backed up on a controller to a specified controller based on the SMB 3.0 failover standards.
Failover group
A failover group is a group of ports that are used for IP address failover in a storage system. The
storage system supports the default failover group, VLAN failover group, and user-defined
failover group. Manual and automatic failbacks are supported. A failback takes about 5 seconds.
Default failover group: If a port is faulty, the storage device fails over the LIFs of this port to a
port with the same location, type (physical or bond), rate (GE or 10GE), and MTU on the peer
controller. If the port is faulty again, the storage device finds a proper port on another controller
using the same rule. On a symmetric network, select this failover group when creating LIFs.
VLAN failover group: The system automatically creates a VLAN failover group when a VLAN
port is created. If a VLAN port is faulty, the storage device fails over the LIFs to a normal VLAN
port that has the same tag and MTU in the failover group. Use this failover group for easier
deployment of LIFs when VLANs are used.
User-defined failover group: The user manually specifies the ports in a failover group. If a port is
faulty, the system finds a proper port from the specified group member ports.
Data reliability solution
HCIP-Storage Course Notes Page 10
Dual mappings for directory metadata: Directories and inodes have dual logical mappings for
redundancy.
Data redundancy with snapshots: Snapshots provide local redundancy for file system data and
data recovery when needed.
Data redundancy on disks: Data is redundantly stored on disks using RAID 2.0+ to prevent loss
in the event of disk failures. The system automatically recovers the data using RAID as long as
the amount of corrupted data is within the permitted range.
Data redundancy across sites: Corrupted data at the local site can be recovered from the remote
site.
1.1.3.3 High Security
Trusted and secure boot of hardware
Secure boot is to establish a hardware root of trust (which is tamperproofing) to implement
authentication layer by layer. This builds a trust chain in the entire system to achieve predictable
system behavior.
Huawei OceanStor all-flash storage systems use this methodology to avoid loading tampered
software during the boot process.
Software verification and loading process for secure boot:
Verify the signed public key of Grub. BootROM verifies the integrity of the signed public key of
Grub. If the verification fails, the boot process is terminated.
Verify and load Grub. BootROM verifies the Grub signature and loads Grub if the verification is
successful. If the verification fails, the boot process is terminated.
Verify the status of the software signature certificate. Grub verifies the status of the software
signature certificate based on the certificate revocation list. If the certificate is invalid, the boot
process is terminated.
Verify and load the OS. Grub verifies the OS signature and loads the OS if the verification is
successful. If the verification fails, the boot process is terminated.
Role-based permission management
Preset default roles: The system provides default roles for system and vStore administrators.
Default roles of system administrators:
Administrator Administrator
User-defined role: Users customize roles based on service requirements. During customization,
users can select multiple functions for a role and multiple objects for each function. User-defined
roles can be deleted and modified.
Security log audit
Technical principles of the native audit log:
Users can specify the file systems and file operations to be audited, such as create, delete,
rename, modify, and chmod.
Audit logs and read/write I/Os are processed in the same process to record the I/Os and logs at
the same time.
Audit logs are stored as metadata in the Audit-Dtree directory of each file system to ensure I/O
performance.
The system converts the log metadata from the *.bin format to the *.xml format in the
background for reads and writes.
Audit logs in the *.xml format are stored in the Audit-Log-FS file system of each vStore.
Asynchronous remote replication provides disaster recovery for the audit logs.
Brand-new architecture: The latest-generation multi-core CPU and SmartMatrix 3.0 architecture
enable the storage systems to support up to 32 controllers and 192 PB of all-flash capacity for linear
performance increase.
Ultimate convergence: SAN and NAS are converged to provide elastic storage, simplify service
deployment, improve storage resource utilization, and reduce TCO.
Outstanding performance: The flash-optimized technology gives full play to SSD performance. Inline
deduplication and compression are supported. Loads are balanced among controllers that serve as hot
backup for each other, delivering higher reliability. Resources are centrally stored and easily
managed.
Multi-controller redundancy: The storage system tolerates the failures of three out of four
controllers.
Next-generation power protection: BBUs are built into controllers. When a controller is
removed, the BBU provides power for flushing cache data to system disks. Even when
multiple controllers are removed concurrently, data is not lost.
3. Controller faults are transparent to hosts.
Port: Each front-end port provides one Fibre Channel session for a host. The host detects
only one Fibre Channel session and WWN from each storage port.
Chip: Four internal links are established, each connecting to a controller in a controller
enclosure. Each controller establishes its own Fibre Channel session with the host.
FIMs enable the full interconnection of front-end links and all storage controllers. When any
controller fails, they ensure continuous front-end access without affecting hosts. I'd like to
take a moment to examine exactly how FIMs work.
For the host's perspective, each front-end port provides the host with one Fibre Channel
session, so the host only identifies one Fibre Channel session and WWN from each storage
port.
For the storage system's perspective, four internal links are established, each connecting to a
controller in a controller enclosure. Each controller establishes its own Fibre Channel
session with the host.
Controller failures: When any controller in a controller enclosure fails, FIMs redirect the
I/Os to the remaining controllers. The host remains unaware of the fault, the Fibre Channel
links remain up, and services run properly. No alarm or event is reported.
required for data reconstruction is shortened from 10 hours to 30 minutes. Data reconstruction
speed is accelerated by 20 times, greatly reducing the impact on services and probabilities of
multi-disk failures in the reconstruction process. All disks in the storage pool participate in
reconstruction, and only service data is reconstructed. The reconstruction mode is changed from
traditional RAID's many-to-one to many-to-many.
Huawei-developed chips
Front-end transmission: The intelligent multi-protocol interface chip supports the industry's
fastest 32 Gbit/s Fibre Channel and 100 Gbit/s Ethernet protocol for hardware offloading. It
enables interface modules to implement protocol parsing previously performed by the CPU to
reduce the CPU workloads and improve the transmission performance. It reduces the front-end
access latency from 160 μs to 80 μs. The parsed data interacts with the CPU to implement
advanced features, such as the traffic control.
Controller chip: The Kunpeng 920 processor is the first 7-nm Arm CPU in the industry and
integrates the southbridge, network adapter, and SAS controller chips.
SSD storage chip: The core FTL algorithm is embedded in the self-developed chip. The chip
directly determines the read/write location, reducing the write latency from 40 μs to 20 μs. The
last is the intelligent management chip used in the management plane that is important to the
entire running period of a storage system.
Intelligent management chip: Its built-in library contains more than 10 years of storage faults to
quickly identify and rectify problems. Once a fault is detected, the management chip quickly
matches the fault model from the library, locates the fault with the accuracy of 93%, and provides
a solution.
RDMA scale-out
Four controllers are expanded to eight controllers without using any switches. The networking is
simple.
100 Gbit/s RDMA ports transmit data between the two controller enclosures.
VLANs are used for logical data communication to ensure data security and reliability on the I/O
plane and management and control plane.
Self-encrypting drive (SED)
SEDs use the AES-256 encryption algorithm to encrypt data stored on the disks without affecting
performance.
Internal Key Manager is a key management application embedded in storage systems. OceanStor
18000 V5 and 18000F V5 use the trusted platform module (TPM) to protect keys.
External Key Manager uses the standard KMIP + TLS protocols. Internal Key Manager is
recommended when the key management system is only used by the storage systems in a data
center.
OceanStor V5 storage systems combine SEDs with Internal Key Manager (built-in key
management system) or External Key Manager (independent key management system) to
implement static data encryption and ensure data security.
The principle of the AES algorithm is based on permutation and substitution. AES uses several
different methods to perform permutation and substitution operations. It is an iterative and
symmetric-key algorithm that has a fixed block size of 128 bits (16 bytes) and a key size of 128,
192, or 256 bits. Different from public key passwords that use key pairs, symmetric key
passwords use the same key to encrypt and decrypt data. The number of bits of the encrypted
data returned by AES is the same as that of the input data. The key size used for an AES cipher
specifies the number of repetitions of transformation rounds that convert the input, called the
plaintext, into the final output, called the ciphertext.
Internal Key Manager is easy to deploy, configure, and manage. There is no need to deploy an
independent key management system.
HCIP-Storage Course Notes Page 15
Advanced features
The block service and file service support a wide range of advanced features. For details, see the
training slides.
Scale-Out NAS adopts a fully symmetric distributed architecture. Scale-Out NAS is used for storing
mass unstructured data with its industry-leading performance, large-scale scale-out capability, and
ultra-large single file system. Huawei Scale-Out NAS can improve the storage efficiency of IT
systems, simplify the workload and migration process, and cope with the growth and evolution of
unstructured data.
Functions as a 2 U 12-slot
EXP NVMe all-flash node
F100 equipped with two Kunpeng Block service
920 CPUs (48-core 2.6
GHz).
Functions as a 2 U 12-slot
Converged, object, HDFS,
node equipped with x86
and block services
CPUs.
P110
Functions as a 2 U 25-slot
node equipped with x86 Block service
CPUs.
HCIP-Storage Course Notes Page 17
Functions as a 4 U 36-slot
Converged, object, HDFS,
C110 node equipped with x86
and block services
CPUs.
Functions as a 2 U 12-slot
NVMe all-flash node
equipped with x86 CPUs.
F110 Block service
Functions as a 2 U 24-slot
NVMe all-flash node
equipped with x86 CPUs.
Functions as a GE
BMC/management switch,
and provides four 10GE
S5731-H48T4XC SFP+ Ethernet optical ports -
and forty-eight
10/100/1000BASE-T
Ethernet electrical ports.
Functions as a GE
BMC/management switch,
and provides four 10GE
S5720-56C-EI-AC SFP+ Ethernet optical ports -
and forty-eight
10/100/1000BASE-T
Ethernet electrical ports.
Functions as a GE
BMC/management switch,
and provides four 10GE
S5331-H48T4XC SFP+ Ethernet optical ports -
Network and forty-eight
devices 10/100/1000BASE-T
Ethernet electrical ports.
Functions as a GE
BMC/management switch,
and provides four 10GE
S5320-56C-EI-AC SFP+ Ethernet optical ports -
and forty-eight
10/100/1000BASE-T
Ethernet electrical ports.
Functions as a 10GE storage
switch, and provides forty-
eight 10GE SFP+ Ethernet
CE6881-48S6CQ -
optical ports and six 40GE
QSFP28 Ethernet optical
ports.
Functions as a 10GE storage
CE6855-48S6Q-HI switch, and provides forty- -
eight 10GE SFP+ Ethernet
HCIP-Storage Course Notes Page 18
If Scale-Out NAS is used, the hardware contains storage nodes, network devices, KVM, and short
message service (SMS) modems. The following table lists the hardware components.
It supports rich enterprise-class features, second-level HyperReplication and HyperMetro of the block
service. Microservice-based architecture is supported. Block, HDFS, and object services can share the
persistence service.
The block service supports a wide range of virtualization platforms and database applications with
standard access interface protocols such as SCSI and iSCSI, and delivers high performance and
scalability to meet SAN storage requirements of virtualization, cloud resource pools, and databases.
Key features of the block service include HyperMetro (active-active storage), HyperReplication
(remote replication), HyperSnap (snapshot), SmartQoS (intelligent service quality control),
SmartDedupe (deduplication), and SmartCompression (compression).
The object service supports mainstream cloud computing ecosystems with standard object service
APIs for content storage, cloud backup and archiving, and public cloud storage service operation. Key
features of the object service include HyperReplication (remote replication), Protocol-Interworking
(object/file interworking), SmartDedupe (deduplication), SmartQuota (quota management), and
SmartQoS (intelligent service quality control).
The HDFS service supports native HDFS interfaces without plug-ins and provides a cloud-enabled
decoupled storage-compute solution for big data analysis. It enables you to efficiently process
massive amounts of data, deploy and use resources on demand, and reduce TCO. Key features of the
HDFS service include SmartTier (tiered storage), SmartQuota (quota), and recycle bin.
The DHT ring of the block service contains 2^32 logical space units which are evenly divided
into n partitions. The n partitions are evenly allocated on all disks in the system. For example, n
is 3600 by default. If the system has 36 disks, each disk is allocated 100 partitions. The system
configures the partition-disk mapping relationship during system initialization and will adjust the
mapping relationship accordingly after the number of disks in the system changes. The partition-
disk mapping table occupies only a small space, and block service nodes store the mapping table
in the memory for rapid routing. The routing mechanism of the block service is different from
that of the traditional storage array. It does not employ the centralized metadata management
HCIP-Storage Course Notes Page 21
mechanism and therefore does not have performance bottlenecks incurred by the metadata
service.
An example is provided as follows: If an application needs to access the 4 KB data identified by
an address starting with LUN1 + LBA1, the system first constructs "key= LUN1 + LBA1/1M",
calculates the hash value for this key, performs modulo operation for the value N, gets the
partition number, and then obtains the disk to which the data belongs based on the partition-disk
mapping.
In addition, the DHT routing algorithm has the following characteristics:
Balance: Data is distributed to all nodes as evenly as possible, thereby balancing loads among
nodes.
Monotonicity: If new nodes are added to the system, the system redistributes data among nodes.
Data migration is implemented only on the new nodes, and the data on the existing nodes is not
significantly adjusted.
Range segmentation and WAL aggregation
Data to be stored is distributed on different nodes in range mode. Write Ahead Log (WAL) is an
intermediate storage technology used before data persistence. After data is stored using WAL, the
message that data is written successfully can be returned to upper-layer applications. WAL
highlights that modifications to data files (they are carriers of tables and indexes) can only occur
after the modifications have been logged, that is, after the log records describing the changes
have been flushed to persistent storage.
Multi-NameNode concurrency
The NameNode is the metadata request processing node of the HDFS, and the DataNode is the
data request processing node of the HDFS.
Traditional HDFS NameNode model:
Only one active NameNode provides the metadata service. The active and standby NameNodes
are not consistent in real time and have a synchronization period.
After the current active NameNode breaks down, the new NameNode cannot provide metadata
services for several hours until the new NameNode loads logs.
The number of files supported by a single active NameNode depends on the memory of a single
node. A maximum of 100 million files can be supported by a single active NameNode.
If a namespace is under heavy pressure, concurrent metadata operations consume a large number
of CPU and memory resources, resulting in poor performance.
Huawei HDFS multi-NameNode concurrency has the following features:
Multiple active NameNodes provide metadata services, ensuring real-time data consistency
among multiple nodes.
It avoids metadata service interruption caused by traditional HDFS NameNode switchover.
The number of files supported by multiple active NameNodes is no longer limited by the memory
of a single node.
Multi-directory metadata operations are concurrently performed on multiple nodes.
Append Only Plog technology
HDD and SSD media can be supported at the same time. Both media have significant differences
in technology parameters such as bandwidth, IOPS, and latency. Therefore, I/O patterns
applicable to both media are greatly different. The Append Only Plog technology is adopted for
unified management of HDDs and SSDs. It provides the optimal disk writing performance model
for media. Small I/O blocks are aggregated into large ones, and then large I/O blocks are written
to disks in sequence. This write mode complies with the characteristics of disks.
EC intelligent aggregation technology
HCIP-Storage Course Notes Page 22
The intelligent aggregation EC based on append write always ensures EC full-stripe write,
reducing read/write network amplification and disk amplification by several times. Data is
aggregated at a time, reducing the CPU computing overhead and providing ultimate peak
performance.
Multi-level cache technology
The following figure shows the write cache.
Step 1 The storage system writes data to the RAM-based write cache (memory write cache)
Step 2 The storage system writes data to the SSD WAL cache (for large I/Os, data is written to the
HDD) and returns a message to the host indicating that the write operation is complete.
Step 3 When the memory write cache reaches a certain watermark, the storage system writes data to
the SSD write cache.
Step 4 For large I/Os, the storage system writes data to the HDD. For small I/Os, the system first
writes data to the SSD write cache, and then writes data to the HDD after aggregating the small
I/Os into large I/Os.
Note: If the data written in Step 1 exceeds 512 KB, it is directly written to the HDD in Step 4.
The following figure shows the read cache.
HCIP-Storage Course Notes Page 23
Step 1 The storage system reads data from the memory write cache. If the read I/O is hit, the message
that the data is read successfully is returned. Otherwise, the storage system proceeds to Step 2.
Step 2 The storage system reads data from the memory read cache. If the read I/O is hit, the message
that the data is read successfully is returned. Otherwise, the storage system proceeds to Step 3.
Step 3 The storage system reads data from the SSD write cache. If the read I/O is hit, the message that
the data is read successfully is returned. Otherwise, the storage system proceeds to Step 4.
Step 4 The storage system reads data from the SSD read cache. If the read I/O is hit, the message that
the data is read successfully is returned. Otherwise, the storage system proceeds to Step 5.
access clouds, development and testing clouds, cloud-based services, B2B cloud resource pools
in carriers' BOM domains, and e-Government cloud.
Mission-critical database
Enterprise-grade capabilities, such as distributed active-active storage and consistent low latency,
ensure efficient and stable running of data warehouses and mission-critical databases, including
online analytical processing (OLAP) and online transaction processing (OLTP).
Big data analysis
An industry-leading decoupled storage-compute solution is provided for big data, which
integrates traditional data silos and builds a unified big data resource pool for enterprises. It also
leverages enterprise-grade capabilities, such as elastic large-ratio erasure coding (EC) and on-
demand deployment and expansion of decoupled compute and storage resources, to improve big
data service efficiency and reduce the TCO. Typical scenarios include big data analysis for
finance, carriers (log retention), and governments.
Content storage and backup archiving
Superb-performance and high-reliability enterprise-grade object storage resource pools are
provided to meet the requirements of real-time online services such as Internet data, online audio
and video data, and enterprise web disks. It delivers large throughput, enables frequent access to
hotspot data, and implements long-term storage and online access Typical scenarios include
storage, backup, and archiving of financial electronic check images, audio and video recordings,
medical images, government and enterprise electronic documents, and Internet of Vehicles (IoV).
For example, the distributed storage block service can be used in scenarios such as BSS, MSS,
OSS, and VAS. The object service can also be used in application scenarios. Its advantages are as
follows:
Stable and low latency for the customer access process: The stable latency is less than 80 ms,
meeting the stability requirements of continuous video write latency and improving the access
experience of end users.
High concurrent connections: Millions of video connections are supported, ensuring stable
performance.
On-demand use: Storage resources can be dynamically used and paid on demand based on
service growth at any time, reducing the TCO.
Automatic network deployment and network resource configuration are supported to implement
the convergence of compute and network resources. In addition, network resources are
dynamically associated with compute and storage resources.
Distributed Block Storage
FusionCube employs FusionStorage block storage to provide distributed storage services.
FusionStorage block storage uses an innovative cache algorithm and adaptive data distribution
algorithm based on a unique parallel architecture, which eliminates high data concentration and
improves system performance. FusionStorage block storage also allows rapid and automatic self-
recovery and ensures high system availability and reliability.
1. Linear scalability and elasticity: FusionStorage block storage uses the distributed hash table
(DHT) to distribute all metadata among multiple nodes. This prevents performance
bottlenecks and allows linear expansion. FusionStorage block storage leverages an
innovative data slicing technology and a DHT-based data routing algorithm to evenly
distribute volume data to fault domains of large resource pools. This allows load balancing
on hardware devices and higher IOPS and megabit per second (MBPS) performance of each
volume.
2. High performance: FusionStorage block storage uses a lock-free scheduled I/O software
subsystem to prevent conflicts of distributed locks. The delay and I/O paths are shortened
because there is no lock operation or metadata query on I/O paths. By using distributed
stateless engines, hardware nodes can be fully utilized, greatly increasing the concurrent
IOPS and MBPS of the system. In addition, the distributed SSD cache technology and large-
capacity SAS/SATA disks (serving as the main storage) ensure high performance and large
storage capacity.
3. High reliability: FusionStorage block storage supports multiple data redundancy and
protection mechanisms, including two-copy backup and three-copy backup. FusionStorage
block storage supports the configuration of flexible data reliability policies, allowing data
copies to be stored on different servers. Data will not be lost and can still be accessed even
in case of server faults. FusionStorage block storage also protects valid data slices against
loss. If a disk or server is faulty, valid data can be rebuilt concurrently. It takes less than 30
minutes to rebuild data of 1 TB. All these measures improve system reliability.
4. Rich advanced storage functions: FusionStorage block storage provides a wide variety of
advanced functions, such as thin provisioning, volume snapshot, and linked clone. The thin
provisioning function allocates physical space to volumes only when users write data to the
volumes, providing more virtual storage resources than physical storage resources. The
volume snapshot function saves the state of the data on a logical volume at a certain time
point. The number of snapshots is not limited, and performance is not compromised. The
linked clone function is implemented based on incremental snapshots. A snapshot can be
used to create multiple cloned volumes. When a cloned volume is created, the data on the
volume is the same as the snapshot. Subsequent modifications on the cloned volume do not
affect the original snapshot and other cloned volumes.
Automatic Deployment
FusionCube supports automatic deployment, which simplifies operations on site and increases
deployment quality and efficiency.
FusionCube supports preinstallation, preintegration, and preverification before the delivery,
which simplifies onsite installation and deployment and reduces the deployment time.
Devices are automatically discovered after the system is powered on. Wizard-based system
initialization configuration is provided for the initialization of compute, storage, and network
resources, accelerating service rollout.
An automatic deployment tool is provided to help users conveniently switch and upgrade
virtualization platforms.
Unified O&M
HCIP-Storage Course Notes Page 28
FusionCube supports unified management of hardware devices (such as servers and switches)
and resources (including compute, storage, and network resources). It can greatly improve O&M
efficiency and QoS.
A unified management interface is provided to help users perform routine maintenance on
hardware devices such as chassis, servers, and switches and understand the status of compute,
storage, and network resources in a system in real time.
The IT resource usage and system operating status are automatically monitored. Alarms are
reported for system faults and potential risks in real time, and alarm notifications can be sent to
O&M personnel by email.
Rapid automatic capacity expansion is supported. Devices to be added can be automatically
discovered, and wizard-based capacity expansion configuration is supported.
Typical Application Scenarios
Server virtualization: Integrated FusionCube virtualization infrastructure is provided without
requiring other application software.
Desktop cloud: Virtual desktop infrastructures (VDIs) or virtualization applications run on the
virtualization infrastructure to provide desktop cloud services.
Enterprise office automation (OA): Enterprise OA service applications such as Microsoft
Exchange and SharePoint run on the virtualization infrastructure.
Rack Servers
FusionServer servers (x86) and TaiShan servers (Kunpeng) are supported. FusionCube supports
1-socket, 2-socket, and 4-socket rack servers, which can be flexibly configured based on
customer requirements.
Next, let's look at the software architecture of hyper-converged storage.
The overall architecture of hyper-converged storage consists of the hardware platform,
distributed storage software, installation, deployment, and O&M management platforms,
virtualization platforms, and backup and disaster recovery (DR) software. The virtualization
platforms can be Huawei FusionSphere and VMware vSphere. In addition, in the FusionSphere
scenario, FusionCube supports the hybrid deployment of the virtualization and database
applications.
Name Description
Manages FusionCube virtualization and hardware resources, and
FusionCube Center
implements system monitoring and O&M.
Enables quick installation and deployment of FusionCube software. It
FusionCube Builder
can be used to replace or update the virtualization platform software.
Provides high-performance and high-reliability block storage services
FusionStorage by using distributed storage technologies to schedule local disks on
servers in an optimized manner.
Implements system virtualization management. The Huawei
Virtualization platform
FusionSphere and VMware virtualization platforms are supported.
Provides the service virtualization function of backup systems, which
include the Huawei-developed backup software eBackup and
Backup
mainstream third-party backup software, such as Veeam, Commvault,
and EISOO.
Provides DR solutions based on active-active storage and
DR asynchronous storage replication. The DR software includes Huawei-
developed BCManager and UltraVR.
Supports E9000, X6800, X6000, and rack servers. The servers
integrate compute, storage, switch, and power modules and allow on-
demand configuration of compute and storage nodes. FusionCube
Hardware platform
supports GPU and SSD PCIe acceleration and expansion, as well as
10GE and InfiniBand switch modules to meet different configuration
requirements.
In the traditional architecture, centralized SAN controllers cause performance bottleneck. The
bottleneck is eliminated in FusionCube because FusionCube employs distributed architecture and
distributed storage. This way, each machine contains compute and storage resources, so each
machine can be regarded as a distributed storage controller.
In the decoupled compute-storage architecture, all data needs to be read from and written to the
storage array through the network. As a result, the network limit becomes another bottleneck.
FusionCube removes this bottleneck by using the InfiniBand network, the fastest in the industry,
to provide 56 Gbit/s bandwidth with nodes interconnected in P2P mode.
The third bottleneck in the traditional architecture is the slow disk read/write speed. The Huawei
HCI architecture uses ES3000 SSD cards, the fastest in the industry, as the cache, which
effectively solves the problems of local disk reads/writes.
HCIP-Storage Course Notes Page 30
Logical structure of distributed storage: In the entire system, all modules are deployed in a
distributed and decentralized manner, which lays a solid foundation for high scalability and high
performance of the system. The functions of some key components are as follows:
1. The VBS module provides standard SCSI/iSCSI services for VMs and databases at the
stateless interface layer. It is similar to the controller of a traditional disk array, but unlike
the controller of a traditional disk array, the number of VBS modules is not limited. The
number of controllers in a traditional disk array is limited, but VBS modules can be
deployed on all servers that require storage services.
2. The OSD module manages disks and is deployed on all servers with disks. It provides data
read and write for VBS and advanced storage services, including thin provisioning,
snapshot, linked clone, cache, and data consistency.
3. The MDC module manages the storage cluster status and is deployed in each cluster. It is
not involved in data processing. It collects the status of each module in the cluster in real
time and controls the cluster view based on algorithms.
performance bottlenecks caused by reconstructing a large amount of data on a single node and
minimizes adverse impacts on upper-layer services.
Dynamic EC
EC Turbo drives higher space utilization and provides a data redundancy solution that ensures
stable performance and reliability when faults occur. Dynamic EC is supported. When a node is
faulty, EC reduction ensures data redundancy and performance. As shown in the following
figure, if the 4+2 EC scheme is used, when a node is faulty, the 4+2 EC scheme is reduced to 2+2
to ensure that new data written into each node is not downgraded.
EC folding is supported. While a cluster requires at least three nodes, the three nodes can also be
configured with the 4+2 EC scheme with the EC folding technology to improve space utilization.
Incremental EC is provided. The system supports writes for data increments and parity bits for
partially full stripes and writes for data appending, such as D1+D2+D3+D4+P1+P2. Storage
utilization is high. N+2, +3, and +4 redundancy levels (maximum 22+2) are supported for a
maximum storage utilization of 90%.
Cabinet-Level Reliability
For the traditional SAN, when Cabinet 2 is faulty, services of App 2 running on Cabinet 2 are
interrupted and need to be manually recovered, as shown in the following figure.
HCIP-Storage Course Notes Page 32
For the hyper-converged storage, when Cabinet 1 is faulty, services are not affected because the
storage pool is shared, as shown in the following figure.
1. Lack of standardization: There are various types and large quantities of devices with low
integration, resulting in difficult management.
2. Long deployment cycle: It usually takes 30 days to deploy a site.
3. High O&M cost: Personnel at remote sites must be highly skilled at O&M operations.
4. Low line utilization: The private line utilization ratio is less than 50% at most sites.
To address the preceding challenges, Huawei introduces FusionCube. In conjunction with
FusionCube Center Vision, FusionCube offers integrated cabinets, service rollout, O&M
management, and troubleshooting services in a centralized manner. It greatly shortens the
deployment cycle, reduces the O&M cost, and improves the private line utilization.
Cloud Infrastructure Scenario
The virtualization platforms can be Huawei FusionSphere or VMware vSphere to implement
unified management of physical resources.
Asynchronous Replication Scenario
The Huawei asynchronous replication architecture consists of two sets of FusionCube distributed
storage that build the asynchronous replication relationship and the UltraVR or BCManager DR
management software. The data on the primary and secondary volumes are periodically
synchronized based on the comparison of snapshots. All the data generated on the primary
volume after the last synchronization will be written to the secondary volume in the next
synchronization.
Storage DR clusters can be deployed on demand. The storage DR cluster is a logical object that
provides replication services. It manages cluster nodes, cluster metadata, replication pairs,
consistency groups, and performs data migration. The DR cluster and system service storage are
deployed on storage nodes. DR clusters offer excellent scalability. One system supports a
maximum of eight DR clusters. A single DR cluster contains three to 64 nodes. A single DR
cluster supports 64,000 volumes and 16,000 consistency groups, meeting future DR
requirements.
The UltraVR or BCManager manages DR services from the perspective of applications and
protects service VMs of the FusionCube system. It provides process-based DR service
configuration, including one-click DR test, DR policy configuration, and fault recovery
operations at the active site.
Summary of Features
RPO within seconds without differential logs is supported, helping customers recover services
more quickly and efficiently.
Replication network type: GE, 10GE, or 25GE (TCP/IP)
Replication link between sites: It is recommended that the replication link between sites be
within 3000 km, the minimum bidirectional connection bandwidth be at least 10 Mbit/s, and the
average write bandwidth of replication volumes be less than the remote replication bandwidth.
System RPO: The minimum RPO is 15 seconds, and the maximum RPO is 2880 minutes (15
seconds for 512 volumes per system; 150 seconds for 500 volumes per node).
HCIP-Storage Course Notes Page 34
snapshots, and offer high random read/write performance capabilities; that is, the OceanStor Dorado
V6 series storage systems eliminate the necessity for both the data backup process and the sequential
read operations done in legacy storage snapshots, thereby delivering lossless storage read/write
performance.
HyperSnap is the snapshot feature developed by Huawei. Huawei HyperSnap creates a point-in-time
consistent copy of original data (LUN) to which the user can roll back, if and when it is needed. It
contains a static image of the source data at the data copy time point. In addition to creating snapshots
for a source LUN, the OceanStor Dorado V6 series storage systems can also create a snapshot (child)
for an existing snapshot (parent); these child and parent snapshots are called cascading snapshots.
Once created, snapshots become accessible to hosts and serve as a data backup for the source data at
the data copy time.
HyperSnap provides the following advantages:
Supports online backup, without the need to stop services.
Provides writable ROW snapshots with no performance compromise.
If the source data is unchanged since the previous snapshot, the snapshot occupies no extra
storage space. If the source data has been changed, only a small amount of space is required to
store the changed data.
2.1.1.2 Working Principle
A snapshot is a copy of the source data at a point in time. Snapshots can be generated quickly and
only occupy a small amount of storage space.
2.1.1.2.1 Basic Concepts
ROW: This is a core technology used to create snapshots. When a storage system receives a write
request to modify existing data, the storage system writes the new data to a new location and directs
the pointer of the modified data block to the new location.
Data organization: The LUNs created in the storage pool of the OceanStor Dorado V6 series storage
systems consist of metadata volumes and data volumes.
Metadata volume: records the data organization information (LBA, version, and clone ID) and data
attributes. A metadata volume is organized in a tree structure.
Logical block address (LBA) indicates the address of a logical block. The version corresponds to the
snapshot time point and the clone ID indicates the number of data copies.
Data volume: stores user data written to a LUN.
Source volume: A volume that stores the source data requiring a snapshot. It is represented to users as
a source LUN or an existing snapshot.
Snapshot volume: A logical data duplicate generated after a snapshot is created for a source LUN. A
snapshot volume is represented to users as a snapshot LUN. A single LUN in the storage pool uses the
data organization form (LBA, version, or clone ID) to construct multiple copies of data with the same
LBA. The source volume and shared metadata of the snapshot volume are saved in the same shared
tree.
Snapshot copy: It copies a snapshot to obtain multiple snapshot copies at the point in time when the
snapshot was activated. If data is written into a snapshot and the snapshot data is changed, the data in
the snapshot copy is still the same as the snapshot data at the point in time when the snapshot was
activated.
Snapshot cascading: Snapshot cascading is to create snapshots for existing snapshots. Different from a
snapshot copy, a cascaded snapshot is a consistent data copy of an existing snapshot at a specific point
in time, including the data written to the source snapshot. In comparison, a snapshot copy preserves
the data at the point in time when the source snapshot was activated, excluding the data written to the
source snapshot. The system supports a maximum of eight levels of cascaded snapshots.
HCIP-Storage Course Notes Page 36
Snapshot consistency group: Protection groups ensure data consistency between multiple associated
LUNs. OceanStor Dorado V6 supports snapshots for protection groups. That is, snapshots are
simultaneously created for each member LUN in a protection group.
Snapshot consistency groups are mainly used by databases. Typically, databases store different types
of data on different LUNs (such as the online redo log volume and data volume), and these LUNs are
associated with each other. To back up the databases using snapshots, the snapshots must be created
for these LUNs at the same time point, so that the data is complete and available for database
recovery.
2.1.1.2.2 Implementation
ROW is a core technology used for snapshot implementation. The working principle is as follows:
Creating a snapshot: After a snapshot is created and activated, a data copy that is identical to the
source LUN is generated. Then the storage system copies the source LUN's pointer to the snapshot so
that the snapshot points to the storage location of the source LUN's data. This enables the source LUN
and snapshot to share the same LBA.
Writing data to the source LUN: When an application server writes data to the source LUN after the
snapshot is created, the storage system uses ROW to save the new data to a new location in the
storage pool and directs the source LUN's pointer to the new location. The pointer of the snapshot still
points to the storage location of the original source data, so the source data at the snapshot creation
time is saved.
Reading snapshot data: After a snapshot is created, client applications can access the snapshot to read
the source LUN's data at the snapshot creation time. The storage system uses the pointer of the
snapshot to locate the requested data and returns it to the client.
Figure 2-1 shows the metadata distribution in the source LUN before a snapshot is created.
2.1.2.2.2 Implementation
An application server can access a file system snapshot to read the data of the source file system at the
point in time when the snapshot was created.
ROW is the core technology used to create file system snapshots. When a source file system receives
a write request to modify existing data, the storage system writes the new data to a new location and
directs the BP of the modified data block to the new location. The BP of the file system snapshot still
points to the original data of the source file system. That is, a file system snapshot always preserves
the original state of the source file system.
Figure 2-1 shows the process of reading a file system snapshot (one snapshot is created in this
example).
HCIP-Storage Course Notes Page 41
2.1.3 HyperReplication
2.1.3.1 Overview
As digitalization advances in various industries, data has become critical to the operation of
enterprises, and customers impose increasingly demanding requirements on stability of storage
systems. Although some storage devices offer extremely high stability, they fail to prevent
irrecoverable damage to production systems upon natural disasters. To ensure continuity,
recoverability, and high reliability of service data, remote DR solutions emerge. The remote
replication technology is one of the key technologies used in remote DR solutions.
HyperReplication is a core technology for remote DR and backup of data.
It supports the following replication modes:
Synchronous remote replication
In this mode, data is synchronized between two storage systems in real time to achieve full
protection for data consistency, minimizing data loss in the event of a disaster.
Asynchronous remote replication
In this mode, data is synchronized between two storage systems periodically to minimize service
performance deterioration caused by the latency of long-distance data transmission.
2.1.3.2 Working Principles
2.1.3.2.1 Basic Concepts
This section describes basic concepts related to HyperReplication, including pair, consistency group,
synchronization, splitting, primary/secondary switchover, data status, and writable secondary LUN.
To enable service data backup and recovery on the secondary storage system, a remote replication
task is implemented in four phases, as shown in Figure 2-1.
HCIP-Storage Course Notes Page 45
actions and, if so, what operation is required. After performing an operation, you can view the
running status of the pair to check whether the operation has succeeded. Table 2-1 describes the
running status of a pair involved in a remote replication task.
Data status
By determining data differences between the primary and secondary LUNs in a remote
replication pair, HyperReplication identifies the data status of the pair. If a disaster occurs,
HyperReplication determines whether a primary/secondary switchover is allowed based on the
data status of the pair. The data status values are Consistent and Inconsistent.
Writable secondary LUN
A writable secondary LUN refers to a secondary LUN to which host data can be written. After
HyperReplication is configured, the secondary LUN is read-only by default. If the primary LUN
is faulty, the administrator can cancel write protection for the secondary LUN and set the
secondary LUN to writable. In this way, the secondary storage system can take over host
services, ensuring service continuity. The secondary LUN can be set to writable in the following
scenarios:
HCIP-Storage Course Notes Page 47
− The primary LUN fails and the remote replication links are in disconnected state.
− The primary LUN fails but the remote replication links are in normal state. The pair must be
split before you enable the secondary LUN to be writable.
Consistency group
A consistency group is a collection of pairs that have a service relationship with each other. For
example, the primary storage system has three primary LUNs that respectively store service data,
logs, and change tracking information of a database. If data on any of the three LUNs becomes
invalid, all data on the three LUNs becomes unusable. For the pairs in which these LUNs exist,
you can create a consistency group. Upon actual configuration, you need to create a consistency
group and then manually add pairs to the consistency group.
Synchronization
Synchronization is a process of copying data from the primary LUN to the secondary LUN.
Synchronization can be performed for a single remote replication pair or for multiple remote
replication pairs in a consistency group at the same time.
Synchronization of a remote replication pair involves initial synchronization and incremental
synchronization.
After an asynchronous remote replication pair is created, initial synchronization is performed to
copy all data from the primary LUN to the secondary LUN. After the initial synchronization is
complete, if the remote replication pair is in normal state, incremental data will be synchronized
from the primary LUN to the secondary LUN based on the specified synchronization mode
(manual or automatic). If the remote replication pair is interrupted due to a fault, incremental data
will be synchronized from the primary LUN to the secondary LUN based on the specified
recovery policy (manual or automatic) after the fault is rectified.
After a synchronous remote replication pair is created, initial synchronization is performed to
copy all data from the primary LUN to the secondary LUN. After the initial synchronization is
complete, if the remote replication pair is in normal state, host I/Os will be written into both the
primary and secondary LUNs, not requiring data synchronization. If the remote replication pair is
interrupted due to a fault, incremental data will be synchronized from the primary LUN to the
secondary LUN based on the specified recovery policy (manual or automatic) after the fault is
rectified.
Splitting
Splitting is a process of stopping data synchronization between primary and secondary LUNs.
This operation can be performed only by an administrator. Splitting can be performed for a single
remote replication pair or multiple remote replication pairs in a consistency group at one time.
After the splitting, the pair relationship between the primary LUN and the secondary LUN still
exists and the access permission of hosts for the primary and secondary LUNs remains
unchanged.
At some time, for example when the bandwidth is insufficient to support critical services, you
probably do not want to synchronize data from the primary LUN to the secondary LUN in a
remote replication pair. In such cases, you can split the remote replication pair to suspend data
synchronization.
You can effectively control the data synchronization process of HyperReplication by performing
synchronization and splitting.
Primary/secondary switchover
A primary/secondary switchover is a process of exchanging the roles of the primary and
secondary LUNs in a pair relationship. You can perform a primary/secondary switchover for a
single remote replication pair or for multiple remote replication pairs in a consistency group at
the same time. A primary/secondary switchover is typically performed in the following scenarios:
After the primary site recovers from a disaster, the remote replication links are re-established and
data is synchronized between the primary and secondary sites.
HCIP-Storage Course Notes Page 48
When the primary storage system requires maintenance or an upgrade, services at the primary
site must be stopped, and the secondary site takes over the services.
Link compression
Link compression is an inline compression technology. In an asynchronous remote replication
task, data is compressed on the primary storage system before transfer. Then the data is
decompressed on the secondary storage system, reducing bandwidth consumption in data
transfer. Link compression has the following highlights:
− Inline data compression
Data is compressed when being transferred through links.
− Intelligent compression
The system preemptively determines whether data can be compressed, preventing
unnecessary compression and improving transfer efficiency.
− High reliability and security
The lossless compression technology is used to ensure data security. Multiple check
methods are used to ensure data reliability. After receiving data, the secondary storage
system verifies data correctness and checks data consistency after the data is decompressed.
− User unawareness
Link compression does not affect services running on the hosts and is transparent to users.
− Compatibility with full backup and incremental backup
Link compression compresses all data that is transferred over the network regardless of
upper-layer services.
Protected object
For customers, the protected objects are LUNs or protection groups. That is, HyperReplication is
configured for LUNs or protection groups for data backup and disaster recovery.
LUN: Data protection can be implemented for each individual LUN.
Protection group: Data protection can be implemented for a protection group, which consists of
multiple independent LUNs or a LUN group.
How to distinguish a protection group and a LUN group:
A LUN group applies to mapping scenarios in which the LUN group can be directly mapped to a
host. You can group LUNs for different hosts or applications.
A protection group applies to data protection with consistency groups. You can plan data
protection policies for different applications and components in the applications. In addition, you
can enable the LUNs used by multiple applications in the same protection scenario to be
protected in a unified manner. For example, you can group the LUNs to form a LUN group, map
the LUN group to a host or host group, and create a protection group for the LUN group to
implement unified data protection of the LUNs used by multiple applications in the same
protection scenario.
2.1.3.2.2 Implementation
Data replication
Data replication is a process of writing service data generated by hosts to the secondary LUNs in
the secondary storage system. The writing process varies depending on the remote replication
mode. This section describes data replication performed in synchronous and asynchronous
remote replication modes.
Writing process in synchronous remote replication
Synchronous remote replication replicates data in real time from the primary storage system to
the secondary storage system. The characteristics of synchronous remote replication are as
follows:
HCIP-Storage Course Notes Page 49
After receiving a write I/O request from a host, the primary storage system sends the request to
the primary and secondary LUNs.
The data write result is returned to the host only after the data is written to both primary and
secondary LUNs. If data fails to be written to the primary LUN or secondary LUN, the primary
LUN or secondary LUN returns a write I/O failure to the remote replication management module.
Then, the remote replication management module changes the mode from dual-write to single-
write, and the remote replication pair is interrupted. In this case, the data write result is
determined by whether the data is successfully written to the primary LUN and is irrelevant to
the secondary LUN.
After a synchronous remote replication pair is created between a primary LUN and a secondary
LUN, you need to manually perform synchronization so that data on the two LUNs is consistent.
Every time a host writes data to the primary storage system after synchronization, the data is
copied from the primary LUN to the secondary LUN of the secondary storage system in real
time.
The specific process is as follows:
1. Initial synchronization
After a remote replication pair is created between a primary LUN on the primary storage
system at the production site and a secondary LUN on the secondary storage system at the
DR site, initial synchronization is started.
All data on the primary LUN is copied to the secondary LUN.
During initial synchronization, if the primary LUN receives a write request from a host and
data is written to the primary LUN, the data is also written to the secondary LUN.
2. Dual-write
After initial synchronization is complete, the data on the primary LUN is the same as that on
the secondary LUN. Then an I/O request is processed as follows:
Figure 2-1 shows how synchronous remote replication processes a write I/O request.
c. HyperReplication waits for the primary and secondary LUNs to return the write result.
If data write to the secondary LUN times out or fails, the remote replication pair
between the primary and secondary LUNs is interrupted. If data write succeeds, the log
is cleared. Otherwise, the log is stored in the DCL, and the remote replication pair is
interrupted. In the follow-up data synchronization, the data block to which the address
of the log corresponds will be synchronized.
d. HyperReplication returns the data write result to the host. The data write result of the
primary LUN prevails.
LOG: data write log
DCL: data change log
Note:
The DCL is stored on all disks and all DCL data has three copies for protection. Storage
system logs are stored on coffer disks.
Writing process in asynchronous remote replication
Asynchronous remote replication periodically replicates data from the primary storage system to
the secondary storage system. The characteristics of asynchronous remote replication are as
follows:
Asynchronous remote replication relies on the snapshot technology. A snapshot is a point-in-time
copy of source data.
When a host successfully writes data to a primary LUN, the primary storage system returns a
response to the host declaring the successful write.
Data synchronization is triggered manually or automatically at preset intervals to ensure data
consistency between the primary and secondary LUNs.
HyperReplication in asynchronous mode adopts the multi-time-segment caching technology. The
working principle of the technology is as follows:
1. After an asynchronous remote replication relationship is set up between primary and
secondary LUNs, the initial synchronization begins by default. The initial synchronization
copies all data from the primary LUN to the secondary LUN to ensure data consistency.
2. After the initial synchronization is complete, the secondary LUN data status becomes
consistent (data on the secondary LUN is a copy of data on the primary LUN at a certain
past point in time). Then the I/O process shown in the following figure starts. Figure 2-2
shows the writing process in asynchronous remote replication mode.
Figure 2-4 Process of recovering data at the primary site after a disaster
Functions of a consistency group
In medium- and large-sized database applications, data, logs, and change records are stored on
associated LUNs of storage systems. The data correlation between those LUNs is ensured by
upper-layer host services at the primary site. When data is replicated to the secondary site, the
data correlation must be maintained. Otherwise, the data at the secondary site cannot be used to
recover services. To maintain the data correlation, you can add the remote replication pairs of
those LUNs to the same consistency group. This section compares storage systems running a
consistency group with storage systems not running a consistency group to show you how a
consistency group ensures service continuity.
HCIP-Storage Course Notes Page 54
Users can perform synchronization, splitting, and primary/secondary switchovers for a single
remote replication pair or perform these operations for multiple remote replication pairs using a
consistency group. Note the following when using a consistency group:
− Remote replication pairs can be added to a consistency group only on the primary storage
system. In addition, secondary LUNs in all remote replication pairs must reside in the same
remote storage system.
− LUNs in different remote replication pairs in a consistency group can belong to different
working controllers.
− Remote replication pairs in one consistency group must work in the same remote replication
mode.
2.1.3.3 Application Scenarios
HyperReplication is used for data backup and DR by working with BCManager eReplication. The
typical application scenarios include central backup and DR as well as 3DC.
Different remote replication modes apply to different application scenarios.
Synchronous remote replication
Applies to backup and DR scenarios where the primary site is very near to the secondary site, for
example, in the same city (same data center or campus).
Asynchronous remote replication
Applies to backup and DR scenarios where the primary site is far from the secondary site (for
example, cross countries or regions) or the network bandwidth is limited.
2.1.4 HyperMetro
2.1.4.1 Overview
HyperMetro is Huawei's active-active storage solution that enables two storage systems to process
services simultaneously, establishing a mutual backup relationship between them. If one storage
system malfunctions, the other one will automatically take over services without data loss or
interruption. With HyperMetro being deployed, you do not need to worry about your storage systems'
inability to automatically switch over services between them and will enjoy rock-solid reliability,
enhanced service continuity, and higher storage resource utilization.
Huawei's active-active solution supports both single-data center (DC) and cross-DC deployments.
Single-DC deployment
In this mode, the active-active storage systems are deployed in two equipment rooms in the same
campus.
Hosts are deployed in a cluster and communicate with storage systems through a switched fabric
(Fibre Channel or IP). Dual-write mirroring channels are deployed on the storage systems to
ensure continuous operation of active-active services.
Figure 2-1 shows an example of the single-DC deployment mode.
In this mode, the active-active storage systems are deployed in two DCs in the same city or in
two cities located close. The distance between the two DCs is within 300 km. Both of the DCs
can handle service requests concurrently, thereby accelerating service response and improving
resource utilization. If one DC fails, its services are automatically switched to the other DC.
In cross-DC deployment scenarios involving long-distance transmission (≥ 25 km for Fibre
Channel; ≥ 80 km for IP), dense wavelength division multiplexing (DWDM) devices must be
used to ensure a short transmission latency. In addition, mirroring channels must be deployed
between the active-active storage systems for data synchronization.
Figure 2-2 shows an example of the cross-DC deployment mode.
In medium- and large-size databases, the user data and logs are stored on different LUNs. If data on
any LUN is lost or becomes inconsistent in time with the data on other LUNs, data on all of the LUNs
becomes invalid. Creating a HyperMetro consistency group for these LUNs can preserve the integrity
of their data and guarantee write-order fidelity.
HyperMetro I/O processing mechanism
− Write I/O Process
Dual-write and locking mechanisms are essential for data consistency between storage
systems.
Dual-write and DCL technologies synchronize data changes while services are running.
Dual-write enables hosts' I/O requests to be delivered to both local and remote caches,
ensuring data consistency between the caches. If the storage system in one DC
malfunctions, the DCL records data changes. After the storage system recovers, the data
changes are synchronized to the storage system, ensuring data consistency across DCs.
Two HyperMetro storage systems can process hosts' I/O requests concurrently. To prevent
conflicts when different hosts access the same data on a storage system simultaneously, a
locking mechanism is used to allow only one storage system to write data. The storage
system denied by the locking mechanism must wait until the lock is released and then obtain
the write permission.
1. A host delivers a write I/O to the HyperMetro I/O processing module.
2. The write I/O applies for write permission from the optimistic lock on the local storage
system. After the write permission is obtained, the system records the address
information in the log but does not record the data content.
3. The HyperMetro I/O processing module writes the data to the caches of both the local
and remote LUNs concurrently. When data is written to the remote storage system, the
write I/O applies for write permission from the optimistic lock before the data can be
written to the cache.
4. The local and remote caches return the write result to the HyperMetro I/O processing
module.
5. The system determines whether dual-write is successful.
If writing to both caches is successful, the log is deleted.
If writing to either cache fails, the system:
a. Converts the log into a DCL that records the differential data between the local and
remote LUNs. After conversion, the original log is deleted.
b. Suspends the HyperMetro pair. The status of the HyperMetro pair becomes To be
synchronized. I/Os are only written to the storage system on which writing to its
cache succeeded. The storage system on which writing to its cache failed stops
providing services for the host.
6. The HyperMetro I/O processing module returns the write result to the host.
Read I/O Process
The data of LUNs on both storage systems is synchronized in real time. Both storage systems are
accessible to hosts. If one storage system malfunctions, the other one continues providing
services for hosts.
1. A host delivers a read I/O to the HyperMetro I/O processing module.
2. The HyperMetro I/O processing module enables the local storage system to respond to the
read request of the host.
3. If the local storage system is operating properly, it returns data to the HyperMetro I/O
processing module.
HCIP-Storage Course Notes Page 59
4. If the local storage system is not operating properly, the HyperMetro I/O processing module
enables the host to read data from the remote storage system. Then the remote storage
system returns data to the HyperMetro I/O processing module.
5. The HyperMetro I/O processing module returns the requested data to the host.
Arbitration Mechanism
If links between two HyperMetro storage systems are disconnected or either storage system
breaks down, real-time data synchronization will be unavailable to the storage systems and only
one storage system of the HyperMetro relationship can continue providing services. To ensure
data consistency, HyperMetro uses the arbitration mechanism to determine which storage system
continues providing services.
HyperMetro provides two arbitration modes:
− Static priority mode: Applies when no quorum server is deployed.
If no quorum server is configured or the quorum server is inaccessible, HyperMetro works
in static priority mode. When an arbitration occurs, the preferred site wins the arbitration
and provides services.
If links between the two storage systems are down or the non-preferred site of a HyperMetro
pair breaks down, LUNs of the storage system at the preferred site continue providing
HyperMetro services and LUNs of the storage system at the non-preferred site stop.
If the preferred site of a HyperMetro pair breaks down, the non-preferred site does not take
over HyperMetro services automatically. As a result, the services stop. You must forcibly
start the services at the non-preferred site.
− Quorum server mode (recommended): Applies when quorum servers are deployed.
In this mode, an independent physical server or VM is used as the quorum server. You are
advised to deploy the quorum server at a dedicated quorum site that is in a different fault
domain from the two DCs.
In the event of a DC failure or disconnection between the storage systems, each storage
system sends an arbitration request to the quorum server, and only the winner continues
providing services. The preferred site takes precedence in arbitration.
2.1.5 HyperCDP
2.1.5.1 Overview
HyperCDP is a continuous data protection feature developed by Huawei. A HyperCDP object is
similar to a common writable snapshot, which is a point-in-time consistent copy of original data to
which the user can roll back, if and when it is needed. It contains a static image of the source data at
the data copy time point.
HyperCDP has the following advantages:
HyperCDP provides data protection at an interval of seconds, with zero impact on performance
and small space occupation.
Support for scheduled tasks
You can specify HyperCDP schedules by day, week, month, or a specific interval, meeting
different backup requirements.
Intensive and persistent data protection
HyperCDP provides higher specifications than common writable snapshots. It achieves
continuous data protection by generating denser recovery points with a shorter protection interval
and longer protection period.
HCIP-Storage Course Notes Page 60
Figure 2-3 Metadata distribution in the source LUN and HyperCDP object
Hosts cannot read the data in a HyperCDP object directly. To allow a host to access the HyperCDP
data, you must create a duplicate for the HyperCDP object and map the duplicate to the host. If you
want to access the HyperCDP data for the same LUN at another time point, you can recreate the
duplicate using the HyperCDP object generated at that time point to obtain its data immediately.
HyperCDP supports quick recovery of the source LUN's data. If data on a source LUN suffers
incorrect deletion, corruption, or virus attacks, you can roll back the source LUN to the point in time
when the HyperCDP object was created, minimizing data loss.
2.1.5.3 Application Scenarios
HyperCDP can be used for various scenarios, for example, rapid data backup and restoration,
continuous data protection, and repurposing of backup data.
Rapid data backup and restoration
HyperCDP objects can be generated periodically for service data to implement quick data
backup.
You can use the latest HyperCDP object to roll back data within several seconds. This protects
data against the following situations:
− Virus infection
− Incorrect deletion
− Malicious tampering
− Data corruption caused by system breakdown
− Data corruption caused by application bugs
− Data corruption caused by storage system bugs
In terms of data backup and restoration, HyperCDP has the following advantages:
− The RTO is significantly reduced. Even a large amount of data can be restored in a few
seconds.
− Data can be frequently backed up without service interruption. Applications can run
correctly without performance compromise.
− The backup window is notably shortened or eliminated.
Repurposing of Backup Data
LUNs serve different purposes in different service scenarios, such as report generation, data
testing, and data analysis. If multiple application servers write data to a LUN simultaneously,
HCIP-Storage Course Notes Page 63
changes to the data may adversely affect services on these application servers. Consequently, the
data testing and analysis results may be inaccurate.
OceanStor Dorado V6 supports multiple duplicates of a HyperCDP object, which can be used by
different application servers for report generation, data testing, and data analysis.
Figure 2-1 shows how HyperCDP duplicates are used for various purposes.
be read but cannot be written within a specific period. Therefore, measures must be taken to prevent
such data from being tampered with. In the storage industry, WORM is the most common method
used to archive and back up data, ensure secure data access, and prevent data tampering.
A file protected by WORM enters the read-only state immediately after data is written to it. In read-
only state, the file can be read, but cannot be deleted, modified, or renamed. The WORM feature can
prevent data from being tampered with, meeting data security requirements of enterprises and
organizations.
File systems with the WORM feature configured are called WORM file systems. WORM can only be
configured by administrators. There are two WORM modes: Regulatory Compliance WORM
(WORM-C for short) and Enterprise WORM (WORM-E).
The WORM feature implements read-only protection for important data in archived documents to
prevent data tampering, meeting regulatory compliance requirements.
WORM is used to protect important data in archived documents that cannot be tampered with or
damaged, for example, case documents of courts, medical records, and financial documents.
For example, a large number of litigation files are generated in courts. According to laws and
regulations, the protection periods of litigation files can be set to permanent, long-term, and short-
term based on the characteristics of the files.
2.1.6.3 HyperVault
Based on file systems, HyperVault enables data backup and recovery within a storage system and
between different storage systems.
Data backup involves local backup and remote backup. With file systems' snapshot or remote
replication technology, HyperVault backs up the data at a specific point in time to the source storage
system or backup storage system based on a specified backup policy.
Data recovery involves local recovery and remote recovery. With file systems' snapshot rollback or
remote replication technology, HyperVault specifies a local backup snapshot of a file system to roll
back it or specifies a remote snapshot of the backup storage system for recovery.
HyperVault has the following characteristics:
Time-saving local backup and recovery: A storage system can generate a local snapshot within
several seconds to obtain a consistent copy of the source file system, and roll back the snapshot
to quickly recover data to that at the desired point in time.
Incremental backup for changed data: In remote backup mode, full backup at an initial time and
permanent incremental backup save bandwidth.
Flexible and reliable data backup policy: HyperVault supports self-defined backup policies and
threshold for the number of copies. A copy of invalid backup data will not affect follow-up
backup tasks.
HyperVault applies to data backup, data recovery, and other scenarios.
applications at the same time. As the service applications running on each storage system grow
sharply and have different I/O characteristics, resource preemption among service applications
undermines the performance of mission-critical service applications.
To meet demanding QoS requirements and guarantee the service performance of storage systems,
storage vendors have introduced a number of techniques, such as I/O priority, application traffic
control, and cache partitioning. In particular, cache partitioning provides an efficient way to meet QoS
requirements as cache resources are indispensable to data transmissions between storage systems and
applications.
SmartPartition is an intelligent cache partitioning feature developed by Huawei. It is also a
performance-critical feature that allows you to assign cache partitions of different capacities to users
of different levels. Service performance for service applications assigned with a cache partition is
improved with specified cache capacity.
SmartPartition applies to LUNs (block services) and file systems (file services).
2.2.1.2 Working Principle
2.2.1.2.1 Concepts
SmartPartition ensures the service quality for mission-critical services by isolating cache resources
among services. In a storage system, a cache capacity indicates the amount of cache resources that a
service can use. Cache capacity is a major factor to the performance of a storage system and also
affects services with different I/O characteristics to different extents.
For data writes, a larger cache capacity means a higher write combining rate, a higher write hit
ratio, and better sequential disk access.
For data reads, a larger cache capacity means a higher read hit ratio.
For a sequential service, its cache capacity should only be enough for I/O request merging.
For a random service, a larger cache capacity enables better sequential disk access, which
improves service performance.
Cache resources are divided into read cache and write cache:
Read cache effectively improves the read hit ratio of a host by means of read prefetching.
Write cache improves the disk access performance of a host by means of combining, hitting, and
sequencing.
On a storage system, you can set a dedicated read cache and write cache for each SmartPartition
partition to meet the requirements of different types of services. The cache partitions on a storage
system include SmartPartition partitions and a default partition.
SmartPartition partitions are created by users and provide cache services for service applications in
the partitions.
The default partition is a cache partition automatically reserved by the system and provides cache
services for system operation and other applications for which no SmartPartition partition is assigned.
2.2.1.2.2 Implementation
The following figure shows the implementation of SmartPartition.
HCIP-Storage Course Notes Page 66
SmartPartition allows you to configure independent cache partitions for the production and test
systems separately. In addition, appropriate read and write cache capacities can be configured for
the two systems according to their respective read and write I/O frequencies. This approach
HCIP-Storage Course Notes Page 67
improves the read and write I/O performance of the production system while maintaining the
normal operation of the test system.
Example:
SmartPartition policy A: SmartPartition partition 1 is created for the production system. The read
cache is 2 GB and the write cache is 1 GB. The read and write caches are enough for processing
frequent read and write I/Os in the production system.
SmartPartition policy B: SmartPartition partition 2 is created for the test system. The read cache
is 1 GB and the write cache is 512 MB. The cache resources are limited for the test system but
are enough to maintain its normal operation while not affecting the performance of the
production system.
Meeting the QoS Requirements of High-Level Users in VDI Scenarios
In virtual desktop infrastructure (VDI) scenarios, different users use different services and have
different QoS requirements. How to meet the QoS requirement of each user while making full
use of resources is a pressing problem that data centers must address.
SmartPartition allows you to create cache partitions of different capacities for different users.
When resources are limited, SmartPartition preferentially meets the QoS requirements of high-
level users.
For example, multiple users share the storage resources provided by a data center. The following
table lists the QoS requirements of user A and user B.
SmartPartition allows you to create cache partitions for users A and B, respectively, and define
different cache read/write policies.
SmartPartition policy A: SmartPartition partition 1 is created for user A. The read cache is 2 GB
and the write cache is 1 GB. The read and write caches are enough to guarantee the normal
operation and excellent data read and write performance of the applications used by user A.
SmartPartition policy B: SmartPartition partition 2 is created for user B. The read cache is 1 GB
and the write cache is 512 MB. The cache resources are limited for user B but are enough to
maintain the normal operation of the applications used by user B while meeting user A's
demanding requirements on applications.
2.2.2 SmartQuota
2.2.2.1 Overview
IT systems are in urgent need of improving resource utilization and management to ride on the
advancements in virtualization and cloud computing. In a typical IT storage system, all available
storage resources (disk space) will be used up. Therefore, we must find a way to control storage
resource utilization and growth to save costs.
In a network attached storage (NAS) file service environment, resources are provisioned as directories
to departments, organizations, and individuals. Each department or individual has unique resource
requirements or limitations, and therefore, storage systems must allocate and limit resources based on
actual conditions. SmartQuota perfectly meets this requirement by limiting the directory resources
that users can use.
HCIP-Storage Course Notes Page 68
SmartQuota is a file system quota technology. It allows system administrators to control storage
resource usage by limiting the disk space that each user can use and accordingly, preventing users
from excessively using resources.
2.2.2.2 Working Principle
In each I/O operation, SmartQuota checks the sum of used space and file quantity plus additional
space and file quantity required for this operation. If the sum exceeds the hard quota, the operation
will fail. If the sum does not exceed the hard quota, this operation will succeed. If the I/O operation
succeeds, SmartQuota updates the used space and file quantity under the quotas and writes the quota
update together with the data generated in the I/O operation to the file system. Either both the I/O
operation and quota update succeed or both fail. This approach guarantees that the used space checked
in each I/O operation is correct.
SmartQuota checks the hard quota as well as the soft quota. If the sum of used and incremental space
and file quantity does not exceed the hard quota, SmartQuota checks whether used space or file
quantity exceeds the soft quota. If yes, an alarm will be reported. After used space or file quantity
drops below the soft quota, the alarm will be cleared. The alarm is sent to the alarm center after an I/O
operation success is returned to the file system.
Alarm Generation and Clearance Policies
When the amount of used resources (space or file quantity) exceeds the space or file quantity soft
quota, SmartQuota generates an alarm to notify administrators for handling. A soft quota is
designed to allow administrators to handle the resource over-usage problem by deleting
unnecessary files or applying for additional quotas before a file operation fails due to insufficient
quota.
SmartQuota clears the resource over-usage alarm only when the amount of resources used by a
user is less than 90% of the soft quota. This way, frequent generation and clearance of alarms can
be prevented as the amount of used resources is kept remarkably below the soft quota.
− Quota trees are critical to the implementation of SmartQuota. Directory quotas can only be
configured on quota trees. Quota trees are a kind of special directories:
− Quota trees can only be created, deleted, or renamed by administrators in the CLI or GUI.
Only empty quota trees can be deleted.
− Quota trees can be shared through a protocol and cannot be renamed or deleted when they
are being shared.
− Files cannot be moved (through NFS) or cut (through CIFS) between quota trees.
− A hard link cannot be created between quota trees.
Supporting Directory Quotas
SmartQuota limits resource usage by setting one or more resource quotas for each user.
SmartQuota principally employs directory quotas to limit resource usage:
A directory quota limits the maximum available space of all files under a directory. SmartQuota
supports only directory quotas on special level-1 directories (level-1 directories created by
running the specific management command). Such level-1 directories are called quota trees.
The following figure shows a typical configuration of SmartQuota.
HCIP-Storage Course Notes Page 69
2.2.3 SmartVirtualization
2.2.3.1 Overview
As the amount of user data grows, efficient management and capacity expansion of existing storage
systems become increasingly important. However, these operations are impeded by the following
problems:
If a user replaces an existing storage system with a new storage system, service data stored on the
existing storage system must be migrated to the new storage system. However, incompatibility
between storage systems of different vendors prolongs data migration duration and even causes data
loss during migration.
If a user acquires a new storage system and manages storage systems separately, the maintenance
costs will increase with the addition of the new system. In addition, storage resources provided by
existing storage systems and the new storage system cannot be effectively integrated and uniformly
managed.
SmartVirtualization can effectively address these problems. Physical attributes of different storage
systems are shielded for easy configuration and management of storage systems and efficient
utilization of storage resources.
SmartVirtualization is a heterogeneous virtualization feature developed by Huawei. After a local
storage system is connected to a heterogeneous storage system, the local storage system can use the
storage resources provided by the heterogeneous storage system as local storage resources and
HCIP-Storage Course Notes Page 71
manage them in a unified manner, regardless of different software and hardware architectures
between storage systems.
SmartVirtualization applies only to LUNs (block services).
SmartVirtualization resolves incompatibility between storage systems so that a user can manage the
storage resources provided by the local storage system and the heterogeneous storage system in a
unified manner. Meanwhile, a user can still use the storage resources provided a legacy storage
system to save investment.
In this section, the local storage system refers to an OceanStor V5 series storage system. The
heterogeneous storage system can be a Huawei (excluding an OEM storage system commissioned by
Huawei) or third-party storage system.
SmartVirtualization allows only management but not configuration of the storage resources on a
heterogeneous storage system.
SmartVirtualization allows online or offline takeover of a heterogeneous storage system.
SmartVirtualization offers the following benefits:
Broad compatibility: The local storage system is compatible with mainstream heterogeneous
storage systems to facilitate planning and managing storage resources in a unified manner.
Conserving storage space: When a local storage system uses the storage space provided by the
external LUNs on a heterogeneous storage system, it does not perform a full physical data
mirroring, which remarkably saves the storage space on the local storage system.
Scalable functions: A local storage system can not only use external LUNs as local storage
resources, but also configure value-added functions, such as HyperReplication and HyperSnap,
for these LUNs, to meet higher data security and reliability requirements.
2.2.3.2 Working Principle
2.2.3.2.1 Concepts
Data organization
A local storage system uses a storage virtualization technology. Each LUN in the local storage
system consists of a metadata volume and a data volume.
A metadata volume records data storage locations.
A data volume stores user data.
External LUN
It is a LUN on a heterogeneous storage system, which is displayed as a remote LUN in
DeviceManager.
eDevLUN
In the storage pool of the local storage system, the mapped external LUNs are created as raw
storage devices based on the virtualization data organization form. The raw storage devices
created in this way are called eDevLUNs. An eDevLUN consists of a metadata volume and a
data volume. Physical space needed by an eDevLUN on the local storage system is that needed
by the metadata volume. Application servers can use eDevLUNs to access data on external LUNs
and configure value-added features, such as HyperSnap, HyperReplication, SmartMigration, and
HyperMirror, for the eDevLUNs.
LUN masquerading
When encapsulating a LUN on a heterogeneous storage system into an eDevLUN, you can
configure the LUN masquerading property. An application server will identify the eDevLUN as a
LUN on the heterogeneous storage system. The WWN and host LUN ID of the eDevLUN
detected by a host are the same as those of the external LUN. The masquerading property of an
eDevLUN is configured to implement online takeover.
Takeover
HCIP-Storage Course Notes Page 72
LUNs on a heterogeneous storage system are mapped to the local storage system to allow the
local storage system to use and manage these LUNs.
Relationship between an eDevLUN and an external LUN
An eDevLUN consists of a data volume and a metadata volume. The data volume is a logical
abstract object of the data on an external LUN. The physical space needed by the data volume is
provided by a heterogeneous storage system instead of the local storage system. The metadata
volume manages the storage locations of data on an eDevLUN. The physical space needed by the
metadata volume is provided by the local storage system. A metadata volume needs small storage
space. If no value-added feature is configured for eDevLUNs, each eDevLUN consumes about
130 MB of space in the storage pool of the local storage system. A mapping is configured
between each eDevLUN created on the local storage system and each external LUN on a
heterogeneous storage system. An application server accesses the data on an external LUN by
reading and writing data from and to an eDevLUN.
storage systems and cross-site data DR. The implementation process is as follows: First, take
over the LUNs on the heterogeneous storage system in each data center and create eDevLUNs.
Then, create an asynchronous remote replication pair between each eDevLUN and each LUN on
a Huawei storage system deployed at the other site.
Heterogeneous Data Protection
After a heterogeneous storage system is taken over using SmartVirtualization, data on the LUNs
in the heterogeneous storage system may still be subject to damage due to viruses or other
reasons. To this end, HyperSnap can be used to create snapshots of eDevLUNs for backing up
the data on external LUNs. Damaged data on an external LUN can be swiftly recovered by
recovering the data on an eDevLUN from a specified snapshot point-in time by means of quick
snapshot rollback.
Heterogeneous Local HA
After a heterogeneous storage system is taken over using SmartVirtualization, service data is still
stored on the heterogeneous storage system. A variety of heterogeneous storage systems may be
incompatible with one another, which may cause service interruption and even data loss. The
HyperMirror feature can be enabled on the local storage system to create a mirror LUN for each
eDevLUN. Then, two mirror copies of each mirror LUN are saved on the local storage system.
Data on an external LUN is written to both mirror copies at the same time, preventing service
interruption and data loss.
Huawei's SmartMulti-Tenant allows tenants to create multiple virtual storage systems in one physical
storage system. With SmartMulti-Tenant, tenants can share hardware resources and safeguard data
security and confidentiality in a multi-protocol unified storage architecture.
SmartMulti-Tenant enables users to implement flexible, easy-to-manage, and cost-effective storage
sharing among multiple vStores in a multi-protocol unified storage infrastructure. SmartMulti-Tenant
supports performance tuning and data protection settings for each vStore to meet different SLA
requirements.
vStore-based service isolation: The development of cloud technology brings a higher sharing level of
underlying resources. There is also an increasing demand for data resource isolation. With
SmartMulti-Tenant, multiple vStores can be created in a physical storage system, providing
independent services and configuration space for each vStore, and isolating services, storage
resources, and networks among vStores. Different vStores can share the same hardware resources,
without affecting data security and privacy.
Example: An enterprise allocates a physical storage system to several business departments. These
business departments manage and allocate their own storage resources while meeting the requirement
for secure storage resource access and isolation.
2.2.4.3 SmartQoS
SmartQoS is an intelligent service quality control feature developed by Huawei. It dynamically
allocates storage system resources to meet the performance requirement of certain applications.
SmartQoS extended the information lifecycle management (ILM) strategy to control the performance
level for each application within a storage system. SmartQoS is an essential add-on to a storage
system, especially when certain applications have demanding SLA requirements. In a storage system
serving two or more applications, SmartQoS helps derive the maximum value from the storage
system:
SmartQoS controls the performance level for each application, preventing interference between
applications and ensuring the performance of mission-critical applications.
SmartQoS prioritizes mission-critical applications in storage resource allocation by limiting the
resources allocated to non-critical applications.
SmartQoS applies to LUNs (block services) and file systems (file services).
SmartQoS dynamically allocates storage resources to ensure performance for mission-critical services
and high-priority users.
Ensuring Performance for Mission-Critical Services
SmartQoS is useful in specifying the performance objectives for different services to guarantee
the normal operation of mission-critical services.
You can ensure the performance of mission-critical services by setting I/O priorities or creating
SmartQoS traffic control policies.
The services running on a storage system can be categorized into the following types:
− Online Transaction Processing (OLTP) service is a mission-critical service and
requires excellent real-time performance.
− Archive and backup service involves a large amount of data but requires general real-
time performance.
The OLTP service runs between 08:00 a.m. and 00:00 a.m. and the archive and backup service
runs between 00:00 a.m. and 08:00 a.m.
Adequate system resources must be provided for those two types of services when they are
running in specific periods.
As the OLTP service is a mission-critical service, you can modify LUN I/O priorities to give a
higher priority to the OLTP service than the archive and backup service. This practice guarantees
HCIP-Storage Course Notes Page 78
the normal operation of the OLTP service and prevents the archive and backup service from
affecting the running of the OLTP service.
To meet service requirements, you can leverage the following two policies:
Setting two upper limits:
Traffic control policy A: Limits the bandwidth for the archive and backup service (for example, ≤
50 MB/s) between 08:00 a.m. and 00:00 a.m. to reserve adequate system resources for the normal
operation of the OLTP service during daytime.
Traffic control policy B: Limits the IOPS for the OLTP service (for example, ≤ 200) between
00:00 a.m. and 08:00 a.m. to reserve adequate system resources for the normal operation of the
archive and backup service during night.
Setting a lower limit:
Traffic control policy C: Sets the latency for the OLTP service (for example, ≤ 10 ms) between
08:00 a.m. and 00:00 a.m. to reserve adequate system resources for the normal operation of the
OLTP service during daytime.
Ensuring Performance for High-Priority Subscribers
To reduce the total cost of ownership (TCO) and maintain service continuity, some subscribers
tend to run their services on the storage platforms offered by a storage service provider instead of
building their own storage systems. However, storage resource preemption may occur among
different types of services with different service characteristics. This may prevent high-priority
subscribers from using adequate storage resources.
SmartQoS is useful in creating SmartQoS policies and setting I/O priorities for different
subscribers. This way, when resources become insufficient, high-priority subscribers can
maintain normal and satisfactory operation of their services.
2.2.4.4 SmartDedupe and SmartCompression
SmartDedupe and SmartCompression are the intelligent data deduplication and compression features
developed by Huawei.
SmartDedupe is a data reduction technology that removes redundant data blocks from a storage
system to reduce the physical storage space used by data and meet the increasing data storage
requirements. OceanStor storage systems support inline deduplication, that is, only new data is
deduplicated.
SmartCompression reorganizes data while maintaining data integrity to reduce data amount, save
storage space, and improve data transmission, processing, and storage efficiency. Storage systems
support inline compression, that is, only new data is compressed.
SmartDedupe and SmartCompression implement data deduplication and compression to reduce the
storage space occupied by data. In application scenarios such as databases, virtual desktops, and email
services, SmartDedupe and SmartCompression can be used independently or jointly to improve
storage efficiency as well as reduce investments and O&M costs.
Application Scenarios for SmartDedupe
Virtual Desktop Infrastructure (VDI) is a common application scenario for SmartDedupe. In VDI
applications, multiple virtual images are created on a storage system. These images contain a
large amount of duplicate data. As the amount of duplicate data increases, the storage space
provided by the storage system becomes insufficient for the normal operation of services.
SmartDedupe removes duplicate data between images to release storage resources for more
service data.
Application Scenarios for SmartCompression
Data compression occupies CPU resources, which increase with the amount of data to be
compressed.
HCIP-Storage Course Notes Page 79
Databases are best application scenario for SmartCompression. To store a large amount of data in
databases, it is wise to trade a little service performance for more than 65% increase in available
storage space.
File services are also a common application scenario for SmartCompression. A typical example
is a file service system that is only busy for half of its service time and has a 50% compression
ratio for datasets.
Engineering, seismic, and geological data: With similar characteristics to database backups, these
types of data are stored in the same format but contain little duplicate data. Such data can be
compressed to save the storage space.
Application Scenarios for Using Both SmartDedupe and SmartCompression
SmartDedupe and SmartCompression can be used together to save more storage space in a wide
range of scenarios, such as data testing or development systems, file service systems, and
engineering data systems.
In VDI applications, multiple virtual images are created on a storage system. These images
contain a large amount of duplicate data. As the amount of duplicate data increases, the storage
space provided by the storage system becomes insufficient for the normal operation of services.
SmartDedupe and SmartCompression remove or compress duplicate data between images to
release storage resources for more service data.
HCIP-Storage Course Notes Page 80
If compression is disabled, the system directly applies for storage space to store data blocks. If
compression is enabled, compression will be performed for the data blocks before storage. The data
blocks will be compressed by the compression engine at the granularity of 512 bytes and then saved in
the system.
The compression engine runs in a combination of two different compression algorithms. One is the
algorithm with high compression speed but low compression rate and the other one is the algorithm
with high compression rate but low compression speed. By configuring different execution ratios of
the two compression algorithms, you can obtain different performance and data reduction rates. Only
one compression algorithm can be selected for a storage pool. Changing the compression algorithm of
a storage pool does not affect compressed data. During data reads, compressed data will be
decompressed using the same compression algorithm when the data was compressed.
3.1.2 SmartQoS
SmartQoS enables you to set upper limits on IOPS or bandwidth for certain applications. Based on the
upper limits, SmartQoS can accurately limit performance of these applications, preventing them from
contending for storage resources with critical applications.
SmartQoS extends the information lifecycle management (ILM) strategy to implement application
performance tiering in the block service. When multiple applications run on one storage system,
proper QoS configurations ensure the performance of critical services:
SmartQoS controls storage resource usage by limiting the performance upper limits of non-
critical applications so that critical applications have sufficient storage resources to achieve
performance objectives.
Some services are prone to traffic bursts or storms in specified time periods, for example, daily
backup, database sorting, monthly salary distribution, and periodic bill settlement. The traffic
bursts or storms will consume a large number of system resources. If the traffic bursts or storms
occur at production time, interactive services will be affected. To avoid this, you can limit the
maximum IOPS or bandwidth of these services during traffic burst occurrence time to control
array resources consumed by the services, preventing production or interactive services from
being affected.
HCIP-Storage Course Notes Page 82
As shown in the following figure, I/Os from application servers first enter I/O queues of volumes.
SmartQoS periodically processes I/Os waiting in the queues. It dequeues the head element in a
queue, and attempts to obtain tokens from a token bucket. If the number of remaining tokens in
the token bucket meets the token requirement of the head element, the system delivers the
element to another module for processing and continues to process the next head element. If the
number of remaining tokens in the token bucket does not meet the token requirement of the head
element, the system puts the head element back in the queue and stops I/O dequeuing.
3.1.3 HyperSnap
HyperSnap is the snapshot feature in the block service that captures the state of volume data at a
specific point in time. The snapshots created using HyperSnap can be exported and used for restoring
volume data.
The system uses the ROW mechanism to create snapshots, which imposes no adverse impact on
volume performance.
The block service supports the consistency snapshot capability. Specifically, the block service can
ensure that the snapshots of multiple volumes used by an upper-layer application are at the same point
in time. Consistency snapshots are used for VM backup. A VM is usually mounted with multiple
volumes. When a VM is backed up, all volume snapshots must be at the same time point to ensure
data restoration reliability.
3.1.4 HyperClone
HyperClone is the clone feature in the block service that provides the linked clone function to create
multiple clone volumes from one snapshot. Data on each clone volume is consistent with that of the
snapshot. Data writes and reads on a clone volume have no impact on the source snapshot or other
clone volumes.
The system supports a linked clone rate of 1:2048, effectively improving storage space utilization.
A clone volume has all functions of a common volume. You can create snapshots for a clone volume,
use the snapshots to restore the clone volume, and clone the clone volume.
HCIP-Storage Course Notes Page 85
3.1.5 HyperReplication
HyperReplication is the asynchronous remote replication feature in the block service that periodically
synchronizes differential data on primary and secondary volumes of block service clusters. All the
data generated on primary volumes after the last synchronization will be synchronized to the
secondary volumes.
Periodic synchronization: Based on the preset synchronization period, the primary replication cluster
periodically initiates a synchronization task and breaks it down to each working node based on the
balancing policy. Each working node obtains the differential data generated at specified points in time
and synchronizes the differential data to the secondary end.
No differential logs: HyperReplication does not provide the differential log function. The LSM log
(ROW) mechanism supports data differences at multiple time points, saving memory space and
reducing impacts on host services.
Each logical address mapping entry (metadata) records a time point at which data is written.
Write requests with the same address at the same time point are appended. Metadata is written to a
new address, new metadata is recorded, old metadata is deleted, and data space is reclaimed.
Write requests with the same address at different time points are appended. Metadata is written to a
new address and new metadata is recorded. If no snapshot is created at the original time point, the
metadata is deleted and space is reclaimed. Otherwise, the metadata is retained.
The metadata mapping entry itself can identify an incremental data modification address within a
specified time period.
You can deploy DR clusters as required. A DR cluster provides replication services and manages DR
nodes, cluster metadata, replication pairs, and replication consistency groups. DR nodes can be
deployed on the same servers as storage nodes or on independent servers. DR clusters have excellent
scalability. A single DR cluster contains three to 64 nodes. One system supports a maximum of eight
DR clusters. A single DR cluster supports 64000 volumes and 16000 consistency groups, meeting
future DR requirements.
After an asynchronous remote replication relationship is established between a primary volume at the
primary site and a secondary volume at the secondary site, initial synchronization is implemented.
After initial synchronization, the data status of the secondary volume becomes consistent. Then, I/Os
are processed as follows:
1. The primary volume receives a write request from a production host.
2. The system writes the data to the primary volume, and returns a write completion response to the
host.
3. The system automatically synchronizes incremental data from the primary volume to the
secondary volume at a user-defined interval, which ranges from 60 seconds to 1440 minutes in
the standard license and from 10 seconds to 1440 minutes in the advanced license. If the
synchronization mode is manual, you need to trigger synchronization manually. When the
synchronization starts, the system generates a synchronization snapshot for the primary volume
to ensure that the data read from the primary volume during the synchronization remains
unchanged.
4. The system generates a synchronization snapshot for the secondary volume to back up the
secondary volume's data in case that the data becomes unavailable if an exception occurs during
the synchronization.
5. During synchronization, the system copies data in the synchronization snapshot of the primary
volume to the secondary volume. After synchronization, the system automatically deletes the
synchronization snapshots of the primary and secondary volumes.
HCIP-Storage Course Notes Page 87
3.1.6 HyperMetro
HyperMetro is the active-active storage feature that establishes active-active DR relationships
between two block service clusters in two data centers. It provides HyperMetro volumes by
virtualizing volumes in the two block service clusters and enables the HyperMetro volumes to be read
and written by hosts in the two data centers at the same time. If one data center fails, the other
automatically takes over services without data loss and service interruption.
HyperMetro in the block service supports incremental synchronization. If a site fails, the site winning
arbitration continues to provide services. I/O requests change from the dual-write state to the single-
write state. After the faulty site recovers, incremental data can be synchronized to it to quickly restore
the system.
The block service supports logical write error handling. If the system is running properly but one site
fails to process a write I/O, the system will redirect the write I/O to a normal site for processing. After
the fault is rectified, incremental data can be synchronized from the normal site to the one that fails to
process the I/O. By doing so, upper-layer applications do not need to switch sites for I/O processing
upon logical write errors.
HyperMetro supports a wide range of upper-layer applications, including Oracle RAC and VMware.
It is recommended that the distance between two HyperMetro storage systems be less than 100 km in
database scenarios and be less than 300 km in VMware scenarios. For details about supported upper-
layer applications, access Storage Interoperability Navigator.
HCIP-Storage Course Notes Page 88
After receiving a write I/O request, the system writes the data into both the local and remote
volumes at the same time and returns a write success message to the host.
When HyperMetro functions properly but a storage pool at the local site is faulty, the active-
active relationship is disconnected, the remote site continues providing services, and volumes at
the local site cannot be read or written. After the I/O redirection function is enabled at the local
site, read and write I/Os delivered to the local site will be redirected to the remote site for
processing (only the SCSI protocol is supported).
remote replication clusters (one volume can only be used to create one HyperMetro pair). Active-
active storage relationships cannot be established among replication clusters in the block service.
A new replication cluster can be deployed to meet customers' requirements for a dedicated
replication cluster or when one cluster cannot meet service requirements. The block service
supports a maximum of 8 replication clusters.
For example, when site A receives a write I/O, the mirroring process is as follows:
A host at site A delivers the write I/O request.
A pre-write log is recorded in the storage pool at site A.
The pre-write log is successfully processed.
The block cluster at site A writes data to the local storage pool and delivers the write request
to the remote cluster at the same time.
After the data is successfully written to the remote cluster, the remote cluster returns a write
success message to the local cluster.
Data is successfully written to both the local and remote clusters. The system deletes the
pre-write log, and returns a write success message to the host.
If the data fails to be written to either of the local or remote site, the active-active storage
relationship is disconnected. Only the site to which the data is successfully written continues
providing services and the pre-write log is converted to a data change record. After the
active-active storage relationship is recovered, incremental data will be synchronized
between the two sites based on the data change record.
2. Data Consistency Assurance
In the HyperMetro DR scenario, read and write operations can be concurrently performed at
both sites. If data is read from or written to the same storage address on a volume
simultaneously, the storage layer must ensure data consistency at both sites.
HyperMetro enables the storage systems at the two sites to provide concurrent accesses. The
two sites can perform read and write operations on the same volume concurrently. If
different hosts perform read and write operations to the same storage address on a volume
concurrently, the storage system must ensure data consistency between the two sites.
Traditional distributed storage systems use the distributed locking mechanism to resolve
concurrent write I/O conflicts. In the conventional solution, when a host accesses a volume,
it applies for a lock from the cross-site distributed lock service. Data can be written to the
volume only after a cross-site lock is obtained, other write requests that do not have locks
can only be performed after the lock is released. The problem caused by this mechanism is
that each write operation requires a cross-site lock, which increases cross-site interactions.
HCIP-Storage Course Notes Page 92
In addition, the system concurrency is poor. Even when two write requests are concurrently
delivered to two storage addresses, they are still processed in a serial manner, decreasing the
efficiency of dual-write and affecting the system performance.
HyperMetro uses the optimistic locking mechanism to reduce write conflicts. Write requests
initiated by each host are processed independently without applying for locks. Request
conflicts are checked until data write submission. When the block service detects that the
data in a same storage address is modified by two concurrent write requests, one of the write
request is forwarded to the other site to be processed in a serial manner, ensuring data
consistency at the two sites.
As shown in the preceding figure, hosts in the two DCs both write data to the HyperMetro
volume. I/O dual writes are successfully performed on the volume. A host in DC A delivers
I/O 2 to modify the data in a storage address on the HyperMetro volume. The system then
detects that I/O 1 delivered by the host in DC B is also modifying the data in the same
address (the local scope lock can be used to detect whether data modifications conflict)
during the submission. In this case, I/O 2 is forwarded to DC B and will be written to both
DCs after I/O 1 is processed.
In the optimistic locking mechanism, the cross-site lock service is not required. Write
requests do not need to apply for a lock from the distributed lock service or even the cross-
site lock service, improving the concurrency performance of active-active clusters.
3. Cross-Site Bad Block Repair
If a storage pool at a site has bad data blocks, that is, multiple data copies in the storage pool
have bad blocks, you can use the data at the peer site to repair the bad blocks.
4. Performance Optimization
To ensure real-time data consistency of two sites, a write success message is returned to
hosts only when the data has been written to the storage systems at both sites. Real-time
dual-write increases the latency of active-active I/Os. To address this, HyperMetro employs
various I/O performance optimization solutions to mitigate the impact on write latency and
improve the overall active-active service performance.
Initial data synchronization performance optimization: In the initial synchronization of
active-active mirroring data, only a small amount of data is written to the local volume. To
improve the copy performance and reduce the impact on hosts, the initial synchronization is
optimized when the remote volume has no data. Only data written to the local volume is
synchronized to the remote site. Suppose that the size of the local volume is 1 TB and only
100 GB data is written to the volume. If the remote volume is a new volume that has no
data, only the 100 GB data is synchronized to the remote volume during initial data
synchronization.
FastWrite: A write I/O operation is generally divided into two steps: write allocation and
write execution. In this way, to perform a remote write operation, the local site needs to
communicate with the remote site twice. To reduce the communication latency between
sites and improve the write performance, HyperMetro combines write allocation and write
execution as one request and delivers it to the remote site. In addition, the interaction for the
write allocation completion is canceled. This halves the interactions of a cross-site write I/O
operation. For example, the RTT is 1 ms. FastWrite reduces the transmission time for
delivering requests to the remote site from 2 ms to 1 ms.
Optimistic lock: When HyperMetro functions properly, both sites support host access. To
ensure data consistency, write operations need to be locked. In the traditional distributed
lock solution, each write request needs to obtain a cross-site distributed lock, increasing the
host write latency. To improve the write performance, HyperMetro uses the local optimistic
lock to replace the traditional distributed lock, reducing the time for cross-site
communication.
Load balancing: HyperMetro is in the active-active storage mode and both sites support host
access. You can set third-party multipathing software to the load balancing mode to balance
the read and write operations delivered to both sites, improving host service performance.
Arbitration Mechanism
1. Dual Arbitration Mode
HyperMetro supports the static priority arbitration mode and quorum server mode. If a third-
place quorum server is faulty, the systems automatically switch to the static priority
arbitration mode. When the link between the two sites fails, the quorum function is still
available.
2. Static Priority Mode
When creating a HyperMetro pair, you can specify the preferred site of the pair. If the link
between the two sites is abnormal, the preferred site continues providing services. The
principles are as follows:
HCIP-Storage Course Notes Page 94
The storage at each site periodically sends a heartbeat message to the peer site to check
whether the peer cluster is working properly.
When the heartbeat between the local and remote clusters is abnormal, the pair provides
services only at site A according to the configuration.
If site A is faulty, site B still stops providing services. In this case, service interruption
occurs.
3. Third-Place Arbitration Mode
HyperMetro provides consistency groups. If services running on multiple pairs are mutually
dependent, you can add the pairs into a consistency group. All member pairs in a
consistency group can be arbitrated to the same site when a link is faulty to ensure service
continuity. The arbitration is implemented as follows:
HCIP-Storage Course Notes Page 95
Preferred sites are independently specified for consistency groups. The preferred sites of
some consistency groups are site A while the preferred sites of other consistency groups are
site B.
If a link is down, some services are running at site A while other services are running at site
B. Service performance is not degraded.
After the link is recovered, differential data is synchronized between the two sites by
consistency group.
3.2.4 Multi-Tenancy
3.2.5 SmartQoS
The object service provides SmartQoS to properly allocate system resources and deliver better service
capabilities.
HCIP-Storage Course Notes Page 100
3.2.7 WORM
Write Once Read Many (WORM) is a technology that allows data to be read-only once being written.
Users can set protection periods for objects. During protection periods, objects can be read but cannot
be modified or deleted. After protection periods expire, objects can be read or deleted but cannot be
modified. WORM is mandatory for archiving systems.
The WORM feature in the object service does not provide any privileged interfaces or methods to
delete or modify object data that has the WORM feature enabled.
WORM policies can be configured for buckets. Different buckets can be configured with different
WORM policies. In addition, you can specify different object name prefixes and protection periods in
WORM policies. For example, you can set a 100-day protection period for objects whose names start
with prefix1 and a 365-day protection period for objects whose names start with prefix2.
The object service uses built-in WORM clocks to time protection periods. After a WORM clock is
set, the system times protection periods according to the clock. This ensures that objects are properly
protected even if the local clock time is changed. Each object has creation time and expiration time
measured by its WORM clock. After WORM properties are set for an object, the object uses a
WORM clock for timing, preventing its protection period from being changed due to local node time
changes.
A WORM clock can automatically adjust its time according to the local node time:
If the local node time is earlier than the WORM clock time, the WORM clock winds back its
time 128 seconds or less every hour.
If the local node time is later than the WORM clock time, the WORM clock adjusts its time to
the local node time.
Objects enabled with WORM have three states: unprotected, protected, and protection expired, as
shown in the following figure.
HCIP-Storage Course Notes Page 102
3.2.8 HyperReplication
HyperReplication is a remote replication feature provided in the object service that implements
asynchronous remote replication to periodically synchronize data between primary and secondary
storage systems for system DR. This minimizes service performance deterioration caused by the
latency of long-distance data transmission.
Remote replication: is a core technology for DR and backup, as well as the basis for data
synchronization and DR. It remotely maintains a data copy through the remote data connection
function of storage devices that reside in different places. Even when a disaster occurs, data backups
on remote storage devices are not affected and are used for data restoration, ensuring service
continuity. Remote replication can be divided into synchronous remote replication and asynchronous
remote replication by whether a write request on the client needs the confirmation information of the
secondary storage system of remote replication.
Asynchronous remote replication: When a client sends data to the primary storage system, the primary
storage system writes the data. After the data is successfully written to the primary storage system, a
write success message is returned to the client. The primary storage system periodically synchronizes
data to the secondary storage system, minimizing service performance deterioration caused by the
latency of long-distance data transmission.
Synchronous remote replication: The client sends data to the primary storage system. The primary
storage system synchronizes the data to the secondary storage system in real time. After the data is
successfully written to both the primary and secondary storage systems, a write success message is
returned to the client. Synchronous remote replication maximizes data consistency between the
primary and secondary storage systems and reduces data loss in the event of a disaster.
Default cluster and non-default cluster: A region has only one default cluster and other clusters are all
non-default clusters. The difference between the default cluster and a non-default cluster is that the LS
and POE services in the default cluster are active services and the default cluster has read and write
permissions on LS and POE operations. However, those in a non-default cluster are standby services
and a non-default cluster only has the read permission on LS and POE operations. The default and
HCIP-Storage Course Notes Page 103
non-default clusters are nothing related to replication groups, as well as the primary and secondary
clusters introduced in the following.
Primary and secondary clusters: The primary and secondary clusters are based on buckets. The cluster
where a source bucket resides is the primary cluster, and the backup of the source bucket is stored in
the secondary cluster. Therefore, a cluster is not an absolute primary cluster. Assume that there are
clusters Cluster1 and Cluster2 and buckets Bucket1 and Bucket2. For Bucket1, its primary cluster
is Cluster1 and secondary cluster is Cluster2. At the same time, for Bucket2, its primary cluster is
Cluster2 and secondary cluster is Cluster1. Primary and secondary clusters are nothing related to the
default and non-default clusters. The default cluster is not definitely a primary cluster, and a non-
default cluster is not definitely a secondary cluster.
Replication group: A replication group is the DR attribute of a bucket. It defines the primary and
secondary clusters, as well as the replication link of the bucket. The bucket and all objects in it are
synchronized between the primary and secondary clusters. If one cluster is faulty, data can be
recovered using backups in the other cluster. A bucket belongs to only one replication group. When
creating a bucket, you need to select the replication group to which the bucket belongs. Then, the
system performs remote replication for the bucket based on the replication group's definition. If you
do not select an owning replication group when creating a bucket, the system will add the bucket to
the default replication group. The system has only one default replication group, which is specified by
the user.
A replication relationship is established between the primary and secondary clusters. After the
replication relationship is established, data written to the primary cluster is asynchronized to the
secondary cluster. The process is as follows:
The client puts an object to the primary cluster. After the object is successfully uploaded, a replication
log is generated. The log records information required by the replication task, such as the bucket name
and object name.
The synchronization task at the primary cluster reads the replication log, parses the bucket name and
object name, reads the data, and writes the data to the secondary site to complete the replication of an
object. If the secondary cluster is faulty or the network between the primary and secondary clusters is
faulty, the replication task fails for a short period of time. The primary cluster retries until the object
replication is successful.
The primary and secondary clusters are in the same region. After asynchronous replication is
configured, data written to the primary cluster will be synchronized to the secondary cluster. If the
primary cluster is faulty, the secondary cluster can be promoted to primary at one click to continue
providing services.
The object service can be accessed by the unified domain name and supports seamless failover after a
primary cluster is faulty. Users do not need to change the domain name or URL for accessing the
object service.
3.2.9 Protocol-Interworking
The feature of interworking between object and file protocols (Protocol-Interworking) provided by
Huawei distributed storage is based on the addition of NAS functions to the object storage system. By
providing the NFS access capability based on the distributed object service, the storage system can:
receive I/O requests from standard NFS clients, parse the requests, convert requests for files into
requests for objects, use the storage capability of the object storage system to process I/O requests,
and also enable in-depth software optimization to improve customers' NFS experience.
In the object storage system, buckets are classified into common object buckets and file buckets. A
common object bucket can be accessed only through the object protocol. A file bucket can be
accessed through both the standard NFS protocol and object protocol.
The object storage system has a built-in NFS protocol parsing module. This module receives I/O
requests from standard NFS clients, parses the requests, converts operations on files into operations
HCIP-Storage Course Notes Page 104
on the objects, and then uses built-in object clients to send the requests to the storage system for
processing.
The object service reads and writes data. The module only processes and converts the NFS protocol
and caches data, but does not store data. All data is stored in buckets of the object storage system. The
data protection level is determined by the bucket configuration.
All directories and files viewed by customers from a file system are objects for the object storage
system. A directory is an object named Full path of a directory/. In this way, listing directories is
converted to ListObject. The prefix is the parent directory, and the delimiter is /. Read/write access to
a file is converted to read/write access to the object named the full path of the file. Protocol-
Interworking is used to parse the accessed file/directory to obtain the object name and convert the
operation on the file/directory to that on an object/bucket.
policies. Hot, warm, and cold data is online in real time, and applications are unaware of data
flow.
Distributed data and metadata management, elastically and effectively meeting future data
access requirements
The decoupled storage-compute big data solution adopts a fully distributed architecture. It
enables a linear growth in system capacity and performance by increasing storage nodes,
requiring no complex resource requirement plans. It can be easily expanded to contain thousands
of nodes and provide EB-level storage capacity, meeting storage demands of fast-growing
services. The native HDFS uses active and standby NameNodes and a single NameNode only
supports a maximum of 100 million files. Different from the native HDFS, the decoupled
storage-compute big data solution adopts a fully distributed NameNode mechanism, enabling a
single namespace to support ten billion files and the whole cluster to support trillions of files.
Full compatibility between EC and native HDFS semantics, helping you migrate services
smoothly
The native HDFS EC does not support interfaces such as append, truncate, hflush, and fsync.
Different from the native HDFS EC, EC adopted in the decoupled storage-compute big data
solution is fully compatible with native HDFS semantics, facilitating smooth service migration
and supporting a wide range of Huawei and third-party big data platforms. The solution even
supports the 22+2 large-ratio EC scheme with a utilization rate of 91.6%, significantly higher
than the utilization achieved by using the native HDFS EC and three-copy mechanism. This
reduces investment costs.
Enterprise-grade reliability, ensuring service and data security
The decoupled storage-compute big data solution provides a reconstruction speed of 2 TB/hour,
preventing data loss caused by subsequent faults. The solution supports faulty and sub-healthy
disk identification and fault tolerance processing, token-based flow control, as well as silent data
corruption check, ensuring service and data security with enterprise-grade reliability.
3.3.2 SmartTier
Data stored in big data platforms can be classified into cold and hot data. For example, charging data
records (CDRs) and network access logs are frequently accessed by CDR query systems, accounting
systems, or customer behavior analysis systems on the day or in the month when they are generated
and thus become hot data. However, such data will be accessed less frequently or even no longer
accessed in the next month and thus become cold data. Assume that network access logs need to be
stored for 12 months and the logs are frequently accessed only in one month. Storing all the logs on
high-performance media is costly and storing them on low-performance media affects service
performance.
To address this, the HDFS service provides SmartTier, a storage tiering feature, to store hot and cold
data on different tiers. Hot data is stored on high-performance SSDs to ensure service performance
and cold data is stored on SATA disks to reduce costs, providing high energy efficiency.
HCIP-Storage Course Notes Page 106
Achieves optimal utilization and performance for both large and small files.
3.4.2 InfoTier
InfoTier, also named dynamic storage tiering (DST), can store files on storage devices with different
performance levels according to file properties, and can automatically migrate files between devices.
InfoTier meets users' requirements on file processing speed and storage capacity, ensures optimized
space utilization, enhances access performance, and reduces deployment costs.
InfoTier focuses on the following file properties: file name, file path, file size, creation time,
modification time, last access time, owning user/user group, I/O count, I/O popularity, and SSD
acceleration.
Storage Tier Composition
InfoTier enables files to be stored on different tiers based on file properties. A tier consists of one
or more node pools. A node pool consists of multiple nodes. A node pool is divided into multiple
disk pools. A partition is created for each disk pool. A node pool is the basic unit of a storage tier.
A node pool consists of multiple nodes. Nodes of different features form node pools of different
features, which are combined into tiers of different performance levels to implement classified
data management. After a node pool is successfully deployed, nodes in the node pool can no
longer be changed. If you want to change the node pool to which a node belongs, delete the node
HCIP-Storage Course Notes Page 109
from the node pool first and then add it into another node pool. After a node pool is created, it
can be migrated from one storage tier to another one without restriping data in the node pool.
Disks of all the nodes in each node pool form disk pools based on disk types. The disk pool
formed by SSDs is used to store data of small files if SSD acceleration is enabled. The disk pool
formed by HDDs is used to store data and metadata. Disk pool division is related to disk
configurations of nodes. In typical configurations, one SSD can be inserted into the first slot of
each node to be used by the underlying file system. No disk pool composed of SSDs is available.
To fully utilize the advantages in reading and writing small files, you need to configure SSDs in
slots 2 to N.
After system deployment, an administrator can set tiers based on service requirements and
specify the mappings between node pools and tiers. A default tier exists in the system. If no tier
is added, all node pools belong to the default tier. To leverage the advantages of InfoTier, you are
advised to configure multiple tiers and corresponding file pool policies.
You are advised to associate the node pool to which the nodes with high disk processing
capability and high response speed belong with the tier where the frequently accessed data is
stored. This accelerates the system's response to hotspot data and improves the overall storage
performance.
You are advised to associate the node pool to which the nodes with low response speed and large
storage capacity belong with the tier where the less frequently accessed data is stored. This fully
utilizes the advantages of different nodes and effectively reduces deployment and maintenance
costs.
It is recommended that one tier consist of node pools of the same type. Users can configure the
type of node pools in a tier based on site requirements.
Restriping
Restriping means migrating data that has been stored to another tier or node pool.
The system periodically scans metadata and determines whether to restripe files that have been
stored in the storage system based on file pool policies. If files need to be restriped, the system
sends a restriping task.
Before restriping, the system determines whether the used space of the node pool in the target tier
reaches the read-only watermark. Restriping is implemented only when the used space is lower
than the read-only watermark. If the used spaces of all node pools in the target tier are higher
than the read-only watermark, restriping is stopped.
A restriping operation must not interrupt user access. If a user modifies data that is being
restriped, the restriping operation stops and rolls back. The data that has been restriped to the new
node pool is deleted and restriping will be performed later.
You can start another restriping operation only after the current one is complete. For example,
during the process of restriping a file from tier 1 to tier 2, if the file pool policy changes and the
file needs to be restriped to tier 3, you must wait until the restriping to tier 2 is complete and then
start the restriping to tier 3.
Watermark Policy
InforTier uses watermark policies to monitor the storage capacities in node pools. Based on the
available capacities in node pools, InfoTier determines where to store new data.
The watermark is the percentage of the used capacity in the available capacity of a disk in a node
pool. The available capacity of a disk is the same as the minimum capacity of the disk in a disk
pool. Watermarks include the high watermark and read-only watermark. A watermark enables
Scale-Out NAS to limit where a file is stored and restriped. In addition, you can set spillover for
a node pool to determine whether data can be written to other node pools when the read-only
watermark is reached.
File Pool Policy
HCIP-Storage Course Notes Page 110
Administrators can create file pool policies to determine initial file storage locations and storage
tiers to which files are restriped.
Immediately after InfoTier is enabled, the storage matches a file pool policy and uses the file
pool policy to store and restripe files.
A file pool policy can be configured to be a combination of multiple parameters. A file pool
policy can be matched only when the file properties match all parameters of the file pool policy.
3.4.3 InfoAllocator
It is a resource control technology that restricts available resources (including storage space and file
quantity) for a specified user or user group in a directory. Using the InfoAllocator feature,
administrators can:
Plan storage space or file quantity for users or user groups properly.
Manage storage space or file quantity for users or user groups.
Make statistics on and check file quantity or storage space capacity consumed by users or user groups.
Quota Types
Capacity quota: manages and monitors storage space usage.
File quantity quota: manages and monitors the number of files.
Quota Modes
Calculate quota: only monitors storage capacity or file quantity.
Mandatory quota: monitors and controls storage capacity or file quantity.
1. Relationship between thresholds of a mandatory quota
Recommended threshold: When the used storage space or file quantity reaches the
recommended threshold, the storage system does not restrict writes but only reports an
alarm.
Soft threshold: When the used storage space or file quantity reaches the soft threshold, the
storage system generates an alarm but allows data writing before the grace period expires.
However, after the grace period expires, the system forbids data writes and reports an alarm.
You need to configure the soft threshold and grace period at the same time.
2. Effective thresholds in multi-quota applications
After you set a quota for a user and its owning user group or for a directory and its parent
directory, or set different types of quotas for a directory, the quotas are all valid and the
quota that reaches the hard threshold takes effect first.
For example, quota A is configured for user group group1 and quota B is configured for
user quota_user1 who belongs to group1. The two quotas are both effective. However, the
effective threshold for quota B is the first threshold.
Effective Quotas
Common quota: Quota of the specified directory that can be used by a specified user or quota that
can be used by all users in a user group for a specified directory. For example, if the hard
threshold of the common quota for a user group is 10 GB, the total space used by all users in the
user group is 10 GB.
Default quota: Quota that any user in a user group can use for a specified empty directory. The
default quota of the everyone user group applies to all users in the cluster that uses the directory.
For example, if the hard threshold of the default quota for a user group is 10 GB, the space used
by each user in the user group is 10 GB.
Associated quota: If a default quota is configured for a directory, when a user writes files to the
directory or creates a directory, the system automatically generates a new quota, which is called
HCIP-Storage Course Notes Page 111
an associated quota. The quota is associated with the default quota. The use of storage space or
file quantity is limited by the default quota.
3.4.4 InfoLocker
Definition of InfoLocker
InfoLocker is a Write Once Read Many (WORM) feature. The feature can be used to set a retention
period for files. During the retention period, the files can be read but cannot be modified or deleted.
After the retention period expires, the files can be deleted but cannot be modified. With this function,
InfoLocker becomes a mandatory feature of archiving files.
InfoLocker has the enterprise compliance mode and regulatory compliance mode. In enterprise
compliance mode, locked files can only be deleted by system administrators. In regulatory compliance
mode, no one can delete locked files.
There are four WORM file states, as described in the following table.
Status Description
Unprotected
A file in this state can be modified or deleted.
state
After the write permission for a file is disabled, the file enters the protected state.
A file in this state can be read but cannot be deleted or modified.
Protected state
NOTE: However, the super administrator (admin) can execute the privileged
deletion of locked files.
After the write permission for an empty file in the protected state is enabled, the
Appended file enters the appended state. Data can be appended to a file in the appended
state.
Files in this state cannot be modified but can be deleted and read and their
Expired state
properties can be viewed.
3.4.5 InfoStamper
InfoStamper is a directory-based snapshot function provided for scale-out file storage. It can create
snapshots for any directory (except the root directory) in a file system to provide precise on-demand
data protection for users. A single directory supports a maximum of 2048 snapshots while a system
supports a maximum of 8192 snapshots.
COW: Before changing a protected snapshot data, copy the data to another location or object to be
saved as a snapshot, and then replace the original data object with the new data. Scale-Out NAS
implements COW for metadata because: 1. One read operation and two write operations need to be
performed during the COW process. 2. A few write operations but many read operations need to be
performed on metadata. 3. Metadata occupies small space.
ROW: Before changing a protected snapshot data, write the data to a new location or object without
overwriting the original data. Scale-Out NAS uses the ROW technology for data. The ROW mode is
used to reduce the impact on the system performance because the data volume is large.
In a file system, each file consists of metadata and data and each directory contains only metadata:
Metadata: defines data properties and includes dentries and index nodes. Dentries contain information
about file names, parent directories, and subdirectories and associate file names with inodes. Inodes
contain file size, creation time, access time, permissions, block locations, and other information.
Data: For a file, data is the content of the file. Scale-Out NAS divides data into stripes.
HCIP-Storage Course Notes Page 112
3.4.6 InfoScanner
InfoScanner is an antivirus feature. Scale-Out NAS provides Huawei Antivirus Agent and
interconnects with third-party antivirus software installed on external antivirus servers, thereby
protecting shared directories from virus attacks. The third-party antivirus software accesses shared
directories using the CIFS protocol and scans files in the directories for viruses (in real time or
periodically). If viruses are detected, the third-party antivirus software kills the viruses based on the
configured antivirus policy, providing continuous protection for data in storage.
InfoScanner:
The anti-virus software is installed on the anti-virus proxy server.
The antivirus server reads files from CIFS shares for virus scanning and isolation.
3.4.7 InfoReplicator
InfoReplicator provides the directory-level asynchronous remote replication function for Scale-Out
NAS. Folders or files can be periodically or manually replicated between directories in different
storage systems through IP links over a local area network (LAN) or wide area network (WAN),
saving a data duplicate of the local cluster to the remote cluster.
In InfoReplicator, a remote replication pair is a replication relationship that specifies data replication
source, destination, frequency, and other rules.
Synchronization is an operation that copies the data of the primary directory to the secondary
directory. Replication synchronizes data based on pairs to maintain data consistency between the
primary and secondary directories.
InfoReplicator supports two types of synchronization:
Full synchronization: copies all data of the primary directory to the secondary directory.
Incremental synchronization: copies only the data that has changed in the primary directory since the
beginning of the last synchronization.
InfoReplicator allows you to split a pair to suspend replication between directories in the pair. If you
want to resume synchronization between the directories in a split pair to keep directory data
consistent, manually start synchronization for the pair again. By so doing, the suspended
synchronization resumes instead of starting at the beginning. This is called resumable data
transmission.
When data is replicated for the first time from the primary directory to the secondary directory in a
replication pair, the storage system automatically creates a snapshot for the primary directory at the
replication point in time. When data is replicated from the primary directory to the secondary
directory again, the storage system creates a snapshot for the primary directory, compares it with the
last one, analyzes differences between the two snapshots, and synchronizes the changed data to the
secondary directory. In this way, the storage system can easily locate the changed data without the
need to traverse all its directories, improving data synchronization efficiency.
Full synchronization refers to the process of fully replicating data from the primary directory to the
secondary directory. The initial synchronization of a pair uses the full synchronization mode.
Incremental synchronization indicates that only the incremental data that is changed after the previous
synchronization is complete and before the current synchronization is started is copied to the
secondary directory. After the initial full synchronization, each synchronization is in incremental
synchronization mode.
A replication zone is a collection of nodes that participate in remote replication. You can add nodes to
the replication zone by specifying front-end service IP addresses of the nodes in the replication zone.
Replication channel refers to a group of replication links between the replication zones of the primary
and secondary Scale-Out NAS systems.
HCIP-Storage Course Notes Page 114
You can create only one replication channel between two storage systems. This channel is shared by
all pairs that are used to replicate data between the two storage systems. This channel is also used to
authenticate and control traffic for all replication links between the two storage systems.
3.4.8 InfoRevive
InfoRevive is used to provide error tolerance for video surveillance systems. By using this feature,
when the number of faulty nodes or disks exceeds the upper limit, some video data can still be read
and new video data can still be written, protecting user data and improving the continuity of video
surveillance services.
InfoRevive supports the following operation modes:
Read Fault Tolerance Mode
When the number of faulty nodes or disks exceeds the upper limit, the system can still read part
of damaged video file data, enhancing data availability and security. This mode applies to the
scenario where video surveillance data has been written to the storage system and only read
operations are required.
Read and Write Fault Tolerance Mode
When the number of faulty nodes or disks exceeds the upper limit, the system can still read and
write part of damaged video file data, enhancing service continuity and availability. This mode
applies to the scenario where video surveillance data is not completely written or new
surveillance data needs to be written, and both write and read operations are required.
Assume that the faulty disks (in gray) are data disks, the 4+1 data protection level is used, two
data copies damaged. When the InfoRevive feature is enabled, the system reads only three copies
of data successfully and returns three copies of data that is successfully read. The system adds 0s
to the two copies of data that fails to be read and returns the two copies of data. Only three copies
of data response write success. No 0 is added to the two copies of data that fails to be written.
The write operation is processed as a stripe write success.
3.4.9 InfoTurbo
InfoTurbo, which is a performance acceleration feature and supports intelligent prefetch, SMB3
Multichannel, and NFS protocol enhancement.
Intelligent prefetch provides a higher cache hit ratio for users in media assets scenarios. In latency-
sensitive scenarios, performance can be greatly improved.
The SMB3 Multichannel function greatly improves service performance and reliability. In addition, if
one channel fails, it transmits data over another channel to prevent services from being affected.
In CIFS file sharing scenarios, if a client that uses SMB 3.0 (delivered with Windows 8 and Windows
Server 2012 by default) is equipped with two or more GE/10GE/IB network ports of the same type or
with one GE/10GE/IB network port that supports Receive-Side Scaling (RSS), the client will set up
multiple channels with Scale-Out NAS. By bringing multi-core CPUs and bandwidth resources of
clients into full play, SMB3 Multichannel greatly improves service performance. In addition, after one
channel fails, SMB3 Multichannel transmits data over another channel, thereby improving service
reliability.
The NFS protocol enhancement feature is a performance acceleration feature provided by Scale-Out
NAS. By configuring multiple network ports and installing NFS protocol optimization plug-in
DFSClient on a client, concurrent connections can be established between the client and Scale-Out
NAS, thereby increasing the access bandwidth. Cache optimization is enabled for Mac OS X clients to
further improve access performance to adapt to 4K video editing in media assets scenarios.
4.1.2 Content
4.1.2.1 Project Information
Project information collection
Project information collection is the first step of planning and design and the basis for subsequent
activities. Comprehensive, timely, and accurate identification, filtering, and collection of raw
data are necessary for ensuring information correctness and effectiveness. Storage project
information to be collected involves live network devices, network topology, and service
information.
It also includes the schedule, project delivery time, and key time points. In the schedule, we need
to clarify the time needed to complete the work specified in the delivery scope of a certain phase,
tasks related to the delivery planned in each time period, and milestones as well as time points of
events.
Customer requirement collection: Collect information about the customer's current pain points of
services, whether the storage product (involving storage capacity and concurrency) meets the
service growth requirements, and system expansion analysis for the future.
Requirement analysis
Availability: indicates the probability and duration of normal system running during a certain
period. It is a comprehensive feature that measures the reliability, maintainability, and
maintenance support of the system.
Manageability:
− Integrated console: integrates the management functions of multiple devices and systems
and provides end-to-end integrated management tools to simplify administrator operations.
− Remote management: manage the system through the network on the remote console. The
devices or system does not need to be managed by personnel on site.
− Traceability: ensures that the management operation history and important events can be
recorded.
− Automation: The event-driven mode is used to implement automatic fault diagnosis,
periodic and automatic system check, and alarm message sending when the threshold is
exceeded.
Performance: Indicators of a physical system are designed based on the Service Level Agreement
(SLA) for the overall system and different users. Performance design includes not only
performance indicators required by normal services, but also performance requirements in
abnormal cases, such as the burst peak performance, fault recovery performance, and DR
switchover performance.
Security: Security design must provide all-round security protection for the entire system. The
following aspects must be included: physical layer security, network security, host security,
application security, virtualization security, user security, security management, and security
service. Multiple security protection and management measures are required to form a
hierarchical security design.
Cost: Cost is always an important factor. A good design should always focus on the total cost of
ownership (TCO). When calculating the TCO, consider all associated costs, including the
purchase cost, installation cost, energy cost, upgrade cost, migration cost, service cost,
breakdown cost, security cost, risk cost, reclamation cost, and handling cost. The cost and other
design principles need to be coordinated based on balance principles and best practices.
4.1.2.2 Hardware Planning
Storage device selection: Consider the following aspects: capacity, throughput, and IOPS. Different
scenarios have different requirements. The cost must be considered during the evaluation. If multiple
types of disks can meet the performance requirements, select the most cost-effective one.
HCIP-Storage Course Notes Page 119
Disk type: A disk type in a disk domain corresponds to a storage tier of a storage pool. If the disk
domain does not have a specific disk type, the corresponding storage tier cannot be created for a
storage pool.
Nominal capacity: The disk capacity defined by the vendor and operating system is different. As a
result, the nominal capacity of a disk is different from that displayed in the operating system.
Disk capacity defined by disk manufactures: 1 GB = 1,000 MB, 1 MB = 1,000 KB, 1 KB = 1,000
bytes
Disk capacity calculated by operating systems: 1 GB = 1,024 MB, 1 MB = 1,024 KB, 1 KB = 1,024
bytes
Hot spare capacity: The storage system provides hot spare space to take over data from failed member
disks.
RAID usage: indicates the capacity used by parity data at different RAID levels.
Disk bandwidth performance: The total bandwidth provided by the back-end disks of a storage device
is the sum of the bandwidth provided by all disks. The minimum value is recommended during device
selection.
RAID level: A number of RAID levels have been developed, but just a few of them are still in use.
I/O characteristics: Write operations consume most of disk resources. The read/write ratio describes
the ratio of read and write requests. The disk flushing ratio indicates the ratio of disk flushing
operations when the system responds to read/write requests.
Compatibility check: Use the Huawei Storage Interoperability Navigator to query the compatibility
between storage systems and application servers, switches, and cluster software, and evaluate whether
the live network environment meets the storage compatibility requirements.
4.1.2.3 Network Planning
Flash storage
Direct-connection network: An application server is connected to different controllers of a
storage system to form two paths for redundancy. The path between the application server and
the owning controller of LUNs is the optimal path and the other path is a standby path.
Single-switch network:
Switches increase the number of ports to allow more access paths. Moreover, switches extend the
transmission distance by connecting remote application servers to the storage system. As only
one switch is available in this mode, a single point of failure may occur. There are four paths
between the application server and storage system. The two paths between the application server
and the owning controller of LUNs are the optimal paths, and the other two paths are standby
paths. In normal cases, the two optimal paths are used for data transmission. If one optimal path
is faulty, UltraPath selects the other optimal path for data transmission. If both optimal paths are
faulty, UltraPath uses the two standby paths for data transmission. After an optimal path
recovers, UltraPath switches data transmission back to the optimal path again.
Dual-switch network:
With two switches, single points of failure can be prevented, boosting the network reliability.
There are four paths between the application server and storage system. UltraPath works in the
same way as it works in a multi-link single-switch networking environment.
Distributed storage
Management plane: interconnects with the customer's management network for system
management and maintenance.
BMC plane: connects to management ports of management or storage nodes to enable remote
device management.
Storage plane: An internal plane used for service data communication among all nodes in the
storage system.
HCIP-Storage Course Notes Page 120
Service plane: interconnects with customers' applications and accesses storage devices through
standard protocols such as iSCSI and HDFS.
Replication plane: enables data synchronization and replication among replication nodes.
Arbitration plane: communicates with the HyperMetro quorum server. This plane is planned
when the HyperMetro function is planned for the block service.
Network port and VLAN planning:
On the firewall, allow the following ports: TCP ports (FTP (20), SSH (22), and iSCSI (3260)),
upper-layer network management port (5989), DeviceManager or CLI device connection port
(8080), DeviceManager service management port (8088), iSNS service port (24924), and UDP
port (SNMP (161)).
This example describes switch port planning when six nodes are deployed on a 10GE network,
and the service, storage, and management switches are deployed independently.
M-LAG implements link aggregation among multiple devices. In a dual-active system, one
device is connected to two devices through M-LAG to achieve device-level link reliability.
When the management network uses independent switches, the BMC switch and management
switch can be used independently or together.
The slides describe switch port planning when service and storage switches are used
independently and management and BMC switches are used independently.
4.1.2.4 Service Planning
Block service planning: Disk domain planning is applicable during hybrid flash storage service
planning. For details, see the product documentation. Disk domain planning, disk read/write policy
planning, and iSCSI CHAP planning are optional. Disk domain planning does not involve space
allocation but the number of disks and disk types. The disk domain space size depends on the number
of disks. A disk domain is a collection of disks. Disks in different disk domains are physically
isolated. In this way, faults and storage resources of different disk domains can be isolated.
File service planning: Disk domain planning and user authentication planning are optional. User
permission: Users with the full control permission can not only read and write directories but also
have permissions to modify directories and obtain all permissions of directories. Users with the
forbidden permission can view only shared directories and cannot perform operations on any
directory. File systems can be shared using NFS, CIFS, FTP, and HTTP protocols.
4.1.3 Tools
4.1.3.1 eService LLDesigner
Service engineers spend a lot of time on project planning and design, device installation, and device
configuration during the project delivery. How to improve the efficiency?
LLDesigner: provides functions such as hardware configuration, device networking, and resource
allocation to quickly complete product planning and design.
LLDesigner supports free creation, creation by importing a BOQ, and creation by using a template. It
outputs the LLD document and configuration files. LLDesigner provides wizard-based, visualized,
standardized, and automated services.
4.1.3.2 Other Tools
Networking Assistant
Click the networking assistant, select a product model and configuration mode, and output the
networking diagram.
Energy consumption calculation
Enter the power calculator page, select a product and component type, and view the result.
HCIP-Storage Course Notes Page 121
Connecting a certain quantity of disk enclosures to the controller enclosure expands the storage space.
Observe the principles of cascading disk enclosures and then cascade disk enclosures in the correct
way.
Bend cables naturally and reserve at least 97 mm space in front of each enclosure for wrapping cables.
Standard and smart disk enclosures cannot be connected to one expansion loop.
If you want to connect two disk enclosures, create multiple loops according to the number of
expansion ports on the controller enclosure and allocate disk enclosures evenly to the loops.
The number of disk enclosures connected to the expansion ports on the controller enclosure and that
connected to the back-end ports cannot exceed the upper limit.
Connect the expansion module on controller A to expansion module A on each disk enclosure and the
expansion module on controller B to expansion module B on each disk enclosure.
A pair of SAS ports support connection of up to two SAS disk enclosures. One is recommended.
A pair of RDMA ports support connection of up to two smart disk enclosures. One is recommended.
4.2.1.1.4 Checking Hardware Installation and Powering On
Check that all components and cables are correctly installed and connected.
After hardware installation is complete, power on storage devices and check that they are working
properly. You can press the power buttons on all controller enclosures or remotely power on
controller enclosures.
Correct power-on sequence: Switch on external power supplies connecting to all the devices; press the
power button on either controller; switch on switches; switch on application servers.
4.2.1.1.5 Initializing the Storage System
After checking that the storage system is correctly powered on, initialize the storage system.
Initialization operations include: changing the management IP address, logging in to DeviceManager,
initializing the configuration wizard, configuring security policies, and handling alarms.
In the initial configuration wizard, configure basic information such as device information, time,
license, and alarms, creating a storage pool, scanning for UltraPath hosts, and allocating resources.
4.2.1.1.6 Security Policies
System security policies include the account, login, access control, and user account audit policies.
Proper settings of the security policies improve system security.
Configuring account policies: user name, password complexity, and validity period
Configuring login policies: password locking and idle account locking
Configuring authorized IP addresses: This function specifies the IP addresses that are allowed to
access DeviceManager to prevent unauthorized access. After access control is enabled,
DeviceManager is accessible only to the authorized IP addresses or IP address segment.
Configuring user account auditing: After account auditing is enabled, the system periodically sends
account audit alarms to remind the super administrator to audit the number, role, and status of
accounts to ensure account security.
4.2.1.1.7 Alarms and Events
To better manage and clear alarms and events, read this section to learn the alarming mechanism,
alarm and event notification methods, and alarm dump function. Alarm severities indicate the impact
of alarms on user services. In Huawei all-flash storage systems, alarm severities are classified into
critical, major, warning, and warning in descending order.
Alarm notifications can be sent by email, SMS message, Syslog, or trap.
Email notification: allows alarms of specified severities to be sent to preset email addresses.
SMS notification: allows alarms and events of specified severities to be sent to preset mobile phones
by SMS.
HCIP-Storage Course Notes Page 123
Syslog notification: allows you to view storage system logs on a Syslog server.
Trap notification: You can modify the addresses that receive trap notifications based on service
requirements. The storage system's alarms and events will be sent to the network management
systems or other storage systems specified by the trap servers.
Alarm dump: automatically dumps alarm messages to a specific FTP or SFTP server when the alarm
message number exceeds a system-definable threshold.
4.2.1.2 Service Deployment
The basic service configuration process involves block service configuration and file service
configuration.
Before configuring the block service, plan the configuration and perform checks. Check whether the
software installation, network connection status, and initial configuration meet the configuration
requirements. To configure the basic block service on DeviceManager, create disk domains, storage
pools, LUNs, LUN groups, hosts, and host groups, and then map LUNs or LUN groups to hosts or
host groups. Some processes and steps vary depending on products. For example, you may not need to
create a disk domain for Huawei all-flash storage devices. For details about the configuration
procedure, see the product documentation of the corresponding product.
Before configuring the file service, plan the configuration and perform checks as well. To configure
the basic file service on DeviceManager, create disk domains, storage pools, and file systems, and
share and access the file systems with application servers. You can create quota trees and quotas for
file systems. Some processes and steps vary according to products. For details about the configuration
procedure, see the product documentation of the corresponding product.
Reserved port: They are idle ports. You are advised to run the shutdown command to disable
these ports.
NIC port: It connects to the NIC port of each node.
Aggregation port: You are advised to use two GE ports to connect to the management network.
ETH management port: It connects to the ETH management port of the storage switch.
Cable connection in converged deployment (for block)
The figure describes the port usage of nodes when the storage network is 10GE/25GE and each
node is equipped with one 4-port 10GE/25GE NIC.
Cable connection in separated deployment (for block)
The figure describes the port usage of nodes when the storage network is 10GE/25GE and each
node is equipped with one 4-port 10GE/25GE NIC.
Object service node connection
The figure describes the port usage when the service network is GE, the storage network is
10GE, and each storage node is equipped with one 4-port 10GE/25GE NIC.
HDFS service node connection
The figure describes the port usage when the service and storage networks are 10GE and each
storage node is equipped with one 4-port 10GE NIC.
KVM signal cable connection
An idle VGA port must be connected with a KVM cable. The other end of the KVM cable is
bound to the mounting bar of the cabinet.
4.2.2.2 Service Deployment
The following plans ports on a 48-port CE6800 service switch and a storage switch.
The deployment processes of the block service, file service, HDFS service, and object service are the
same. You can select a service type by creating a storage pool. You can import the licenses of
different services to specify the service type provided by each cluster.
For the block service, you need to create a VBS client before configuring services. For the object
service, you need to initialize the object service before configuring services.
The following describes the configuration processes of different services. The training materials apply
to many scenarios but the specific configuration may vary based on actual needs. For details, see the
corresponding basic service configuration guide.
Block storage configuration process
SCSI: The compute node must be configured with the VBS client, management network, and
front-end storage network. The front-end storage IP address and management IP address of the
added compute node must communicate with the network plane of existing nodes in the cluster.
iSCSI: A compute node must be configured with the multipathing software and an independent
service network is deployed between the host and storage system. To configure the iSCSI
service, you need to plan the IP address for the node to provide the iSCSI service.
HDFS service configuration process
When configuring a Global zone/NameNode zone, you need to plan the IP address for the node
to provide the HDFS metadata service and data service for external systems.
Object storage configuration process
The object service uses an independent installation package and needs to be deployed during
object service initialization.
When configuring the service network for the object service, you need to plan the IP addresses
for the nodes to provide the object service.
HCIP-Storage Course Notes Page 125
5.2 Troubleshooting
5.2.1 Fault Overview
Storage system faults are classified into minor faults, major faults, and critical faults in terms of fault
impact.
Faults can be divided in to storage faults and environment faults in terms of fault occurrence
locations.
Storage fault
Storage system fault caused by hardware or software. The fault information can be obtained
using the alarm platform of the storage system.
Environment fault
Software or hardware fault occurs when data is transferred from the host to the storage system
over a network. Such faults are caused by network links. The fault information can be obtained
from operating system logs, application program logs, and switch logs.
Analyze the alarms of higher severities and then those of lower severities.
The alarm severity sequence from high to low is critical alarms, major alarms, and warnings.
Analyze common alarms and then uncommon alarms.
When analyzing an event, confirm whether it is an uncommon or common fault and then
determine its impact. Determine whether the fault occurred on only one component or on
multiple components.
To improve the emergency handling efficiency and reduce losses caused by emergency faults,
emergency handling must comply with the following principles:
− If a fault that may cause data loss occurs, stop host services or switch services to the standby
host, and back up the service data in time.
− During emergency handling, completely record all operations performed.
− Emergency handling personnel must participate dedicated training courses and understand
related technologies.
− Recover core services before recovering other services.