Isilon Cluster Shutdown
Isilon Cluster Shutdown
Isilon Cluster Shutdown
This article provides the procedure for properly shutting down your EMC Isilon cluster and
includes information about the risks associated with an improper cluster shutdown.
CAUTION!
Improperly shutting down the cluster can lead to data availability and integrity issues.
Nodes that are improperly shut down in your cluster should not be without system power for
longer than the life of the NVRAM battery, which is approximately 3 to 5 days depending on
the type of node. If data is still stored in a node journal and a node is without system power for
longer than the NVRAM battery life, you will lose data and may have to rebuild the cluster.
Contact EMC Isilon Technical Support for assistance if you have questions about the
procedures or information in this article.
Procedure
The cluster shutdown procedure requires root credentials and serial console access to nodes in
the cluster. The procedure is divided into five phases.
Please read the entire procedure before beginning the shutdown process. This will ensure that
you understand the context and order for completing each step.
IMPORTANT!
If you are running a version of OneFS that reached its end of service life (EOSL), such as
OneFS 5.5, upgrade to a supported version of OneFS. See the Isilon Supportability &
Compatibilty Guide for more information.
These steps are to be performed approximately 4-8 weeks before the scheduled shutdown. The
purpose of this phase is to identify unknown or latent hardware or firmware issues that can
impede the shutdown procedure.
CAUTION!
EMC strongly advises that you follow all the steps in Phase 1 before shutting down your Isilon
cluster.
If circumstances require an immediate cluster-wide shutdown, you can shut down all nodes
simultaneously using the OneFS command-line interface or the OneFS Web administration
interface.
EMC strongly recommends following all the steps in Phase 3 to preserve the integrity of data in
the event of an emergency shutdown procedure.
1. (Optional) Request an Isilon Health Check. This service evaluates the health of the
cluster to ensure that it is in a good supportable, operational status. This service is
delivered by the Remote Reactive (Customer Support) team and is available to all
customers with an active maintenance agreement for clusters on OneFS 6.5.5 and later.
The Health Check is not intended to fix cluster issues, or assess the cluster's
configuration, performance, or workflow. If you meet these requirements, open a
Service Request (SR) on the EMC Online Support site requesting an "Isilon Health
Check."
2. Perform a "cold reboot" of each node by performing the following steps:
a. Schedule a maintenance window for a cluster-wide reboot.
b. Shut down each node in your cluster one at a time. This process allows you to
identify any memory errors or drive failure modes that are only detected when
the node is powered back on. To shut down each node:
i. NOTE: This process will be very disruptive to all connections, except
NFSv3. Contact Isilon support for assistance with instructions on a
longer process that will not disrupt client activity while the nodes are
being rebooted for this maintenance test.
ii. Open an SSH connection to any node. Shut down each node by running
the following command:
isi config
>>> shutdown <lnn>
iii. Verify that each node has powered off by confirming that the green
power indicator LED on the back of the node is no longer illuminated.
iv. Press the power button to power the node back on.
v. Verify that the node has rejoined the cluster and is healthy by running the
isi status -q command and looking for -OK- in the Health DASR
column of the output.
vi. If a node encounters problems indicated in the Health DASR column, or
fails to rejoin the cluster, resolve these issues before shutting down the
next node. An example of a problem is highlighted below. Node 1 has
rejoined the cluster successfully, but the Health DASR column indicates
that it needs attention.
mycluster-1# isi status -q
ID |IP Address |DASR | In Out Total| Used / Size |Used / Size
-------------------+-----+-----+-----+-----+-----------------+-----------------
1|10.1.16.141 |-A-- | 0| 150K| 150K| 2.0G/ 2.8G( 69%)| (No SSDs)
2|10.1.16.142 |-OK- | 98K| 13K| 112K| 2.0G/ 2.8G( 69%)| (No SSDs)
3|10.1.16.143 |-OK- | 0| 44K| 44K| 2.0G/ 2.8G( 69%)| (No SSDs)
4|10.1.16.144 |-OK- | 0| 512| 512| 2.0G/ 2.8G( 69%)| (No SSDs)
-------------------+-----+-----+-----+-----+-----------------+-----------------
Cluster Totals: | 98K| 208K| 306K| 7.9G/ 11G( 69%)| (No SSDs)
c. Double-check the health of your entire cluster after you have rebooted each
node. Open an SSH connection to any node and run the isi status -q
command. Verify that every node's Health DASR column reads -OK-.
d. Resolve any hardware issues uncovered by the reboot before proceeding to the
next phase.
NOTE
If time does not permit for a cold-reboot approach for each node, you can
proactively uncover some latent hardware issues by instead performing a rolling
reboot or "warm reboot" by running the following command for each node:
isi config
>>> reboot <lnn>
These steps are to be performed on the day that you shut down your Isilon cluster. During a
cluster-wide shutdown, some factors may impact or delay the shutdown process. For example,
outstanding data writes to a node might affect the shutdown. The purpose of steps 1-2 is to
ensure that all clients are disconnected from the cluster and data is properly saved from node
journals to the file system prior to running the shutdown command. If you have iSCSI clients,
make sure you shut down clients before the iSCSI service is disabled.
Step 3 describes how to shut down each node in your cluster sequentially using a serial console.
This method is recommended because it enables you to verify that each node is properly shut
down before proceeding to the next node, and make adjustments or fix issues as needed to
ensure a proper cluster shutdown. However, this method may be time-consuming because it
requires connecting a serial console to each node to run the shutdown command. The section,
"Shut down all nodes in your cluster simultaneously," describes how to use the OneFS
command-line interface or the OneFS web administration interface to shut down your cluster.
This method is less time-consuming than step 3, but makes it more challenging to identify
nodes which encounter problems during the shutdown process.
1. Isilon recommends isolating the cluster from clients to ensure that write-heavy clients
do not impede the shutdown procedure. You can do this by disabling the client-facing
services running on your cluster. Perform the following procedure to disable client-
facing services:
a. Identify the client-facing services or protocols that are running on your cluster
by running the following commands for each client-facing service:
isi services apache2
isi services isi_hdfs_d
isi services isi_iscsi_d
isi services ndmpd
isi services nfs
isi services smb
isi services vsftpd
b. Document the services that are "enabled" on your cluster based on the output for
each command. Highlighted in the example below, the SMB service is enabled
whereas the NFS service is disabled:
CAUTION!
If you are disabling the iSCSI service, make sure that you have shut
down iSCSI clients before running the isi_iscsi_d disable
command. Disruption to a mounted iSCSI LUN may cause damage to the
client, which typically requires recovery from backup.
2. Move data writes stored in node journals to the file system by running the
isi_for_array isi_flush command. Output similar to the following appears:
NOTE
On a large cluster with a high number of outstanding writes, this step may take several
minutes to complete.
a. If a node fails to flush its data, you will receive output similar to the following
below, where node 1 and node 2 fail their flush command:
CAUTION!
If you remove a power source from a node that has not flushed data from its
journal to the file system, the risk of data loss increases substantially. Contact
EMC Isilon Technical Support if you need assistance with the shutdown
procedure.
3. Shut down each node in the cluster sequentially and monitor the output. This approach
is recommended because it enables you to identify and resolve any issues before
shutting down the next node in the cluster. Shut down each node by performing the
following steps:
IMPORTANT!
Do not run the isi_for_array shutdown -p command to shut down your
cluster.
IMPORTANT!
Any node that panics or reboots at this step is a node that requires further investigation.
In particular, all nodes must flush data from the node journal to the file system before
proceeding.
CAUTION!
If you remove a power source from a node that has not flushed data from its journal to
the file system, the risk of data loss increases substantially. Contact EMC Isilon
Technical Support if you need assistance with the shutdown procedure.
isi config
>>> shutdown
When the node is successfully shut down, output similar to the following
appears:
NOTE
If you do not have access to your nodes through a keyboard, video, mouse
(KVM) switch and must use a laptop instead, this step may take hours to
complete.
c. Watch the console and look for hardware-related failure events. Successful node
journal saves are highlighted in the following output variations:
If you receive an error that the node journal did not save, you can manually save
the journal by performing the steps in Phase 3.
In case of an emergency, you can shut down all nodes in the cluster simultaneously. However,
this method is not recommended because it does not enable you to monitor the status and output
of each node in case an issue occurs. If you choose to follow these steps, EMC strongly
recommends following all the steps in Phase 3 to verify that all nodes have properly shutdown
after performing the procedures below.
IMPORTANT!
Any node that panics or reboots at this step is a node that requires further investigation. In
particular, all nodes must flush data from the node journal to the file system before proceeding
CAUTION!
If you remove a power source from a node that has not flushed data from its journal to the file
system, the risk of data loss increases substantially. Contact EMC Isilon Technical Support if
you need assistance with the shutdown procedure.
To shut down all nodes in your cluster, use the OneFS command-line interface or the OneFS
web administration interface.
isi config
>>> shutdown all
IMPORTANT!
Do not run the isi_for_array shutdown -p command to shut down your
cluster.
Confirm that the nodes have properly shut down by looking at the power indicator light-
emitting diode (LED) on the back of the node. All power indicator LEDs should appear dark, or
OFF. This indicates that the node has successfully shutdown.
CAUTION!
If a node has not successfully shut down and you disconnect the power source to the node, the
chance of data loss increases substantially. Recovering data requires a lengthy recovery
procedure, and sometimes a complete cluster rebuild.
Contact EMC Isilon Technical Support if you have any doubts about the success of the
shutdown operation.
If the node does not shut down or the journal is not saved
If the power indicator light on the back of the node is still illuminated, the node has not shut
down. If the node has not shut down, or if you receive console output indicating that the node
journal did not save properly (from Phase 2, step 3c), you will need to manually save the
journal to ensure that data is committed to disk before shutting down the node.
To manually save the journal and shut down the node, perform the following steps:
1. Attach a serial console to the node. Determine if the node is responsive to the command-
line interface.
a. If the node is responsive to the command-line interface, reboot the node by
running the following command:
isi config
>>>reboot
b. If the node is not responsive to the command-line interface, manually reboot the
node by pressing and holding the power button on the back of the node. This will
cause the node to power off. Wait 30 seconds and then press the power button
once to boot the node back up again. Proceed to the next step.
CAUTION!
Manually rebooting the node is advised for this step only. Do not manually shut
down the node for any other condition. It can lead to data loss.
2. After rebooting the node, log back in and use the following steps to save the journal:
a. Attempt to gracefully shut down the node again by running the following
command:
isi config
>>>shutdown
b. If output still indicates that the journal did not save, manually save the journal by
running the isi_save_journal command.
c. If the journal still does not save, unmount the file system, /ifs, by running the
isi_kill_busy && umount /ifs command. Then force save the journal
by running the isi_save_journal -f command.
d. Verify the journal is saved by running the isi_checkjournal command.
e. Do not proceed to the next step until output indicates that the journal is
successfully saved. Contact EMC Isilon Technical Support if needed.
After your cluster has successfully shut down and the nodes are powered off, only then can the
power source be disconnected from the cluster.
CAUTION!
If a node has not been successfully shut down, do not disconnect the node's power source.
Doing so may result in data loss, a lengthy recovery procedure, and sometimes a complete
cluster rebuild.
NVRAM batteries
When a client writes a file to a node, the writes are first stored in non-volatile RAM (NVRAM)
hosted on the node's journal card. Sometime later, OneFS commits those writes to disk. To
protect the data stored in NVRAM in the event of an unscheduled power outage, each node is
equipped with NVRAM batteries (two for redundancy). A node that is powered off but remains
connected to a power source will continue to refresh its NVRAM batteries. When the power
source is disconnected from the node, the NVRAM batteries will start to drain. Battery life in
the current generation of nodes (X200, S200, X400, and NL400) is approximately five days. In
the previous generation of nodes, NVRAM battery life is approximately three days.
EMC recommends properly shutting down nodes to avoid relying on NVRAM batteries for a
substantial length of time during a power outage.
NOTE
For more information about how Isilon uses NVRAM to preserve data integrity, see the
"Structure of the file system" section in the OneFS web administration and CLI administration
guides.
If the NVRAM batteries on a node drain completely, the node will boot to read-only mode and
stay in read-only mode for approximately 30 minutes until the NVRAM batteries fully charge.
When the batteries are recharged, the node will automatically return to normal read/write mode.
CAUTION!
If data is still stored in NVRAM because of an improper shutdown, and a node is without
system power for longer than the NVRAM battery life, you will experience data loss, a lengthy
recovery procedure, and sometimes a complete cluster rebuild.
These steps are to be performed when you are ready to restart your Isilon cluster.
ID |IP Address |DASR | In Out Total| Used / Size |Used / Size
-------------------+-----+-----+-----+-----+-----------------+-----------------
1|10.1.16.141 |-OK- | 0| 150K| 150K| 2.0G/ 2.8G( 69%)| (No SSDs)
2|10.1.16.142 |-OK- | 98K| 13K| 112K| 2.0G/ 2.8G( 69%)| (No SSDs)
3|10.1.16.143 |-OK- | 0| 44K| 44K| 2.0G/ 2.8G( 69%)| (No SSDs)
4|10.1.16.144 |-OK- | 0| 512| 512| 2.0G/ 2.8G( 69%)| (No SSDs)
-------------------+-----+-----+-----+-----+-----------------+-----------------
Cluster Totals: | 98K| 208K| 306K| 7.9G/ 11G( 69%)| (No SSDs)
4. Refer to the list of enabled services that was created in Phase 2, Step 1b and enable the
services that were disabled by running one or more of the following commands: