Isilon Cluster Shutdown

Introduction
This article provides the procedure for properly shutting down your EMC Isilon cluster and
includes information about the risks associated with an improper cluster shutdown.
CAUTION!
Improperly shutting down the cluster can lead to data availability and integrity issues.
Nodes that are improperly shut down in your cluster should not be without system power for
longer than the life of the NVRAM battery, which is approximately 3 to 5 days depending on
the type of node. If data is still stored in a node journal and a node is without system power for
longer than the NVRAM battery life, you will lose data and may have to rebuild the cluster.
Contact EMC Isilon Technical Support for assistance if you have questions about the
procedures or information in this article.
Procedure
The cluster shutdown procedure requires root credentials and serial console access to nodes in
the cluster. The procedure is divided into five phases.
 Phase 1: Perform preventative maintenance

 Phase 2: Shut down each node in the cluster
 Phase 3: Verify that nodes have successfully shut down
 Phase 4: Disconnect the power source
 Phase 5: Power on each node in the cluster
Please read the entire procedure before beginning the shutdown process. This will ensure that
you understand the context and order for completing each step.
IMPORTANT!
If you are running a version of OneFS that reached its end of service life (EOSL), such as
OneFS 5.5, upgrade to a supported version of OneFS. See the Isilon Supportability &
Compatibilty Guide for more information.
Phase 1: Perform preventative maintenance
These steps are to be performed approximately 4-8 weeks before the scheduled shutdown. The
purpose of this phase is to identify unknown or latent hardware or firmware issues that can
impede the shutdown procedure.
CAUTION!
EMC strongly advises that you follow all the steps in Phase 1 before shutting down your Isilon
cluster.
If circumstances require an immediate cluster-wide shutdown, you can shut down all nodes
simultaneously using the OneFS command-line interface or the OneFS Web administration
interface.
EMC strongly recommends following all the steps in Phase 3 to preserve the integrity of data in
the event of an emergency shutdown procedure.
1. (Optional) Request an Isilon Health Check. This service evaluates the health of the
cluster to ensure that it is in a good supportable, operational status. This service is
delivered by the Remote Reactive (Customer Support) team and is available to all
customers with an active maintenance agreement for clusters on OneFS 6.5.5 and later.
The Health Check is not intended to fix cluster issues, or assess the cluster's
configuration, performance, or workflow. If you meet these requirements, open a
Service Request (SR) on the EMC Online Support site requesting an "Isilon Health
Check."

2. Perform a "cold reboot" of each node by performing the following steps:
a. Schedule a maintenance window for a cluster-wide reboot.

b. Shut down each node in your cluster one at a time. This process allows you to
identify any memory errors or drive failure modes that are only detected when
the node is powered back on. To shut down each node:
i. NOTE: This process will be very disruptive to all connections, except
NFSv3. Contact Isilon support for assistance with instructions on a
longer process that will not disrupt client activity while the nodes are
being rebooted for this maintenance test.
ii. Open an SSH connection to any node. Shut down each node by running
the following command:
isi config
>>> shutdown <lnn>
iii. Verify that each node has powered off by confirming that the green
power indicator LED on the back of the node is no longer illuminated.

iv. Press the power button to power the node back on.

v. Verify that the node has rejoined the cluster and is healthy by running the
isi status -q command and looking for -OK- in the Health DASR
column of the output.

vi. If a node encounters problems indicated in the Health DASR column, or
fails to rejoin the cluster, resolve these issues before shutting down the
next node. An example of a problem is highlighted below. Node 1 has
rejoined the cluster successfully, but the Health DASR column indicates
that it needs attention.

mycluster-1# isi status -q
Cluster Name: mycluster
Cluster Health: [ ATTN]
Cluster Storage: HDD SSD
Size: 11G (23G Raw) 0 (0 Raw)
VHS Size: 11G
Used: 7.9G (69%) 0 (n/a)
Avail: 3.5G (31%) 0 (n/a)
Health Throughput (bps) HDD Storage SSD Storage
ID |IP Address |DASR | In Out Total| Used / Size |Used / Size
-------------------+-----+-----+-----+-----+-----------------+-----------------
1|10.1.16.141 |-A-- | 0| 150K| 150K| 2.0G/ 2.8G( 69%)| (No SSDs)
2|10.1.16.142 |-OK- | 98K| 13K| 112K| 2.0G/ 2.8G( 69%)| (No SSDs)
3|10.1.16.143 |-OK- | 0| 44K| 44K| 2.0G/ 2.8G( 69%)| (No SSDs)
4|10.1.16.144 |-OK- | 0| 512| 512| 2.0G/ 2.8G( 69%)| (No SSDs)
-------------------+-----+-----+-----+-----+-----------------+-----------------
Cluster Totals: | 98K| 208K| 306K| 7.9G/ 11G( 69%)| (No SSDs)
Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
c. Double-check the health of your entire cluster after you have rebooted each
node. Open an SSH connection to any node and run the isi status -q
command. Verify that every node's Health DASR column reads -OK-.

d. Resolve any hardware issues uncovered by the reboot before proceeding to the
next phase.

NOTE
If time does not permit for a cold-reboot approach for each node, you can
proactively uncover some latent hardware issues by instead performing a rolling
reboot or "warm reboot" by running the following command for each node:
isi config
>>> reboot <lnn>
However, EMC strongly recommends using the cold-reboot approach to more

effectively identify latent hardware issues.
3. Schedule a maintenance window for a total cluster shutdown.

Phase 2: Shut down each node in the cluster
These steps are to be performed on the day that you shut down your Isilon cluster. During a
cluster-wide shutdown, some factors may impact or delay the shutdown process. For example,
outstanding data writes to a node might affect the shutdown. The purpose of steps 1-2 is to
ensure that all clients are disconnected from the cluster and data is properly saved from node
journals to the file system prior to running the shutdown command. If you have iSCSI clients,
make sure you shut down clients before the iSCSI service is disabled.
Step 3 describes how to shut down each node in your cluster sequentially using a serial console.
This method is recommended because it enables you to verify that each node is properly shut
down before proceeding to the next node, and make adjustments or fix issues as needed to
ensure a proper cluster shutdown. However, this method may be time-consuming because it
requires connecting a serial console to each node to run the shutdown command. The section,
"Shut down all nodes in your cluster simultaneously," describes how to use the OneFS
command-line interface or the OneFS web administration interface to shut down your cluster.
This method is less time-consuming than step 3, but makes it more challenging to identify
nodes which encounter problems during the shutdown process.
1. Isilon recommends isolating the cluster from clients to ensure that write-heavy clients
do not impede the shutdown procedure. You can do this by disabling the client-facing
services running on your cluster. Perform the following procedure to disable client-
facing services:
a. Identify the client-facing services or protocols that are running on your cluster
by running the following commands for each client-facing service:
isi services apache2
isi services isi_hdfs_d
isi services isi_iscsi_d
isi services ndmpd
isi services nfs
isi services smb
isi services vsftpd

b. Document the services that are "enabled" on your cluster based on the output for
each command. Highlighted in the example below, the SMB service is enabled
whereas the NFS service is disabled:
mycluster-4# isi services smb

Service 'smb' is enabled.
mycluster-4# isi services nfs
Service 'nfs' is disabled.
mycluster-4#

c. Disable client-facing services. After this step, all clients will immediately lose
connection to the cluster. To disable a service, run the following command that
is related to the services that you have enabled:
isi services apache2 disable

isi services isi_hdfs_d disable
isi services isi_iscsi_d disable
isi services ndmpd disable
isi services nfs disable
isi services smb disable
isi services vsftpd disable
i. If you have iSCSI clients, ensure that iSCSI clients have unmounted their
LUNs prior to performing step 2. Run the isi iscsi list
command to confirm that all iSCSI clients are disconnected from the
cluster.

CAUTION!
If you are disabling the iSCSI service, make sure that you have shut
down iSCSI clients before running the isi_iscsi_d disable
command. Disruption to a mounted iSCSI LUN may cause damage to the
client, which typically requires recovery from backup.
2. Move data writes stored in node journals to the file system by running the
isi_for_array isi_flush command. Output similar to the following appears:
mycluster-4# isi_for_array isi_flush

mycluster-1: Flushing cache...
mycluster-1: Cache flushing complete.
mycluster-4#

NOTE
On a large cluster with a high number of outstanding writes, this step may take several
minutes to complete.
a. If a node fails to flush its data, you will receive output similar to the following
below, where node 1 and node 2 fail their flush command:
mycluster-4# isi_for_array isi_flush

vinvalbuf: flush failed, 1 clean and 0 dirty bufs
remaining
fsync: giving up on dirty
Run the isi_for_array isi_flush command again. If any node fails to

flush, contact EMC Isilon Technical Support. All nodes must successfully flush
before proceeding to the next step.
CAUTION!
If you remove a power source from a node that has not flushed data from its
journal to the file system, the risk of data loss increases substantially. Contact
EMC Isilon Technical Support if you need assistance with the shutdown
procedure.
3. Shut down each node in the cluster sequentially and monitor the output. This approach
is recommended because it enables you to identify and resolve any issues before
shutting down the next node in the cluster. Shut down each node by performing the
following steps:
IMPORTANT!
Do not run the isi_for_array shutdown -p command to shut down your
cluster.
IMPORTANT!
Any node that panics or reboots at this step is a node that requires further investigation.
In particular, all nodes must flush data from the node journal to the file system before
proceeding.
CAUTION!
If you remove a power source from a node that has not flushed data from its journal to
the file system, the risk of data loss increases substantially. Contact EMC Isilon
Technical Support if you need assistance with the shutdown procedure.
a. Attach a serial console to each node.

b. Run the following command:
isi config
>>> shutdown
When the node is successfully shut down, output similar to the following
appears:
Powering system off using ACPI

NOTE
If you do not have access to your nodes through a keyboard, video, mouse
(KVM) switch and must use a laptop instead, this step may take hours to
complete.
c. Watch the console and look for hardware-related failure events. Successful node
journal saves are highlighted in the following output variations:
2014-03-22T00:35:19Z <1.5> mycluster-3(id11)

isi_save_journal[44868]: Attempting to save journal to
default location
2014-03-22T00:35:19Z <1.5> mycluster-3(id11)
isi_save_journal[44868]: Saving journal to
/var/journal/journal.gz
2014-03-22T00:35:19Z <1.5> mycluster-3(id11)
isi_save_journal[44868]: All data saved successfully
2014-03-22T00:37:29Z <1.5> mycluster-3(id11)

isi_save_journal[45074]: Attempting to save journal to
default location
2014-03-22T00:37:29Z <1.5> mycluster-3(id11)
isi_save_journal[45074]: A valid backup journal
already exists. Not saving.
An example of a node journal save failure is highlighted in the output below:
2014-03-21T23:39:09Z <1.4> mycluster-3(id11)

/sbin/shutdown: ERROR: Validation failed for backup
journal. Shutdown aborted
2014-03-21T23:39:09Z <1.4> mycluster-3(id11)
/sbin/shutdown: Failed command output:
If you receive an error that the node journal did not save, you can manually save
the journal by performing the steps in Phase 3.
Shut down all nodes in the cluster simultaneously
In case of an emergency, you can shut down all nodes in the cluster simultaneously. However,
this method is not recommended because it does not enable you to monitor the status and output
of each node in case an issue occurs. If you choose to follow these steps, EMC strongly
recommends following all the steps in Phase 3 to verify that all nodes have properly shutdown
after performing the procedures below.
IMPORTANT!
Any node that panics or reboots at this step is a node that requires further investigation. In
particular, all nodes must flush data from the node journal to the file system before proceeding
CAUTION!
If you remove a power source from a node that has not flushed data from its journal to the file
system, the risk of data loss increases substantially. Contact EMC Isilon Technical Support if
you need assistance with the shutdown procedure.
To shut down all nodes in your cluster, use the OneFS command-line interface or the OneFS
web administration interface.
From the OneFS command-line interface.
1. Run the following command:
isi config
>>> shutdown all

IMPORTANT!
Do not run the isi_for_array shutdown -p command to shut down your
cluster.
From the OneFS web administration interface.

1. Do one of the following procedures:
In OneFS 7.0 and later:

a. Click Cluster Management > Hardware Configuration > Shutdown &
Reboot Controls

b. Click Shut down, and then click Submit.

c. Click Yes to confirm. A page appears stating that the cluster is now shutting
down.
In OneFS 6.5 and earlier:
a. Click Cluster > Cluster Management > Shut Down Cluster

b. Click Shut down, and then click Submit.

c. Click Yes to confirm. A page appears stating that the cluster is now shutting
down.
Phase 3: Verify that nodes have successfully shut down
Confirm that the nodes have properly shut down by looking at the power indicator light-
emitting diode (LED) on the back of the node. All power indicator LEDs should appear dark, or
OFF. This indicates that the node has successfully shutdown.
CAUTION!
If a node has not successfully shut down and you disconnect the power source to the node, the
chance of data loss increases substantially. Recovering data requires a lengthy recovery
procedure, and sometimes a complete cluster rebuild.
Contact EMC Isilon Technical Support if you have any doubts about the success of the
shutdown operation.
If the node does not shut down or the journal is not saved
If the power indicator light on the back of the node is still illuminated, the node has not shut
down. If the node has not shut down, or if you receive console output indicating that the node
journal did not save properly (from Phase 2, step 3c), you will need to manually save the
journal to ensure that data is committed to disk before shutting down the node.
To manually save the journal and shut down the node, perform the following steps:
1. Attach a serial console to the node. Determine if the node is responsive to the command-
line interface.
a. If the node is responsive to the command-line interface, reboot the node by
running the following command:
isi config
>>>reboot

b. If the node is not responsive to the command-line interface, manually reboot the
node by pressing and holding the power button on the back of the node. This will
cause the node to power off. Wait 30 seconds and then press the power button
once to boot the node back up again. Proceed to the next step.

CAUTION!
Manually rebooting the node is advised for this step only. Do not manually shut
down the node for any other condition. It can lead to data loss.
2. After rebooting the node, log back in and use the following steps to save the journal:
a. Attempt to gracefully shut down the node again by running the following
command:
isi config
>>>shutdown

b. If output still indicates that the journal did not save, manually save the journal by
running the isi_save_journal command.

c. If the journal still does not save, unmount the file system, /ifs, by running the
isi_kill_busy && umount /ifs command. Then force save the journal
by running the isi_save_journal -f command.

d. Verify the journal is saved by running the isi_checkjournal command.

e. Do not proceed to the next step until output indicates that the journal is
successfully saved. Contact EMC Isilon Technical Support if needed.
Phase 4: Disconnect the power source
After your cluster has successfully shut down and the nodes are powered off, only then can the
power source be disconnected from the cluster.
CAUTION!
If a node has not been successfully shut down, do not disconnect the node's power source.
Doing so may result in data loss, a lengthy recovery procedure, and sometimes a complete
cluster rebuild.
NVRAM batteries
When a client writes a file to a node, the writes are first stored in non-volatile RAM (NVRAM)
hosted on the node's journal card. Sometime later, OneFS commits those writes to disk. To
protect the data stored in NVRAM in the event of an unscheduled power outage, each node is
equipped with NVRAM batteries (two for redundancy). A node that is powered off but remains
connected to a power source will continue to refresh its NVRAM batteries. When the power
source is disconnected from the node, the NVRAM batteries will start to drain. Battery life in
the current generation of nodes (X200, S200, X400, and NL400) is approximately five days. In
the previous generation of nodes, NVRAM battery life is approximately three days.
EMC recommends properly shutting down nodes to avoid relying on NVRAM batteries for a
substantial length of time during a power outage.
NOTE
For more information about how Isilon uses NVRAM to preserve data integrity, see the
"Structure of the file system" section in the OneFS web administration and CLI administration
guides.
If the NVRAM batteries on a node drain completely, the node will boot to read-only mode and
stay in read-only mode for approximately 30 minutes until the NVRAM batteries fully charge.
When the batteries are recharged, the node will automatically return to normal read/write mode.
CAUTION!
If data is still stored in NVRAM because of an improper shutdown, and a node is without
system power for longer than the NVRAM battery life, you will experience data loss, a lengthy
recovery procedure, and sometimes a complete cluster rebuild.
Phase 5: Power on each node in the cluster
These steps are to be performed when you are ready to restart your Isilon cluster.
1. Restore the power source to each node.

2. Press the power button on the front panel or the back of each node to boot them.

3. After all nodes have been powered on, run the isi status -q command to review
the health of your cluster. Verify that all nodes are OK in the Health DASR column and
are not in a read-only (R) mode before proceeding. For a healthy cluster, output similar
to the following should appear:

Cluster Name: mycluster

Cluster Health: [ OK ]
Cluster Storage: HDD SSD
Size: 11G (23G Raw) 0 (0 Raw)
VHS Size: 11G
Used: 7.9G (69%) 0 (n/a)
Avail: 3.5G (31%) 0 (n/a)
Health Throughput (bps) HDD Storage SSD Storage
ID |IP Address |DASR | In Out Total| Used / Size |Used / Size
-------------------+-----+-----+-----+-----+-----------------+-----------------
1|10.1.16.141 |-OK- | 0| 150K| 150K| 2.0G/ 2.8G( 69%)| (No SSDs)
2|10.1.16.142 |-OK- | 98K| 13K| 112K| 2.0G/ 2.8G( 69%)| (No SSDs)
3|10.1.16.143 |-OK- | 0| 44K| 44K| 2.0G/ 2.8G( 69%)| (No SSDs)
4|10.1.16.144 |-OK- | 0| 512| 512| 2.0G/ 2.8G( 69%)| (No SSDs)
-------------------+-----+-----+-----+-----+-----------------+-----------------
Cluster Totals: | 98K| 208K| 306K| 7.9G/ 11G( 69%)| (No SSDs)
Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
4. Refer to the list of enabled services that was created in Phase 2, Step 1b and enable the
services that were disabled by running one or more of the following commands:
isi services apache2 enable

isi services isi_hdfs_d enable
isi services isi_iscsi_d enable
isi services ndmpd enable
isi services nfs enable
isi services smb enable
isi services vsftpd enable

5. Verify that your clients can connect to the cluster and perform their usual workflows.
Your cluster should be functioning normally.

Isilon Cluster Shutdown

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Isilon Cluster Shutdown

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Isilon Cluster Shutdown

Uploaded by

Copyright:

Available Formats

Introduction

 Phase 1: Perform preventative maintenance

Phase 1: Perform preventative maintenance

Cluster Name: mycluster

Cluster Health: [ ATTN]

Cluster Storage: HDD SSD

Size: 11G (23G Raw) 0 (0 Raw)

VHS Size: 11G

Used: 7.9G (69%) 0 (n/a)

Avail: 3.5G (31%) 0 (n/a)

Health Throughput (bps) HDD Storage SSD Storage

Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only

However, EMC strongly recommends using the cold-reboot approach to more

3. Schedule a maintenance window for a total cluster shutdown.

Phase 2: Shut down each node in the cluster

mycluster-4# isi services smb

isi services apache2 disable

mycluster-4# isi_for_array isi_flush

mycluster-4# isi_for_array isi_flush

Run the isi_for_array isi_flush command again. If any node fails to

a. Attach a serial console to each node.

Powering system off using ACPI

2014-03-22T00:35:19Z <1.5> mycluster-3(id11)

2014-03-22T00:37:29Z <1.5> mycluster-3(id11)

2014-03-21T23:39:09Z <1.4> mycluster-3(id11)

Shut down all nodes in the cluster simultaneously

From the OneFS command-line interface.

1. Run the following command:

From the OneFS web administration interface.

In OneFS 7.0 and later:

In OneFS 6.5 and earlier:

a. Click Cluster > Cluster Management > Shut Down Cluster

Phase 3: Verify that nodes have successfully shut down

Phase 4: Disconnect the power source

Phase 5: Power on each node in the cluster

1. Restore the power source to each node.

Cluster Name: mycluster

Cluster Storage: HDD SSD

Size: 11G (23G Raw) 0 (0 Raw)

VHS Size: 11G

Used: 7.9G (69%) 0 (n/a)

Avail: 3.5G (31%) 0 (n/a)

Health Throughput (bps) HDD Storage SSD Storage

Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only

isi services apache2 enable

You might also like