Skip to content

Commit 76d9fc2

Browse files
Sunil Mushranjlbec
authored andcommitted
ocfs2/cluster: Increase the live threshold for global heartbeat
We have seen isolated cases (very few, I might add) of o2hb not detecting all live nodes on startup. One plausible reasoning for it is that other node had a hb io delay at the same time. The live threshold set at 2 (as low as it can be) could be increased to ameliorate the situation. But increasing the threshold directly affects mount time. Currently it takes around 5 secs to mount a volume in o2cb cluster with local heartbeat. Increasing the threshold will make mounts even slower. As the issue itself is rare, we have left things as they are for the local heartbeat mode. However we can improve the situation for global heartbeat mode as in that mode, we start the heartbeat much before the mount. This patch doubles the live threshold for the start of the first region in global heartbeat mode. Addresses internal Oracle bug#10635585. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
1 parent 4da6dc2 commit 76d9fc2

File tree

1 file changed

+12
-1
lines changed

1 file changed

+12
-1
lines changed

fs/ocfs2/cluster/heartbeat.c

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1690,6 +1690,7 @@ static ssize_t o2hb_region_dev_write(struct o2hb_region *reg,
16901690
struct file *filp = NULL;
16911691
struct inode *inode = NULL;
16921692
ssize_t ret = -EINVAL;
1693+
int live_threshold;
16931694

16941695
if (reg->hr_bdev)
16951696
goto out;
@@ -1766,8 +1767,18 @@ static ssize_t o2hb_region_dev_write(struct o2hb_region *reg,
17661767
* A node is considered live after it has beat LIVE_THRESHOLD
17671768
* times. We're not steady until we've given them a chance
17681769
* _after_ our first read.
1770+
* The default threshold is bare minimum so as to limit the delay
1771+
* during mounts. For global heartbeat, the threshold doubled for the
1772+
* first region.
17691773
*/
1770-
atomic_set(&reg->hr_steady_iterations, O2HB_LIVE_THRESHOLD + 1);
1774+
live_threshold = O2HB_LIVE_THRESHOLD;
1775+
if (o2hb_global_heartbeat_active()) {
1776+
spin_lock(&o2hb_live_lock);
1777+
if (o2hb_pop_count(&o2hb_region_bitmap, O2NM_MAX_REGIONS) == 1)
1778+
live_threshold <<= 1;
1779+
spin_unlock(&o2hb_live_lock);
1780+
}
1781+
atomic_set(&reg->hr_steady_iterations, live_threshold + 1);
17711782

17721783
hb_task = kthread_run(o2hb_thread, reg, "o2hb-%s",
17731784
reg->hr_item.ci_name);

0 commit comments

Comments
 (0)