Skip to content

Commit 1d9bd51

Browse files
htejunaxboe
authored andcommitted
blk-mq: replace timeout synchronization with a RCU and generation based scheme
Currently, blk-mq timeout path synchronizes against the usual issue/completion path using a complex scheme involving atomic bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence rules. Unfortunately, it contains quite a few holes. There's a complex dancing around REQ_ATOM_STARTED and REQ_ATOM_COMPLETE between issue/completion and timeout paths; however, they don't have a synchronization point across request recycle instances and it isn't clear what the barriers add. blk_mq_check_expired() can easily read STARTED from N-2'th iteration, deadline from N-1'th, blk_mark_rq_complete() against Nth instance. In fact, it's pretty easy to make blk_mq_check_expired() terminate a later instance of a request. If we induce 5 sec delay before time_after_eq() test in blk_mq_check_expired(), shorten the timeout to 2s, and issue back-to-back large IOs, blk-mq starts timing out requests spuriously pretty quickly. Nothing actually timed out. It just made the call on a recycle instance of a request and then terminated a later instance long after the original instance finished. The scenario isn't theoretical either. This patch replaces the broken synchronization mechanism with a RCU and generation number based one. 1. Each request has a u64 generation + state value, which can be updated only by the request owner. Whenever a request becomes in-flight, the generation number gets bumped up too. This provides the basis for the timeout path to distinguish different recycle instances of the request. Also, marking a request in-flight and setting its deadline are protected with a seqcount so that the timeout path can fetch both values coherently. 2. The timeout path fetches the generation, state and deadline. If the verdict is timeout, it records the generation into a dedicated request abortion field and does RCU wait. 3. The completion path is also protected by RCU (from the previous patch) and checks whether the current generation number and state match the abortion field. If so, it skips completion. 4. The timeout path, after RCU wait, scans requests again and terminates the ones whose generation and state still match the ones requested for abortion. By now, the timeout path knows that either the generation number and state changed if it lost the race or the completion will yield to it and can safely timeout the request. While it's more lines of code, it's conceptually simpler, doesn't depend on direct use of subtle memory ordering or coherence, and hopefully doesn't terminate the wrong instance. While this change makes REQ_ATOM_COMPLETE synchronization unnecessary between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't removed yet as it's still used in other places. Future patches will move all state tracking to the new mechanism and remove all bitops in the hot paths. Note that this patch adds a comment explaining a race condition in BLK_EH_RESET_TIMER path. The race has always been there and this patch doesn't change it. It's just documenting the existing race. v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao. - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter. - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter. v3: - Fixed possible extended seqcount / u64_stats_sync read looping spotted by Peter. - MQ_RQ_IDLE was incorrectly being set in complete_request instead of free_request. Fixed. v4: - Rebased on top of hctx_lock() refactoring patch. - Added comment explaining the use of hctx_lock() in completion path. v5: - Added comments requested by Bart. - Note the addition of BLK_EH_RESET_TIMER race condition in the commit message. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
1 parent 5197c05 commit 1d9bd51

File tree

7 files changed

+230
-79
lines changed

7 files changed

+230
-79
lines changed

block/blk-core.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
126126
rq->start_time = jiffies;
127127
set_start_time_ns(rq);
128128
rq->part = NULL;
129+
seqcount_init(&rq->gstate_seq);
130+
u64_stats_init(&rq->aborted_gstate_sync);
129131
}
130132
EXPORT_SYMBOL(blk_rq_init);
131133

block/blk-mq.c

Lines changed: 157 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -483,6 +483,7 @@ void blk_mq_free_request(struct request *rq)
483483
if (blk_rq_rl(rq))
484484
blk_put_rl(blk_rq_rl(rq));
485485

486+
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
486487
clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
487488
clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
488489
if (rq->tag != -1)
@@ -530,6 +531,8 @@ static void __blk_mq_complete_request(struct request *rq)
530531
bool shared = false;
531532
int cpu;
532533

534+
WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
535+
533536
if (rq->internal_tag != -1)
534537
blk_mq_sched_completed_request(rq);
535538
if (rq->rq_flags & RQF_STATS) {
@@ -573,6 +576,36 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
573576
*srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
574577
}
575578

579+
static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
580+
{
581+
unsigned long flags;
582+
583+
/*
584+
* blk_mq_rq_aborted_gstate() is used from the completion path and
585+
* can thus be called from irq context. u64_stats_fetch in the
586+
* middle of update on the same CPU leads to lockup. Disable irq
587+
* while updating.
588+
*/
589+
local_irq_save(flags);
590+
u64_stats_update_begin(&rq->aborted_gstate_sync);
591+
rq->aborted_gstate = gstate;
592+
u64_stats_update_end(&rq->aborted_gstate_sync);
593+
local_irq_restore(flags);
594+
}
595+
596+
static u64 blk_mq_rq_aborted_gstate(struct request *rq)
597+
{
598+
unsigned int start;
599+
u64 aborted_gstate;
600+
601+
do {
602+
start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
603+
aborted_gstate = rq->aborted_gstate;
604+
} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
605+
606+
return aborted_gstate;
607+
}
608+
576609
/**
577610
* blk_mq_complete_request - end I/O on a request
578611
* @rq: the request being processed
@@ -590,8 +623,20 @@ void blk_mq_complete_request(struct request *rq)
590623
if (unlikely(blk_should_fake_timeout(q)))
591624
return;
592625

626+
/*
627+
* If @rq->aborted_gstate equals the current instance, timeout is
628+
* claiming @rq and we lost. This is synchronized through
629+
* hctx_lock(). See blk_mq_timeout_work() for details.
630+
*
631+
* Completion path never blocks and we can directly use RCU here
632+
* instead of hctx_lock() which can be either RCU or SRCU.
633+
* However, that would complicate paths which want to synchronize
634+
* against us. Let stay in sync with the issue path so that
635+
* hctx_lock() covers both issue and completion paths.
636+
*/
593637
hctx_lock(hctx, &srcu_idx);
594-
if (!blk_mark_rq_complete(rq))
638+
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
639+
!blk_mark_rq_complete(rq))
595640
__blk_mq_complete_request(rq);
596641
hctx_unlock(hctx, srcu_idx);
597642
}
@@ -617,34 +662,32 @@ void blk_mq_start_request(struct request *rq)
617662
wbt_issue(q->rq_wb, &rq->issue_stat);
618663
}
619664

620-
blk_add_timer(rq);
621-
665+
WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
622666
WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
623667

624668
/*
625-
* Mark us as started and clear complete. Complete might have been
626-
* set if requeue raced with timeout, which then marked it as
627-
* complete. So be sure to clear complete again when we start
628-
* the request, otherwise we'll ignore the completion event.
669+
* Mark @rq in-flight which also advances the generation number,
670+
* and register for timeout. Protect with a seqcount to allow the
671+
* timeout path to read both @rq->gstate and @rq->deadline
672+
* coherently.
629673
*
630-
* Ensure that ->deadline is visible before we set STARTED, such that
631-
* blk_mq_check_expired() is guaranteed to observe our ->deadline when
632-
* it observes STARTED.
674+
* This is the only place where a request is marked in-flight. If
675+
* the timeout path reads an in-flight @rq->gstate, the
676+
* @rq->deadline it reads together under @rq->gstate_seq is
677+
* guaranteed to be the matching one.
633678
*/
634-
smp_wmb();
679+
preempt_disable();
680+
write_seqcount_begin(&rq->gstate_seq);
681+
682+
blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
683+
blk_add_timer(rq);
684+
685+
write_seqcount_end(&rq->gstate_seq);
686+
preempt_enable();
687+
635688
set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
636-
if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
637-
/*
638-
* Coherence order guarantees these consecutive stores to a
639-
* single variable propagate in the specified order. Thus the
640-
* clear_bit() is ordered _after_ the set bit. See
641-
* blk_mq_check_expired().
642-
*
643-
* (the bits must be part of the same byte for this to be
644-
* true).
645-
*/
689+
if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
646690
clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
647-
}
648691

649692
if (q->dma_drain_size && blk_rq_bytes(rq)) {
650693
/*
@@ -677,6 +720,7 @@ static void __blk_mq_requeue_request(struct request *rq)
677720
blk_mq_sched_requeue_request(rq);
678721

679722
if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
723+
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
680724
if (q->dma_drain_size && blk_rq_bytes(rq))
681725
rq->nr_phys_segments--;
682726
}
@@ -774,6 +818,7 @@ EXPORT_SYMBOL(blk_mq_tag_to_rq);
774818
struct blk_mq_timeout_data {
775819
unsigned long next;
776820
unsigned int next_set;
821+
unsigned int nr_expired;
777822
};
778823

779824
void blk_mq_rq_timed_out(struct request *req, bool reserved)
@@ -801,6 +846,12 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
801846
__blk_mq_complete_request(req);
802847
break;
803848
case BLK_EH_RESET_TIMER:
849+
/*
850+
* As nothing prevents from completion happening while
851+
* ->aborted_gstate is set, this may lead to ignored
852+
* completions and further spurious timeouts.
853+
*/
854+
blk_mq_rq_update_aborted_gstate(req, 0);
804855
blk_add_timer(req);
805856
blk_clear_rq_complete(req);
806857
break;
@@ -816,58 +867,61 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
816867
struct request *rq, void *priv, bool reserved)
817868
{
818869
struct blk_mq_timeout_data *data = priv;
819-
unsigned long deadline;
870+
unsigned long gstate, deadline;
871+
int start;
872+
873+
might_sleep();
820874

821875
if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
822876
return;
823877

824-
/*
825-
* Ensures that if we see STARTED we must also see our
826-
* up-to-date deadline, see blk_mq_start_request().
827-
*/
828-
smp_rmb();
829-
830-
deadline = READ_ONCE(rq->deadline);
878+
/* read coherent snapshots of @rq->state_gen and @rq->deadline */
879+
while (true) {
880+
start = read_seqcount_begin(&rq->gstate_seq);
881+
gstate = READ_ONCE(rq->gstate);
882+
deadline = rq->deadline;
883+
if (!read_seqcount_retry(&rq->gstate_seq, start))
884+
break;
885+
cond_resched();
886+
}
831887

832-
/*
833-
* The rq being checked may have been freed and reallocated
834-
* out already here, we avoid this race by checking rq->deadline
835-
* and REQ_ATOM_COMPLETE flag together:
836-
*
837-
* - if rq->deadline is observed as new value because of
838-
* reusing, the rq won't be timed out because of timing.
839-
* - if rq->deadline is observed as previous value,
840-
* REQ_ATOM_COMPLETE flag won't be cleared in reuse path
841-
* because we put a barrier between setting rq->deadline
842-
* and clearing the flag in blk_mq_start_request(), so
843-
* this rq won't be timed out too.
844-
*/
845-
if (time_after_eq(jiffies, deadline)) {
846-
if (!blk_mark_rq_complete(rq)) {
847-
/*
848-
* Again coherence order ensures that consecutive reads
849-
* from the same variable must be in that order. This
850-
* ensures that if we see COMPLETE clear, we must then
851-
* see STARTED set and we'll ignore this timeout.
852-
*
853-
* (There's also the MB implied by the test_and_clear())
854-
*/
855-
blk_mq_rq_timed_out(rq, reserved);
856-
}
888+
/* if in-flight && overdue, mark for abortion */
889+
if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
890+
time_after_eq(jiffies, deadline)) {
891+
blk_mq_rq_update_aborted_gstate(rq, gstate);
892+
data->nr_expired++;
893+
hctx->nr_expired++;
857894
} else if (!data->next_set || time_after(data->next, deadline)) {
858895
data->next = deadline;
859896
data->next_set = 1;
860897
}
861898
}
862899

900+
static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
901+
struct request *rq, void *priv, bool reserved)
902+
{
903+
/*
904+
* We marked @rq->aborted_gstate and waited for RCU. If there were
905+
* completions that we lost to, they would have finished and
906+
* updated @rq->gstate by now; otherwise, the completion path is
907+
* now guaranteed to see @rq->aborted_gstate and yield. If
908+
* @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
909+
*/
910+
if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
911+
!blk_mark_rq_complete(rq))
912+
blk_mq_rq_timed_out(rq, reserved);
913+
}
914+
863915
static void blk_mq_timeout_work(struct work_struct *work)
864916
{
865917
struct request_queue *q =
866918
container_of(work, struct request_queue, timeout_work);
867919
struct blk_mq_timeout_data data = {
868920
.next = 0,
869921
.next_set = 0,
922+
.nr_expired = 0,
870923
};
924+
struct blk_mq_hw_ctx *hctx;
871925
int i;
872926

873927
/* A deadlock might occur if a request is stuck requiring a
@@ -886,14 +940,40 @@ static void blk_mq_timeout_work(struct work_struct *work)
886940
if (!percpu_ref_tryget(&q->q_usage_counter))
887941
return;
888942

943+
/* scan for the expired ones and set their ->aborted_gstate */
889944
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
890945

946+
if (data.nr_expired) {
947+
bool has_rcu = false;
948+
949+
/*
950+
* Wait till everyone sees ->aborted_gstate. The
951+
* sequential waits for SRCUs aren't ideal. If this ever
952+
* becomes a problem, we can add per-hw_ctx rcu_head and
953+
* wait in parallel.
954+
*/
955+
queue_for_each_hw_ctx(q, hctx, i) {
956+
if (!hctx->nr_expired)
957+
continue;
958+
959+
if (!(hctx->flags & BLK_MQ_F_BLOCKING))
960+
has_rcu = true;
961+
else
962+
synchronize_srcu(hctx->queue_rq_srcu);
963+
964+
hctx->nr_expired = 0;
965+
}
966+
if (has_rcu)
967+
synchronize_rcu();
968+
969+
/* terminate the ones we won */
970+
blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
971+
}
972+
891973
if (data.next_set) {
892974
data.next = blk_rq_timeout(round_jiffies_up(data.next));
893975
mod_timer(&q->timeout, data.next);
894976
} else {
895-
struct blk_mq_hw_ctx *hctx;
896-
897977
queue_for_each_hw_ctx(q, hctx, i) {
898978
/* the hctx may be unmapped, so check it here */
899979
if (blk_mq_hw_queue_mapped(hctx))
@@ -1893,6 +1973,22 @@ static size_t order_to_size(unsigned int order)
18931973
return (size_t)PAGE_SIZE << order;
18941974
}
18951975

1976+
static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
1977+
unsigned int hctx_idx, int node)
1978+
{
1979+
int ret;
1980+
1981+
if (set->ops->init_request) {
1982+
ret = set->ops->init_request(set, rq, hctx_idx, node);
1983+
if (ret)
1984+
return ret;
1985+
}
1986+
1987+
seqcount_init(&rq->gstate_seq);
1988+
u64_stats_init(&rq->aborted_gstate_sync);
1989+
return 0;
1990+
}
1991+
18961992
int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
18971993
unsigned int hctx_idx, unsigned int depth)
18981994
{
@@ -1954,12 +2050,9 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
19542050
struct request *rq = p;
19552051

19562052
tags->static_rqs[i] = rq;
1957-
if (set->ops->init_request) {
1958-
if (set->ops->init_request(set, rq, hctx_idx,
1959-
node)) {
1960-
tags->static_rqs[i] = NULL;
1961-
goto fail;
1962-
}
2053+
if (blk_mq_init_request(set, rq, hctx_idx, node)) {
2054+
tags->static_rqs[i] = NULL;
2055+
goto fail;
19632056
}
19642057

19652058
p += rq_size;
@@ -2099,9 +2192,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
20992192
if (!hctx->fq)
21002193
goto sched_exit_hctx;
21012194

2102-
if (set->ops->init_request &&
2103-
set->ops->init_request(set, hctx->fq->flush_rq, hctx_idx,
2104-
node))
2195+
if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
21052196
goto free_fq;
21062197

21072198
if (hctx->flags & BLK_MQ_F_BLOCKING)
@@ -3019,12 +3110,6 @@ static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
30193110

30203111
static int __init blk_mq_init(void)
30213112
{
3022-
/*
3023-
* See comment in block/blk.h rq_atomic_flags enum
3024-
*/
3025-
BUILD_BUG_ON((REQ_ATOM_STARTED / BITS_PER_BYTE) !=
3026-
(REQ_ATOM_COMPLETE / BITS_PER_BYTE));
3027-
30283113
cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
30293114
blk_mq_hctx_notify_dead);
30303115
return 0;

0 commit comments

Comments
 (0)